Multi-sensor large-scale dataset for multi-view 3D reconstruction

Page created by Lewis Schneider
 
CONTINUE READING
Multi-sensor large-scale dataset for multi-view 3D reconstruction
Multi-sensor large-scale dataset for multi-view 3D reconstruction

                                                      Oleg Voynov1 , Gleb Bobrovskikh1 , Pavel Karpyshev1 , Andrei-Timotei Ardelean1 ,
                                                      Arseniy Bozhenko1 , Saveliy Galochkin1 , Ekaterina Karmanova1 , Pavel Kopanev1 ,
                                                      Yaroslav Labutin-Rymsho2 , Ruslan Rakhimov1 , Aleksandr Safin1 , Valerii Serpiva1 ,
                                                          Alexey Artemov1 , Evgeny Burnaev1 , Dzmitry Tsetserukou1 , Denis Zorin3
                                            1
                                                Skolkovo Institute of Science and Technology, 2 Huawei Research Moscow, 3 New York University
arXiv:2203.06111v1 [cs.CV] 11 Mar 2022

                                                      {oleg.voinov, g.bobrovskih, pavel.karpyshev, timotei.ardelean}@skoltech.ru,
                                                        {a.bozhenko, saveliy.galochkin, e.karmanova, pavel.kopanev}@skoltech.ru,
                                                  labutin.rymsho.yaroslav@huawei.com, {ruslan.rakhimov, aleksandr.safin}@skoltech.ru,
                                                    {v.serpiva, a.artemov, e.burnaev, d.tsetserukou}@skoltech.ru, dzorin@cs.nyu.edu

                                         Figure 1. A representative set of objects from our dataset. We focus on challenging cases for depth sensors or 3D reconstruction
                                         algorithms. Evaluation of common 3D reconstruction methods using our dataset demonstrates its potential value (lower row).

                                                                 Abstract                                    1. Introduction
                                                                                                                 Reconstruction of 3D geometry of physical objects and
                                            We present a new multi-sensor dataset for 3D surface re-         scenes is an important task for a broad range of applications.
                                         construction. It includes registered RGB and depth data from        Sensor data used in 3D reconstruction range from highly
                                         sensors of different resolutions and modalities: smartphones,       specialized and expensive CT, laser, and structured-light
                                         Intel RealSense, Microsoft Kinect, industrial cameras, and          scanners to video from commodity cameras and depth sen-
                                         structured-light scanner. The data for each scene is obtained       sors; computational 3D reconstruction methods are typically
                                         under a large number of lighting conditions, and the scenes         tailored to a particular type of data. Yet, even commodity
                                         are selected to emphasize a diverse set of material proper-         hardware increasingly provides multi-sensor data: for ex-
                                         ties challenging for existing algorithms. In the acquisition        ample, many recent phones have multiple RGB cameras as
                                         process, we aimed to maximize high-resolution depth data            well as lower resolution depth sensors. Using data from
                                         quality for challenging cases, to provide reliable ground           different sensors, RGB-D data in particular, has the potential
                                         truth for learning algorithms. Overall, we provide over             to considerably improve the quality of 3D reconstruction.
                                         1.4 million images of 110 different scenes acquired at 14           For example, multi-view stereo (MVS) algorithms produce
                                         lighting conditions from 100 viewing directions. We expect          high-quality 3D geometry from RGB data, but may miss
                                         our dataset will be useful for evaluation and training of 3D        featureless surfaces; supplementing RGB images with depth
                                         reconstruction algorithms of different types and for other          sensor data makes it possible to have more complete recon-
                                         related tasks. Our dataset and accompanying software will           structions. Conversely, commodity depth sensors often lack
                                         be available online at adase.group/3ddl/projects/sk3d.              resolution provided by RGB cameras.
                                                                                                                Combining multi-view RGB and depth data in a sin-

                                                                                                         1
Multi-sensor large-scale dataset for multi-view 3D reconstruction
Hi-res. geom.
                                                                                                                          Depth, MPix
                                                                                              Sensor types
gle algorithm is challenging; fortunately, recent learning-

                                                                                                              RGB, MPix

                                                                                                                                                          Poses/scene

                                                                                                                                                                                                # Frames
                                                                                                                                                                                   # Scenes
                                                                                                                                                                        Lighting
based techniques substantially simplify this task. For single-
modality data, learning-based algorithms also have the
                                                                       Dataset
promise of being more robust to variations in reflection prop-         DTU [24]         RGB (2)                      2                  ✓                49/64              8      80         27K
erties and lighting conditions. However, learning methods                               SLS
require suitable datasets for training. A number of excel-             ETH3D [43]       RGB                     24                      ✓ 10–70                           U        24         11K
lent datasets were developed over time, with new datasets                               TLS                     —
                                                                       TnT [27]         RGB                      8                      ✓ 150–300                         U        21         148K
introduced in parallel with advances in sensors, as well as                             TLS                     —
to provide more varied or challenging data (e.g., indoor and           BlendedMVG [66] unknown               3/0.4                                      20–1000           U 502               110K
outdoor scenes, challenging surface properties, varying light-         BigBIRD [47]     RGB (5)                12           —                             600               1 120             144K
ing conditions) or data suitable for learning. Our dataset                              RGB-D (5)             1.2          0.3
aims to complement existing ones in all of these ways, as              ScanNet [11]     RGB-D                 1.3          0.3                            NA              U 1513              2.5M
discussed in more detail in Sections 2 and 3.                          Ours             RGB     (2)              5   — ✓                                  100           14 110                913K
                                                                                        RGB-D 1 (2)             40 0.04
    The structure of our dataset is expected to benefit research                        RGB-D 2                  2 0.2
on 3D reconstruction in several ways.                                                   RGB-D 3                  2 0.9
• Multi-sensor data. We provide aligned data from seven                                 SLS                     —    —

   different devices, including low-resolution depth data from         Table 1. Comparison of our dataset to the most widely used
                                                                       related datasets. U indicates uncontrolled lighting; frames are
   commodity sensors, high-resolution geometry data from a
                                                                       counted per sensor, i.e., all data from an RGB-D sensor are counted
   structured-light scanner, and RGB data at different resolu-
                                                                       as a single frame. The number of separate images acquired may be
   tions and from different cameras. This enables supervised           considerably larger (1.4 M for our dataset). All scenes, from both
   learning for reconstruction methods relying on different            training and testing sets, were counted.
   combinations of sensor data, in particular, increasingly
   common combination of high-resolution RGB with low-                 discuss datasets most closely related to ours. We review how
   resolution depth data. In addition, multi-sensor data sim-          these datasets are used to evaluate and train methods for a
   plifies comparison of methods relying on different types            range of computer vision tasks in Section 3.
   of the sensor (RGB, depth, and RGB-D).                              Sensors. For multi-view stereo (MVS) datasets, high-
• Lighting and pose variability. We aimed to make the                  resolution RGB, either photo [1, 43, 66] or video [27], is
   dataset large enough (1.44 M images of different modal-             standard; in many cases, a structured-light scanner (SLS) [1]
   ities in total) to enable training machine learning algo-           or a terrestrial laser scanner (TLS) [27,43] are used to obtain
   rithms, and provide systematic variability in camera poses          high-resolution 3D ground truth. Datasets designed for tasks
  (100 per object), lighting (14 lighting setups) and reflection       like SLAM, object classification and segmentation often in-
   properties these algorithms need.                                   clude low-resolution depth data acquired using devices like
• Object selection. Among 110 objects in our dataset, we               Microsoft Kinect or Intel RealSense [11, 38, 47, 48], but do
   include primarily objects that may present challenges to            not include high-resolution depth data. Our dataset aims to
   existing algorithms mentioned above (see examples in                improve the existing ones and enable new 3D reconstruc-
   Figure 1); we made special effort to improve quality of 3D          tion tasks by providing aligned image and depth data from
   high-resolution structured-light data for these objects.            multiple sensors, including low- and high-resolution depth.
    Our focus is on RGB and depth data for individual objects              A recently proposed dataset RGB-D-D [21] makes a step
similar to [24], in laboratory setting, rather than on complex         in a similar direction to ours, pairing data from a 240 × 180
scenes with natural lighting (e.g., [27, 43]). This provides           phone camera with medium resolution 640 × 480 Lucid
means for systematic exploration and isolation of different            Helios time-of-flight (ToF) depth camera. Our collection
factors contributing to strengths and weaknesses of different          provides depth data at three levels of accuracy, including, in
algorithms, and complements more holistic evaluation and               addition to similar depth sensors, high-resolution data from
training data provided by datasets with complex scenes.                a structured-light scanner.
    While our evaluation is restricted to multi-view 3D recon-             We should also mention synthetic benchmarks such as [3,
struction methods, the dataset can be used for testing and             20, 31]. SyB3R [31] has an ability to generate large training
training in several other related tasks (Section 3).                   sets easily using a generator simulating an actual acquisition
                                                                       process. However, real data are required to model sensors
2. Related work on datasets                                            faithfully, train generators, and test trained algorithms.
                                                                       Scene choice, lighting and poses. Datasets focusing on
   Many datasets for tasks related to 3D reconstruction were           individual objects and controlled lighting include Middle-
developed (see, for example, [33] for a survey of datasets             bury [44], TUM [9] and the most widely used DTU MVS
related to simultaneous localization and mapping (SLAM),               dataset [1, 24]. Our dataset is significantly larger, compared
including a survey of 3D reconstruction datasets); we only             to DTU, on a number of parameters, as we show in Table 1.

                                                                   2
Multi-sensor large-scale dataset for multi-view 3D reconstruction
■ Transparent
                                                                                                                        ■ Featureless
   Most of other MVS datasets, while containing some im-

                                                                                                                                                   ■ Reflective
                                                                                                                                                                  ■ Periodic
ages of isolated objects, focus on complete scenes, often col-

                                                                                                                                        ■ Glossy
                                                                                                             ■ Relief
lected with hand-held, freely positioned cameras [27, 43, 51].
The Redwood dataset [38] contains the largest number of               Dataset
objects (over 10 K with raw RGB-D data including 398                  BlendedMVG [66], buildings             46         6               5          1              47           3
with 3D reconstructions), but only low-resolution depth
                                                                      BlendedMVG [66], other objects         99         4               7          4              2            1
data. In robotics, several datasets were developed for SLAM:
CoRBS [56] with high-resolution 3D data, but only 4 scenes;           ETH3D [43]                             19         9               8          1              1            3
BigBIRD [47] with 120 objects but only with low-resolution            TnT [27]                               7          11              9          5              9            4
depth and no lighting variation. Among datasets with high-            DTU [24]                               52 10 17 15 21                                                    1
resolution 3D scanner data, we provide the largest number              Ours                                  52 58 45 21 17 7
of objects, the largest number of lighting conditions, and the        Table 2. Statistics of scenes with different surface properties
most challenging objects, as we show in Tables 1 and 2.               across datasets. Compared to existing datasets, ours includes a
                                                                      larger number of scenes with challenging surface properties.
3. Motivating tasks
                                                                      depth maps into a truncated signed distance function (TSDF).
    Most datasets closely related to ours are designed to sup-        Learning-based methods were proposed for this type of
port the development and testing of RGB image-based multi-            tasks [57, 58]. The challenge of evaluating/training these
view stereo (MVS) and structure-from-motion (SfM, SLAM)               methods using the existing datasets is the absence of high-
algorithms. We expand the range of supported algorithms               resolution depth sensor data to be used as ground truth. For
to include reconstruction algorithms relying on combining             this reason, synthetic datasets [5, 61] are used for training,
RGB and depth data of different resolutions; our dataset can          or a single-resolution depth is subsampled to obtain depth
also be used for multiple additional tasks described below.           maps to serve as inputs.
At the same time, we do not aim to support directly a number             Several methods use data from RGB-D sensors, like Mi-
of common tasks, such as camera pose estimation, object               crosoft Kinect [13, 42, 60], producing voxel-based TSDF or
classification or segmentation, or SLAM.                              surfel clouds as output. Synthetic ICL-NUIM SLAM bench-
RGB image-to-geometry reconstruction. In this task,                   mark [20] is the most commonly used for comparing these
high-resolution RGB images are used as input to produce               methods. Learning-based algorithms for this task started
either depth maps or complete 3D geometry descriptions.               appearing [22]; these are, so far, trained on synthetic data
Standard MVS methods compute depth maps, which are                    like [5]. One exception is [12] and [14] trained on real data
fused into point clouds. This category includes PatchMatch-           from Matterport3D [4] and SUNCG [49] with inputs ob-
based [2] algorithms [40, 63, 64], recent learning-based ap-          tained by subsampling. Our dataset contains inputs from
proaches [6, 19, 29, 30, 34, 54, 59, 65, 69], and hybrid meth-        depth sensors of low and high resolutions along with associ-
ods [15, 29]. These methods are predominantly tested and              ated registered higher resolution RGB images, providing a
trained on the DTU [24] and to lesser extent, on ETH3D [43]           framework for evaluating and training both depth fusion and
and Tanks and Temples (TnT) [27], BlendedMVS (Blended-                RGB-D fusion algorithms, as well as developing new ones.
MVG) [66], and other datasets.                                        Depth super-resolution and completion. Improving the
    Several recent methods [36, 37, 55, 67] reconstruct a com-        depth maps obtained either by MVS or direct sensor measure-
pact implicit surface representation encoded by a neural net-         ment is often considered a separate task in reconstruction
work from a set of RGB images. All these methods use the              pipeline, or may be of direct value in applications. Due
DTU dataset for evaluation, with some also using Blended-             to absence of datasets with sensor data at different resolu-
MVG and other datasets. Several subsets of our dataset can            tions, depth map super-resolution methods [23, 50, 53, 62]
be used instead or together with DTU and similar datasets.            rely on artificial pairs of low- and high- resolution depth
For methods of this type, we contribute higher quality ground         maps constructed by subsampling. Depth map completion
truth for objects with challenging surface properties, larger         work [25, 70] suffers from similar issues. Our dataset pro-
set of such objects, and more lighting variation.                     vides real-world pairs for supervised learning in this context.
    Another class of learning-based methods directly recon-           View synthesis and relighting. A range of methods [7,17,
struct a voxelized representation of implicit surface [35, 52].       39] synthesize novel views of objects from collections of
These methods are trained and tested using ScanNet [11]               existing images. A large number of poses and lighting setups
and 7-Scenes [18] datasets with relatively low resolution and         in our dataset facilitates learning in these tasks, and the multi-
high noise in the depth data, and are likely to benefit from          view depth data can be used in related tasks such synthesis of
including our high-resolution depth data in training.                 depth maps for novel views or pixelwise visibility estimation.
Depth fusion and RGB-D reconstruction. Starting with                  Sensor modeling and inter-sensor generalization. As
the classical method [10], many techniques aim to fuse                datasets of the type we describe are difficult to collect, there

                                                                  3
Multi-sensor large-scale dataset for multi-view 3D reconstruction
Figure 2. Our acquisition setup (view in zoom). We included
a diverse set of seven commonly used RGB and RGB-D sensors,
mounting them on a shared metal rig to aid data alignment. We con-
structed a metal frame surrounding the scanning area and installed
various light sources to provide 14 lighting setups.
Device                 #   RGB Depth      IR    Intr. Extr. Rec.
DFK 33UX250            2    ✓*     —      —      ✓     ✓     —
                                                                         Figure 3. Images captured from all sensors and the reference
Mate 30 Pro            2    ✓*     ✓      ✓      ✓     ✓     —           mesh reconstruction.
RealSense D435         1    ✓*     ✓* ✓✓*        ✓     ✓     —
                                                                             We surrounded the scanning area with a metal frame to
Kinect v2              1    ✓*     ✓      ✓      ✓     ✓     —           which we attached the light sources: seven directional lights,
Spectrum                1   —       ✓      —     —      —     ✓          three diffuse soft-boxes, and LED strips, shown in Figure 2
Table 3. Composition of our dataset. We provide RGB, depth,              on the right. We also used the flashlights of the phones.
and IR images, intrinsic (Intr.) and extrinsic (Extr.) calibration           For each scene, we moved the camera rig through 100
parameters, and a reference mesh reconstruction (Rec.). The data         fixed positions on a sphere with radius of 70 cm and collected
marked with * is captured per lighting setup.
                                                                         the data using 14 lighting setups. For each device, except
is a considerable interest in (pre-) training on synthetic data.         the SLS, we collected raw RGB, depth, and infrared (IR)
The success of this approach depends on how faithfully such              images, including both left and right IR for RealSense. In
data reproduces sensor behavior [28,31], which is often diffi-           total, we collected 15 raw images per scene, camera position,
cult to model. Our dataset includes data from multiple com-              and lighting setup: 6 RGB, 5 IR, and 4 depth images, as
modity sensors, on which image and depth data synthesis                  illustrated in Table 3 and Figure 3. As the data from ToF
algorithms can be trained to reproduce the behavior of spe-              sensors of the phones and Kinect is unaffected by the lighting
cific sensors. At the same time, our dataset supports testing            conditions, we captured this data once per camera position.
sensor-to-sensor generalization of learning-based methods                For the SLS we collected partial scans from 27 positions.
by offering aligned images for multiple sensors.                             In addition to the raw images and partial structured-light
                                                                         scans, we include into our dataset intrinsic parameters of the
4. Dataset                                                               cameras and their positions, RGB and depth images with
                                                                         lens distortion removed, cleaned up structured-light scans,
4.1. Overview                                                            and meshes reconstructed from the complete set of scans.
   Our dataset consists of 110 scenes with a single everyday             4.2. Design decisions
object or a small group of objects on a black background,
see examples in Figure 1 and all scenes in supplementary                    The design of our dataset was primarily determined by the
material. For collection of the dataset we used a set of sen-            requirements of methods for high-quality 3D reconstruction
sors mounted on Universal Robots UR10 robotic arm with 6                 based on large collections of sensor data (in contrast to tasks
degrees of freedom and sub-millimeter position repeatability.            such as real-time or monocular reconstruction).
We used the following sensors, shown in Figure 2 on the left:            Choice of sensors. We aimed to include a variety of RGB
• RangeVision Spectrum structured-light scanner (SLS),                   and depth sensors commonly used in practice, and high-
• Two The Imaging Source DFK 33UX250 industrial RGB                      resolution sensors that can be used to generate high-quality
  cameras,                                                               reference data. Smartphones with a depth sensor are increas-
• Two Huawei Mate 30 Pro phones with ToF sensors,                        ingly widely available but have the lowest depth resolution
• Intel RealSense D435 active stereo RGB-D camera,                       and accuracy; Kinect is another consumer-level ToF sensor
• Microsoft Kinect v2 ToF RGB-D camera.                                  with better accuracy; RealSense active stereo devices are

                                                                     4
widely used in robotics; finally, a structured-light scanner
has the highest depth resolution and accuracy and is typi-
cally used as the source of ground truth for evaluation of
3D reconstruction quality. All these devices except the SLS
are also sources of consumer-level RGB data with differ-
ent fields of view and resolutions. We supplemented them
with industrial RGB cameras with high-quality optics and
low-noise sensors.                                                        Figure 5. Four types of lighting in our dataset. Lighting can
Laboratory setting and lighting. We chose to focus on a                   affect depth captured with active stereo camera of RealSense.
setup with controlled (but variable) lighting and a fixed set
of camera positions for all scenes. While a more complex                  Camera calibration, i.e., reducing the data from different
type of data with natural light and trajectory, as in ETH3D               sensors to a single coordinate system, is essential to use this
or TnT datasets, is excellent for stress-testing algorithms,              data jointly in computer vision tasks. We used the calibration
identifying specific sources of weaknesses of a particular                pipeline of [41] for generic camera models, which allow for
approach is likely to be easier if lighting and camera posi-              more accurate calibration than parametric camera models
tions vary in a controlled way, consistently across the whole             used most commonly, including in DTU and ETH3D. The
dataset. Furthermore, laboratory setting considerably simpli-             original implementation of this pipeline supports calibration
fied collecting well-aligned multi-sensor data.                           of rigid camera rigs, however, its straightforward application
    We aimed to provide a broad range of realistic lighting               for our camera rig as a single rigid whole was numerically
conditions, illustrated in Figure 5: directional light sources            non-stable due the properties of our setup. Firstly, the in-
and flashlights of the phones provide eight samples of “hard”             cluded sensors have a large variation in the field of view
light, typical, for example, for streetlight; soft-boxes provide          (30–90°) and resolution (0.04–40 MPix). Secondly, the fo-
three samples of diffuse light, typical for indoor illumination;          cus of the phone cameras, being fixed programmatically,
LED strips imitate ambient light, typical for cloudy weather.             fluctuates slightly over time, which we relate to thermal de-
To minimize the level of the light reflected from objects                 formations of the device (see [16] for a study of such effect).
outside the scene we used a black cloth as the background.                Finally, the camera rig deforms slightly depending on its tilt
Scene selection. In our choice of specific objects, we                    in different scanning positions. To avoid the loss of accuracy
aimed to ensure that, on the one hand, a variety of mate-                 we split the calibration procedure into several steps.
rial and geometric properties are represented, and, on the                    First, we obtained intrinsic camera models for each sen-
other hand, there are enough samples of objects with mate-                sor independently. Then, for the SLS and all RGB sensors
rial properties of the same type. While a few scenes in our               except the sensors of the phones, i.e., the sensors with a rela-
dataset contain multiple objects, we did not aim to present               tively high resolution and stable focus, we estimated the pose
objects in a “natural” cluttered environment.                             within the rig for its “neutral” tilt, as if it was rigid. Next, we
Preparation of objects for 3D scanning. Our goal was to                   estimated the poses of RGB sensors of the phones within the
include objects with sur-                                                 rig, keeping the estimated poses of the other sensors fixed.
face reflection properties                                                After that, we estimated the pose of each RGB sensor and
that challenge common                                                     the SLS for different scanning positions of the robotic arm,
sensors and existing algo-                                                individually for each sensor, and then transformed all the
rithms. However, these                                                    poses into the same space “through” a position with a neu-
objects often challenge                                                   tral rig tilt. Finally, we estimated the pose of each depth/IR
the structured-light scan-                                                sensor assuming it is attached rigidly to its RGB companion.
ning too, making it hard                                                     This whole procedure required capturing thousands of
to obtain reliable high-                                                  images of calibration pattern, which we made almost fully
resolution depth, suitable                                                automated with the use of the robotic arm, except for several
for use as the ground truth.       Figure 4. A partial scan without
                                   (left) and with coating (right).       manual reorientations of the pattern. As the result, for all
To get the highest-quality                                                RGB and depth sensors and the SLS we obtained central
structured-light scans we applied a temporary coating to the              generic camera models with 2.3–4.7 K parameters, depend-
scanned objects (see Figure 4), as we describe in Section 4.3.            ing on the sensor resolution, and the poses of the sensor in
                                                                          the global space for each scanning position of the robotic
4.3. Data acquisition
                                                                          arm. The mean calibration error for different sensors at dif-
   We outline the most important aspects of our acquisition               ferent steps of the procedure was in the range 0.04–0.4 px,
process and data postprocessing here and provide additional               or approximately in the range 0.024–0.15 mm. We explain
details in the supplementary material.                                    these measurements in the supplementary material.

                                                                      5
Data acquisition procedure for each scene consisted of                affect each other, so we stopped the camera application on
the following steps:                                                  one of the phones during depth capture with the other one.
1. Object placement ensured that a greater area of the object         Data post-processing. To simplify the use of the
    with features of interest was visible to the sensors.             structured-light data for evaluation and training, we obtained
2. Sensor adjustment set optimal values of camera exposure            clean partial scans and mesh reconstructions from the raw
    and gain, and laser power for RealSense.                          scans. For this, first, we globally aligned the raw partial
3. For Structured-light scanning the object was covered               scans using the method of [8] initialized with the poses ob-
    with vanishing opaque matte coating and scanned from              tained during calibration. For each point in each partial scan
    27 positions. After the coating vanished, additional 5            we calculated the distance to the closest point in all other
    scans were done and were later registered to the scans of         scans and manually rejected the scenes with bad alignment
    the coated object to verify that it was not deformed.             based on the statistics of this value. Then, from the aligned
4. Finally, RGB and low-resolution depth sensor data was
                                                                      scans we reconstructed the surface using Screened Poisson
    acquired.
                                                                      reconstruction [26] with the cell width set to 0.3 mm, which
    Since we observed slight variations of intrinsic camera
                                                                      is a conservative estimation of the accuracy of our SLS. After
parameters depending on the temperature, we warmed up
                                                                      that, we manually removed the parts of this surface corre-
all devices at the beginning of each day by scanning empty
                                                                      sponding to outlier scanning artefacts, and automatically
space until the parameters were stable (about 1 hour), and
                                                                      removed the fill artefacts of reconstruction by only keeping
then kept the devices warm by constantly scanning.
                                                                      the vertices within 0.3 mm of the raw scans. Finally, we
Sensor exposure/gain adjustment was a critical step for
                                                                      filtered out artefacts in the partial scans by only keeping the
obtaining the data useful in computer vision tasks. Using
                                                                      points within 0.3 mm of the clean reconstructed surface.
universal settings for all scenes would lead to a low image
                                                                          In our experiments, we used the clean partial scans as the
quality caused by over- or underexposure due to variations
                                                                      reference data for evaluation, and the clean reconstructed
of object surface properties. Adjusting the settings automati-
                                                                      surface for ray-tracing the depth maps needed for training.
cally for each scene using the hardware auto-exposure would
                                                                      For temporary coating of the scanned objects we used
also be suboptimal since the black background, occupying
                                                                      Aesub Blue scanning spray. It sublimates from the surface at
a substantial part of the image, would cause overexposure
                                                                      room temperature in a few hours, which we reduced to 5–15
of the object. To collect the data of high quality we em-
                                                                      minutes by slightly heating the object with a heatgun. To
ployed a custom auto-exposure algorithm, that was inspired
                                                                      check if this led to a deformation of the object, we calculated
by [45, 46] and worked reasonably well.
                                                                      the distance from scans made without the coating to the full
    For each sensor, we extracted the foreground mask of the
                                                                      scan made with the coating. We then manually rejected full
scene from the images with and without the object using the
                                                                      scenes with observable deformation, which included a few
method of [32], and then, for each lighting, we obtained the
                                                                      objects made of soft plastic, or cut out the deformed parts
minimal noise setting, by setting the gain to the minimum and
                                                                      from the scans, such as power cords of electronic devices.
maximizing the Shannon entropy of the foreground image
w.r.t. the exposure value. Additionally, we fixed the optimal
exposure and optimized the entropy w.r.t. the gain value to           5. Experimental evaluation
prevent underexposure of very dark objects (typically, the            5.1. Setup
gain stayed close to the minimum). For the flash and ambient
lighting, we also obtained the real-time / high-noise setting,           To demonstrate possible applications of our dataset we
by setting the exposure to 30 FPS and optimizing the gain,            used it for testing several methods of 3D reconstruction from
and then fixing the gain and optimizing the exposure.                 multi-view RGB and depth data, and also used it for training
    We did the exposure/gain adjustment at one fixed position         one RGB-to-3D reconstruction method and one depth-to-3D
of the rig per scene, for each RGB sensor, except Kinect for          reconstruction method. Additionally, we applied an RGB-D
which these controls are not available, and for RealSense             reconstruction method to our data.
IR sensor. Each optimization loop required 10–50 iterations.          Methods. ACMP [64] is a PatchMatch-based non-
For RealSense we also picked the optimal power of IR pro-             learnable multi-view stereo (MVS) method with strong per-
jector, by capturing depth at all 12 settings and picking the         formance on benchmarks such as Tanks and Temples (TnT).
one with the lowest number of pixels with missing values.             VisMVSNet [69] is a learning-based MVS method based on
Reduction of sensor cross-talk. We aimed, whenever pos-               the plane sweeping approach, one of the best performing
sible, to minimize the effects of sensors on each other. For          learnable methods on TnT benchmark with a publicly avail-
example, IR projector of Microsoft Kinect v2 cannot be                able implementation. NeuS [55] is a recent rendering-based
turned off and affects other depth sensors, so we added an            method producing a neural representation of a TSDF directly
external shutter to close the projector while the other sensors       from RGB images. RoutedFusion [57] is a state-of-the-art
are imaging. Similarly, time-of-flight sensors of the phones          depth fusion method that performs online TSDF reconstruc-

                                                                  6
blue sticky   brown relief   candlestick                                                 green carved   grey braided                     large coral     large white
                         amber vase                                                     dragon       dumbbells      fencing mask                                  large candles
                                             roller         pot           thing                                                         pot            box                          backpack           jug
Method                  AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4
ACMP                    34 35 30 24 28 30 32 39 36 20 28 31 57 58 64 14 33 33 37 43 36 28 33 26 52 51 46 19 29 32 24 38 35 23 26 25
VisMVSNet, BlendedMVG   35 29 28          9 32 30 37 30 34 17 31 33 63 61 67                         4 21 25 33 40 35 26 33 24 47 57 46                             5 28 31        8 40 37 11 25 24
VisMVSNet, Ours         38 50 54          5 50 50 32 49 53 17 48 50 45 46 54                         5 35 44 38 51 49 39 45 44 64 75 67                             9 50 50        5 51 48 13 42 43

                                      orange cash       orange mini     painted                                                    white castle    white castle   white ceramic       white        white starry
                          mittens                                                       skate        steel grater    white box
                                        register          vacuum        samovar                                                       land           towers         elephant      christmas star       jug
Method                  AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4
ACMP                    18 32 26 28 29 30 25 36 36 28 38 39 47 59 59 29 31 34 10 25 28 21 22 23 23 31 35 28 33 36 21 37 30 31 34 26
VisMVSNet, BlendedMVG    7 27 20 11 27 28 15 33 39 21 40 40 39 67 63 21 31 32                                        6 23 28          1 23 25       2 38 42 20 34 34               8 40 29 31 31 25
VisMVSNet, Ours          3 37 39 20 43 44 19 46 54 23 54 54 17 49 57 39 48 50                                        3 32 35          5 42 45       9 48 53 21 44 48               4 49 44 46 53 52
                            Featureless                 Glossy             Reflective            Transparent                 Relief               Periodic textures

Table 4. F-score per scene for RGB-based methods for three lighting setups: ambient lighting AL, flash lighting FB, and hard lighting H4.
tion using two learnable networks: a “routing” network and                                              quantitative evaluation as we explain in Section 5.2.
a fusion network. SurfelMeshing [42] is an online RGB-D                                                     For quantitative evaluation of ACMP and VisMVSNet
reconstruction method that produces a triangle mesh and                                                 we compared the produced point cloud with the reference
uses a surfel cloud as the intermediate representation.                                                 structured-light data. For NeuS, we extracted the mesh from
Data and training. For our experiments, we selected 24                                                  TSDF at the resolution around 0.5 mm, sampled a point
testing scenes representing different surface types present in                                          cloud from this mesh with point density close to that in
our dataset, and used the remaining 86 scenes for training.                                             the SL scans, and measured the quality of this point cloud.
As the input for ACMP, VisMVSNet, and NeuS we used                                                      Following the common practice (e.g., [27, 43]), we used
the undistorted images from the right industrial camera. For                                            threshold-based Precision (accuracy), Recall (completeness),
RoutedFusion we used the depth maps from RealSense. For                                                 and F-score quality measures. Precision is based on the
SurfelMeshing we combined the undistorted RGB images                                                    per-point distance from the reconstructed to the reference
from the right industrial camera with the depth maps from                                               data: if all reconstructed points are close to the reference, the
RealSense aligned to these RGB images. We used the RGB                                                  result is accurate. Recall measures the opposite one-sided
images from the industrial camera instead of the RealSense                                              per-point distance from the reference to the reconstructed
since the former captures the images of higher quality. In all                                          data: if for every reference point there is a close point in the
experiments, we used the intrinsic camera parameters and                                                reconstruction, the result is complete. To calculate Precision
camera positions obtained during calibration.                                                           and Recall, we selected a threshold related to the resolution
   To test the learning-based methods, VisMVSNet and                                                    of the reference data, specifically 0.3 mm, and computed
RoutedFusion, we trained two versions of each method. For                                               the percentage of points for which the distance is less than
VisMVSNet, we trained the first version on BlendedMVG                                                   the threshold. We then calculated F-score as the harmonic
dataset (an extended version of BlendedMVS), and the sec-                                               mean of these two numbers, with both Precision and Recall
ond version on our dataset, using, as targets, the depth maps                                           required to be high for the F-score to be high.
obtained from the surface reconstructed from SL scans via                                                   For a careful calculation of Precision and Recall for 3D
ray-tracing. For RoutedFusion, we trained the first version                                             point clouds, two problems have to be considered. First, the
on ModelNet dataset [61], and the second version on our                                                 reference data is available only for a part of the 3D space,
dataset, using, as targets for the routing network, the depth                                           and for the other part it is unknown whether the space is
maps obtained from the SL surface, and, as targets for the                                              free or occupied by the surface of the object. Second, both
fusion network, TSDF volumes obtained from the SL sur-                                                  the reconstructed and the reference point clouds may have
face via the authors' processing pipeline. In all cases, we                                             varying point densities, which will cause uneven contribution
followed the original authors' training regime.                                                         of different parts of the surface to the value of the measure.
Quality measures. The five evaluated methods produce                                                    We addressed these problems similar to [43], with the main
the reconstruction in different forms: the full pipeline of                                             difference related to different properties of our reference data:
ACMP and VisMVSNet produces a point cloud; NeuS pro-                                                    dense structured-light scans instead of sparser terrestrial
duces a neural TSDF with, virtually, infinite resolution; Rout-                                         laser scans. We describe the details of calculation of quality
edFusion produces a TSDF volume with a very limited res-                                                measures in the supplementary material.
olution; and SurfelMeshing produces a triangle mesh. For                                                    For quantitative evaluation of RoutedFusion we calcu-
quantitative evaluation of these methods on our dataset we                                              lated intersection-over-union (IoU), reported as a percentage,
used two different strategies: one for ACMP, VisMVSNet,                                                 between the produced TSDF volume and the TSDF volume
and NeuS, and the other one for RoutedFusion. We describe                                               calculated from the SL data. We decided to opt for this qual-
them below. For SurfelMeshing we did not perform any                                                    ity measure instead of Precision, Recall, F-score since the

                                                                                                 7
■ Transparent
                                    ■ Featureless

                                                               ■ Reflective

                                                                                                                                                                                                               fencing mask

                                                                                                                                                                                                                                             large candles
                                                                                                                                                                                                                              green carved

                                                                                                                                                                                                                              grey braided
                                                                                                                                                                        brown relief
                                                                                                                                                           amber vase

                                                                                                                                                                        candlestick

                                                                                                                                                                                                                                                             large white
                                                                                                                                                                        blue sticky

                                                                                                                                                                                                                                                             large coral
                                                                              ■ Periodic

                                                                                                                                                                                                dumbbells

                                                                                                                                                                                                                                                              backpack
                                                                                                                                                                            thing
                                                                                                                                                                           roller

                                                                                                                                                                                                                                  box
                                                    ■ Glossy

                                                                                                                                                                                                                                   pot
                                                                                                                                                                              pot

                                                                                                                                                                                                                                                                  jug
                         ■ Relief

                                                                                                                                                                                       dragon
                                                                                                                                  Method

                                                                                                           ■ All
Method                                                                                                             AL FB H4       NeuS                     15 10 21              8     43 17 11 fail 4                                       17 19 23
ACMP                     34 30 30 39 33 30 33 28 35 35                                                                            RoutedFusion, ModelNet   16           5   10 11 19            9              7              11 11 13 22 16
VisMVSNet, BlendedMVG    32 26 26 37 30 27 30 20 35 34                                                                            RoutedFusion, Ours       12 19             6   16 20 13 27                                   7     30 12 31 16
VisMVSNet, Ours          41 34 39 42 43 37 39 22 47 49

                                                                                                                                                                                                                              white ceramic

                                                                                                                                                                                                                              christmas star
                                                                                                                                                                        orange mini
                                                                                                                                                                        orange cash

                                                                                                                                                                                                                              white starry
                                                                                                                                                                                                                              white castle

                                                                                                                                                                                                                              white castle

                                                                                                                                                                                                                                 elephant
                                                                                                                                                                                                steel grater
Neus                     18 16 14 17 13 15 16

                                                                                                                                                                          vacuum
                                                                                                                                                                          register

                                                                                                                                                                                                               white box

                                                                                                                                                                                                                                 towers

                                                                                                                                                                                                                                  white
                                                                                                                                                                        samovar

                                                                                                                                                                                                                                  land

                                                                                                                                                                                                                                   jug
                                                                                                                                                                        painted
                                                                                                                                                           mittens

                                                                                                                                                                                       skate
RoutedFusion, ModelNet   14 13 11 11                                                9              6 12                           Method
RoutedFusion, Ours       17 17 13 18 24 21 18                                                                                     NeuS                     21           8   20 16 31 14                        8               9     16      5               17 10
Table 5. Average reconstruction quality per surface type and on                                                                   RoutedFusion, ModelNet   11           5    8   10 17          9              26 14                  8      7                8    18
the whole test set, F-score for RGB-based methods and IoU for                                                                     RoutedFusion, Ours       11           9   23 22 20 16 16 28 27 15 21 10
RoutedFusion. The last three columns show average F-scores per                                                                    Table 6. Reconstruction quality per scene, F-score for NeuS and
three selected lighting setups.                                                                                                   IoU for RoutedFusion.
results of RoutedFusion were not comparable to the results
of RGB-based reconstruction methods anyway, while the
IoU score could be directly compared to the results reported
by the authors of the method.

5.2. Experimental results
                                                                                                                                  Figure 6. The structured-light scan of “moon pillow” (left) and the
    In Table 4 we show the F-score per scene for ACMP and                                                                         SurfelMeshing reconstruction (right).
VisMVSNet for three lighting setups: ambient lighting with                                                                        learning-based methods substantially outperform ACMP for
real-time / high-noise setting, flash lighting with minimal                                                                       better lighting.
noise setting, and one of the hard lights with minimal noise                                                                      Reconstruction from RGB-D. Finally, we have experi-
setting. In Table 6 we show the F-score for NeuS and IoU                                                                          mented with reconstruction from our RGB-D data using
for RoutedFusion for one lighting setup per scene. In Table 5                                                                     SurfelMeshing. However, the quality of reconstruction was
we show the results for all the methods averaged per scenes                                                                       insufficient for a meaningful quantitative evaluation of this
featuring specific surface types, and for ACMP and VisMVS-                                                                        method, as we illustrate in Figure 6, which we hypothesize
Net we additionally show average F-score for three lighting                                                                       is due to the relatively small number of frames per scene.
setups. We show the additional quantitative and qualitative
results in the supplementary material.                                                                                            6. Conclusions
Scene dependence. We observe that the performance of
the methods strongly depends on the scene and lighting,                                                                               In this paper, we presented a new dataset for evaluation
with F-score varying from as low as 2% to 57%, with ACMP                                                                          and training of 3D reconstruction algorithms. Compared to
demonstrating greater stability but lesser range of F-scores.                                                                     other available datasets the distinguishing features of ours
NeuS reconstruction performs worse than MVS methods as                                                                            include a large number of sensors of different modalities
the method tends to smooth out surfaces, however, it is less                                                                      and resolutions, depth sensors in particular, selection of
sensitive to the surface type. Some scenes, e.g., “dragon”,                                                                       scenes presenting difficulties for many existing algorithms
are reconstructed well by all RGB-based algorithms, while                                                                         and high-quality reference data for these objects. Our dataset
most scenes with Featureless surface are, not surprisingly,                                                                       can support training and evaluation of methods for many
difficult for them. Based on the average scores per surface                                                                       variations of 3D reconstruction tasks. We plan to expand
type in Table 5, we observe that Featureless type is, by far,                                                                     our dataset in the future, following improved versions of
the most difficult one. We note, however, that many scenes                                                                        the process we have developed. More significant extensions
in our dataset feature multiple different surface types at once,                                                                  we are considering include randomized trajectories of the
so average results per surface type may be ambiguous.                                                                             camera, and capturing videos.
Light dependence. We demonstrate the dependence of                                                                                Acknowledgements. We acknowledge the use of compu-
MVS results on lighting in the right part of Table 5. While                                                                       tational resources of the Skoltech CDISE supercomputer
flash and directional lighting do not differ much, ambient                                                                        Zhores [68] for obtaining the results presented in this paper.
lighting is strikingly different, with many nearly total recon-                                                                   E. Burnaev, O. Voynov and A. Artemov were supported by
struction failures, which we believe to be due to noisy data                                                                      the Analytical center under the RF Government (subsidy
in low light. We observe that ACMP is much less sensitive                                                                         agreement 000000D730321P5Q0002, Grant No. 70-2021-
compared to learning-based methods. On the other hand,                                                                            00145 02.11.2021).

                                                                                                                              8
References                                                               [13] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram
                                                                              Izadi, and Christian Theobalt. BundleFusion: Real-Time
 [1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis,                   Globally Consistent 3D Reconstruction Using On-the-Fly Sur-
     Engin Tola, and Anders Bjorholm Dahl. Large-scale data for               face Reintegration. ACM Transactions on Graphics, 36(3):1–
     multiple-view stereopsis. International Journal of Computer              18, July 2017. 3
     Vision, 120(2):153–168, 2016. 2                                     [14] Angela Dai, Yawar Siddiqui, Justus Thies, Julien Valentin,
 [2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and                    and Matthias Niessner. Spsg: Self-supervised photomet-
     Dan B Goldman. PatchMatch: A randomized correspondence                   ric scene generation from rgb-d scans. In Proceedings of
     algorithm for structural image editing. ACM Transactions on              the IEEE/CVF Conference on Computer Vision and Pattern
     Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009. 3                           Recognition (CVPR), pages 1747–1756, June 2021. 3
 [3] Matthew Berger, Joshua A Levine, Luis Gustavo Nonato,               [15] Simon Donne and Andreas Geiger. Learning non-volumetric
     Gabriel Taubin, and Claudio T Silva. A benchmark for sur-                depth fusion using successive reprojections. In Proceedings
     face reconstruction. ACM Transactions on Graphics (TOG),                 of the IEEE/CVF Conference on Computer Vision and Pattern
     32(2):1–17, 2013. 2                                                      Recognition (CVPR), June 2019. 3
 [4] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal-             [16] Melanie Elias, Anette Eltner, Frank Liebold, and Hans-Gerd
     ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy                 Maas. Assessing the influence of temperature changes on the
     Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d                 geometric stability of smartphone-and raspberry pi cameras.
     data in indoor environments. International Conference on 3D              Sensors, 20(3):643, 2020. 5
     Vision (3DV), 2017. 3                                               [17] John Flynn, Michael Broxton, Paul Debevec, Matthew Du-
 [5] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat                  Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and
     Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis                Richard Tucker. Deepview: View synthesis with learned
     Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and                   gradient descent. In Proceedings of the IEEE/CVF Confer-
     Fisher Yu. ShapeNet: An Information-Rich 3D Model Repos-                 ence on Computer Vision and Pattern Recognition, pages
     itory. Technical Report arXiv:1512.03012 [cs.GR], Stanford               2367–2376, 2019. 3
     University — Princeton University — Toyota Technological            [18] Ben Glocker, Shahram Izadi, Jamie Shotton, and Antonio
     Institute at Chicago, 2015. 3                                            Criminisi. Real-time rgb-d camera relocalization. In 2013
 [6] Jaesung Choe, Sunghoon Im, Francois Rameau, Minjun Kang,                 IEEE International Symposium on Mixed and Augmented
     and In So Kweon. Volumefusion: Deep depth fusion for                     Reality (ISMAR), pages 173–179. IEEE, 2013. 3
     3d scene reconstruction. In Proceedings of the IEEE/CVF             [19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong
     International Conference on Computer Vision (ICCV), pages                Tan, and Ping Tan. Cascade cost volume for high-resolution
     16086–16095, October 2021. 3                                             multi-view stereo and stereo matching. In Proceedings of
 [7] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim,               the IEEE/CVF Conference on Computer Vision and Pattern
     and Jan Kautz. Extreme view synthesis. In Proceedings of                 Recognition (CVPR), June 2020. 3
     the IEEE/CVF International Conference on Computer Vision,           [20] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A
     pages 7781–7790, 2019. 3                                                 benchmark for RGB-D visual odometry, 3D reconstruction
 [8] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. Robust                  and SLAM. In IEEE Intl. Conf. on Robotics and Automation,
     reconstruction of indoor scenes. In Proceedings of the IEEE              ICRA, Hong Kong, China, May 2014. 2, 3
     Conference on Computer Vision and Pattern Recognition,              [21] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin
     pages 5556–5565, 2015. 6                                                 Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao
 [9] D. Cremers and K. Kolev. Multiview stereo and silhouette con-            Zhao. Towards fast and accurate real-world depth super-
     sistency via convex functionals over convex domains. IEEE                resolution: Benchmark dataset and baseline. In Proceedings
     Transactions on Pattern Analysis and Machine Intelligence,               of the IEEE/CVF Conference on Computer Vision and Pattern
     33(6):1161–1174, 2011. 2                                                 Recognition (CVPR), pages 9229–9238, June 2021. 2
[10] Brian Curless and Marc Levoy. A volumetric method for               [22] Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, and Shi-
     building complex models from range images. In Proceedings                Min Hu. Di-fusion: Online implicit 3d reconstruction with
     of the 23rd annual conference on Computer graphics and                   deep priors. In Proceedings of the IEEE/CVF Conference
     interactive techniques - SIGGRAPH ’96, pages 303–312, Not                on Computer Vision and Pattern Recognition (CVPR), pages
     Known, 1996. ACM Press. 3                                                8932–8941, June 2021. 3
[11] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-               [23] Tak-Wai Hui, Chen Change Loy, , and Xiaoou Tang. Depth
     ber, Thomas Funkhouser, and Matthias Nießner. ScanNet:                   map super-resolution by deep multi-scale guidance. In
     Richly-annotated 3D reconstructions of indoor scenes. In                 Proceedings of European Conference on Computer Vision
     Proceedings of the IEEE conference on computer vision and                (ECCV), pages 353–369, 2016. 3
     pattern recognition, pages 5828–5839, 2017. 2, 3                    [24] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola,
[12] Angela Dai, Christian Diller, and Matthias Niessner. Sg-nn:              and Henrik Aanæs. Large scale multi-view stereopsis eval-
     Sparse generative neural networks for self-supervised scene              uation. In 2014 IEEE Conference on Computer Vision and
     completion of rgb-d scans. In Proceedings of the IEEE/CVF                Pattern Recognition, pages 406–413. IEEE, 2014. 2, 3
     Conference on Computer Vision and Pattern Recognition               [25] Junho Jeon and Seungyong Lee. Reconstruction-based pair-
     (CVPR), June 2020. 3                                                     wise depth dataset for depth image enhancement using cnn. In

                                                                     9
Proceedings of the European Conference on Computer Vision            [39] Gernot Riegler and Vladlen Koltun. Free view synthesis. In
       (ECCV), September 2018. 3                                                 European Conference on Computer Vision, pages 623–640.
[26]   Michael Kazhdan and Hugues Hoppe. Screened poisson                        Springer, 2020. 3
       surface reconstruction. ACM Transactions on Graphics (ToG),          [40] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
       32(3):1–13, 2013. 6                                                       and Jan-Michael Frahm. Pixelwise View Selection for Un-
[27]   Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen                    structured Multi-View Stereo. In European Conference on
       Koltun. Tanks and temples: Benchmarking large-scale                       Computer Vision (ECCV), 2016. 3
       scene reconstruction. ACM Transactions on Graphics (ToG),            [41] Thomas Schops, Viktor Larsson, Marc Pollefeys, and Torsten
       36(4):1–13, 2017. 2, 3, 7                                                 Sattler. Why having 10,000 parameters in your camera model
[28]   Sebastian Koch, Yurii Piadyk, Markus Worchel, Marc Alexa,                 is better than twelve. In Proceedings of the IEEE/CVF Con-
       Cláudio Silva, Denis Zorin, and Daniele Panozzo. Hardware                ference on Computer Vision and Pattern Recognition, pages
       design and accurate simulation for benchmarking of 3D re-                 2535–2544, 2020. 5
       construction algorithms. In Thirty-fifth Conference on Neural        [42] Thomas Schops, Torsten Sattler, and Marc Pollefeys. Sur-
       Information Processing Systems Datasets and Benchmarks                    felMeshing: Online Surfel-Based Mesh Reconstruction. IEEE
       Track (Round 2), 2021. 4                                                  Transactions on Pattern Analysis and Machine Intelligence,
[29]   Andreas Kuhn, Christian Sormann, Mattia Rossi, Oliver                     42(10):2494–2507, Oct. 2020. 3, 7
       Erdler, and Friedrich Fraundorfer. Deepc-mvs: Deep con-              [43] Thomas Schops, Johannes L Schonberger, Silvano Galliani,
       fidence prediction for multi-view stereo reconstruction. In               Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An-
       2020 International Conference on 3D Vision (3DV), pages                   dreas Geiger. A multi-view stereo benchmark with high-
       404–413, 2020. 3                                                          resolution images and multi-camera videos. In Proceedings
[30]   Vincent Leroy, Jean-Sebastien Franco, and Edmond Boyer.                   of the IEEE Conference on Computer Vision and Pattern
       Shape reconstruction using volume sweeping and learned                    Recognition, pages 3260–3269, 2017. 2, 3, 7
       photoconsistency. In Proceedings of the European Conference          [44] Steven M Seitz, Brian Curless, James Diebel, Daniel
       on Computer Vision (ECCV), September 2018. 3                              Scharstein, and Richard Szeliski. A comparison and evalua-
[31]   Andreas Ley, Ronny Hänsch, and Olaf Hellwich. Syb3r:                     tion of multi-view stereo reconstruction algorithms. In 2006
       A realistic synthetic benchmark for 3d reconstruction from                IEEE computer society conference on computer vision and
       images. In European Conference on Computer Vision, pages                  pattern recognition (CVPR’06), volume 1, pages 519–528.
       236–251. Springer, 2016. 2, 4                                             IEEE, 2006. 2
[32]   Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta,                  [45] Inwook Shim, Tae-Hyun Oh, Joon-Young Lee, Jinwook Choi,
       Brian L Curless, Steven M Seitz, and Ira Kemelmacher-                     Dong-Geol Choi, and In So Kweon. Gradient-based cam-
       Shlizerman. Real-time high-resolution background matting.                 era exposure control for outdoor mobile platforms. IEEE
       In Proceedings of the IEEE/CVF Conference on Computer                     Transactions on Circuits and Systems for Video Technology,
       Vision and Pattern Recognition, pages 8762–8771, 2021. 6                  29(6):1569–1583, 2018. 6
[33]   Yuanzhi Liu, Yujia Fu, Fengdong Chen, Bart Goossens, Wei             [46] Ukcheol Shin, Jinsun Park, Gyumin Shim, Francois Rameau,
       Tao, and Hui Zhao. Simultaneous localization and mapping                  and In So Kweon. Camera exposure control for robust robot
       related datasets: A comprehensive survey. arXiv preprint                  vision with noise-aware image quality assessment. In 2019
       arXiv:2102.04036, 2021. 2                                                 IEEE/RSJ International Conference on Intelligent Robots and
[34]   Xinjun Ma, Yue Gong, Qirui Wang, Jingwei Huang, Lei                       Systems (IROS), pages 1165–1172. IEEE, 2019. 6
       Chen, and Fan Yu. Epp-mvsnet: Epipolar-assembling based              [47] Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim,
       depth prediction for multi-view stereo. In Proceedings of                 and Pieter Abbeel. BigBIRD: A large-scale 3D database of
       the IEEE/CVF International Conference on Computer Vision                  object instances. In 2014 IEEE international conference on
       (ICCV), pages 5732–5740, October 2021. 3                                  robotics and automation (ICRA), pages 509–516. IEEE, 2014.
[35]   Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha,                 2, 3
       Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-             [48] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
       to-end 3d scene reconstruction from posed images. In Euro-                SUN RGB-D: A RGB-D scene understanding benchmark
       pean Conference on Computer Vision (ECCV), 2020. 3                        suite. In Proceedings of the IEEE conference on computer
[36]   Michael Niemeyer, Lars Mescheder, Michael Oechsle, and                    vision and pattern recognition, pages 567–576, 2015. 2
       Andreas Geiger. Differentiable volumetric rendering: Learn-          [49] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano-
       ing implicit 3d representations without 3d supervision. In                lis Savva, and Thomas Funkhouser. Semantic scene comple-
       Proceedings of the IEEE/CVF Conference on Computer Vi-                    tion from a single depth image. Proceedings of 30th IEEE
       sion and Pattern Recognition, pages 3504–3515, 2020. 3                    Conference on Computer Vision and Pattern Recognition,
[37]   Michael Oechsle, Songyou Peng, and Andreas Geiger.                        2017. 3
       Unisurf: Unifying neural implicit surfaces and radiance              [50] Xibin Song, Yuchao Dai, Dingfu Zhou, Liu Liu, Wei Li,
       fields for multi-view reconstruction.         arXiv preprint              Hongdong Li, and Ruigang Yang. Channel attention based
       arXiv:2104.10078, 2021. 3                                                 iterative residual learning for depth map super-resolution.
[38]   Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Colored                    In Proceedings of the IEEE/CVF Conference on Computer
       point cloud registration revisited. In ICCV, 2017. 2, 3                   Vision and Pattern Recognition (CVPR), June 2020. 3

                                                                       10
[51] Christoph Strecha, Wolfgang Von Hansen, Luc Van Gool, Pas-            [63] Qingshan Xu and Wenbing Tao. Multi-scale geometric consis-
     cal Fua, and Ulrich Thoennessen. On benchmarking camera                    tency guided multi-view stereo. Computer Vision and Pattern
     calibration and multi-view stereo for high resolution imagery.             Recognition (CVPR), 2019. 3
     In 2008 IEEE conference on computer vision and pattern                [64] Qingshan Xu and Wenbing Tao. Planar prior assisted patch-
     recognition, pages 1–8. Ieee, 2008. 3                                      match multi-view stereo. AAAI Conference on Artificial Intel-
[52] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and                   ligence (AAAI), 2020. 3, 6
     Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc-             [65] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
     tion from monocular video. In Proceedings of the IEEE/CVF                  Mvsnet: Depth inference for unstructured multi-view stereo.
     Conference on Computer Vision and Pattern Recognition                      European Conference on Computer Vision (ECCV), 2018. 3
     (CVPR), pages 15598–15607, June 2021. 3                               [66] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren,
[53] Oleg Voynov, Alexey Artemov, Vage Egiazarian, Alexander                    Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A
     Notchenko, Gleb Bobrovskikh, Evgeny Burnaev, and Denis                     large-scale dataset for generalized multi-view stereo networks.
     Zorin. Perceptual deep depth super-resolution. In Proceed-                 Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3
     ings of the IEEE/CVF International Conference on Computer             [67] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
     Vision, pages 5653–5663, 2019. 3                                           Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural
[54] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo                  surface reconstruction by disentangling geometry and appear-
     Speciale, and Marc Pollefeys. Patchmatchnet: Learned multi-                ance. Advances in Neural Information Processing Systems,
     view patchmatch stereo. In Proceedings of the IEEE/CVF                     33, 2020. 3
     Conference on Computer Vision and Pattern Recognition                 [68] Igor Zacharov, Rinat Arslanov, Maksim Gunin, Daniil Ste-
     (CVPR), 2021. 3                                                            fonishin, Andrey Bykov, Sergey Pavlov, Oleg Panarin, Anton
[55] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku                 Maliutin, Sergey Rykovanov, and Maxim Fedorov. “zhores”-
     Komura, and Wenping Wang. Neus: Learning neural implicit                   petaflops supercomputer for data-driven modeling, machine
     surfaces by volume rendering for multi-view reconstruction.                learning and artificial intelligence installed in skolkovo insti-
     NeurIPS, 2021. 3, 6                                                        tute of science and technology. Open Engineering, 9(1):512–
[56] Oliver Wasenmüller, Marcel Meyer, and Didier Stricker.                    520, 2019. 8
     CoRBS: Comprehensive rgb-d benchmark for slam using                   [69] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian
     kinect v2. In IEEE Winter Conference on Applications of                    Fang. Visibility-aware multi-view stereo network. British
     Computer Vision (WACV). IEEE, March 2016. 3                                Machine Vision Conference (BMVC), 2020. 3, 6
[57] Silvan Weder, Johannes Schonberger, Marc Pollefeys, and               [70] Yinda Zhang and Thomas Funkhouser. Deep Depth Com-
     Martin R. Oswald. RoutedFusion: Learning Real-Time Depth                   pletion of a Single RGB-D Image. In 2018 IEEE/CVF Con-
     Map Fusion. In 2020 IEEE/CVF Conference on Computer                        ference on Computer Vision and Pattern Recognition, pages
     Vision and Pattern Recognition (CVPR), pages 4886–4896,                    175–185, Salt Lake City, UT, USA, June 2018. IEEE. 3
     Seattle, WA, USA, June 2020. IEEE. 3, 6
[58] Silvan Weder, Johannes L. Schonberger, Marc Pollefeys, and
     Martin R. Oswald. Neuralfusion: Online depth fusion in
     latent space. In Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition (CVPR), pages
     3162–3172, June 2021. 3
[59] Zizhuang Wei, Qingtian Zhu, Chen Min, Yisong Chen, and
     Guoping Wang. Aa-rmvsnet: Adaptive aggregation recurrent
     multi-view stereo network. In Proceedings of the IEEE/CVF
     International Conference on Computer Vision, pages 6187–
     6196, 2021. 3
[60] Thomas Whelan, Renato F Salas-Moreno, Ben Glocker, An-
     drew J Davison, and Stefan Leutenegger. ElasticFusion: Real-
     time dense SLAM and light source estimation. The Interna-
     tional Journal of Robotics Research, 35(14):1697–1716, Dec.
     2016. 3
[61] Zhirong Wu, Shuran Song, Aditya Khosla, Linguang Zhang,
     Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep
     representation for volumetric shape modeling. In IEEE Con-
     ference on Computer Vision and Pattern Recognition (CVPR),
     Boston, USA, June 2015. 3, 7
[62] Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie
     C. L. Wang. Multi-Scale Progressive Fusion Learning
     for Depth Map Super-Resolution. arXiv e-prints, page
     arXiv:2011.11865, Nov. 2020. 3

                                                                      11
You can also read