Multi-sensor large-scale dataset for multi-view 3D reconstruction
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-sensor large-scale dataset for multi-view 3D reconstruction Oleg Voynov1 , Gleb Bobrovskikh1 , Pavel Karpyshev1 , Andrei-Timotei Ardelean1 , Arseniy Bozhenko1 , Saveliy Galochkin1 , Ekaterina Karmanova1 , Pavel Kopanev1 , Yaroslav Labutin-Rymsho2 , Ruslan Rakhimov1 , Aleksandr Safin1 , Valerii Serpiva1 , Alexey Artemov1 , Evgeny Burnaev1 , Dzmitry Tsetserukou1 , Denis Zorin3 1 Skolkovo Institute of Science and Technology, 2 Huawei Research Moscow, 3 New York University arXiv:2203.06111v1 [cs.CV] 11 Mar 2022 {oleg.voinov, g.bobrovskih, pavel.karpyshev, timotei.ardelean}@skoltech.ru, {a.bozhenko, saveliy.galochkin, e.karmanova, pavel.kopanev}@skoltech.ru, labutin.rymsho.yaroslav@huawei.com, {ruslan.rakhimov, aleksandr.safin}@skoltech.ru, {v.serpiva, a.artemov, e.burnaev, d.tsetserukou}@skoltech.ru, dzorin@cs.nyu.edu Figure 1. A representative set of objects from our dataset. We focus on challenging cases for depth sensors or 3D reconstruction algorithms. Evaluation of common 3D reconstruction methods using our dataset demonstrates its potential value (lower row). Abstract 1. Introduction Reconstruction of 3D geometry of physical objects and We present a new multi-sensor dataset for 3D surface re- scenes is an important task for a broad range of applications. construction. It includes registered RGB and depth data from Sensor data used in 3D reconstruction range from highly sensors of different resolutions and modalities: smartphones, specialized and expensive CT, laser, and structured-light Intel RealSense, Microsoft Kinect, industrial cameras, and scanners to video from commodity cameras and depth sen- structured-light scanner. The data for each scene is obtained sors; computational 3D reconstruction methods are typically under a large number of lighting conditions, and the scenes tailored to a particular type of data. Yet, even commodity are selected to emphasize a diverse set of material proper- hardware increasingly provides multi-sensor data: for ex- ties challenging for existing algorithms. In the acquisition ample, many recent phones have multiple RGB cameras as process, we aimed to maximize high-resolution depth data well as lower resolution depth sensors. Using data from quality for challenging cases, to provide reliable ground different sensors, RGB-D data in particular, has the potential truth for learning algorithms. Overall, we provide over to considerably improve the quality of 3D reconstruction. 1.4 million images of 110 different scenes acquired at 14 For example, multi-view stereo (MVS) algorithms produce lighting conditions from 100 viewing directions. We expect high-quality 3D geometry from RGB data, but may miss our dataset will be useful for evaluation and training of 3D featureless surfaces; supplementing RGB images with depth reconstruction algorithms of different types and for other sensor data makes it possible to have more complete recon- related tasks. Our dataset and accompanying software will structions. Conversely, commodity depth sensors often lack be available online at adase.group/3ddl/projects/sk3d. resolution provided by RGB cameras. Combining multi-view RGB and depth data in a sin- 1
Hi-res. geom. Depth, MPix Sensor types gle algorithm is challenging; fortunately, recent learning- RGB, MPix Poses/scene # Frames # Scenes Lighting based techniques substantially simplify this task. For single- modality data, learning-based algorithms also have the Dataset promise of being more robust to variations in reflection prop- DTU [24] RGB (2) 2 ✓ 49/64 8 80 27K erties and lighting conditions. However, learning methods SLS require suitable datasets for training. A number of excel- ETH3D [43] RGB 24 ✓ 10–70 U 24 11K lent datasets were developed over time, with new datasets TLS — TnT [27] RGB 8 ✓ 150–300 U 21 148K introduced in parallel with advances in sensors, as well as TLS — to provide more varied or challenging data (e.g., indoor and BlendedMVG [66] unknown 3/0.4 20–1000 U 502 110K outdoor scenes, challenging surface properties, varying light- BigBIRD [47] RGB (5) 12 — 600 1 120 144K ing conditions) or data suitable for learning. Our dataset RGB-D (5) 1.2 0.3 aims to complement existing ones in all of these ways, as ScanNet [11] RGB-D 1.3 0.3 NA U 1513 2.5M discussed in more detail in Sections 2 and 3. Ours RGB (2) 5 — ✓ 100 14 110 913K RGB-D 1 (2) 40 0.04 The structure of our dataset is expected to benefit research RGB-D 2 2 0.2 on 3D reconstruction in several ways. RGB-D 3 2 0.9 • Multi-sensor data. We provide aligned data from seven SLS — — different devices, including low-resolution depth data from Table 1. Comparison of our dataset to the most widely used related datasets. U indicates uncontrolled lighting; frames are commodity sensors, high-resolution geometry data from a counted per sensor, i.e., all data from an RGB-D sensor are counted structured-light scanner, and RGB data at different resolu- as a single frame. The number of separate images acquired may be tions and from different cameras. This enables supervised considerably larger (1.4 M for our dataset). All scenes, from both learning for reconstruction methods relying on different training and testing sets, were counted. combinations of sensor data, in particular, increasingly common combination of high-resolution RGB with low- discuss datasets most closely related to ours. We review how resolution depth data. In addition, multi-sensor data sim- these datasets are used to evaluate and train methods for a plifies comparison of methods relying on different types range of computer vision tasks in Section 3. of the sensor (RGB, depth, and RGB-D). Sensors. For multi-view stereo (MVS) datasets, high- • Lighting and pose variability. We aimed to make the resolution RGB, either photo [1, 43, 66] or video [27], is dataset large enough (1.44 M images of different modal- standard; in many cases, a structured-light scanner (SLS) [1] ities in total) to enable training machine learning algo- or a terrestrial laser scanner (TLS) [27,43] are used to obtain rithms, and provide systematic variability in camera poses high-resolution 3D ground truth. Datasets designed for tasks (100 per object), lighting (14 lighting setups) and reflection like SLAM, object classification and segmentation often in- properties these algorithms need. clude low-resolution depth data acquired using devices like • Object selection. Among 110 objects in our dataset, we Microsoft Kinect or Intel RealSense [11, 38, 47, 48], but do include primarily objects that may present challenges to not include high-resolution depth data. Our dataset aims to existing algorithms mentioned above (see examples in improve the existing ones and enable new 3D reconstruc- Figure 1); we made special effort to improve quality of 3D tion tasks by providing aligned image and depth data from high-resolution structured-light data for these objects. multiple sensors, including low- and high-resolution depth. Our focus is on RGB and depth data for individual objects A recently proposed dataset RGB-D-D [21] makes a step similar to [24], in laboratory setting, rather than on complex in a similar direction to ours, pairing data from a 240 × 180 scenes with natural lighting (e.g., [27, 43]). This provides phone camera with medium resolution 640 × 480 Lucid means for systematic exploration and isolation of different Helios time-of-flight (ToF) depth camera. Our collection factors contributing to strengths and weaknesses of different provides depth data at three levels of accuracy, including, in algorithms, and complements more holistic evaluation and addition to similar depth sensors, high-resolution data from training data provided by datasets with complex scenes. a structured-light scanner. While our evaluation is restricted to multi-view 3D recon- We should also mention synthetic benchmarks such as [3, struction methods, the dataset can be used for testing and 20, 31]. SyB3R [31] has an ability to generate large training training in several other related tasks (Section 3). sets easily using a generator simulating an actual acquisition process. However, real data are required to model sensors 2. Related work on datasets faithfully, train generators, and test trained algorithms. Scene choice, lighting and poses. Datasets focusing on Many datasets for tasks related to 3D reconstruction were individual objects and controlled lighting include Middle- developed (see, for example, [33] for a survey of datasets bury [44], TUM [9] and the most widely used DTU MVS related to simultaneous localization and mapping (SLAM), dataset [1, 24]. Our dataset is significantly larger, compared including a survey of 3D reconstruction datasets); we only to DTU, on a number of parameters, as we show in Table 1. 2
■ Transparent ■ Featureless Most of other MVS datasets, while containing some im- ■ Reflective ■ Periodic ages of isolated objects, focus on complete scenes, often col- ■ Glossy ■ Relief lected with hand-held, freely positioned cameras [27, 43, 51]. The Redwood dataset [38] contains the largest number of Dataset objects (over 10 K with raw RGB-D data including 398 BlendedMVG [66], buildings 46 6 5 1 47 3 with 3D reconstructions), but only low-resolution depth BlendedMVG [66], other objects 99 4 7 4 2 1 data. In robotics, several datasets were developed for SLAM: CoRBS [56] with high-resolution 3D data, but only 4 scenes; ETH3D [43] 19 9 8 1 1 3 BigBIRD [47] with 120 objects but only with low-resolution TnT [27] 7 11 9 5 9 4 depth and no lighting variation. Among datasets with high- DTU [24] 52 10 17 15 21 1 resolution 3D scanner data, we provide the largest number Ours 52 58 45 21 17 7 of objects, the largest number of lighting conditions, and the Table 2. Statistics of scenes with different surface properties most challenging objects, as we show in Tables 1 and 2. across datasets. Compared to existing datasets, ours includes a larger number of scenes with challenging surface properties. 3. Motivating tasks depth maps into a truncated signed distance function (TSDF). Most datasets closely related to ours are designed to sup- Learning-based methods were proposed for this type of port the development and testing of RGB image-based multi- tasks [57, 58]. The challenge of evaluating/training these view stereo (MVS) and structure-from-motion (SfM, SLAM) methods using the existing datasets is the absence of high- algorithms. We expand the range of supported algorithms resolution depth sensor data to be used as ground truth. For to include reconstruction algorithms relying on combining this reason, synthetic datasets [5, 61] are used for training, RGB and depth data of different resolutions; our dataset can or a single-resolution depth is subsampled to obtain depth also be used for multiple additional tasks described below. maps to serve as inputs. At the same time, we do not aim to support directly a number Several methods use data from RGB-D sensors, like Mi- of common tasks, such as camera pose estimation, object crosoft Kinect [13, 42, 60], producing voxel-based TSDF or classification or segmentation, or SLAM. surfel clouds as output. Synthetic ICL-NUIM SLAM bench- RGB image-to-geometry reconstruction. In this task, mark [20] is the most commonly used for comparing these high-resolution RGB images are used as input to produce methods. Learning-based algorithms for this task started either depth maps or complete 3D geometry descriptions. appearing [22]; these are, so far, trained on synthetic data Standard MVS methods compute depth maps, which are like [5]. One exception is [12] and [14] trained on real data fused into point clouds. This category includes PatchMatch- from Matterport3D [4] and SUNCG [49] with inputs ob- based [2] algorithms [40, 63, 64], recent learning-based ap- tained by subsampling. Our dataset contains inputs from proaches [6, 19, 29, 30, 34, 54, 59, 65, 69], and hybrid meth- depth sensors of low and high resolutions along with associ- ods [15, 29]. These methods are predominantly tested and ated registered higher resolution RGB images, providing a trained on the DTU [24] and to lesser extent, on ETH3D [43] framework for evaluating and training both depth fusion and and Tanks and Temples (TnT) [27], BlendedMVS (Blended- RGB-D fusion algorithms, as well as developing new ones. MVG) [66], and other datasets. Depth super-resolution and completion. Improving the Several recent methods [36, 37, 55, 67] reconstruct a com- depth maps obtained either by MVS or direct sensor measure- pact implicit surface representation encoded by a neural net- ment is often considered a separate task in reconstruction work from a set of RGB images. All these methods use the pipeline, or may be of direct value in applications. Due DTU dataset for evaluation, with some also using Blended- to absence of datasets with sensor data at different resolu- MVG and other datasets. Several subsets of our dataset can tions, depth map super-resolution methods [23, 50, 53, 62] be used instead or together with DTU and similar datasets. rely on artificial pairs of low- and high- resolution depth For methods of this type, we contribute higher quality ground maps constructed by subsampling. Depth map completion truth for objects with challenging surface properties, larger work [25, 70] suffers from similar issues. Our dataset pro- set of such objects, and more lighting variation. vides real-world pairs for supervised learning in this context. Another class of learning-based methods directly recon- View synthesis and relighting. A range of methods [7,17, struct a voxelized representation of implicit surface [35, 52]. 39] synthesize novel views of objects from collections of These methods are trained and tested using ScanNet [11] existing images. A large number of poses and lighting setups and 7-Scenes [18] datasets with relatively low resolution and in our dataset facilitates learning in these tasks, and the multi- high noise in the depth data, and are likely to benefit from view depth data can be used in related tasks such synthesis of including our high-resolution depth data in training. depth maps for novel views or pixelwise visibility estimation. Depth fusion and RGB-D reconstruction. Starting with Sensor modeling and inter-sensor generalization. As the classical method [10], many techniques aim to fuse datasets of the type we describe are difficult to collect, there 3
Figure 2. Our acquisition setup (view in zoom). We included a diverse set of seven commonly used RGB and RGB-D sensors, mounting them on a shared metal rig to aid data alignment. We con- structed a metal frame surrounding the scanning area and installed various light sources to provide 14 lighting setups. Device # RGB Depth IR Intr. Extr. Rec. DFK 33UX250 2 ✓* — — ✓ ✓ — Figure 3. Images captured from all sensors and the reference Mate 30 Pro 2 ✓* ✓ ✓ ✓ ✓ — mesh reconstruction. RealSense D435 1 ✓* ✓* ✓✓* ✓ ✓ — We surrounded the scanning area with a metal frame to Kinect v2 1 ✓* ✓ ✓ ✓ ✓ — which we attached the light sources: seven directional lights, Spectrum 1 — ✓ — — — ✓ three diffuse soft-boxes, and LED strips, shown in Figure 2 Table 3. Composition of our dataset. We provide RGB, depth, on the right. We also used the flashlights of the phones. and IR images, intrinsic (Intr.) and extrinsic (Extr.) calibration For each scene, we moved the camera rig through 100 parameters, and a reference mesh reconstruction (Rec.). The data fixed positions on a sphere with radius of 70 cm and collected marked with * is captured per lighting setup. the data using 14 lighting setups. For each device, except is a considerable interest in (pre-) training on synthetic data. the SLS, we collected raw RGB, depth, and infrared (IR) The success of this approach depends on how faithfully such images, including both left and right IR for RealSense. In data reproduces sensor behavior [28,31], which is often diffi- total, we collected 15 raw images per scene, camera position, cult to model. Our dataset includes data from multiple com- and lighting setup: 6 RGB, 5 IR, and 4 depth images, as modity sensors, on which image and depth data synthesis illustrated in Table 3 and Figure 3. As the data from ToF algorithms can be trained to reproduce the behavior of spe- sensors of the phones and Kinect is unaffected by the lighting cific sensors. At the same time, our dataset supports testing conditions, we captured this data once per camera position. sensor-to-sensor generalization of learning-based methods For the SLS we collected partial scans from 27 positions. by offering aligned images for multiple sensors. In addition to the raw images and partial structured-light scans, we include into our dataset intrinsic parameters of the 4. Dataset cameras and their positions, RGB and depth images with lens distortion removed, cleaned up structured-light scans, 4.1. Overview and meshes reconstructed from the complete set of scans. Our dataset consists of 110 scenes with a single everyday 4.2. Design decisions object or a small group of objects on a black background, see examples in Figure 1 and all scenes in supplementary The design of our dataset was primarily determined by the material. For collection of the dataset we used a set of sen- requirements of methods for high-quality 3D reconstruction sors mounted on Universal Robots UR10 robotic arm with 6 based on large collections of sensor data (in contrast to tasks degrees of freedom and sub-millimeter position repeatability. such as real-time or monocular reconstruction). We used the following sensors, shown in Figure 2 on the left: Choice of sensors. We aimed to include a variety of RGB • RangeVision Spectrum structured-light scanner (SLS), and depth sensors commonly used in practice, and high- • Two The Imaging Source DFK 33UX250 industrial RGB resolution sensors that can be used to generate high-quality cameras, reference data. Smartphones with a depth sensor are increas- • Two Huawei Mate 30 Pro phones with ToF sensors, ingly widely available but have the lowest depth resolution • Intel RealSense D435 active stereo RGB-D camera, and accuracy; Kinect is another consumer-level ToF sensor • Microsoft Kinect v2 ToF RGB-D camera. with better accuracy; RealSense active stereo devices are 4
widely used in robotics; finally, a structured-light scanner has the highest depth resolution and accuracy and is typi- cally used as the source of ground truth for evaluation of 3D reconstruction quality. All these devices except the SLS are also sources of consumer-level RGB data with differ- ent fields of view and resolutions. We supplemented them with industrial RGB cameras with high-quality optics and low-noise sensors. Figure 5. Four types of lighting in our dataset. Lighting can Laboratory setting and lighting. We chose to focus on a affect depth captured with active stereo camera of RealSense. setup with controlled (but variable) lighting and a fixed set of camera positions for all scenes. While a more complex Camera calibration, i.e., reducing the data from different type of data with natural light and trajectory, as in ETH3D sensors to a single coordinate system, is essential to use this or TnT datasets, is excellent for stress-testing algorithms, data jointly in computer vision tasks. We used the calibration identifying specific sources of weaknesses of a particular pipeline of [41] for generic camera models, which allow for approach is likely to be easier if lighting and camera posi- more accurate calibration than parametric camera models tions vary in a controlled way, consistently across the whole used most commonly, including in DTU and ETH3D. The dataset. Furthermore, laboratory setting considerably simpli- original implementation of this pipeline supports calibration fied collecting well-aligned multi-sensor data. of rigid camera rigs, however, its straightforward application We aimed to provide a broad range of realistic lighting for our camera rig as a single rigid whole was numerically conditions, illustrated in Figure 5: directional light sources non-stable due the properties of our setup. Firstly, the in- and flashlights of the phones provide eight samples of “hard” cluded sensors have a large variation in the field of view light, typical, for example, for streetlight; soft-boxes provide (30–90°) and resolution (0.04–40 MPix). Secondly, the fo- three samples of diffuse light, typical for indoor illumination; cus of the phone cameras, being fixed programmatically, LED strips imitate ambient light, typical for cloudy weather. fluctuates slightly over time, which we relate to thermal de- To minimize the level of the light reflected from objects formations of the device (see [16] for a study of such effect). outside the scene we used a black cloth as the background. Finally, the camera rig deforms slightly depending on its tilt Scene selection. In our choice of specific objects, we in different scanning positions. To avoid the loss of accuracy aimed to ensure that, on the one hand, a variety of mate- we split the calibration procedure into several steps. rial and geometric properties are represented, and, on the First, we obtained intrinsic camera models for each sen- other hand, there are enough samples of objects with mate- sor independently. Then, for the SLS and all RGB sensors rial properties of the same type. While a few scenes in our except the sensors of the phones, i.e., the sensors with a rela- dataset contain multiple objects, we did not aim to present tively high resolution and stable focus, we estimated the pose objects in a “natural” cluttered environment. within the rig for its “neutral” tilt, as if it was rigid. Next, we Preparation of objects for 3D scanning. Our goal was to estimated the poses of RGB sensors of the phones within the include objects with sur- rig, keeping the estimated poses of the other sensors fixed. face reflection properties After that, we estimated the pose of each RGB sensor and that challenge common the SLS for different scanning positions of the robotic arm, sensors and existing algo- individually for each sensor, and then transformed all the rithms. However, these poses into the same space “through” a position with a neu- objects often challenge tral rig tilt. Finally, we estimated the pose of each depth/IR the structured-light scan- sensor assuming it is attached rigidly to its RGB companion. ning too, making it hard This whole procedure required capturing thousands of to obtain reliable high- images of calibration pattern, which we made almost fully resolution depth, suitable automated with the use of the robotic arm, except for several for use as the ground truth. Figure 4. A partial scan without (left) and with coating (right). manual reorientations of the pattern. As the result, for all To get the highest-quality RGB and depth sensors and the SLS we obtained central structured-light scans we applied a temporary coating to the generic camera models with 2.3–4.7 K parameters, depend- scanned objects (see Figure 4), as we describe in Section 4.3. ing on the sensor resolution, and the poses of the sensor in the global space for each scanning position of the robotic 4.3. Data acquisition arm. The mean calibration error for different sensors at dif- We outline the most important aspects of our acquisition ferent steps of the procedure was in the range 0.04–0.4 px, process and data postprocessing here and provide additional or approximately in the range 0.024–0.15 mm. We explain details in the supplementary material. these measurements in the supplementary material. 5
Data acquisition procedure for each scene consisted of affect each other, so we stopped the camera application on the following steps: one of the phones during depth capture with the other one. 1. Object placement ensured that a greater area of the object Data post-processing. To simplify the use of the with features of interest was visible to the sensors. structured-light data for evaluation and training, we obtained 2. Sensor adjustment set optimal values of camera exposure clean partial scans and mesh reconstructions from the raw and gain, and laser power for RealSense. scans. For this, first, we globally aligned the raw partial 3. For Structured-light scanning the object was covered scans using the method of [8] initialized with the poses ob- with vanishing opaque matte coating and scanned from tained during calibration. For each point in each partial scan 27 positions. After the coating vanished, additional 5 we calculated the distance to the closest point in all other scans were done and were later registered to the scans of scans and manually rejected the scenes with bad alignment the coated object to verify that it was not deformed. based on the statistics of this value. Then, from the aligned 4. Finally, RGB and low-resolution depth sensor data was scans we reconstructed the surface using Screened Poisson acquired. reconstruction [26] with the cell width set to 0.3 mm, which Since we observed slight variations of intrinsic camera is a conservative estimation of the accuracy of our SLS. After parameters depending on the temperature, we warmed up that, we manually removed the parts of this surface corre- all devices at the beginning of each day by scanning empty sponding to outlier scanning artefacts, and automatically space until the parameters were stable (about 1 hour), and removed the fill artefacts of reconstruction by only keeping then kept the devices warm by constantly scanning. the vertices within 0.3 mm of the raw scans. Finally, we Sensor exposure/gain adjustment was a critical step for filtered out artefacts in the partial scans by only keeping the obtaining the data useful in computer vision tasks. Using points within 0.3 mm of the clean reconstructed surface. universal settings for all scenes would lead to a low image In our experiments, we used the clean partial scans as the quality caused by over- or underexposure due to variations reference data for evaluation, and the clean reconstructed of object surface properties. Adjusting the settings automati- surface for ray-tracing the depth maps needed for training. cally for each scene using the hardware auto-exposure would For temporary coating of the scanned objects we used also be suboptimal since the black background, occupying Aesub Blue scanning spray. It sublimates from the surface at a substantial part of the image, would cause overexposure room temperature in a few hours, which we reduced to 5–15 of the object. To collect the data of high quality we em- minutes by slightly heating the object with a heatgun. To ployed a custom auto-exposure algorithm, that was inspired check if this led to a deformation of the object, we calculated by [45, 46] and worked reasonably well. the distance from scans made without the coating to the full For each sensor, we extracted the foreground mask of the scan made with the coating. We then manually rejected full scene from the images with and without the object using the scenes with observable deformation, which included a few method of [32], and then, for each lighting, we obtained the objects made of soft plastic, or cut out the deformed parts minimal noise setting, by setting the gain to the minimum and from the scans, such as power cords of electronic devices. maximizing the Shannon entropy of the foreground image w.r.t. the exposure value. Additionally, we fixed the optimal exposure and optimized the entropy w.r.t. the gain value to 5. Experimental evaluation prevent underexposure of very dark objects (typically, the 5.1. Setup gain stayed close to the minimum). For the flash and ambient lighting, we also obtained the real-time / high-noise setting, To demonstrate possible applications of our dataset we by setting the exposure to 30 FPS and optimizing the gain, used it for testing several methods of 3D reconstruction from and then fixing the gain and optimizing the exposure. multi-view RGB and depth data, and also used it for training We did the exposure/gain adjustment at one fixed position one RGB-to-3D reconstruction method and one depth-to-3D of the rig per scene, for each RGB sensor, except Kinect for reconstruction method. Additionally, we applied an RGB-D which these controls are not available, and for RealSense reconstruction method to our data. IR sensor. Each optimization loop required 10–50 iterations. Methods. ACMP [64] is a PatchMatch-based non- For RealSense we also picked the optimal power of IR pro- learnable multi-view stereo (MVS) method with strong per- jector, by capturing depth at all 12 settings and picking the formance on benchmarks such as Tanks and Temples (TnT). one with the lowest number of pixels with missing values. VisMVSNet [69] is a learning-based MVS method based on Reduction of sensor cross-talk. We aimed, whenever pos- the plane sweeping approach, one of the best performing sible, to minimize the effects of sensors on each other. For learnable methods on TnT benchmark with a publicly avail- example, IR projector of Microsoft Kinect v2 cannot be able implementation. NeuS [55] is a recent rendering-based turned off and affects other depth sensors, so we added an method producing a neural representation of a TSDF directly external shutter to close the projector while the other sensors from RGB images. RoutedFusion [57] is a state-of-the-art are imaging. Similarly, time-of-flight sensors of the phones depth fusion method that performs online TSDF reconstruc- 6
blue sticky brown relief candlestick green carved grey braided large coral large white amber vase dragon dumbbells fencing mask large candles roller pot thing pot box backpack jug Method AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 ACMP 34 35 30 24 28 30 32 39 36 20 28 31 57 58 64 14 33 33 37 43 36 28 33 26 52 51 46 19 29 32 24 38 35 23 26 25 VisMVSNet, BlendedMVG 35 29 28 9 32 30 37 30 34 17 31 33 63 61 67 4 21 25 33 40 35 26 33 24 47 57 46 5 28 31 8 40 37 11 25 24 VisMVSNet, Ours 38 50 54 5 50 50 32 49 53 17 48 50 45 46 54 5 35 44 38 51 49 39 45 44 64 75 67 9 50 50 5 51 48 13 42 43 orange cash orange mini painted white castle white castle white ceramic white white starry mittens skate steel grater white box register vacuum samovar land towers elephant christmas star jug Method AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 AL FB H4 ACMP 18 32 26 28 29 30 25 36 36 28 38 39 47 59 59 29 31 34 10 25 28 21 22 23 23 31 35 28 33 36 21 37 30 31 34 26 VisMVSNet, BlendedMVG 7 27 20 11 27 28 15 33 39 21 40 40 39 67 63 21 31 32 6 23 28 1 23 25 2 38 42 20 34 34 8 40 29 31 31 25 VisMVSNet, Ours 3 37 39 20 43 44 19 46 54 23 54 54 17 49 57 39 48 50 3 32 35 5 42 45 9 48 53 21 44 48 4 49 44 46 53 52 Featureless Glossy Reflective Transparent Relief Periodic textures Table 4. F-score per scene for RGB-based methods for three lighting setups: ambient lighting AL, flash lighting FB, and hard lighting H4. tion using two learnable networks: a “routing” network and quantitative evaluation as we explain in Section 5.2. a fusion network. SurfelMeshing [42] is an online RGB-D For quantitative evaluation of ACMP and VisMVSNet reconstruction method that produces a triangle mesh and we compared the produced point cloud with the reference uses a surfel cloud as the intermediate representation. structured-light data. For NeuS, we extracted the mesh from Data and training. For our experiments, we selected 24 TSDF at the resolution around 0.5 mm, sampled a point testing scenes representing different surface types present in cloud from this mesh with point density close to that in our dataset, and used the remaining 86 scenes for training. the SL scans, and measured the quality of this point cloud. As the input for ACMP, VisMVSNet, and NeuS we used Following the common practice (e.g., [27, 43]), we used the undistorted images from the right industrial camera. For threshold-based Precision (accuracy), Recall (completeness), RoutedFusion we used the depth maps from RealSense. For and F-score quality measures. Precision is based on the SurfelMeshing we combined the undistorted RGB images per-point distance from the reconstructed to the reference from the right industrial camera with the depth maps from data: if all reconstructed points are close to the reference, the RealSense aligned to these RGB images. We used the RGB result is accurate. Recall measures the opposite one-sided images from the industrial camera instead of the RealSense per-point distance from the reference to the reconstructed since the former captures the images of higher quality. In all data: if for every reference point there is a close point in the experiments, we used the intrinsic camera parameters and reconstruction, the result is complete. To calculate Precision camera positions obtained during calibration. and Recall, we selected a threshold related to the resolution To test the learning-based methods, VisMVSNet and of the reference data, specifically 0.3 mm, and computed RoutedFusion, we trained two versions of each method. For the percentage of points for which the distance is less than VisMVSNet, we trained the first version on BlendedMVG the threshold. We then calculated F-score as the harmonic dataset (an extended version of BlendedMVS), and the sec- mean of these two numbers, with both Precision and Recall ond version on our dataset, using, as targets, the depth maps required to be high for the F-score to be high. obtained from the surface reconstructed from SL scans via For a careful calculation of Precision and Recall for 3D ray-tracing. For RoutedFusion, we trained the first version point clouds, two problems have to be considered. First, the on ModelNet dataset [61], and the second version on our reference data is available only for a part of the 3D space, dataset, using, as targets for the routing network, the depth and for the other part it is unknown whether the space is maps obtained from the SL surface, and, as targets for the free or occupied by the surface of the object. Second, both fusion network, TSDF volumes obtained from the SL sur- the reconstructed and the reference point clouds may have face via the authors' processing pipeline. In all cases, we varying point densities, which will cause uneven contribution followed the original authors' training regime. of different parts of the surface to the value of the measure. Quality measures. The five evaluated methods produce We addressed these problems similar to [43], with the main the reconstruction in different forms: the full pipeline of difference related to different properties of our reference data: ACMP and VisMVSNet produces a point cloud; NeuS pro- dense structured-light scans instead of sparser terrestrial duces a neural TSDF with, virtually, infinite resolution; Rout- laser scans. We describe the details of calculation of quality edFusion produces a TSDF volume with a very limited res- measures in the supplementary material. olution; and SurfelMeshing produces a triangle mesh. For For quantitative evaluation of RoutedFusion we calcu- quantitative evaluation of these methods on our dataset we lated intersection-over-union (IoU), reported as a percentage, used two different strategies: one for ACMP, VisMVSNet, between the produced TSDF volume and the TSDF volume and NeuS, and the other one for RoutedFusion. We describe calculated from the SL data. We decided to opt for this qual- them below. For SurfelMeshing we did not perform any ity measure instead of Precision, Recall, F-score since the 7
■ Transparent ■ Featureless ■ Reflective fencing mask large candles green carved grey braided brown relief amber vase candlestick large white blue sticky large coral ■ Periodic dumbbells backpack thing roller box ■ Glossy pot pot jug ■ Relief dragon Method ■ All Method AL FB H4 NeuS 15 10 21 8 43 17 11 fail 4 17 19 23 ACMP 34 30 30 39 33 30 33 28 35 35 RoutedFusion, ModelNet 16 5 10 11 19 9 7 11 11 13 22 16 VisMVSNet, BlendedMVG 32 26 26 37 30 27 30 20 35 34 RoutedFusion, Ours 12 19 6 16 20 13 27 7 30 12 31 16 VisMVSNet, Ours 41 34 39 42 43 37 39 22 47 49 white ceramic christmas star orange mini orange cash white starry white castle white castle elephant steel grater Neus 18 16 14 17 13 15 16 vacuum register white box towers white samovar land jug painted mittens skate RoutedFusion, ModelNet 14 13 11 11 9 6 12 Method RoutedFusion, Ours 17 17 13 18 24 21 18 NeuS 21 8 20 16 31 14 8 9 16 5 17 10 Table 5. Average reconstruction quality per surface type and on RoutedFusion, ModelNet 11 5 8 10 17 9 26 14 8 7 8 18 the whole test set, F-score for RGB-based methods and IoU for RoutedFusion, Ours 11 9 23 22 20 16 16 28 27 15 21 10 RoutedFusion. The last three columns show average F-scores per Table 6. Reconstruction quality per scene, F-score for NeuS and three selected lighting setups. IoU for RoutedFusion. results of RoutedFusion were not comparable to the results of RGB-based reconstruction methods anyway, while the IoU score could be directly compared to the results reported by the authors of the method. 5.2. Experimental results Figure 6. The structured-light scan of “moon pillow” (left) and the In Table 4 we show the F-score per scene for ACMP and SurfelMeshing reconstruction (right). VisMVSNet for three lighting setups: ambient lighting with learning-based methods substantially outperform ACMP for real-time / high-noise setting, flash lighting with minimal better lighting. noise setting, and one of the hard lights with minimal noise Reconstruction from RGB-D. Finally, we have experi- setting. In Table 6 we show the F-score for NeuS and IoU mented with reconstruction from our RGB-D data using for RoutedFusion for one lighting setup per scene. In Table 5 SurfelMeshing. However, the quality of reconstruction was we show the results for all the methods averaged per scenes insufficient for a meaningful quantitative evaluation of this featuring specific surface types, and for ACMP and VisMVS- method, as we illustrate in Figure 6, which we hypothesize Net we additionally show average F-score for three lighting is due to the relatively small number of frames per scene. setups. We show the additional quantitative and qualitative results in the supplementary material. 6. Conclusions Scene dependence. We observe that the performance of the methods strongly depends on the scene and lighting, In this paper, we presented a new dataset for evaluation with F-score varying from as low as 2% to 57%, with ACMP and training of 3D reconstruction algorithms. Compared to demonstrating greater stability but lesser range of F-scores. other available datasets the distinguishing features of ours NeuS reconstruction performs worse than MVS methods as include a large number of sensors of different modalities the method tends to smooth out surfaces, however, it is less and resolutions, depth sensors in particular, selection of sensitive to the surface type. Some scenes, e.g., “dragon”, scenes presenting difficulties for many existing algorithms are reconstructed well by all RGB-based algorithms, while and high-quality reference data for these objects. Our dataset most scenes with Featureless surface are, not surprisingly, can support training and evaluation of methods for many difficult for them. Based on the average scores per surface variations of 3D reconstruction tasks. We plan to expand type in Table 5, we observe that Featureless type is, by far, our dataset in the future, following improved versions of the most difficult one. We note, however, that many scenes the process we have developed. More significant extensions in our dataset feature multiple different surface types at once, we are considering include randomized trajectories of the so average results per surface type may be ambiguous. camera, and capturing videos. Light dependence. We demonstrate the dependence of Acknowledgements. We acknowledge the use of compu- MVS results on lighting in the right part of Table 5. While tational resources of the Skoltech CDISE supercomputer flash and directional lighting do not differ much, ambient Zhores [68] for obtaining the results presented in this paper. lighting is strikingly different, with many nearly total recon- E. Burnaev, O. Voynov and A. Artemov were supported by struction failures, which we believe to be due to noisy data the Analytical center under the RF Government (subsidy in low light. We observe that ACMP is much less sensitive agreement 000000D730321P5Q0002, Grant No. 70-2021- compared to learning-based methods. On the other hand, 00145 02.11.2021). 8
References [13] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Christian Theobalt. BundleFusion: Real-Time [1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Globally Consistent 3D Reconstruction Using On-the-Fly Sur- Engin Tola, and Anders Bjorholm Dahl. Large-scale data for face Reintegration. ACM Transactions on Graphics, 36(3):1– multiple-view stereopsis. International Journal of Computer 18, July 2017. 3 Vision, 120(2):153–168, 2016. 2 [14] Angela Dai, Yawar Siddiqui, Justus Thies, Julien Valentin, [2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and and Matthias Niessner. Spsg: Self-supervised photomet- Dan B Goldman. PatchMatch: A randomized correspondence ric scene generation from rgb-d scans. In Proceedings of algorithm for structural image editing. ACM Transactions on the IEEE/CVF Conference on Computer Vision and Pattern Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009. 3 Recognition (CVPR), pages 1747–1756, June 2021. 3 [3] Matthew Berger, Joshua A Levine, Luis Gustavo Nonato, [15] Simon Donne and Andreas Geiger. Learning non-volumetric Gabriel Taubin, and Claudio T Silva. A benchmark for sur- depth fusion using successive reprojections. In Proceedings face reconstruction. ACM Transactions on Graphics (TOG), of the IEEE/CVF Conference on Computer Vision and Pattern 32(2):1–17, 2013. 2 Recognition (CVPR), June 2019. 3 [4] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- [16] Melanie Elias, Anette Eltner, Frank Liebold, and Hans-Gerd ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Maas. Assessing the influence of temperature changes on the Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d geometric stability of smartphone-and raspberry pi cameras. data in indoor environments. International Conference on 3D Sensors, 20(3):643, 2020. 5 Vision (3DV), 2017. 3 [17] John Flynn, Michael Broxton, Paul Debevec, Matthew Du- [5] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Richard Tucker. Deepview: View synthesis with learned Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and gradient descent. In Proceedings of the IEEE/CVF Confer- Fisher Yu. ShapeNet: An Information-Rich 3D Model Repos- ence on Computer Vision and Pattern Recognition, pages itory. Technical Report arXiv:1512.03012 [cs.GR], Stanford 2367–2376, 2019. 3 University — Princeton University — Toyota Technological [18] Ben Glocker, Shahram Izadi, Jamie Shotton, and Antonio Institute at Chicago, 2015. 3 Criminisi. Real-time rgb-d camera relocalization. In 2013 [6] Jaesung Choe, Sunghoon Im, Francois Rameau, Minjun Kang, IEEE International Symposium on Mixed and Augmented and In So Kweon. Volumefusion: Deep depth fusion for Reality (ISMAR), pages 173–179. IEEE, 2013. 3 3d scene reconstruction. In Proceedings of the IEEE/CVF [19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong International Conference on Computer Vision (ICCV), pages Tan, and Ping Tan. Cascade cost volume for high-resolution 16086–16095, October 2021. 3 multi-view stereo and stereo matching. In Proceedings of [7] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, the IEEE/CVF Conference on Computer Vision and Pattern and Jan Kautz. Extreme view synthesis. In Proceedings of Recognition (CVPR), June 2020. 3 the IEEE/CVF International Conference on Computer Vision, [20] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A pages 7781–7790, 2019. 3 benchmark for RGB-D visual odometry, 3D reconstruction [8] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. Robust and SLAM. In IEEE Intl. Conf. on Robotics and Automation, reconstruction of indoor scenes. In Proceedings of the IEEE ICRA, Hong Kong, China, May 2014. 2, 3 Conference on Computer Vision and Pattern Recognition, [21] Lingzhi He, Hongguang Zhu, Feng Li, Huihui Bai, Runmin pages 5556–5565, 2015. 6 Cong, Chunjie Zhang, Chunyu Lin, Meiqin Liu, and Yao [9] D. Cremers and K. Kolev. Multiview stereo and silhouette con- Zhao. Towards fast and accurate real-world depth super- sistency via convex functionals over convex domains. IEEE resolution: Benchmark dataset and baseline. In Proceedings Transactions on Pattern Analysis and Machine Intelligence, of the IEEE/CVF Conference on Computer Vision and Pattern 33(6):1161–1174, 2011. 2 Recognition (CVPR), pages 9229–9238, June 2021. 2 [10] Brian Curless and Marc Levoy. A volumetric method for [22] Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, and Shi- building complex models from range images. In Proceedings Min Hu. Di-fusion: Online implicit 3d reconstruction with of the 23rd annual conference on Computer graphics and deep priors. In Proceedings of the IEEE/CVF Conference interactive techniques - SIGGRAPH ’96, pages 303–312, Not on Computer Vision and Pattern Recognition (CVPR), pages Known, 1996. ACM Press. 3 8932–8941, June 2021. 3 [11] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- [23] Tak-Wai Hui, Chen Change Loy, , and Xiaoou Tang. Depth ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: map super-resolution by deep multi-scale guidance. In Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of European Conference on Computer Vision Proceedings of the IEEE conference on computer vision and (ECCV), pages 353–369, 2016. 3 pattern recognition, pages 5828–5839, 2017. 2, 3 [24] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, [12] Angela Dai, Christian Diller, and Matthias Niessner. Sg-nn: and Henrik Aanæs. Large scale multi-view stereopsis eval- Sparse generative neural networks for self-supervised scene uation. In 2014 IEEE Conference on Computer Vision and completion of rgb-d scans. In Proceedings of the IEEE/CVF Pattern Recognition, pages 406–413. IEEE, 2014. 2, 3 Conference on Computer Vision and Pattern Recognition [25] Junho Jeon and Seungyong Lee. Reconstruction-based pair- (CVPR), June 2020. 3 wise depth dataset for depth image enhancement using cnn. In 9
Proceedings of the European Conference on Computer Vision [39] Gernot Riegler and Vladlen Koltun. Free view synthesis. In (ECCV), September 2018. 3 European Conference on Computer Vision, pages 623–640. [26] Michael Kazhdan and Hugues Hoppe. Screened poisson Springer, 2020. 3 surface reconstruction. ACM Transactions on Graphics (ToG), [40] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, 32(3):1–13, 2013. 6 and Jan-Michael Frahm. Pixelwise View Selection for Un- [27] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen structured Multi-View Stereo. In European Conference on Koltun. Tanks and temples: Benchmarking large-scale Computer Vision (ECCV), 2016. 3 scene reconstruction. ACM Transactions on Graphics (ToG), [41] Thomas Schops, Viktor Larsson, Marc Pollefeys, and Torsten 36(4):1–13, 2017. 2, 3, 7 Sattler. Why having 10,000 parameters in your camera model [28] Sebastian Koch, Yurii Piadyk, Markus Worchel, Marc Alexa, is better than twelve. In Proceedings of the IEEE/CVF Con- Cláudio Silva, Denis Zorin, and Daniele Panozzo. Hardware ference on Computer Vision and Pattern Recognition, pages design and accurate simulation for benchmarking of 3D re- 2535–2544, 2020. 5 construction algorithms. In Thirty-fifth Conference on Neural [42] Thomas Schops, Torsten Sattler, and Marc Pollefeys. Sur- Information Processing Systems Datasets and Benchmarks felMeshing: Online Surfel-Based Mesh Reconstruction. IEEE Track (Round 2), 2021. 4 Transactions on Pattern Analysis and Machine Intelligence, [29] Andreas Kuhn, Christian Sormann, Mattia Rossi, Oliver 42(10):2494–2507, Oct. 2020. 3, 7 Erdler, and Friedrich Fraundorfer. Deepc-mvs: Deep con- [43] Thomas Schops, Johannes L Schonberger, Silvano Galliani, fidence prediction for multi-view stereo reconstruction. In Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- 2020 International Conference on 3D Vision (3DV), pages dreas Geiger. A multi-view stereo benchmark with high- 404–413, 2020. 3 resolution images and multi-camera videos. In Proceedings [30] Vincent Leroy, Jean-Sebastien Franco, and Edmond Boyer. of the IEEE Conference on Computer Vision and Pattern Shape reconstruction using volume sweeping and learned Recognition, pages 3260–3269, 2017. 2, 3, 7 photoconsistency. In Proceedings of the European Conference [44] Steven M Seitz, Brian Curless, James Diebel, Daniel on Computer Vision (ECCV), September 2018. 3 Scharstein, and Richard Szeliski. A comparison and evalua- [31] Andreas Ley, Ronny Hänsch, and Olaf Hellwich. Syb3r: tion of multi-view stereo reconstruction algorithms. In 2006 A realistic synthetic benchmark for 3d reconstruction from IEEE computer society conference on computer vision and images. In European Conference on Computer Vision, pages pattern recognition (CVPR’06), volume 1, pages 519–528. 236–251. Springer, 2016. 2, 4 IEEE, 2006. 2 [32] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, [45] Inwook Shim, Tae-Hyun Oh, Joon-Young Lee, Jinwook Choi, Brian L Curless, Steven M Seitz, and Ira Kemelmacher- Dong-Geol Choi, and In So Kweon. Gradient-based cam- Shlizerman. Real-time high-resolution background matting. era exposure control for outdoor mobile platforms. IEEE In Proceedings of the IEEE/CVF Conference on Computer Transactions on Circuits and Systems for Video Technology, Vision and Pattern Recognition, pages 8762–8771, 2021. 6 29(6):1569–1583, 2018. 6 [33] Yuanzhi Liu, Yujia Fu, Fengdong Chen, Bart Goossens, Wei [46] Ukcheol Shin, Jinsun Park, Gyumin Shim, Francois Rameau, Tao, and Hui Zhao. Simultaneous localization and mapping and In So Kweon. Camera exposure control for robust robot related datasets: A comprehensive survey. arXiv preprint vision with noise-aware image quality assessment. In 2019 arXiv:2102.04036, 2021. 2 IEEE/RSJ International Conference on Intelligent Robots and [34] Xinjun Ma, Yue Gong, Qirui Wang, Jingwei Huang, Lei Systems (IROS), pages 1165–1172. IEEE, 2019. 6 Chen, and Fan Yu. Epp-mvsnet: Epipolar-assembling based [47] Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim, depth prediction for multi-view stereo. In Proceedings of and Pieter Abbeel. BigBIRD: A large-scale 3D database of the IEEE/CVF International Conference on Computer Vision object instances. In 2014 IEEE international conference on (ICCV), pages 5732–5740, October 2021. 3 robotics and automation (ICRA), pages 509–516. IEEE, 2014. [35] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, 2, 3 Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- [48] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. to-end 3d scene reconstruction from posed images. In Euro- SUN RGB-D: A RGB-D scene understanding benchmark pean Conference on Computer Vision (ECCV), 2020. 3 suite. In Proceedings of the IEEE conference on computer [36] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and vision and pattern recognition, pages 567–576, 2015. 2 Andreas Geiger. Differentiable volumetric rendering: Learn- [49] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- ing implicit 3d representations without 3d supervision. In lis Savva, and Thomas Funkhouser. Semantic scene comple- Proceedings of the IEEE/CVF Conference on Computer Vi- tion from a single depth image. Proceedings of 30th IEEE sion and Pattern Recognition, pages 3504–3515, 2020. 3 Conference on Computer Vision and Pattern Recognition, [37] Michael Oechsle, Songyou Peng, and Andreas Geiger. 2017. 3 Unisurf: Unifying neural implicit surfaces and radiance [50] Xibin Song, Yuchao Dai, Dingfu Zhou, Liu Liu, Wei Li, fields for multi-view reconstruction. arXiv preprint Hongdong Li, and Ruigang Yang. Channel attention based arXiv:2104.10078, 2021. 3 iterative residual learning for depth map super-resolution. [38] Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Colored In Proceedings of the IEEE/CVF Conference on Computer point cloud registration revisited. In ICCV, 2017. 2, 3 Vision and Pattern Recognition (CVPR), June 2020. 3 10
[51] Christoph Strecha, Wolfgang Von Hansen, Luc Van Gool, Pas- [63] Qingshan Xu and Wenbing Tao. Multi-scale geometric consis- cal Fua, and Ulrich Thoennessen. On benchmarking camera tency guided multi-view stereo. Computer Vision and Pattern calibration and multi-view stereo for high resolution imagery. Recognition (CVPR), 2019. 3 In 2008 IEEE conference on computer vision and pattern [64] Qingshan Xu and Wenbing Tao. Planar prior assisted patch- recognition, pages 1–8. Ieee, 2008. 3 match multi-view stereo. AAAI Conference on Artificial Intel- [52] Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and ligence (AAAI), 2020. 3, 6 Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc- [65] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. tion from monocular video. In Proceedings of the IEEE/CVF Mvsnet: Depth inference for unstructured multi-view stereo. Conference on Computer Vision and Pattern Recognition European Conference on Computer Vision (ECCV), 2018. 3 (CVPR), pages 15598–15607, June 2021. 3 [66] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, [53] Oleg Voynov, Alexey Artemov, Vage Egiazarian, Alexander Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A Notchenko, Gleb Bobrovskikh, Evgeny Burnaev, and Denis large-scale dataset for generalized multi-view stereo networks. Zorin. Perceptual deep depth super-resolution. In Proceed- Computer Vision and Pattern Recognition (CVPR), 2020. 2, 3 ings of the IEEE/CVF International Conference on Computer [67] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Vision, pages 5653–5663, 2019. 3 Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural [54] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo surface reconstruction by disentangling geometry and appear- Speciale, and Marc Pollefeys. Patchmatchnet: Learned multi- ance. Advances in Neural Information Processing Systems, view patchmatch stereo. In Proceedings of the IEEE/CVF 33, 2020. 3 Conference on Computer Vision and Pattern Recognition [68] Igor Zacharov, Rinat Arslanov, Maksim Gunin, Daniil Ste- (CVPR), 2021. 3 fonishin, Andrey Bykov, Sergey Pavlov, Oleg Panarin, Anton [55] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Maliutin, Sergey Rykovanov, and Maxim Fedorov. “zhores”- Komura, and Wenping Wang. Neus: Learning neural implicit petaflops supercomputer for data-driven modeling, machine surfaces by volume rendering for multi-view reconstruction. learning and artificial intelligence installed in skolkovo insti- NeurIPS, 2021. 3, 6 tute of science and technology. Open Engineering, 9(1):512– [56] Oliver Wasenmüller, Marcel Meyer, and Didier Stricker. 520, 2019. 8 CoRBS: Comprehensive rgb-d benchmark for slam using [69] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian kinect v2. In IEEE Winter Conference on Applications of Fang. Visibility-aware multi-view stereo network. British Computer Vision (WACV). IEEE, March 2016. 3 Machine Vision Conference (BMVC), 2020. 3, 6 [57] Silvan Weder, Johannes Schonberger, Marc Pollefeys, and [70] Yinda Zhang and Thomas Funkhouser. Deep Depth Com- Martin R. Oswald. RoutedFusion: Learning Real-Time Depth pletion of a Single RGB-D Image. In 2018 IEEE/CVF Con- Map Fusion. In 2020 IEEE/CVF Conference on Computer ference on Computer Vision and Pattern Recognition, pages Vision and Pattern Recognition (CVPR), pages 4886–4896, 175–185, Salt Lake City, UT, USA, June 2018. IEEE. 3 Seattle, WA, USA, June 2020. IEEE. 3, 6 [58] Silvan Weder, Johannes L. Schonberger, Marc Pollefeys, and Martin R. Oswald. Neuralfusion: Online depth fusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3162–3172, June 2021. 3 [59] Zizhuang Wei, Qingtian Zhu, Chen Min, Yisong Chen, and Guoping Wang. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6187– 6196, 2021. 3 [60] Thomas Whelan, Renato F Salas-Moreno, Ben Glocker, An- drew J Davison, and Stefan Leutenegger. ElasticFusion: Real- time dense SLAM and light source estimation. The Interna- tional Journal of Robotics Research, 35(14):1697–1716, Dec. 2016. 3 [61] Zhirong Wu, Shuran Song, Aditya Khosla, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shape modeling. In IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), Boston, USA, June 2015. 3, 7 [62] Chuhua Xian, Kun Qian, Zitian Zhang, and Charlie C. L. Wang. Multi-Scale Progressive Fusion Learning for Depth Map Super-Resolution. arXiv e-prints, page arXiv:2011.11865, Nov. 2020. 3 11
You can also read