Neural Implicit 3D Shapes from Single Images with Spatial Patterns
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Implicit 3D Shapes from Single Images with Spatial Patterns Yixin Zhuang1 , Yunzhe Liu1 , Yujie Wang2 , Baoquan Chen1 1 Center on Frontiers of Computing Studies, Peking University 2 Shandong University arXiv:2106.03087v2 [cs.CV] 2 Oct 2021 Abstract models have shown impressive performance compared to the others. 3D shape reconstruction from a single image has been a long- Implicit field-based networks take a set of 3D samplings as standing problem in computer vision. Recent advances have input, and predict corresponding values in accordance with led to 3D representation learning, wherein pixel-aligned 3D re- construction methods show impressive performance. However, varying representations (e.g., occupancy, signed distance, it is normally hard to exploit meaningful local image features etc.). Once the network is trained, with meshing methods such to describe 3D point samplings from the aligned pixels when as Marching Cubes (Lorensen and Cline 1987), 3D shapes large variations of occlusions, views, and appearances exist. are identified as the iso-surfaces of the predicted scalar fields. In this paper, we study a general kernel to encode local im- By directly conditioning the 3D shape generation on the ex- age features with considering geometric relationships of point tracted global feature of input image (Mescheder et al. 2019; samplings from the underlying surfaces. The kernel is derived Chen and Zhang 2019), the implicit networks are well-suited from the proposed spatial pattern, in a way the kernel points to reconstruct 3D shapes from single images. However, this are obtained as the 2D projections of a number of 3D pattern trivial combination often fails to reconstruct fine geometric points around a sampling. Supported by the spatial pattern, the details and produces overly smoothed surfaces. 2D kernel encodes geometric information that is essential for 3D reconstruction tasks, while traditional 2D kernels mainly To address this issue, DISN (Xu et al. 2019) proposes a consider appearance information. Furthermore, to enable the pixel-aligned network where each input point sampling is de- network to discover more adaptive spatial patterns for further scribed by a learned local image feature obtained by project- capturing non-local contextual information, the spatial pattern ing the sampling to the image plane using predicted camera is devised to be deformable. Experimental results on both syn- pose. With local image features, the network predicts a recti- thetic datasets and real datasets demonstrate the superiority of fying field for refining the final predicted field. However, the the proposed method. Pre-trained models, codes, and data are strategy of associating 3D point samplings with learned local available at https://github.com/yixin26/SVR-SP. image features would become less effective when samplings are occluded from the observation view. Hence, to represent each point sampling with meaningful local image feature, Introduction Ladybird (Xu et al. 2020) also utilizes the feature extracted 3D shape reconstruction from a single image has been one from the 2D projection of its symmetric point obtained from of the central problems in computer vision. Empowering the the self-reflective symmetry of the object. By associating a machines with the ability to perceive the imagery and infer pair of local features for 3D samplings, the reconstruction the underlying 3D shapes can benefit various downstream quality is significantly improved upon DISN. Nevertheless, tasks, such as augmented reality, robot navigation, etc. How- the strategy used in Ladybird is not sufficiently generic as ever, the problem is overly ambiguous and ill-posed, and thus the feature probably would have no intuitive meaning in the remains highly challenging, due to the information loss and situation where the symmetric points are non-visible or the occlusion occurred during the imagery capture. symmetry assumption does not hold. In recent years, many deep learning based methods have In this paper, we investigate a more general point structure, been proposed to infer 3D shapes from single images. In gen- the spatial pattern, from which a kernel operating in image eral, these methods rely on learning shape priors from a large space is derived to better encode local image features of 3D amount of shape collections, for reasoning the underlying point samplings. Specifically, the pattern is formed by a fixed shape of unseen images. To this end, various learning frame- number of affinities around a point sampling, for which the works have been proposed that exploit different 3D shape corresponding 2D projections are utilized as the operation representations, including point sets (Fan, Su, and Guibas positions of the kernel. 2017; Achlioptas et al. 2018), voxels (Wu et al. 2016, 2018), Although a traditional 2D convolution is also possible to polygonal meshes (Groueix et al. 2018; Wang et al. 2018), encode contextual information for the central point, it ignores and implicit fields (Mescheder et al. 2019; Park et al. 2019; the underlying geometric relations in original 3D space be- Chen and Zhang 2019). In particular, implicit field-based tween pixels and encounters the limitations brought by the
Figure 1: Illustration of the pipeline of spatial pattern guided kernel. (a) shows that each 3D point sampling (colored differently) of the depicted shape is aligned to a 2D pixel by the given camera pose. Compared with a 2D convolution kernel (b) that only considers neighbors located within a 2D regular local patch, the kernels in (d) derived from the proposed spatial patterns (c) explicitly exploit the underlying geometric relations for each pixel. As a result, the kernels in (d) encode the local image features for capturing both image contextual information and point geometry relations. regular local area. A 2D deformable kernel (Dai et al. 2017) and primitives (Niu, Li, and Xu 2018; Tang et al. 2019; Wu is able to operate on irregular neighborhoods, but it is still et al. 2020). The representation can also be learned with- not able to explicitly consider the underlying 3D geometric out knowing the underlying ground truth 3D shapes (Kato, relations, which are essential in 3D reconstruction tasks. Ushiku, and Harada 2018; Liu et al. 2019, 2020a; Yan et al. Figure 1 shows the pipeline of the 3D spatial pattern guided 2016; Insafutdinov and Dosovitskiy 2018; Lin, Kong, and 2D kernel. As shown in Figure 1 (c-d), the kernels operate Lucey 2018). In this line of research, AtlasNet (Groueix et al. on points determined by spatial patterns for different point 2018) represents 3D shapes as the union of several surface samplings. Specifically, the proposed kernel finds kernel elements that are generated from the learned multilayer per- points adaptively for each pixel, which consider its geometric- ceptrons (MLPs). Pixel2Mesh (Wang et al. 2018) generates related positions (e.g., symmetry locations) in the underlying genus-zero shapes as the deformations of ellipsoid template 3D space, rather than only relying upon the appearance in- meshes. The mesh is progressively refined with higher resolu- formation. Furthermore, the spatial pattern is devised to be tions using a graph convolutional neural network conditioned deformable to enable the network to discover more adaptive on the multi-scale image features. 3DN (Wang et al. 2019) geometric relations for point samplings. In the experiments also deforms a template mesh to the target, trained with a dif- section, we analyze the learned 3D spatial pattern with visu- ferentiable mesh sampling operator pushing sampled points alization and statistics. to the target position. To demonstrate the effectiveness of spatial pattern guided The explicit 3D representations are usually limited by kernel, we integrate it into a network based on a deep im- fixed shape resolution or topology. Alternatively, implicit plicit network (Xu et al. 2019), and extensively evaluate our functions for 3D objects have shown the advantages at repre- model on the large collection of 3D shapes – the ShapeNet senting complicated geometry (Mescheder et al. 2019; Chen Core dataset (Chang et al. 2015) and Pix3D dataset (Sun et al. and Zhang 2019; Xu et al. 2019, 2020; Li and Zhang 2020; 2018). The experiments show that our method can produce Niemeyer et al. 2020; Liu et al. 2020b; Jiang et al. 2020; Saito state-of-the-art 3D shape reconstruction results from single et al. 2019). ImNet (Chen and Zhang 2019) uses an MLP- images compared to previous works. Ablation experiments based neural network to approximate the signed distance field and analyses are conducted to show the performance of differ- (SDF) of 3D shapes and shows improved results in contrast ent spatial pattern variants and the importance of individual to the explicit surface representations. OccNet (Mescheder points within the spatial pattern. et al. 2019) generates an implicit volumetric shape by in- ferring the probability of each grid cell being occupied or not. The shape resolution is refined by repeatedly subdivid- Related Work ing the interest cells. While those methods are capable of There has been a lot of research on single image reconstruc- capturing the global shape structure, the geometric details tion task. Recent works involve 3D representation learning, are usually missing. In addition to the holistic shape descrip- including points (Fan, Su, and Guibas 2017; Lin, Kong, and tion, DISN (Xu et al. 2019) adds a local image feature for Lucey 2018; Mandikal et al. 2018), voxels (Choy et al. 2016; each 3D point computed by aligning the image to the 3D Wu et al. 2018; Xie et al. 2019), meshes (Groueix et al. 2018; shape using an estimated camera pose. With global and local Wang et al. 2018, 2019; Gkioxari, Malik, and Johnson 2019) features, DISN recovers much better geometric details and
Figure 2: The overview of our method. Given an image, our network predicts the signed distance field (SDF) for the underlying 3D object. To predict the SDF value for each point p, except for utilizing the global feature encoded from the image and the point feature directly inferred from p, local image features are fully exploited. Particularly, the local feature of a 3D point is encoded with a kernel in the image space whose kernel points are derived from a spatial pattern. *, © and ⊕ denote convolution, concatenation and sum operations respectively. outperforms state-of-the-art methods. The local image feature of each 3D point sampling can be further augmented with its self-symmetry point in the situation of self-occlusion, as shown in Ladybird (Xu et al. 2020). Compared to Ladybird, we investigate a more general point structure, the spatial pattern, along with a deformable 2D kernel derived from the pattern, to encode geometric relationships for local image features. Figure 3: Illustration of spatial pattern generator. For an input point sampling, a pattern is initialized from n nearby points Method around it, and the offsets of each surrounding points is pre- Overview dicted by an MLP network. The final pattern is created as the sum of initial points and the corresponding offsets. Given an RGB image of an object, our goal is to reconstruct the complete 3D shape of the object with high-quality geomet- ric details. We use signed distance fields (SDF) to represent where pixels a1 , ...an are the 2D projections of the 3D points the 3D objects and approximate the SDFs with neural net- p1 , ..., pn belonging to the spatial pattern of the point sam- work. Our network takes 3D points p = (x, y, z) ∈ R3 and pling p. The encoding kernel h is an MLP network that fuses an image I as input and outputs the signed distance s at each the local image features. n is the number of the pattern points. input location. With an SDF, the surface of an object can Points p1 , ..., pn are generated by a spatial pattern generator be extracted as the isosurface of SDF (·) = 0 through the which is addressed in the following subsection. Marching Cubes algorithm. In general, our network consists In general, our pipeline is designed to achieve better ex- of a fully convolutional image encoder m and a continuous ploitation of contextual information from local image features implicit function f represented as multi-layer perceptrons extracted according to the predicted 3D spatial patterns, re- (MLPs), from which the SDF is generated as sulting in geometry-sensitive image feature descriptions for f (p, Fl (a), Fg ) = s, s ∈ R, (1) 3D point samplings, ultimately improve the 3D reconstruc- tion from single-view images. A schematic illustration of the where a = π(p) is the 2D projection for p, Fl (a) = m(I(a)) proposed model is given in Figure 2. is the local feature at image location a, and Fg represents the global image feature. Feature Fl (a) integrates the multi-scale Spatial Pattern Generator local image features from the feature maps of m, from which Our spatial pattern generator takes as input a 3D point sam- the local image features are localized by aligning the 3D pling p, and outputs n 3D coordinates, i.e., p1 , ..., pn . Like points to the image pixels via estimated camera c. previous deformable convolution networks (Dai et al. 2017), By integrating with a spatial pattern at each 3D point sam- the position of a pattern point is computed as the sum of the pling, the feature Fl (a) of the sampling is modified by the initial location and a predicted offset. A schematic illustration local image features of the pattern points. We devise a fea- of the spatial pattern generator is shown in Figure 3. ture encoding kernel h attaching to the image encoder m to With proper initialization, the pattern can be learned effi- encode a new local image feature from the features extracted ciently and is highly effective for geometric reasoning. from the image feature map. Then our model is reformulated as Initialization. We consider two different sampling meth- f (p, h(Fl (a), Fl (a1 ), ..., Fl (an )), Fg ) = s, (2) ods for spatial pattern initialization, i.e, uniform and non-
Experiments In this section, we show qualitative and quantitative results on single-view 3D reconstruction from our method, and compar- isons with state-of-the-art methods. We also conduct a study on the variants of spatial patterns with quantitative results. To understand the effectiveness of the points in the spatial pattern, we provide an analysis of the learned spatial pattern with visualization and statistics. Implementation Details. We use DISN (Xu et al. 2019) as our backbone network, which consists of a VGG-style fully Figure 4: Examples of spatial pattern initialization obtained convolutional neural network as the image encoder and multi- by non-uniform sampling (I) and uniform sampling (II) strate- layer perceptrons (MLPs) to represent the implicit function. gies. In (I), a non-uniform pattern is formed by 3D points Our spatial pattern generator and local feature aggregation (in black) that are symmetry to input point p along x, y, z kernel are based on MLPs. The parameters of our loss func- axis and xy, yz, xz planes; and in (II), an uniform pattern tion in Equation 3 are set to ω1 = 4, ω2 = 1, and δ = 0.01. is created by 3D points (in black) ling at centers of the side faces of a cube centered at point p. Dataset and Training Details. We use the ShapeNet Core dataset (Chang et al. 2015) and Pix3D dataset (Sun et al. 2018) for evaluation. The proposed model is trained across all categories. In testing, for the ShapeNet dataset, the camera uniform 3D point samplings. For simplicity, the input shapes parameters are estimated from the input images, and we use are normalized to a unified cube centered at the origin. the trained camera model from DISN (Xu et al. 2019) for For uniform pattern, we uniformly sample n points from fair comparisons. For the Pix3D dataset, ground truth camera a cube centered at point p = (x, y, z) with edge length parameters and image masks are used for evaluations. of l. For example, we set n = 6 and l = 0.2, then the pattern points pi can be drawn from the combinations of Evaluation Metrics. The quantitative results are obtained pi = (x ± 0.1, y ± 0.1, z ± 0.1), such that the pattern points by computing the similarity between generated surfaces and lie at the center of the six side faces of the cube. ground truth surfaces. We use the standard metrics including Unlike the uniform sampling method, the non-uniform Chamfer Distance (CD), Earth Mover’s Distance (EMD), and sampling method does not have a commonly used strategy, Intersection over Union (IoU). except for random sampling. Randomly sampled points al- Please refer to the supplemental materials for more details ways do not have intuitive geometric meaning and are hardly on data processing, network architecture, training procedures, appeared in any kernel point selection methods. As to capture and spatial pattern computation. non-local geometric relations, we design pattern points as the symmetry points of p along global axes or planes that Quantitative and Qualitative Evaluations go through the origin of the 3D coordinate frame. Since the Impact of Spatial Pattern Configuration. To figure out shape is normalized and centered at the origin, the symmetry the influence of different spatial patterns, we designed several points also lie within the 3D shape space. Similarly, we let variants of the pattern. Specifically, two factors are consid- n = 6, and we use x, y, z axes and xy, yz, xz planes to com- ered, including the initialization and the capacity, i.e., the pute the symmetry points, then the pattern points pi can be pattern point sampling strategy and the number of the points drawn from the combinations of pi = (±x, ±y, ±z). in a pattern. As described before, we considers non-uniform The two different types of spatial pattern initialization are and uniform sampling methods for pattern initialization and shown in Figure 4. After initialization, the pattern points, set the number of points to three to six for changing the ca- along with input sampling, are passed to an MLP network to pacity. The methods derived from the combinations of the generate the offset of each pattern point, and the final pattern aforementioned two factors are denoted as is the sum of the initial pattern positions and the predicted • Oursunif orm−6p , in which six points are uniformly sam- offsets. pled on a cube l centered at point sampling. • Oursnon−unif orm−6p , in which six points are non- Loss Function. Given a collection of 3D shapes and the uniformly sampled at the symmetry locations in the shape generated implicit fields from images I, the loss is defined space along xy, yz and xz planes and x, y and z axes. with L1 distance: • Oursnon−unif orm−3p , in which three points are non- XX uniformly sampled at the symmetry locations in the shape LSDF = ω|f (p, FlI , FgI ) − SDF I (p)|, (3) space along xy, yz and xz planes. I∈I p In Table 1, we report the numerical results of the methods using ground truth camera pose. In general, where SDF I denotes the ground truth SDF value correspond- Oursnon−unif orm−6p achieves best performance. By reduc- ing to image I and f (·) is the predicted field. ω is set to ω1 , ing the capacity to the number of three points, the perfor- if SDF I (p) < δ, and ω2 , otherwise. mance decreases, as shown by Oursnon−unif orm−3p . This
CD(x1000)↓ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean Oursunif orm−6p 3.72 3.73 7.09 3.93 4.59 4.78 7.77 9.19 2.02 4.64 6.71 3.62 4.17 5.07 Oursnon−unif orm−6p 3.27 3.38 6.88 3.93 4.40 5.40 6.77 8.48 1.58 4.38 6.49 4.02 4.01 4.85 Oursnon−unif orm−3p 3.33 3.51 6.88 3.87 4.38 4.58 7.22 8.76 3.00 4.45 6.66 3.63 4.11 4.95 EMD(x100)↓ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean Oursunif orm−6p 2.07 2.02 2.60 2.38 2.19 2.11 2.86 2.85 1.55 2.16 2.41 1.78 2.01 2.23 Oursnon−unif orm−6p 1.91 1.90 2.58 2.36 2.17 2.08 2.66 2.75 1.52 2.11 2.36 1.77 1.99 2.17 Oursnon−unif orm−3p 1.96 1.94 2.58 2.35 2.16 2.07 2.81 2.81 1.58 2.13 2.39 1.78 2.00 2.20 IOU(%)↑ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean Oursunif orm−6p 66.1 59.5 59.6 80.0 65.8 66.7 53.8 63.7 74.7 74.1 60.8 79.6 68.0 67.1 Oursnon−unif orm−6p 68.2 63.1 61.4 80.7 66.8 67.9 55.9 65.0 75.0 75.2 62.6 81.0 68.9 68.6 Oursnon−unif orm−3p 67.4 62.0 60.5 80.5 66.8 67.5 54.1 64.2 73.6 75.1 61.8 80.2 68.7 67.9 Table 1: Quantitative results of the variants of our method using different configurations of spatial pattern. Metrics include CD (multiply by 1000, the smaller the better), EMD (multiply by 100, the smaller the better), and IoU (%, the larger the better). CD and EMD are computed on 2048 points. CD(x1000)↓ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean AtlasNet 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19 Pixel2Mesh 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28 3DN 6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77 IMNET 12.65 15.10 11.39 8.86 11.27 13.77 63.84 21.83 8.73 10.30 17.82 7.06 13.25 16.61 3DCNN 10.47 10.94 10.40 5.26 11.15 11.78 35.97 17.97 6.80 9.76 13.35 6.30 9.80 12.30 OccNet 7.70 6.43 9.36 5.26 7.67 7.54 26.46 17.30 4.86 6.72 10.57 7.17 9.09 9.70 DISN 9.96 8.98 10.19 5.39 7.71 10.23 25.76 17.90 5.58 9.16 13.59 6.40 11.91 10.98 Ladybird 5.85 6.12 9.10 5.13 7.08 8.23 21.46 14.75 5.53 6.78 9.97 5.06 6.71 8.60 Ourscam 5.40 5.59 8.43 5.01 6.17 8.54 14.96 14.07 3.82 6.70 8.97 5.42 6.19 7.64 Ours 3.27 3.38 6.88 3.93 4.40 5.40 6.77 8.48 1.58 4.38 6.49 4.02 4.01 4.85 EMD(x100)↓ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean AtlasNet 3.39 3.22 3.36 3.72 3.86 3.12 5.29 3.75 3.35 3.14 3.98 3.19 4.39 3.67 Pixel2Mesh 2.98 2.58 3.44 3.43 3.52 2.92 5.15 3.56 3.04 2.70 3.52 2.66 3.94 3.34 3DN 3.30 2.98 3.21 3.28 4.45 3.91 3.99 4.47 2.78 3.31 3.94 2.70 3.92 3.56 IMNET 2.90 2.80 3.14 2.73 3.01 2.81 5.85 3.80 2.65 2.71 3.39 2.14 2.75 3.13 3DCNN 3.36 2.90 3.06 2.52 3.01 2.85 4.73 3.35 2.71 2.60 3.09 2.10 2.67 3.00 OccNet 2.75 2.43 3.05 2.56 2.70 2.58 3.96 3.46 2.27 2.35 2.83 2.27 2.57 2.75 DISN 2.67 2.48 3.04 2.67 2.67 2.73 4.38 3.47 2.30 2.62 3.11 2.06 2.77 2.84 Ladybird 2.48 2.29 3.03 2.65 2.60 2.61 4.20 3.32 2.22 2.42 2.82 2.06 2.46 2.71 Ourscam 2.35 2.15 2.90 2.66 2.49 2.49 3.59 3.20 2.04 2.40 2.70 2.05 2.40 2.57 Ours 1.91 1.90 2.58 2.36 2.17 2.08 2.66 2.75 1.52 2.11 2.36 1.77 1.99 2.17 IOU(%)↑ plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean AtlasNet 39.2 34.2 20.7 22.0 25.7 36.4 21.3 23.2 45.3 27.9 23.3 42.5 28.1 30.0 Pixel2Mesh 51.5 40.7 43.4 50.1 40.2 55.9 29.1 52.3 50.9 60.0 31.2 69.4 40.1 47.3 3DN 54.3 39.8 49.4 59.4 34.4 47.2 35.4 45.3 57.6 60.7 31.3 71.4 46.4 48.7 IMNET 55.4 49.5 51.5 74.5 52.2 56.2 29.6 52.6 52.3 64.1 45.0 70.9 56.6 54.6 3DCNN 50.6 44.3 52.3 76.9 52.6 51.5 36.2 58.0 50.5 67.2 50.3 70.9 57.4 55.3 OccNet 54.7 45.2 73.2 73.1 50.2 47.9 37.0 65.3 45.8 67.1 50.6 70.9 52.1 56.4 DISN 57.5 52.9 52.3 74.3 54.3 56.4 34.7 54.9 59.2 65.9 47.9 72.9 55.9 57.0 Ladybird 60.0 53.4 50.8 74.5 55.3 57.8 36.2 55.6 61.0 68.5 48.6 73.6 61.3 58.2 Ourscam 60.6 54.4 52.9 74.7 56.0 59.2 38.3 56.1 62.9 68.8 49.3 74.7 60.6 59.1 Ours 68.2 63.1 61.4 80.7 66.8 67.9 55.9 65.0 75.0 75.2 62.6 81.0 68.9 68.6 Table 2: Quantitative results on the ShapeNet Core dataset for various methods. indicates that some critical points in Oursnon−unif orm−6p This implies that optimizing the pattern position in the contin- that have high responses to the query point do not ap- uous 3D space is challenging, and with proper initialization, pear in Oursnon−unif orm−3p . Notably, the sampling strat- the spatial pattern can be learned more efficiently. To bet- egy is more important. Both Oursnon−unif orm−6p and ter understand the learned spatial pattern and which pattern Oursnon−unif orm−3p outperforms Oursunif orm−6p with points are preferred by the network, we provide analysis with large margins. Thus, initialization with non-uniform sam- visualization and statistics in the next section. Before that, pling makes the learning of effective spatial patterns easier. we evaluate the performance of our method by comparing
it with several state-of-the-art methods. Specifically, we use Oursnon−unif orm−6p as our final method for comparison. Comparison with Various Methods. We compare our method with the state-of-the-art methods on the single-image 3D reconstruction task. All the methods, including Atlas- Net (Groueix et al. 2018), Pixel2Mesh (Wang et al. 2018), 3DN (Wang et al. 2019), ImNet (Chen and Zhang 2019), 3DCNN (Xu et al. 2019), OccNet (Mescheder et al. 2019), DISN (Xu et al. 2019), Ladybird (Xu et al. 2020), are trained across all 13 categories. The method of Ours uses ground truth cameras while Ourscam denotes the version of Ours using estimated camera poses. Figure 5: Visualization of spatial pattern points with differ- A quantitative evaluation on the ShapeNet dataset is re- ent shapes and colors. From the examples in (I) and (II), ported in Table 2 in terms of CD, EMD, and IOU. CD and the learned pattern points (i.e., pink cones) from the non- EMD are evaluated on the sampling points from the generated uniform initialization (i.e., pink balls) are relative stationary, triangulated mesh. IOU is computed on the solid voxeliza- while points (i.e., blue cones) learned from uniform initializa- tion of the mesh. In general, our method outperforms other tion (i.e., blue balls) have much larger deviations from their methods. Particularly, among the methods including DISN, original positions. Some stationary points p1 , p2 and p6 are Ladybird, and Ours, which share a similar backbone network, highlighted in dash circles. Ours achieves much better performance. In Figure 7, we show qualitative results generated by Mesh R-CNN (Gkioxari, Malik, and Johnson 2019), Occ- Net (Mescheder et al. 2019), DISN (Xu et al. 2019) and Ladybird (Xu et al. 2020). We use the pre-trained models from the Mesh R-CNN, OccNet, and DISN. For Ladybird, we re-implement their network and carry out training accord- ing to the specifications in their paper. All the methods are able to capture the general structure of the shapes, shapes generated from DISN, Ladybird and Ours are more aligned with the ground truth shapes. Specifically, our method is vi- sually better at the non-visible regions and fine geometric variations. CD(x1000)↓ EMD(x100)↓ IOU(%)↑ Figure 6: Statistics on the offsets of spatial pattern points. Ladybird Ours Ladybird Ours Ladybird Ours The offset of individual pattern point is computed as the mean bed 9.84 8.76 2.80 2.70 70.7 73.2 distance between the initial and predicted position. Among bookcase 10.94 14.70 2.91 3.32 44.3 41.8 all points, p1 , p2 and p6 have smallest learned offsets from the chair 14.05 9.81 2.82 2.72 57.3 57.3 non-uniform initialization (i.e., pink bars), while for uniform desk 18.87 15.38 3.18 2.91 51.2 60.7 initialization (i.e., blue bars), all the predicted points have misc 36.77 30.94 4.45 4.00 29.8 44.0 much larger deviations from their original locations. sofa 4.56 3.77 2.02 1.92 86.7 87.6 table 21.66 14.04 2.96 2.78 56.9 58.8 tool 7.78 16.24 3.70 3.57 41.3 38.2 wardrobe 4.80 5.60 1.92 2.01 87.5 87.5 ternatives, and the experiments on different variants of the mean 14.36 13.25 2.97 2.88 58.4 61.0 spatial pattern show the influence of initialization and capac- ity. To better understand the importance of individual pattern Table 3: Quantitative results on Pix3D dataset. points, we visualize several learned patterns in Figure 5 and calculate the mean offsets of the predicted pattern points visualized in Figure 6. The quantitative evaluation of the Pix3D dataset is pro- From Figure 5, we can see that some learned pattern vided in Table 3. Both Ours and Ladybird are trained and points from the non-uniform initialization are almost sta- evaluated on the same train/test split, during which ground tionary, e.g., points p1 , p2 and p6 that are highlighted by truth camera poses and masks are used. Specifically, 80% dash circles. Also, as shown in Figure 6, the mean offsets of the images are randomly sampled from the dataset for of points p1 , p2 and p6 are close to zero. To figure out the training while the rest images are used for testing. In general, importance of these stationary pattern points, we train the our method outperforms Ladybird on the used metrics. network using the points p1 , p2 , and p6 as a spatial pattern and keep their positions fixed during training. As shown in Analysis of Learned Spatial Patterns Table 4, the performance of the selected rigid pattern is better We have demonstrated the effectiveness of the proposed spa- than Oursnon−unif orm−3p and Oursunif orm−6p and slightly tial pattern via achieving better performance than other al- lower than Oursnon−unif orm−6p . This reveals that the pat-
plane bench cabinet car chair display lamp speaker rifle sofa table phone watercraft mean CD(x1000) 3.38 3.44 7.06 3.87 4.50 4.57 7.30 8.98 1.66 4.53 6.61 3.45 4.17 4.89 EMD(x100) 1.97 1.92 2.58 2.37 2.16 2.07 2.77 2.82 1.52 2.12 2.35 1.80 2.01 2.19 IOU(%) 67.4 62.8 60.5 80.5 66.6 67.4 54.9 64.5 74.9 75.0 62.5 80.1 68.5 68.1 Table 4: Quantitative results of a rigid spatial pattern formed by three pattern points selected from the stationary points of the learned spatial pattern. Figure 7: Qualitative comparison results for various methods. tern points discovered by the network are useful, which finally samplings. Using spatial pattern enables the 2D kernel point lead to a better reconstruction of the underlying geometry. selection explicitly to consider the underlying 3D geometry Even though consuming more time and memory, utilizing relations, which are essential in 3D reconstruction task, while the auxiliary contextual information brought by other points traditional 2D kernels mainly consider the appearance infor- p3 , p4 , and p5 only achieve slight improvement on the per- mation. To better understand the spatial pattern, we study formance. The analysis shows that the position of individual several variants of spatial pattern designs in regard to the pattern point is important than the number of points. Further- pattern capacity and the way of initialization, and we ana- more, it also proves that encoding geometric relationships lyze the importance of individual pattern points. Results on with the 2D kernel derived by the proposed spatial pattern is large synthetic and real datasets show the superiority of the effective for the single-image 3D reconstruction task. proposed method on widely used metrics. A key limitation is that the model is sensitive to camera Conclusion parameters. As shown in Table 2, when camera parameters In this paper, we propose a new feature encoding scheme are exactly correct, the performance is significantly improved. working on deep implicit field regression models for the task One possible direction to investigate is to incorporate the of 3D shape reconstruction from single images. We present camera estimation process in the loop of 3D reconstruction spatial pattern, from which a kernel operating in image space pipeline, such as jointly optimize the camera pose and the is derived to better encode local image features of 3D point implicit field within a framework with multiple objectives.
Appendix Data Processing Datasets. The ShapeNet Core dataset (Chang et al. 2015) includes 13 object categories, and for each object, 24 views are rendered with resolution of 137×137 as in (Choy et al. 2016). Pix3D Dataset (Sun et al. 2018) contains 9 object categories with real-world images and the exact mask images. The number of views and the image resolution varies from different shapes.We process all the shapes and images in the same format for the two datasets. Specifically, all shapes are normalized to the range of [-1,1] and all images are scaled to the resolution of 137×137. 3D Point Sampling. For each shape, 2048 points are sam- pled for training. We firstly normalize the shapes to a unified cube with their centers of mass at the origin. Then we uni- formly sample 2563 grid points from the cube and compute the sign distance field (SDF) values for all the grid samples. Following the sampling process of Ladybird (Xu et al. 2020), Figure 8: The data processing workflow, including 3D point the 2563 points are downsampled with two stages. In the sampling, image preparation and 3D-to-2D camera projec- first stage, 32,768 points are randomly sampled from the tion. four SDF ranges [-0.10,-0.03], [-0.03,0.00], [0.00,0.03], and [0.03,0.10], with the same probabilities. In the second stage, 2048 points are uniformly sampled from the 32,768 points • For Oursnon−unif orm−6p , the pattern configuration pa- using the farthest points sampling strategy. rameters are set to n = 6, and the pattern points are In testing, 653 grid points are sampled are fed to the net- p1 = (x, y, −z), p2 = (−x, y, z), p3 = (x, −y, z), p4 = work, and output the SDF values. The object mesh is ex- (−x, −y, z), p5 = (x, −y, −z), and p6 = (−x, y, −z). tracted as the zero iso-surface of the generated SDF using the • For Oursnon−unif orm−3p , the pattern configuration pa- Marching Cube algorithm. rameters are set to n = 3, and the pattern points are p1 = (x, y, −z), p2 = (−x, y, z), and p3 = (x, −y, z). 3D-to-2D Camera Projection. The pixel coordinate a of a 3D point sampling p is computed as two stages. Firstly, the As introduced in the analysis section, the rigid spa- point is converted from the world coordinate system to the tial pattern formed by the selected stationary points local camera coordinate system c based on the rigid transfor- uses points p1 , p2 and p6 from the pattern points of mation matrix Ac , such that pc = Ac p. Then in the camera Oursnon−unif orm−6p . space, point pc = (xc , y c , z c ) is projected to the 2D can- c c Network and Training Details vas via perspective transformation, i.e., π(pc ) = ( xzc , yzc ). The projected pixel whose coordinate lies out of an image Traning Policy. We implement our method based on the will reset to 0 or 136 (the input image resolution is fixed as framework of Pytorch. For training on the ShapeNet dataset, 137×137 in our experiment). we use the Adam optimizer with a learning rate 1e-4, a decay rate of 0.9, and a decay step size of 5 epochs. The network We show the data processing workflow with an example was trained for 30 epochs with batch size 20. For training from the Pix3D dataset in Figure 8. For the ShapeNet dataset, on the Pix3D dataset, we use the Adam optimizer with a the images are already satisfied. constant learning rate 1e-4, and smaller batch size 5. For the ShapeNet dataset, at each epoch, we randomly sample a sub- Spatial Pattern Computation set of images from each category. Specifically, a maximum For a 3D point sampling p = (x, y, z), the pattern points number of 36000 images are sampled for each category. The of p are drawn from pi = (x ± l/2, y ± l/2, z ± l/2) total number of images in an epoch is 411,384 resulting in when using the uniform initialization, and are computed by 20,570 iterations. Our model is trained across all categories. pi = (±x, ±y, ±z) if adopting the non-uniform initializa- Network Architecture. We introduce the details of the tion, where i is the point index, such that i = 1, ..., n, and n is image encoder m, spatial pattern generator g, feature aggre- the number of pattern points. Specifically, the spatial patterns gation module h, and the SDF regression network f in our for the variants of our methods are computed as follows paper. • For Oursunif orm−6p , the pattern configuration parameters We use the convolution network of VGG-16 as our image are set to n = 6 and l = 0.2, and the pattern points encoder, which generates multi-resolution feature maps. Sim- are p1 = (x, y, z + 0.1), p2 = (x + 0.1, y, z), p3 = ilar to DISN (Xu et al. 2019), we reshape the feature maps to (x, y + 0.1, z), p4 = (x, y, z − 0.1), p5 = (x − 0.1, y, z), the original image size with bilinear interpolation and collect and p6 = (x, y − 0.1, z). the local image features of a pixel from all scales of feature
maps. Specifically, the local feature contains six sub-features 3D coordinate is firstly promoting to a 512-dim point feature from the six feature maps, with the dimension of {64, 128, by three fully connected layers. Then the point feature Pf and 256, 512, 512, 512} respectively. input feature Fa are concatenated and passed to another three fully connected layers to generate the SDF value. The SDF Spatial Pattern Generator g value from global MLP and local MLP are added together as Input 1 + n 3-dim points final output. → output n 3-dim points Operation Output Shape References Conv1D+ReLU (64,) Achlioptas, P.; Diamanti, O.; Mitliagkas, I.; and Guibas, L. Conv1D+ReLU (256,) Conv1D+ReLU (512,) 2018. Learning representations and generative models for Concat({1 + n features}) ((1 + n) × 512,) 3d point clouds. In International conference on machine Conv1D+ReLU (512,) learning, 40–49. PMLR. Conv1D+ReLU (256,) Chang, A. X.; Funkhouser, T.; J. Guibas, L.; Hanrahan, P.; Conv1D+Tanh (n × 3,) Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; Table 5: Spatial Pattern Generator. Xiao, J.; Yi, L.; and Yu, F. 2015. ShapeNet: An Information- Rich 3D Model Repository. (arXiv:1512.03012 [cs.GR]). The spatial pattern generator receives n + 1 local image Chen, Z.; and Zhang, H. 2019. Learning Implicit Fields for features from a point sampling and an initial spatial pattern Generative Shape Modeling. Proceedings of IEEE Confer- belonging to it. All the points are firstly promoted from R3 to ence on Computer Vision and Pattern Recognition (CVPR). the dimension of 512 using a multi-layer perceptron (MLP). Choy, C. B.; Xu, D.; Gwak, J.; Chen, K.; and Savarese, S. Then all the point features are concatenated and passed to 2016. 3d-r2n2: A unified approach for single and multi- another MLP to output a vector with the dimension of n × 3 view 3d object reconstruction. In European conference on resulting in n 3D offset positions. The final pattern is the sum computer vision, 628–644. Springer. of the initial pattern and the predicted offsets. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proceedings Feature Aggregation Module h of the IEEE international conference on computer vision, Input 1 + n D-dim feature vectors 764–773. → output a D-dim feature vector Operation Output Shape Fan, H.; Su, H.; and Guibas, L. J. 2017. A point set generation Concat({1 + n features}) ((1 + n) × D,) network for 3d object reconstruction from a single image. In Conv1D+ReLU (D,) Proceedings of the IEEE conference on computer vision and pattern recognition, 605–613. Table 6: Feature Aggregation Module. Gkioxari, G.; Malik, J.; and Johnson, J. 2019. Mesh r-cnn. In Proceedings of the IEEE/CVF International Conference The feature modulation network is attaching to the im- on Computer Vision, 9785–9795. age encoder from which receives local image features and produces a new local image feature. Since the local image Groueix, T.; Fisher, M.; Kim, V. G.; Russell, B. C.; and Aubry, feature contains six sub-features, we devise six aggregation M. 2018. A Papier-Mâché Approach to Learning 3D Surface modules. Without losing generality, we set the dimension of Generation. In Proc. CVPR, 216–224. the sub-feature to D. Insafutdinov, E.; and Dosovitskiy, A. 2018. Unsupervised The SDF regression network f contains two MLPs, a learning of shape and pose with differentiable point clouds. In global MLP receiving global image feature and a local MLP Proceedings of the 32nd International Conference on Neural accepting local image feature. Without losing generality, we Information Processing Systems, 2807–2817. introduce a network with input a feature of dimension S. The Jiang, Y.; Ji, D.; Han, Z.; and Zwicker, M. 2020. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In Proceedings of the IEEE/CVF Conference Signed Distance Function f on Computer Vision and Pattern Recognition, 1251–1261. Input a 3-dim point and a S-dim feature vector Fa → output a 1-dim scalar Kato, H.; Ushiku, Y.; and Harada, T. 2018. Neural 3d mesh Operation Output Shape renderer. In Proceedings of the IEEE conference on computer Conv1D+ReLU (64,) vision and pattern recognition, 3907–3916. Conv1D+ReLU (256,) Li, M.; and Zhang, H. 2020. D2 IM-Net: Learning Detail Conv1D+ReLU (512,), denoted as Pf Disentangled Implicit Fields from Single Images. arXiv Concat(Fa ,Pf ) (S+512,) preprint arXiv:2012.06650. Conv1D+ReLU (512,) Conv1D+ReLU (256,) Lin, C.-H.; Kong, C.; and Lucey, S. 2018. Learning efficient Conv1D+ReLU (1,) point cloud generation for dense 3d object reconstruction. In proceedings of the AAAI Conference on Artificial Intelligence, Table 7: Signed Distance Function. volume 32.
Liu, L.; Gu, J.; Zaw Lin, K.; Chua, T.-S.; and Theobalt, C. Wu, J.; Zhang, C.; Xue, T.; Freeman, B.; and Tenenbaum, J. 2020a. Neural Sparse Voxel Fields. Advances in Neural 2016. Learning a probabilistic latent space of object shapes Information Processing Systems, 33. via 3d generative-adversarial modeling. In Advances in Neu- Liu, S.; Chen, W.; Li, T.; and Li, H. 2019. Soft rasterizer: ral Information Processing Systems, 82–90. Differentiable rendering for unsupervised single-view mesh Wu, J.; Zhang, C.; Zhang, X.; Zhang, Z.; Freeman, W. T.; reconstruction. arXiv preprint arXiv:1901.05567. and Tenenbaum, J. B. 2018. Learning shape priors for single- Liu, S.; Zhang, Y.; Peng, S.; Shi, B.; Pollefeys, M.; and Cui, view 3d completion and reconstruction. In Proceedings of the Z. 2020b. Dist: Rendering deep implicit signed distance European Conference on Computer Vision (ECCV), 646–662. function with differentiable sphere tracing. In Proceedings of Wu, R.; Zhuang, Y.; Xu, K.; Zhang, H.; and Chen, B. 2020. the IEEE/CVF Conference on Computer Vision and Pattern Pq-net: A generative part seq2seq network for 3d shapes. Recognition, 2019–2028. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 829–838. Lorensen, W. E.; and Cline, H. E. 1987. Marching cubes: A high resolution 3D surface construction algorithm. ACM Xie, H.; Yao, H.; Sun, X.; Zhou, S.; and Zhang, S. 2019. siggraph computer graphics, 21(4): 163–169. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF Inter- Mandikal, P.; Navaneet, K.; Agarwal, M.; and Babu, R. V. national Conference on Computer Vision, 2690–2698. 2018. 3D-LMNet: Latent embedding matching for accu- rate and diverse 3D point cloud reconstruction from a single Xu, Q.; Wang, W.; Ceylan, D.; Mech, R.; and Neumann, image. arXiv preprint arXiv:1807.07796. U. 2019. DISN: Deep Implicit Surface Network for High- quality Single-view 3D Reconstruction. arXiv preprint Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; arXiv:1905.10711. and Geiger, A. 2019. Occupancy Networks: Learning 3D Xu, Y.; Fan, T.; Yuan, Y.; and Singh, G. 2020. Ladybird: Reconstruction in Function Space. In Proc. CVPR. Quasi-monte carlo sampling for deep implicit field based 3d Niemeyer, M.; Mescheder, L.; Oechsle, M.; and Geiger, A. reconstruction with symmetry. In European Conference on 2020. Differentiable volumetric rendering: Learning implicit Computer Vision, 248–263. Springer. 3d representations without 3d supervision. In Proceedings of Yan, X.; Yang, J.; Yumer, E.; Guo, Y.; and Lee, H. 2016. the IEEE/CVF Conference on Computer Vision and Pattern Perspective transformer nets: learning single-view 3D object Recognition, 3504–3515. reconstruction without 3D supervision. In Proceedings of Niu, C.; Li, J.; and Xu, K. 2018. Im2struct: Recovering 3d the 30th International Conference on Neural Information shape structure from a single rgb image. In Proceedings Processing Systems, 1704–1712. of the IEEE Conference on Computer Vision and Pattern Recognition, 4521–4529. Park, J. J.; Florence, P.; Straub, J.; Newcombe, R.; and Love- grove, S. 2019. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In CVPR. Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Kanazawa, A.; and Li, H. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2304–2314. Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J. B.; and Freeman, W. T. 2018. Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Tang, J.; Han, X.; Pan, J.; Jia, K.; and Tong, X. 2019. A skeleton-bridged deep learning approach for generating meshes of complex topologies from single rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 4541–4550. Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; and Jiang, Y.-G. 2018. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 52–67. Wang, W.; Ceylan, D.; Mech, R.; and Neumann, U. 2019. 3dn: 3d deformation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1038–1046.
You can also read