DEEPMULTICAP: PERFORMANCE CAPTURE OF MULTIPLE CHARACTERS USING SPARSE MULTIVIEW CAMERAS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras Yang Zheng1,* , Ruizhi Shao1,* , Yuxiang Zhang1 , Tao Yu1 , Zerong Zheng1 , Qionghai Dai1 , and Yebin Liu1 1 Tsinghua University arXiv:2105.00261v1 [cs.CV] 1 May 2021 Figure 1: Given only sparse multi-view RGB videos (6 views for the left and middle, 8 views for the right), our method is able to reconstruct various kinds of 3D shapes with temporal-varying surface details even under challenging occlusions for multi-person interactive scenarios. Abstract 1. Introduction We propose DeepMultiCap, a novel method for multi- Recent years have witnessed great progress in vision- person performance capture using sparse multi-view cam- based human performance capture, which is promising to eras. Our method can capture time varying surface de- enable various applications (e.g., tele-presence, sportscast, tails without the need of using pre-scanned template mod- gaming and mixed reality) with enhanced interactive and els. To tackle with the serious occlusion challenge for close immersive experiences. To achieve surprisingly detailed ge- interacting scenes, we combine a recently proposed pixel- ometry and texture reconstruction, dense camera rigs even aligned implicit function with parametric model for robust equipped with sophisticated lighting systems are introduced reconstruction of the invisible surface areas. An effec- [40, 4, 22, 21, 15]. However, the extremely expensive and tive attention-aware module is designed to obtain the fine- professional setups limited their popularity. Although other grained geometry details from multi-view images, where light-weight multi-view human performance capture sys- high-fidelity results can be generated. In addition to the tems have achieved impressive results even in real-time, spatial attention method, for video inputs, we further pro- they still relies on pre-scanned templates [27, 26], custom- pose a novel temporal fusion method to alleviate the noise designed RGBD sensors [11, 9], or limited to single-person and temporal inconsistencies for moving character recon- reconstruction [36, 13, 19, 14]. struction. For quantitative evaluation, we contribute a Benefiting from the fast improvement of deep implicit high quality multi-person dataset, MultiHuman, which con- functions for 3D representations, recent methods [33, 34, sists of 150 static scenes with different levels of occlusions 24] are able to recover the 3D body shape only from a single and ground truth 3D human models. Experimental results RGB image. Compared with the voxel-based [37, 19, 48] demonstrate the state-of-the-art performance of our method or mesh-based [30, 1] representations, an implicit function and the well generalization to real multiview video data, guides the deep learning models to notice geometric details which outperforms the prior works by a large margin. in a more efficient way. Specifically, PIFu [33, 24] achieves plausible single human reconstruction using only RGB im- ages, and PIFuHD [34] further utilizes normal maps and high resolution images to generate more detailed results. * Equal contribution Our project page : http://liuyebin.com/dmc/dmc.html Despite the prominent performance in digitizing 3D 1
human body, both PIFu [33] and PIFuHD [34] suffer even with partial observations in each view. from several drawbacks when extending the frameworks • We design an efficient spatial attention-aware mod- to multi-person scenarios and multi-view setups. Firstly, ule to obtain fine-grained details for multi-view se- the average-pooling-based multi-view feature fusion strat- tups, and introduce a novel temporal fusion method to egy in PIFu will lead to over-smoothed outputs when high reduce the reconstruction inconsistencies for moving frequency details are included in multi-view features, i.e, characters from video inputs. the normal maps used in PIFuHD. More importantly, in the two approaches, reconstruction results are only promised • We contribute an extremely high quality 3D model with ideal input images without severe occlusions in multi- dataset containing of 150 multi-person interacting person performance capture scenarios. The reconstruction scenes. The dataset can be used for training and evalu- performance of [33, 34] will be significantly deteriorated ation of related topics in future research. due to the lack of observations caused by severe occlusions. To address the aforementioned problems, we propose 2. Related Work a novel framework to perform multi-person reconstruction Single-view performance capture Many methods have from multi-view images. First of all, inspired by [38], been proposed to reconstruct detailed geometry from we design an spatial attention-aware module to adaptively single-view inputs. Typical techniques include silhouette aggregate information from multi-view inputs. The mod- estimation [30], depth estimation [12, 35] and template- ule is effective to capture and merge the geometric de- based deformation [1, 49, 16]. Moreover, SMPL [28] re- tails from different view points, and finally contributing to gression or optimization can be incorporated to generate the significant improvement of results under multi-view se- more reliable and robust outputs as shown in [48, 47]. Real- tups. Moreover, for multi-person reconstruction, we further time methods can be implemented with the aid of a single combine the attention module with parametric models, i.e., depth sensor [43, 44] or by innovating computation and ren- SMPL to enhance the robustness while maintaining the fine- dering algorithms [24]. Regarding to the 3D representations grained details. The SMPL model serves as a 3D geome- used in these methods, we can split them into two cate- try proxy which compensates for the missing information gories: explicit [37, 30, 48] and implicit [33, 34, 19, 20] where occlusions take place. With the semantic informa- reconstruction methods. Compared with traditional explicit tion provided by SMPL, the network is capable of recon- 3D representations, implicit representations show certain structing complete human bodies even under close interac- advantages in domain-specific shape learning and detail tive scenarios. Finally, when dealing with moving charac- preservation. For example, PIFu [33] define the surface as ters from video, we propose a temporal fusion method by a level set of function f . Similarly, [19] defines a prob- weighting the signed distance field (SDF) across the time ability field of surface points, and ARCH [20] predicts a domain, which further enhance the temporal consistency of 3D occupancy map. However, all of the methods above are the reconstructed dynamic 3D sequences. mainly focusing on single-person reconstruction, and it re- Another urgent problem is that the lack of high-quality mains difficult for them to achieve accurate reconstruction scans of multi-person interactive scenarios in the commu- under multi-person scenarios. nity makes it difficult for accurately evaluating multi-person performance capture systems like ours. To fill this gap and better evaluate the performance of our system, we con- Multi-view performance capture Motion capture has tribute a novel dataset, MultiHuman, which consists of 150 been developed to make accurate motion predictions in high-quality scans with each containing from 1 to 3 multi- multi-person interaction scenes [2, 27, 26, 21]. Some of person interactive actions (including both natural and close them even achieve real-time performance [3, 8, 46]. How- interactions). The dataset is further divided into several cat- ever, these works only capture skeleton motions instead of egories according to the level of occlusions and number of the detailed geometries. Regarding to multi-view geome- persons in the scene, where a detailed evaluation can be try reconstruction, previous studies use template-based de- conducted. Experimental results demonstrate the state-of- forming methods [6, 39, 13], skeleton tracks [39, 13] or the-art performance and well generalization capacity of our fusion-based techniques [10]. Aside from the long com- approach. In general, the main contribution in this work can putation time, these methods often show deficiency in map- be summarized as follows: ping textures, handling topology changes or dealing with drastic frame-to-frame motion. Moreover, the aforemen- • We propose a novel framework for high-fidelity multi- tioned methods also show limited adaptability for multi- view reconstruction for multi-person interactive sce- person capture as they cannot effectively deal with occlu- narios. By leveraging the human shape and pose prior sions. Robust quality reconstruction methods often come for resolving the ambiguities introduced by severe oc- at prohibitive dependencies and constraints. Some meth- clusions, we achieve the state-of-the-art performance ods depend on dense viewpoints [4, 22] and even controlled
Figure 2: Pipeline of our framework. With estimated SMPL models and segmented multi-view, we design a spatial attention- aware network and temporal fusion method to reconstruct each character separately. Robust results with fine-grained details are generated even under closely interactive scenes. lighting [40, 15] to reconstruct detailed geometry. An- projection maps to track different characters in multi-view other branch of multi-view RGBD systems [11, 9] achieve scenes. Finally, the 3D human can be generated through the impressive real-time performance capture results even for spatial attention-aware network based on the pixel-aligned multi-person scenarios benefiting from the strong depth ob- implicit function, and further polished by the temporal fu- servations. Note that Huang et,al. [19] also presents a volu- sion method when the time information is available in video metric capture approach to accomplish quality results using inputs, which will be described detailedly in Section 4. very sparse-view RGB inputs, but they only focus on single- person reconstruction without considering how to resolve 3.1. Preliminary the challenges introduced by multi-person occlusions. Our method is implemented by the implicit function. An implicit function represents the surface of a 3D model as a Attention-based network Apart from the huge success level set of an occupancy field function F, e.g. F(X) = 0.5. of attention mechanism in natural language processing [38, Specifically, PIFu [33] combines 3D points with conditional 7], attention-based network has achieved prominent perfor- variables to formulate a pixel-aligned implicit function: mance in visual tasks, including image classification [41], F (Φ(x, I), z(X)) = s : s ∈ [0, 1] (1) image segmentation [45, 42, 25], super-resolution [5], multi-view stereo [29] and hand pose estimation [18]. In where for an image I and a given 3D point X, x = Π(X) these works, attention mechanism is applied to capture the is the 2D projection coordinate on the image plane, z(X) is correlation of embedding features or context relationship of the depth value in the camera coordinate space, and Φ(x, I) hierarchical structure. In particular, Luo et al. [29] propose is the image feature at location x. In PIFu, a multi-layer an attention-aware network AttMVS to synthesize contex- perceptron (MLP) is trained to fit the implicit function F. tual information from multi-view scenes. An attention- In order to improve the quality of reconstruction results, guided regularization module is used for more robust pre- PIFuHD [34] maintains the origin PIFu framework as a diction. In [18], Lin et al. design a non-autoregressive coarse level prediction while adding high resolution images transformer to learn the structural correlations among hand to a fine level network: joints, which achieves real-time speed and state-of-the-art performance for 3D hand-object pose estimation. F H (Φ(x, I H , N F , N B ), Ω(X)) = s : s ∈ [0, 1] (2) 3. Overview where I H , N F , N B are the high resolution image, the pre- dicted frontal and back normal map, and Ω(X) is the 3D An overview of our approach is illustrated in Fig 2. embeddings extracted from the intermediate features in the The input of DeepMultiCap is the segmented single person coarse level. More detailed human models can be recon- multi-view images as well as the corresponding SMPL, and structed with additional information brought by the increas- the system outputs the reconstructed 3D human. The results ing resolution and high frequent details in the normal maps. are combined together directly with no need for modifying For multi-view images, a naive strategy is proposed in the relative position, since the multi-view setting ensures PIFu to synthesize multi-view features, i.e., performing the 3D spatial relationship between different individuals. mean pooling on the features from the intermediate layer To obtain the inputs, we firstly fit SMPL-X [31] models of the MLP. However, this simple method may lead to loss through 3D keypoints estimated from multi-view by a 4D of details and even collapse in real world cases, especially association algorithm [46]. For multi-person segmentation, when the multi-view features are not consistent due to the we refer to a self-correction method [23] and use SMPL various depth in different views and occlusions.
Figure 3: Architecture of our attention-aware network. We leverage a two-level coarse to fine framework (left) with a multi- view feature fusion module based on self-attention (right). Human body prior SMPL is used in the coarse level to ensure the robustness of reconstruction, and a specially designed SMPL global normal map helps the fine level network better capture the details. To synthesize multi-view features efficiently, we leverage the self-attention mechanism to extract meta information from different observations, which significantly improve the reconstruction quality. 4. Single Person Reconstruction To capture correlations between different views, inspired by [38], we propose a multi-view feature fusion method Reconstruction of a single person from multi-view is a based on self-attention mechanism. The detailed architec- challenging problem. The main concern is to extract the ture of the module is illustrated in Figure 3. Firstly, the meta information of the observations from different views. input multi-view feature φm are embedded with three dif- For this end, we propose an novel feature fusion module ferent linear layers and self-attention mechanism is applied: based on self-attention mechanism, which is effective to help the network aware of geometry details shown in multi- φTq φs view scenes. To tackle with the inconsistencies and loss attention(φq , φs , φt ) = sof tmax( √ )φt (3) dk of information brought by occlusions, we combine the at- tention module with parametric models to enhance the ro- where φq = φm W q , φs = φm W s , φt = φm W t are bustness of reconstruction while preserving the fine-grained the query feature, source feature and target feature embed- details. The architecture of our network is illustrated in Fig- ded by linear weights, and dk is√the embedding size. The ure 3. Following PIFuHD [34], our method builds on a two- dot-product result is divided by dk to prevent the gradient level coarse to fine framework. The coarse level conditioned vanishing problem. with images and SMPL models ensures a confident result, Multi-head attention is used in our method, i.e, and the fine level refines the reconstruction by utilizing high W q , W s , W t ∈ Rnhead ×dk encode the multi-view fea- resolution image feature maps (512 × 512). The results can tures into nhead different embedding subspaces, which al- be further polished by a temporal fusion method when time lows the model to notice the different geometry patterns un- information is available for video inputs. With the proposed der multi-view jointly. The weights of views in target fea- spatial attention and temporal fusion framework, the recon- ture φt are obtained through softmax function by calculat- struction remains robust and in high quality . ing the similarity between views in the query feature φq and the source feature φs . The confident observations in each 4.1. Attention-aware Multi-view Feature Fusion view tend to have large weights and will be maintained, while the invisible regions which lead to small weights have In PIFu [33], the simple strategy for multi-view recon- little influence on the outputs. struction is averaging the multi-view feature embeddings We stack the linear and attention layers to form the self- from the intermediate layer of MLP. We argue that the attention encoder as proposed in [38]. Finally, the meta- method is not efficient enough to synthesize the geometry view prediction is generated as: details from multi-view scenes, which could lead to los- ing information. As shown in Figure 4, when the strategy F T (X) = g T (T (φm )) (4) is applied to PIFuHD [34], we obtain a smoother output. Specially, the geometry features may not remain consistent where φm is the multi-view features, T (φm ) is the feature since the visible regions changes from different views. The output of the self-attention encoder, and the implicit func- mean pooling method can not handle these cases effectively. tion g T predicts the occupancy field. The output meta-view
feature is expected to contain the global spatial information. As demonstrated in Figure 4, when combining the attention module with PIFuHD [34], we are able to capture and pre- serve details with increasing observations. 4.2. Embedded with Parametric Body Model Although the attention-aware feature fusion module is effective to synthesize details from multi-view, without aux- Figure 4: When extending PIFuHD [34] to multi-view, iliary 3D information, the network struggles to make a rea- mean pooling method (left) leads to smoother outputs while sonable prediction when information is lost due to occlu- the attention module (right) helps to preserve the details. sions. To address the limitation, we combine the strength of attention mechanism and parametric models. of the reconstructed mesh at time t, we first calculate the A parametric body model, e.g, SMPL, contains the pose blending weight by: and shape information of human bodies. The semantic fea- X wt,j→i ture of SMPL is extracted by 3D convolution network for Wt,i = Wj (7) wt,i geometry inference. To improve the efficiency of atten- j∈Nt,i tion module, inspired by the position encoding introduced in [38], we further design an informative view representa- where Nt,i is the nearest SMPL vertex set of pt,i , Wj is the tion by rendering SMPL global normal maps. The global blending weight of SMPL vertex vj,t , and maps offer guidance for the network to identify the partic- kpi,t − vj,t k X ular visible body parts in multi-view observations, and then wj→i = exp(− 2 ), wi = wj→i (8) σ geometry features can be synthesized for the corresponding j∈Nt,i parts. Specially, to render the global normal maps, SMPL is the weight of vertex vj,t . Given estimated SMPL at time is transformed to the canonical model coordinate system, 0 where RGB color is obtained from the normal vector and t and t , the reconstructed vertices Vt can then be warped to 0 standard rendering procedure can be applied. In multi- time t through the standard blend skinning: person scenes, though images of single person can be frag- Vt0 ←t = W (W −1 (Vt , J(βt ), θt , Wt ), J(βt0 ), θt0 , Wt ) mentary due to occlusions, the extra information provided (9) by SMPL compensates for the missing part and remain con- where W refers to the skinning procedure, J, β, θ are the sistent under different views, which significantly improves SMPL parameters. With the warped mesh, we calculate the the quality and robustness of reconstruction results. signed distance field (SDF) and performing mean pooling With SMPL, we rewrite the two-level pixel-aligned func- to generate continuous reconstructions: tion Eqn. 1 and Eqn. 2. The coarse level has the formula- tion: 1 X 0 F L (X) = g L (ΦL (x, I), Ψ(X, V M )) (5) Sf usion (t) = S(t ← t ) (10) h 0 t ∈Ft and the fine level: where S denotes the SDF, Ft is a sliding time window with F H (X) = g H (ΦH (x, I, N F , N VM ), ΩL (X)) (6) size of h. In our approach, h is set to 3 for consistent results while maintaining details. where V M is the volumetric representation of SMPL, Ψ(X, V M ) is the SMPL semantic features, N F , N VM are 5. Extend to multi-person Reconstruction the predicted frontal normal map and the rendered SMPL global normal map under the camera view, and ΩL (X) is Multi-person reconstruction is implemented by recon- the 3D embeddings from the coarse level. structing each individual separately. The key challenge is Combined with the attention module, the multi-view fea- to train the network to maintain robust against occlusions in tures φm in Eqn. 4 can be obtained by concatenating image interactive scenes. For this end, we utilize several strategies features with 3D embeddings, as illustrated in Figure 3. during training. We firstly collect 1700 single human mod- els from Twindom to construct a large scale dataset. To sim- 4.3. Temporal Fusion ulate multi-person cases, we render images via taichi [17] and randomly project other persons to the masks, where For moving characters in video inputs, inconsistencies various situations can be generated from non-occlusion to could raise between continuous frames due to the change heavy occlusion scenes. Besides, to help the network aware of visible parts. To address the limitation, we propose a simple temporal fusion method. Suppose pi,t is a vertex https://web.twindom.com/
of visible details and leverage SMPL information for robust reconstruction, we use a sampling method based on the vis- iblity of points. The input points during training are sam- pled from Gaussian distribution centered by surface points with standard deviation σ as introduced in [33]. We further choose a small standard deviation σ 0 for visible points for guiding the network to learn fine-grained geometry details, while a larger σ 1 for invisible points to avoid unreasonable predictions, which we find contributes to the improvement of performance under occluded scenes. 6. Dataset and Experiment 6.1. MultiHuman Dataset Figure 5: Performance on ZJU-Mocap dataset [32]. Our method outperforms state-of-the-art approaches including Since no multi-person dataset is available to evaluate DeepVisualHUll [19], PIFuHD [34] and Neural Body [32]. our method, we propose a high quality 3D model dataset MultiHuman, which is collected using a dense camera- robustly even under closely interactive scenes. rig equipped with 128 DLSRs and a commercial pho- Performance on Real World Data We evaluate our method togrammetry software. The dataset contains 150 multi- on ZJU-MoCap dataset [32], a multi-view real world person static scenes, where in total there are 278 charac- dataset, with comparison to DeepVisualHull [19], a vol- ters, which are mostly university students wearing casual umetric performance capture from sparse multi-view, PI- clothes, dresses, etc. In each scene, the number of person FuHD [34], Neural Body [32], a differentiable rendering is within the range from 1 to 3, and each model consists of method directly trained on the test image sequence. For about 300,000 triangles. inference, we re-implement DeepVisualHUll and use the To evaluate the proposed approach, we divide the dataset released code and pretrained models of PIFuHD and Neu- into different categories by the level of occlusions and num- ral Body . Figure 5 shows the state-of-the-art performance ber of persons. In particular, we split the dataset into 30 of our method on the benchmark. Reconstruction on real single human scenes, 18 occluded single human scenes (by world images (6 view for our data and 8 view for CMU different objects), 46 natural interactive two person scenes, dataset [22]) is demonstrated in Figure 1. For more results, 30 closely interactive two person scenes, and 26 scenes with please refer to our supplementary video. three persons. For more examples of MultiHuman dataset, please refer to the supplementary materials. 6.3. Ablation Study This section aims to find the factors that contribute to the 6.2. Evaluation prominence of our method. We achieve the state-of-the-art Performance on MultiHuman We compare our method performance mainly by leveraging a self-attention network with current state-of-the-art approaches, i.e, PIFu [33], PI- combined with SMPL and a temporal fusion method for FuHD [34] and PaMIR [47] (PIFu + SMPL). All the meth- consistent results. We then demonstrate how the approaches ods are trained with the same strategies on the dataset de- improve the reconstruction under different situations. scribed in Sec. 5. For PIFuHD, the backside normal maps Variant 1: Self-attention Module We design a self- are not used in our implementation, and multi-view features attention module to better capture the details from different are fused by mean pooling as introduced in [33, 47]. Dur- observations. To figure out the strength of our multi-view ing test, the ground truth models are normalized to 180 cen- feature fusion method, we combine the attention module timeters height and then we render the 6 view images as the with PIFu[33] (PIFu + Att) and PIFuHD[34] (PIFuHD + inputs. The point-to-surface distance and chamfer distance Att), and further evaluate our method’s performance with- between the reconstruction and ground truth geometry are out the module (replaced by mean pooling). Quantitative used as evaluation matrix. Quantitative results are shown in results in Table 1 shows that the module benefits baseline Table 1. When occlusions intensify with increasing number models under non occluded and occluded scenes. PIFuHD of persons and interacting elements, the loss of prior meth- with attention module even outperforms ours on single hu- ods exacerbate while ours remains competitive. Qualitative man reconstruction, since the limitations brought by SMPL results are illustrated in Figure 6, which indicates the promi- (Section 6.4) can lead to a lower accuracy for our method. nence of our method and the large gap between prior works For PIFu the improvement is marginal, indicating that the and ours for handling occlusions in multi-person scenes. module is more effective to merge multi-view features with Our method is able to reconstruct highly detailed 3D human the detailed geometry information offered by image nor-
Figure 6: Reconstruction results on MultiHuman dataset of single person, occluded single person, two natural-interactive per- son, two closely-interactive person, three person scenes (top to bottom). Our method (e) generates robust and highly detailed humans, significantly narrowing the gap between ground truth (f) and performance of current state-of-the-art methods. mal maps. For our method, we lose the competitive per- formance without the module. Qualitative examples in Fig- ure 4 further demonstrate how the module can help the base- line model maintain geometry details with increasing views. Variant 2: Use of SMPL SMPL is used in our method as a 3D proxy for the network to generate a reasonable output, and we further design a SMPL global normal map (described in Section 4.2) to improve the robustness of re- construction against occlusions and preserving details. The huge gap between PaMIR [47] and ours indicates SMPL is Figure 7: Our method generates robust results even when not only the factor contributing to our advantages. Table 1 parts of human are not visible in closely interactive scenes. shows the performance of our method without the designed global maps (Ours w/o SN). The results demonstrate lower
MultiHuman MultiHuman MultiHuman MultiHuman MultiHuman Method (single) (occluded single) (two natural-inter) (two closely-inter) (three) Chamfer P2S Chamfer P2S Chamfer P2S Chamfer P2S Chamfer P2S PIFu (Mview + Mean)[33] 1.131 1.220 1.402 1.522 1.578 1.620 1.745 1.831 1.780 1.564 PIFuHD (Mview + Mean)[34] 0.914 0.948 1.365 1.406 1.353 1.376 1.614 1.655 1.814 1.496 PAMIR (Mview + Mean)[47] 1.173 1.113 1.362 1.309 1.227 1.110 1.400 1.198 1.414 1.281 PIFu (Mview + Att) 1.054 1.174 1.343 1.479 1.566 1.605 1.773 1.845 1.541 1.383 PIFuHD (Mview + Att) 0.845 0.867 1.195 1.189 1.278 1.272 1.515 1.450 1.468 1.287 Ours(w/o Att) 1.015 0.967 1.251 1.181 1.020 0.925 1.264 1.088 1.309 1.246 Ours(w/o SN) 1.063 1.017 1.277 1.233 1.126 0.989 1.357 1.141 1.334 1.155 Ours 0.895 0.887 1.041 1.021 0.956 0.927 1.134 1.067 1.130 1.078 Table 1: Quantitative evaluation on MultiHuman dataset. We compare our method with PIFu [33], PIFuHD [34], PaMIR [47] with mean pooling feature fusion method [33], and several alternatives including PIFu + Att (attention module), PIFuHD + Att, our method without attention (w/o att) and our method without the SMPL global normal maps (SN). 6.4. Limitations Since we use SMPL as a 3D reference, our method can not reconstruction other objects aside from human. For challenging clothes, Figure 9 demonstrates that we are able to reconstruct tight dress, while for loose clothing like a wind coat, the reconstruction can be unstable. Figure 9: Reconstruction of challenging clothes. Besides, our method relies on an accurately fitted SMPL, Figure 8: Performance on real world image sequence. (b) i.e, the SMPL body within the correct corresponding region. shows the original result of each frame, (c) demonstrates the An inaccurate SMPL can lead to artifacts and failure cases results polished by temporal fusion. (Figure 10). accuracy of reconstruction, which implies the efficiency of the global maps as a visual reference to guide the attention network merge multi-view information. Variant 3: Temporal Fusion Figure 8 illustrates the re- sults of our method with and without temporal fusion on real world image sequence. The temporal fusion method further enhance the reconstruction consistency, which can be witnessed more clearly in our supplementary video. Figure 10: Failure case with inaccurate SMPL input. Our method is misled by the incorrect SMPL information. Robustness against occlusions Quantitative results (Ta- ble 1) shows that our method maintains high accuracy with 7. Discussion and Future Works increasing occlusions. Figure 7 illustrates that when the body parts are invisible due to heavy occlusion, smooth re- Though our methods is capable of reconstructing multi- sults will be generated. person from real world images, we rely on SMPL as a 3D
reference. The camera parameters are required to estimate [12] Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin, 3D keypoints and fit SMPL models. Besides, the atten- Cordelia Schmid, and Gregory Rogez. Moulding humans: tion network and image encoders are extremely memory- Non-parametric 3d human shape estimation from single im- consuming, which restricts the inference efficiency. Future ages. In ICCV, pages 2232–2241, 2019. 2 works can focus on the network pruning to achieve real- [13] Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. Motion time inference, and design more sophisticated approaches capture using joint skeleton tracking and surface estimation. free from SMPL, which will surely make the system more In CVPR, pages 1746–1753. IEEE, 2009. 1, 2 applicable. [14] Andrew Gilbert, Marco Volino, John P. Collomosse, and Adrian Hilton. Volumetric performance capture from min- References imal camera viewpoints. In Vittorio Ferrari, Martial Hebert, [1] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, Cristian Sminchisescu, and Yair Weiss, editors, ECCV, vol- and Marcus Magnor. Tex2shape: Detailed full human body ume 11215, pages 591–607. Springer, 2018. 1 geometry from a single image. In ICCV, 2019. 1, 2 [15] Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, [2] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts- Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial Escolano, Rohit Pandey, Jason Dourgarian, et al. The re- structures for multiple human pose estimation. In CVPR, lightables: Volumetric performance capture of humans with pages 1669–1676, 2014. 2 realistic relighting. ACM Transactions on Graphics (TOG), [3] Lewis Bridgeman, Marco Volino, Jean-Yves Guillemaut, 38(6):1–19, 2019. 1, 3 and Adrian Hilton. Multi-person 3d pose estimation and [16] Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard tracking in sports. In CVPR, 2019. 2 Pons-Moll, and Christian Theobalt. Deepcap: Monocu- [4] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- lar human performance capture using weak supervision. In nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, CVPR, pages 5052–5063, 2020. 2 and Steve Sullivan. High-quality streamable free-viewpoint [17] Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan video. ACM Transactions on Graphics (ToG), 34(4):1–13, Ragan-Kelley, and Frédo Durand. Taichi: a lan- 2015. 1, 2 guage for high-performance computation on spatially sparse [5] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and data structures. ACM Transactions on Graphics (TOG), Lei Zhang. Second-order attention network for single im- 38(6):201, 2019. 5 age super-resolution. In CVPR, pages 11065–11074, 2019. [18] Lin Huang, Jianchao Tan, Jingjing Meng, Ji Liu, and Junsong 3 Yuan. Hot-net: Non-autoregressive transformer for 3d hand- [6] Edilson De Aguiar, Carsten Stoll, Christian Theobalt, object pose estimation. In Proceedings of the 28th ACM Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. International Conference on Multimedia, pages 3136–3145, Performance capture from sparse multi-view video. In ACM 2020. 3 SIGGRAPH 2008 papers, pages 1–10. 2008. 2 [19] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Chloe LeGendre, Linjie Luo, Chongyang Ma, and Hao Li. Toutanova. Bert: Pre-training of deep bidirectional Deep volumetric video from very sparse multi-view perfor- transformers for language understanding. arXiv preprint mance capture. In ECCV, pages 336–354, 2018. 1, 2, 3, arXiv:1810.04805, 2018. 3 6 [8] Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and [20] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Xiaowei Zhou. Fast and robust multi-person 3d pose estima- Tony Tung. Arch: Animatable reconstruction of clothed hu- tion from multiple views. In CVPR, pages 7792–7801, 2019. mans. In CVPR, pages 3093–3102, 2020. 2 2 [21] H Joo, T Simon, X Li, H Liu, L Tan, L Gui, S Banerjee, [9] Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh T Godisart, B Nabbe, I Matthews, et al. Panoptic studio: Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir A massively multiview system for social interaction capture. Tankovich, and Shahram Izadi. Motion2fusion: Real-time TPAMI, 41(1):190–204, 2019. 1, 2 volumetric performance capture. ACM Transactions on [22] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap- Graphics (TOG), 36(6):1–16, 2017. 1, 3 ture: A 3d deformation model for tracking faces, hands, and [10] Mingsong Dou, Henry Fuchs, and Jan-Michael Frahm. Scan- bodies. In CVPR, pages 8320–8329. IEEE, 2018. 1, 2, 6 ning and tracking dynamic objects with commodity depth [23] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. cameras. In 2013 IEEE international symposium on mixed Self-correction for human parsing. arXiv preprint and augmented Reality (ISMAR), pages 99–106. IEEE, 2013. arXiv:1910.09777, 2019. 3 2 [24] Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle [11] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Olszewski, and Hao Li. Monocular real-time volumetric per- Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts formance capture. In ECCV, 2020. 1, 2 Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, [25] Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan et al. Fusion4d: Real-time performance capture of challeng- Huang, Dalong Du, and Xingang Wang. Attention-guided ing scenes. ACM Transactions on Graphics (TOG), 35(4):1– unified network for panoptic segmentation. In CVPR, pages 13, 2016. 1, 3 7026–7035, 2019. 3
[26] Yebin Liu, Juergen Gall, Carsten Stoll, Qionghai Dai, Hans- [40] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Peter Seidel, and Christian Theobalt. Markerless motion cap- Popović, Szymon Rusinkiewicz, and Wojciech Matusik. Dy- ture of multiple characters using multiview image segmenta- namic shape capture using multi-view photometric stereo. In tion. TPAMI, 35(11):2720–2735, 2013. 1, 2 ACM SIGGRAPH Asia 2009 papers, pages 1–11. 2009. 1, 3 [27] Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel, [41] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng and Christian Theobalt. Markerless motion capture of inter- Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. acting characters using multi-view image segmentation. In Residual attention network for image classification. In CVPR, pages 1249–1256. IEEE, 2011. 1, 2 CVPR, pages 3156–3164, 2017. 3 [28] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard [42] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Pons-Moll, and Michael J Black. Smpl: A skinned multi- Gang Yu, and Nong Sang. Learning a discriminative fea- person linear model. ACM transactions on graphics (TOG), ture network for semantic segmentation. In CVPR, pages 34(6):1–16, 2015. 2 1857–1866, 2018. 3 [29] Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen, [43] Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jian- and Yawei Luo. Attention-aware multi-view stereo. In hui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. Bodyfu- CVPR, pages 1590–1599, 2020. 3 sion: Real-time capture of human motion and surface geom- [30] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, etry using a single depth camera. In ICCV, pages 910–919, Chongyang Ma, Hao Li, and Shigeo Morishima. Siclope: 2017. 2 Silhouette-based clothed people. In CVPR, pages 4480– [44] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qiong- 4490, 2019. 1, 2 hai Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Double- [31] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, fusion: Real-time capture of human performances with in- Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and ner body shapes from a single depth sensor. In ICCV, pages Michael J Black. Expressive body capture: 3d hands, face, 7287–7296, 2018. 2 and body from a single image. In CVPR, pages 10975– [45] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, 10985, 2019. 3 Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- [32] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, text encoding for semantic segmentation. In CVPR, pages Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: 7151–7160, 2018. 3 Implicit neural representations with structured latent codes [46] Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and for novel view synthesis of dynamic humans. arXiv preprint Yebin Liu. 4d association graph for realtime multi-person arXiv:2012.15838, 2020. 6 motion capture using multiple video cameras. In CVPR, [33] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- pages 1324–1333, 2020. 2, 3 ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned [47] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai. implicit function for high-resolution clothed human digitiza- Pamir: Parametric model-conditioned implicit representa- tion. In ICCV, October 2019. 1, 2, 3, 4, 6, 8 tion for image-based human reconstruction. arXiv preprint [34] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul arXiv:2007.03858, 2020. 2, 6, 7, 8 Joo. Pifuhd: Multi-level pixel-aligned implicit function for [48] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and high-resolution 3d human digitization. In CVPR, 2020. 1, 2, Yebin Liu. Deephuman: 3d human reconstruction from a 3, 4, 5, 6, 8 single image. In ICCV, pages 7739–7749, 2019. 1, 2 [35] David Smith, Matthew Loper, Xiaochen Hu, Paris [49] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang Mavroidis, and Javier Romero. Facsimile: Fast and accu- Yang. Detailed human shape estimation from a single image rate scans from an image in less than a second. In ICCV, by hierarchical mesh deformation. In CVPR, pages 4491– pages 5330–5339, 2019. 2 4500, 2019. 2 [36] Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation. IEEE Computer Graphics and Applications, 27(3):21–31, 2007. 1 [37] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volu- metric inference of 3d human body shapes. In ECCV, pages 20–36, 2018. 1, 2 [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 2, 3, 4, 5 [39] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. Articulated mesh animation from multi-view sil- houettes. In ACM SIGGRAPH 2008 papers, pages 1–9. 2008. 2
You can also read