DEEPMULTICAP: PERFORMANCE CAPTURE OF MULTIPLE CHARACTERS USING SPARSE MULTIVIEW CAMERAS

Page created by Jeffery Barnett

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DEEPMULTICAP: PERFORMANCE CAPTURE OF MULTIPLE CHARACTERS USING SPARSE MULTIVIEW CAMERAS

DeepMultiCap: Performance Capture of Multiple Characters
                                                                   Using Sparse Multiview Cameras

                                            Yang Zheng1,* , Ruizhi Shao1,* , Yuxiang Zhang1 , Tao Yu1 , Zerong Zheng1 , Qionghai Dai1 , and Yebin Liu1

                                                                                                 1
                                                                                                     Tsinghua University
arXiv:2105.00261v1 [cs.CV] 1 May 2021

                                        Figure 1: Given only sparse multi-view RGB videos (6 views for the left and middle, 8 views for the right), our method is
                                        able to reconstruct various kinds of 3D shapes with temporal-varying surface details even under challenging occlusions for
                                        multi-person interactive scenarios.

                                                                   Abstract                                      1. Introduction

                                            We propose DeepMultiCap, a novel method for multi-                      Recent years have witnessed great progress in vision-
                                        person performance capture using sparse multi-view cam-                  based human performance capture, which is promising to
                                        eras. Our method can capture time varying surface de-                    enable various applications (e.g., tele-presence, sportscast,
                                        tails without the need of using pre-scanned template mod-                gaming and mixed reality) with enhanced interactive and
                                        els. To tackle with the serious occlusion challenge for close            immersive experiences. To achieve surprisingly detailed ge-
                                        interacting scenes, we combine a recently proposed pixel-                ometry and texture reconstruction, dense camera rigs even
                                        aligned implicit function with parametric model for robust               equipped with sophisticated lighting systems are introduced
                                        reconstruction of the invisible surface areas. An effec-                 [40, 4, 22, 21, 15]. However, the extremely expensive and
                                        tive attention-aware module is designed to obtain the fine-              professional setups limited their popularity. Although other
                                        grained geometry details from multi-view images, where                   light-weight multi-view human performance capture sys-
                                        high-fidelity results can be generated. In addition to the               tems have achieved impressive results even in real-time,
                                        spatial attention method, for video inputs, we further pro-              they still relies on pre-scanned templates [27, 26], custom-
                                        pose a novel temporal fusion method to alleviate the noise               designed RGBD sensors [11, 9], or limited to single-person
                                        and temporal inconsistencies for moving character recon-                 reconstruction [36, 13, 19, 14].
                                        struction. For quantitative evaluation, we contribute a                     Benefiting from the fast improvement of deep implicit
                                        high quality multi-person dataset, MultiHuman, which con-                functions for 3D representations, recent methods [33, 34,
                                        sists of 150 static scenes with different levels of occlusions           24] are able to recover the 3D body shape only from a single
                                        and ground truth 3D human models. Experimental results                   RGB image. Compared with the voxel-based [37, 19, 48]
                                        demonstrate the state-of-the-art performance of our method               or mesh-based [30, 1] representations, an implicit function
                                        and the well generalization to real multiview video data,                guides the deep learning models to notice geometric details
                                        which outperforms the prior works by a large margin.                     in a more efficient way. Specifically, PIFu [33, 24] achieves
                                                                                                                 plausible single human reconstruction using only RGB im-
                                                                                                                 ages, and PIFuHD [34] further utilizes normal maps and
                                                                                                                 high resolution images to generate more detailed results.
                                           * Equal contribution
                                           Our project page : http://liuyebin.com/dmc/dmc.html                      Despite the prominent performance in digitizing 3D

                                                                                                             1

human body, both PIFu [33] and PIFuHD [34] suffer                      even with partial observations in each view.
from several drawbacks when extending the frameworks
                                                                    • We design an efficient spatial attention-aware mod-
to multi-person scenarios and multi-view setups. Firstly,
                                                                      ule to obtain fine-grained details for multi-view se-
the average-pooling-based multi-view feature fusion strat-
                                                                      tups, and introduce a novel temporal fusion method to
egy in PIFu will lead to over-smoothed outputs when high
                                                                      reduce the reconstruction inconsistencies for moving
frequency details are included in multi-view features, i.e,
                                                                      characters from video inputs.
the normal maps used in PIFuHD. More importantly, in the
two approaches, reconstruction results are only promised            • We contribute an extremely high quality 3D model
with ideal input images without severe occlusions in multi-           dataset containing of 150 multi-person interacting
person performance capture scenarios. The reconstruction              scenes. The dataset can be used for training and evalu-
performance of [33, 34] will be significantly deteriorated            ation of related topics in future research.
due to the lack of observations caused by severe occlusions.
    To address the aforementioned problems, we propose            2. Related Work
a novel framework to perform multi-person reconstruction
                                                                  Single-view performance capture Many methods have
from multi-view images. First of all, inspired by [38],
                                                                  been proposed to reconstruct detailed geometry from
we design an spatial attention-aware module to adaptively
                                                                  single-view inputs. Typical techniques include silhouette
aggregate information from multi-view inputs. The mod-
                                                                  estimation [30], depth estimation [12, 35] and template-
ule is effective to capture and merge the geometric de-
                                                                  based deformation [1, 49, 16]. Moreover, SMPL [28] re-
tails from different view points, and finally contributing to
                                                                  gression or optimization can be incorporated to generate
the significant improvement of results under multi-view se-
                                                                  more reliable and robust outputs as shown in [48, 47]. Real-
tups. Moreover, for multi-person reconstruction, we further
                                                                  time methods can be implemented with the aid of a single
combine the attention module with parametric models, i.e.,
                                                                  depth sensor [43, 44] or by innovating computation and ren-
SMPL to enhance the robustness while maintaining the fine-
                                                                  dering algorithms [24]. Regarding to the 3D representations
grained details. The SMPL model serves as a 3D geome-
                                                                  used in these methods, we can split them into two cate-
try proxy which compensates for the missing information
                                                                  gories: explicit [37, 30, 48] and implicit [33, 34, 19, 20]
where occlusions take place. With the semantic informa-
                                                                  reconstruction methods. Compared with traditional explicit
tion provided by SMPL, the network is capable of recon-
                                                                  3D representations, implicit representations show certain
structing complete human bodies even under close interac-
                                                                  advantages in domain-specific shape learning and detail
tive scenarios. Finally, when dealing with moving charac-
                                                                  preservation. For example, PIFu [33] define the surface as
ters from video, we propose a temporal fusion method by
                                                                  a level set of function f . Similarly, [19] defines a prob-
weighting the signed distance field (SDF) across the time
                                                                  ability field of surface points, and ARCH [20] predicts a
domain, which further enhance the temporal consistency of
                                                                  3D occupancy map. However, all of the methods above are
the reconstructed dynamic 3D sequences.
                                                                  mainly focusing on single-person reconstruction, and it re-
    Another urgent problem is that the lack of high-quality
                                                                  mains difficult for them to achieve accurate reconstruction
scans of multi-person interactive scenarios in the commu-
                                                                  under multi-person scenarios.
nity makes it difficult for accurately evaluating multi-person
performance capture systems like ours. To fill this gap
and better evaluate the performance of our system, we con-        Multi-view performance capture Motion capture has
tribute a novel dataset, MultiHuman, which consists of 150        been developed to make accurate motion predictions in
high-quality scans with each containing from 1 to 3 multi-        multi-person interaction scenes [2, 27, 26, 21]. Some of
person interactive actions (including both natural and close      them even achieve real-time performance [3, 8, 46]. How-
interactions). The dataset is further divided into several cat-   ever, these works only capture skeleton motions instead of
egories according to the level of occlusions and number of        the detailed geometries. Regarding to multi-view geome-
persons in the scene, where a detailed evaluation can be          try reconstruction, previous studies use template-based de-
conducted. Experimental results demonstrate the state-of-         forming methods [6, 39, 13], skeleton tracks [39, 13] or
the-art performance and well generalization capacity of our       fusion-based techniques [10]. Aside from the long com-
approach. In general, the main contribution in this work can      putation time, these methods often show deficiency in map-
be summarized as follows:                                         ping textures, handling topology changes or dealing with
                                                                  drastic frame-to-frame motion. Moreover, the aforemen-
   • We propose a novel framework for high-fidelity multi-        tioned methods also show limited adaptability for multi-
     view reconstruction for multi-person interactive sce-        person capture as they cannot effectively deal with occlu-
     narios. By leveraging the human shape and pose prior         sions. Robust quality reconstruction methods often come
     for resolving the ambiguities introduced by severe oc-       at prohibitive dependencies and constraints. Some meth-
     clusions, we achieve the state-of-the-art performance        ods depend on dense viewpoints [4, 22] and even controlled

Figure 2: Pipeline of our framework. With estimated SMPL models and segmented multi-view, we design a spatial attention-
aware network and temporal fusion method to reconstruct each character separately. Robust results with fine-grained details
are generated even under closely interactive scenes.
lighting [40, 15] to reconstruct detailed geometry. An- projection maps to track different characters in multi-view
other branch of multi-view RGBD systems [11, 9] achieve scenes. Finally, the 3D human can be generated through the
impressive real-time performance capture results even for spatial attention-aware network based on the pixel-aligned
multi-person scenarios benefiting from the strong depth ob- implicit function, and further polished by the temporal fu-
servations. Note that Huang et,al. [19] also presents a volu- sion method when the time information is available in video
metric capture approach to accomplish quality results using inputs, which will be described detailedly in Section 4.
very sparse-view RGB inputs, but they only focus on single-
person reconstruction without considering how to resolve 3.1. Preliminary
the challenges introduced by multi-person occlusions. Our method is implemented by the implicit function. An
implicit function represents the surface of a 3D model as a
Attention-based network Apart from the huge success level set of an occupancy field function F, e.g. F(X) = 0.5.
of attention mechanism in natural language processing [38, Specifically, PIFu [33] combines 3D points with conditional
7], attention-based network has achieved prominent perfor- variables to formulate a pixel-aligned implicit function:
mance in visual tasks, including image classification [41],
F (Φ(x, I), z(X)) = s : s ∈ [0, 1] (1)
image segmentation [45, 42, 25], super-resolution [5],
multi-view stereo [29] and hand pose estimation [18]. In where for an image I and a given 3D point X, x = Π(X)
these works, attention mechanism is applied to capture the is the 2D projection coordinate on the image plane, z(X) is
correlation of embedding features or context relationship of the depth value in the camera coordinate space, and Φ(x, I)
hierarchical structure. In particular, Luo et al. [29] propose is the image feature at location x. In PIFu, a multi-layer
an attention-aware network AttMVS to synthesize contex- perceptron (MLP) is trained to fit the implicit function F.
tual information from multi-view scenes. An attention- In order to improve the quality of reconstruction results,
guided regularization module is used for more robust pre- PIFuHD [34] maintains the origin PIFu framework as a
diction. In [18], Lin et al. design a non-autoregressive coarse level prediction while adding high resolution images
transformer to learn the structural correlations among hand to a fine level network:
joints, which achieves real-time speed and state-of-the-art
performance for 3D hand-object pose estimation. F H (Φ(x, I H , N F , N B ), Ω(X)) = s : s ∈ [0, 1] (2)

3. Overview where I H , N F , N B are the high resolution image, the pre-
dicted frontal and back normal map, and Ω(X) is the 3D
An overview of our approach is illustrated in Fig 2. embeddings extracted from the intermediate features in the
The input of DeepMultiCap is the segmented single person coarse level. More detailed human models can be recon-
multi-view images as well as the corresponding SMPL, and structed with additional information brought by the increas-
the system outputs the reconstructed 3D human. The results ing resolution and high frequent details in the normal maps.
are combined together directly with no need for modifying For multi-view images, a naive strategy is proposed in
the relative position, since the multi-view setting ensures PIFu to synthesize multi-view features, i.e., performing
the 3D spatial relationship between different individuals. mean pooling on the features from the intermediate layer
To obtain the inputs, we firstly fit SMPL-X [31] models of the MLP. However, this simple method may lead to loss
through 3D keypoints estimated from multi-view by a 4D of details and even collapse in real world cases, especially
association algorithm [46]. For multi-person segmentation, when the multi-view features are not consistent due to the
we refer to a self-correction method [23] and use SMPL various depth in different views and occlusions.

Figure 3: Architecture of our attention-aware network. We leverage a two-level coarse to fine framework (left) with a multi-
view feature fusion module based on self-attention (right). Human body prior SMPL is used in the coarse level to ensure the
robustness of reconstruction, and a specially designed SMPL global normal map helps the fine level network better capture the
details. To synthesize multi-view features efficiently, we leverage the self-attention mechanism to extract meta information
from different observations, which significantly improve the reconstruction quality.

4. Single Person Reconstruction To capture correlations between different views, inspired
by [38], we propose a multi-view feature fusion method
Reconstruction of a single person from multi-view is a based on self-attention mechanism. The detailed architec-
challenging problem. The main concern is to extract the ture of the module is illustrated in Figure 3. Firstly, the
meta information of the observations from different views. input multi-view feature φm are embedded with three dif-
For this end, we propose an novel feature fusion module ferent linear layers and self-attention mechanism is applied:
based on self-attention mechanism, which is effective to
help the network aware of geometry details shown in multi- φTq φs
view scenes. To tackle with the inconsistencies and loss attention(φq , φs , φt ) = sof tmax( √ )φt (3)
dk
of information brought by occlusions, we combine the at-
tention module with parametric models to enhance the ro- where φq = φm W q , φs = φm W s , φt = φm W t are
bustness of reconstruction while preserving the fine-grained the query feature, source feature and target feature embed-
details. The architecture of our network is illustrated in Fig- ded by linear weights, and dk is√the embedding size. The
ure 3. Following PIFuHD [34], our method builds on a two- dot-product result is divided by dk to prevent the gradient
level coarse to fine framework. The coarse level conditioned vanishing problem.
with images and SMPL models ensures a confident result, Multi-head attention is used in our method, i.e,
and the fine level refines the reconstruction by utilizing high W q , W s , W t ∈ Rnhead ×dk encode the multi-view fea-
resolution image feature maps (512 × 512). The results can tures into nhead different embedding subspaces, which al-
be further polished by a temporal fusion method when time lows the model to notice the different geometry patterns un-
information is available for video inputs. With the proposed der multi-view jointly. The weights of views in target fea-
spatial attention and temporal fusion framework, the recon- ture φt are obtained through softmax function by calculat-
struction remains robust and in high quality . ing the similarity between views in the query feature φq and
the source feature φs . The confident observations in each
4.1. Attention-aware Multi-view Feature Fusion view tend to have large weights and will be maintained,
while the invisible regions which lead to small weights have
In PIFu [33], the simple strategy for multi-view recon-
little influence on the outputs.
struction is averaging the multi-view feature embeddings
We stack the linear and attention layers to form the self-
from the intermediate layer of MLP. We argue that the
attention encoder as proposed in [38]. Finally, the meta-
method is not efficient enough to synthesize the geometry
view prediction is generated as:
details from multi-view scenes, which could lead to los-
ing information. As shown in Figure 4, when the strategy F T (X) = g T (T (φm )) (4)
is applied to PIFuHD [34], we obtain a smoother output.
Specially, the geometry features may not remain consistent where φm is the multi-view features, T (φm ) is the feature
since the visible regions changes from different views. The output of the self-attention encoder, and the implicit func-
mean pooling method can not handle these cases effectively. tion g T predicts the occupancy field. The output meta-view

feature is expected to contain the global spatial information.
As demonstrated in Figure 4, when combining the attention
module with PIFuHD [34], we are able to capture and pre-
serve details with increasing observations.

4.2. Embedded with Parametric Body Model
    Although the attention-aware feature fusion module is
effective to synthesize details from multi-view, without aux-    Figure 4: When extending PIFuHD [34] to multi-view,
iliary 3D information, the network struggles to make a rea-      mean pooling method (left) leads to smoother outputs while
sonable prediction when information is lost due to occlu-        the attention module (right) helps to preserve the details.
sions. To address the limitation, we combine the strength of
attention mechanism and parametric models.                       of the reconstructed mesh at time t, we first calculate the
    A parametric body model, e.g, SMPL, contains the pose        blending weight by:
and shape information of human bodies. The semantic fea-                                  X wt,j→i
ture of SMPL is extracted by 3D convolution network for                          Wt,i =                Wj                (7)
                                                                                                 wt,i
geometry inference. To improve the efficiency of atten-                                        j∈Nt,i
tion module, inspired by the position encoding introduced
in [38], we further design an informative view representa-       where Nt,i is the nearest SMPL vertex set of pt,i , Wj is the
tion by rendering SMPL global normal maps. The global            blending weight of SMPL vertex vj,t , and
maps offer guidance for the network to identify the partic-                           kpi,t − vj,t k         X
ular visible body parts in multi-view observations, and then         wj→i = exp(−             2
                                                                                                     ), wi =   wj→i           (8)
                                                                                            σ
geometry features can be synthesized for the corresponding                                                   j∈Nt,i
parts. Specially, to render the global normal maps, SMPL
                                                                 is the weight of vertex vj,t . Given estimated SMPL at time
is transformed to the canonical model coordinate system,                  0

where RGB color is obtained from the normal vector and           t and t , the reconstructed vertices Vt can then be warped to
                                                                        0
standard rendering procedure can be applied. In multi-           time t through the standard blend skinning:
person scenes, though images of single person can be frag-
                                                                   Vt0 ←t = W (W −1 (Vt , J(βt ), θt , Wt ), J(βt0 ), θt0 , Wt )
mentary due to occlusions, the extra information provided
                                                                                                                               (9)
by SMPL compensates for the missing part and remain con-
                                                                 where W refers to the skinning procedure, J, β, θ are the
sistent under different views, which significantly improves
                                                                 SMPL parameters. With the warped mesh, we calculate the
the quality and robustness of reconstruction results.
                                                                 signed distance field (SDF) and performing mean pooling
    With SMPL, we rewrite the two-level pixel-aligned func-
                                                                 to generate continuous reconstructions:
tion Eqn. 1 and Eqn. 2. The coarse level has the formula-
tion:                                                                                            1 X        0

           F L (X) = g L (ΦL (x, I), Ψ(X, V M ))          (5)                  Sf usion (t) =        S(t ← t )               (10)
                                                                                                 h 0
                                                                                                   t ∈Ft
and the fine level:
                                                                 where S denotes the SDF, Ft is a sliding time window with
    F H (X) = g H (ΦH (x, I, N F , N VM ), ΩL (X))        (6)    size of h. In our approach, h is set to 3 for consistent results
                                                                 while maintaining details.
where V M is the volumetric representation of SMPL,
Ψ(X, V M ) is the SMPL semantic features, N F , N VM are         5. Extend to multi-person Reconstruction
the predicted frontal normal map and the rendered SMPL
global normal map under the camera view, and ΩL (X) is               Multi-person reconstruction is implemented by recon-
the 3D embeddings from the coarse level.                         structing each individual separately. The key challenge is
   Combined with the attention module, the multi-view fea-       to train the network to maintain robust against occlusions in
tures φm in Eqn. 4 can be obtained by concatenating image        interactive scenes. For this end, we utilize several strategies
features with 3D embeddings, as illustrated in Figure 3.         during training. We firstly collect 1700 single human mod-
                                                                 els from Twindom to construct a large scale dataset. To sim-
4.3. Temporal Fusion                                             ulate multi-person cases, we render images via taichi [17]
                                                                 and randomly project other persons to the masks, where
   For moving characters in video inputs, inconsistencies
                                                                 various situations can be generated from non-occlusion to
could raise between continuous frames due to the change
                                                                 heavy occlusion scenes. Besides, to help the network aware
of visible parts. To address the limitation, we propose a
simple temporal fusion method. Suppose pi,t is a vertex             https://web.twindom.com/

of visible details and leverage SMPL information for robust
reconstruction, we use a sampling method based on the vis-
iblity of points. The input points during training are sam-
pled from Gaussian distribution centered by surface points
with standard deviation σ as introduced in [33]. We further
choose a small standard deviation σ 0 for visible points for
guiding the network to learn fine-grained geometry details,
while a larger σ 1 for invisible points to avoid unreasonable
predictions, which we find contributes to the improvement
of performance under occluded scenes.

6. Dataset and Experiment
6.1. MultiHuman Dataset Figure 5: Performance on ZJU-Mocap dataset [32]. Our
method outperforms state-of-the-art approaches including
Since no multi-person dataset is available to evaluate DeepVisualHUll [19], PIFuHD [34] and Neural Body [32].
our method, we propose a high quality 3D model dataset
MultiHuman, which is collected using a dense camera- robustly even under closely interactive scenes.
rig equipped with 128 DLSRs and a commercial pho- Performance on Real World Data We evaluate our method
togrammetry software. The dataset contains 150 multi- on ZJU-MoCap dataset [32], a multi-view real world
person static scenes, where in total there are 278 charac- dataset, with comparison to DeepVisualHull [19], a vol-
ters, which are mostly university students wearing casual umetric performance capture from sparse multi-view, PI-
clothes, dresses, etc. In each scene, the number of person FuHD [34], Neural Body [32], a differentiable rendering
is within the range from 1 to 3, and each model consists of method directly trained on the test image sequence. For
about 300,000 triangles. inference, we re-implement DeepVisualHUll and use the
To evaluate the proposed approach, we divide the dataset released code and pretrained models of PIFuHD and Neu-
into different categories by the level of occlusions and num- ral Body . Figure 5 shows the state-of-the-art performance
ber of persons. In particular, we split the dataset into 30 of our method on the benchmark. Reconstruction on real
single human scenes, 18 occluded single human scenes (by world images (6 view for our data and 8 view for CMU
different objects), 46 natural interactive two person scenes, dataset [22]) is demonstrated in Figure 1. For more results,
30 closely interactive two person scenes, and 26 scenes with please refer to our supplementary video.
three persons. For more examples of MultiHuman dataset,
please refer to the supplementary materials. 6.3. Ablation Study
This section aims to find the factors that contribute to the
6.2. Evaluation
prominence of our method. We achieve the state-of-the-art
Performance on MultiHuman We compare our method performance mainly by leveraging a self-attention network
with current state-of-the-art approaches, i.e, PIFu [33], PI- combined with SMPL and a temporal fusion method for
FuHD [34] and PaMIR [47] (PIFu + SMPL). All the meth- consistent results. We then demonstrate how the approaches
ods are trained with the same strategies on the dataset de- improve the reconstruction under different situations.
scribed in Sec. 5. For PIFuHD, the backside normal maps Variant 1: Self-attention Module We design a self-
are not used in our implementation, and multi-view features attention module to better capture the details from different
are fused by mean pooling as introduced in [33, 47]. Dur- observations. To figure out the strength of our multi-view
ing test, the ground truth models are normalized to 180 cen- feature fusion method, we combine the attention module
timeters height and then we render the 6 view images as the with PIFu[33] (PIFu + Att) and PIFuHD[34] (PIFuHD +
inputs. The point-to-surface distance and chamfer distance Att), and further evaluate our method’s performance with-
between the reconstruction and ground truth geometry are out the module (replaced by mean pooling). Quantitative
used as evaluation matrix. Quantitative results are shown in results in Table 1 shows that the module benefits baseline
Table 1. When occlusions intensify with increasing number models under non occluded and occluded scenes. PIFuHD
of persons and interacting elements, the loss of prior meth- with attention module even outperforms ours on single hu-
ods exacerbate while ours remains competitive. Qualitative man reconstruction, since the limitations brought by SMPL
results are illustrated in Figure 6, which indicates the promi- (Section 6.4) can lead to a lower accuracy for our method.
nence of our method and the large gap between prior works For PIFu the improvement is marginal, indicating that the
and ours for handling occlusions in multi-person scenes. module is more effective to merge multi-view features with
Our method is able to reconstruct highly detailed 3D human the detailed geometry information offered by image nor-

Figure 6: Reconstruction results on MultiHuman dataset of single person, occluded single person, two natural-interactive per-
son, two closely-interactive person, three person scenes (top to bottom). Our method (e) generates robust and highly detailed
humans, significantly narrowing the gap between ground truth (f) and performance of current state-of-the-art methods.

                                                                 mal maps. For our method, we lose the competitive per-
                                                                 formance without the module. Qualitative examples in Fig-
                                                                 ure 4 further demonstrate how the module can help the base-
                                                                 line model maintain geometry details with increasing views.
                                                                 Variant 2: Use of SMPL SMPL is used in our method
                                                                 as a 3D proxy for the network to generate a reasonable
                                                                 output, and we further design a SMPL global normal map
                                                                 (described in Section 4.2) to improve the robustness of re-
                                                                 construction against occlusions and preserving details. The
                                                                 huge gap between PaMIR [47] and ours indicates SMPL is
Figure 7: Our method generates robust results even when          not only the factor contributing to our advantages. Table 1
parts of human are not visible in closely interactive scenes.    shows the performance of our method without the designed
                                                                 global maps (Ours w/o SN). The results demonstrate lower

MultiHuman          MultiHuman            MultiHuman            MultiHuman           MultiHuman
 Method                            (single)       (occluded single)     (two natural-inter)   (two closely-inter)       (three)
                               Chamfer     P2S    Chamfer       P2S     Chamfer      P2S      Chamfer      P2S      Chamfer       P2S

 PIFu (Mview + Mean)[33]        1.131    1.220     1.402    1.522        1.578      1.620     1.745      1.831      1.780     1.564
 PIFuHD (Mview + Mean)[34]      0.914    0.948     1.365    1.406        1.353      1.376     1.614      1.655      1.814     1.496
 PAMIR (Mview + Mean)[47]       1.173    1.113     1.362    1.309        1.227      1.110     1.400      1.198      1.414     1.281
 PIFu (Mview + Att)             1.054    1.174     1.343    1.479        1.566      1.605     1.773      1.845      1.541     1.383
 PIFuHD (Mview + Att)           0.845    0.867     1.195    1.189        1.278      1.272     1.515      1.450      1.468     1.287
 Ours(w/o Att)                  1.015    0.967     1.251    1.181        1.020      0.925     1.264      1.088      1.309     1.246
 Ours(w/o SN)                   1.063    1.017     1.277    1.233        1.126      0.989     1.357      1.141      1.334     1.155
 Ours                           0.895    0.887     1.041    1.021        0.956      0.927     1.134      1.067      1.130     1.078

Table 1: Quantitative evaluation on MultiHuman dataset. We compare our method with PIFu [33], PIFuHD [34], PaMIR [47]
with mean pooling feature fusion method [33], and several alternatives including PIFu + Att (attention module), PIFuHD +
Att, our method without attention (w/o att) and our method without the SMPL global normal maps (SN).

                                                                      6.4. Limitations
                                                                         Since we use SMPL as a 3D reference, our method can
                                                                      not reconstruction other objects aside from human. For
                                                                      challenging clothes, Figure 9 demonstrates that we are able
                                                                      to reconstruct tight dress, while for loose clothing like a
                                                                      wind coat, the reconstruction can be unstable.

                                                                           Figure 9: Reconstruction of challenging clothes.

                                                                          Besides, our method relies on an accurately fitted SMPL,
Figure 8: Performance on real world image sequence. (b)               i.e, the SMPL body within the correct corresponding region.
shows the original result of each frame, (c) demonstrates the         An inaccurate SMPL can lead to artifacts and failure cases
results polished by temporal fusion.                                  (Figure 10).

accuracy of reconstruction, which implies the efficiency of
the global maps as a visual reference to guide the attention
network merge multi-view information.

Variant 3: Temporal Fusion Figure 8 illustrates the re-
sults of our method with and without temporal fusion on
real world image sequence. The temporal fusion method
further enhance the reconstruction consistency, which can
be witnessed more clearly in our supplementary video.                 Figure 10: Failure case with inaccurate SMPL input. Our
                                                                      method is misled by the incorrect SMPL information.
Robustness against occlusions Quantitative results (Ta-
ble 1) shows that our method maintains high accuracy with             7. Discussion and Future Works
increasing occlusions. Figure 7 illustrates that when the
body parts are invisible due to heavy occlusion, smooth re-              Though our methods is capable of reconstructing multi-
sults will be generated.                                              person from real world images, we rely on SMPL as a 3D

reference. The camera parameters are required to estimate           [12] Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin,
3D keypoints and fit SMPL models. Besides, the atten-                    Cordelia Schmid, and Gregory Rogez. Moulding humans:
tion network and image encoders are extremely memory-                    Non-parametric 3d human shape estimation from single im-
consuming, which restricts the inference efficiency. Future              ages. In ICCV, pages 2232–2241, 2019. 2
works can focus on the network pruning to achieve real-             [13] Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian
                                                                         Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. Motion
time inference, and design more sophisticated approaches
                                                                         capture using joint skeleton tracking and surface estimation.
free from SMPL, which will surely make the system more
                                                                         In CVPR, pages 1746–1753. IEEE, 2009. 1, 2
applicable.                                                         [14] Andrew Gilbert, Marco Volino, John P. Collomosse, and
                                                                         Adrian Hilton. Volumetric performance capture from min-
References                                                               imal camera viewpoints. In Vittorio Ferrari, Martial Hebert,
 [1] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt,              Cristian Sminchisescu, and Yair Weiss, editors, ECCV, vol-
     and Marcus Magnor. Tex2shape: Detailed full human body              ume 11215, pages 591–607. Springer, 2018. 1
     geometry from a single image. In ICCV, 2019. 1, 2              [15] Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch,
 [2] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka,           Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts-
     Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial        Escolano, Rohit Pandey, Jason Dourgarian, et al. The re-
     structures for multiple human pose estimation. In CVPR,             lightables: Volumetric performance capture of humans with
     pages 1669–1676, 2014. 2                                            realistic relighting. ACM Transactions on Graphics (TOG),
 [3] Lewis Bridgeman, Marco Volino, Jean-Yves Guillemaut,                38(6):1–19, 2019. 1, 3
     and Adrian Hilton. Multi-person 3d pose estimation and         [16] Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard
     tracking in sports. In CVPR, 2019. 2                                Pons-Moll, and Christian Theobalt. Deepcap: Monocu-
 [4] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den-          lar human performance capture using weak supervision. In
     nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk,               CVPR, pages 5052–5063, 2020. 2
     and Steve Sullivan. High-quality streamable free-viewpoint     [17] Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan
     video. ACM Transactions on Graphics (ToG), 34(4):1–13,              Ragan-Kelley, and Frédo Durand.             Taichi: a lan-
     2015. 1, 2                                                          guage for high-performance computation on spatially sparse
 [5] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and              data structures. ACM Transactions on Graphics (TOG),
     Lei Zhang. Second-order attention network for single im-            38(6):201, 2019. 5
     age super-resolution. In CVPR, pages 11065–11074, 2019.        [18] Lin Huang, Jianchao Tan, Jingjing Meng, Ji Liu, and Junsong
     3                                                                   Yuan. Hot-net: Non-autoregressive transformer for 3d hand-
 [6] Edilson De Aguiar, Carsten Stoll, Christian Theobalt,               object pose estimation. In Proceedings of the 28th ACM
     Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun.               International Conference on Multimedia, pages 3136–3145,
     Performance capture from sparse multi-view video. In ACM            2020. 3
     SIGGRAPH 2008 papers, pages 1–10. 2008. 2                      [19] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing,
 [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina              Chloe LeGendre, Linjie Luo, Chongyang Ma, and Hao Li.
     Toutanova.      Bert: Pre-training of deep bidirectional            Deep volumetric video from very sparse multi-view perfor-
     transformers for language understanding. arXiv preprint             mance capture. In ECCV, pages 336–354, 2018. 1, 2, 3,
     arXiv:1810.04805, 2018. 3                                           6
 [8] Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and          [20] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and
     Xiaowei Zhou. Fast and robust multi-person 3d pose estima-          Tony Tung. Arch: Animatable reconstruction of clothed hu-
     tion from multiple views. In CVPR, pages 7792–7801, 2019.           mans. In CVPR, pages 3093–3102, 2020. 2
     2                                                              [21] H Joo, T Simon, X Li, H Liu, L Tan, L Gui, S Banerjee,
 [9] Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh             T Godisart, B Nabbe, I Matthews, et al. Panoptic studio:
     Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir                  A massively multiview system for social interaction capture.
     Tankovich, and Shahram Izadi. Motion2fusion: Real-time              TPAMI, 41(1):190–204, 2019. 1, 2
     volumetric performance capture. ACM Transactions on            [22] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap-
     Graphics (TOG), 36(6):1–16, 2017. 1, 3                              ture: A 3d deformation model for tracking faces, hands, and
[10] Mingsong Dou, Henry Fuchs, and Jan-Michael Frahm. Scan-             bodies. In CVPR, pages 8320–8329. IEEE, 2018. 1, 2, 6
     ning and tracking dynamic objects with commodity depth         [23] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang.
     cameras. In 2013 IEEE international symposium on mixed              Self-correction for human parsing.           arXiv preprint
     and augmented Reality (ISMAR), pages 99–106. IEEE, 2013.            arXiv:1910.09777, 2019. 3
     2                                                              [24] Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle
[11] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip                  Olszewski, and Hao Li. Monocular real-time volumetric per-
     Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts             formance capture. In ECCV, 2020. 1, 2
     Escolano, Christoph Rhemann, David Kim, Jonathan Taylor,       [25] Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan
     et al. Fusion4d: Real-time performance capture of challeng-         Huang, Dalong Du, and Xingang Wang. Attention-guided
     ing scenes. ACM Transactions on Graphics (TOG), 35(4):1–            unified network for panoptic segmentation. In CVPR, pages
     13, 2016. 1, 3                                                      7026–7035, 2019. 3

[26] Yebin Liu, Juergen Gall, Carsten Stoll, Qionghai Dai, Hans-     [40] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan
     Peter Seidel, and Christian Theobalt. Markerless motion cap-         Popović, Szymon Rusinkiewicz, and Wojciech Matusik. Dy-
     ture of multiple characters using multiview image segmenta-          namic shape capture using multi-view photometric stereo. In
     tion. TPAMI, 35(11):2720–2735, 2013. 1, 2                            ACM SIGGRAPH Asia 2009 papers, pages 1–11. 2009. 1, 3
[27] Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel,      [41] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng
     and Christian Theobalt. Markerless motion capture of inter-          Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang.
     acting characters using multi-view image segmentation. In            Residual attention network for image classification. In
     CVPR, pages 1249–1256. IEEE, 2011. 1, 2                              CVPR, pages 3156–3164, 2017. 3
[28] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard           [42] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
     Pons-Moll, and Michael J Black. Smpl: A skinned multi-               Gang Yu, and Nong Sang. Learning a discriminative fea-
     person linear model. ACM transactions on graphics (TOG),             ture network for semantic segmentation. In CVPR, pages
     34(6):1–16, 2015. 2                                                  1857–1866, 2018. 3
[29] Keyang Luo, Tao Guan, Lili Ju, Yuesong Wang, Zhuo Chen,         [43] Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jian-
     and Yawei Luo. Attention-aware multi-view stereo. In                 hui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. Bodyfu-
     CVPR, pages 1590–1599, 2020. 3                                       sion: Real-time capture of human motion and surface geom-
[30] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen,              etry using a single depth camera. In ICCV, pages 910–919,
     Chongyang Ma, Hao Li, and Shigeo Morishima. Siclope:                 2017. 2
     Silhouette-based clothed people. In CVPR, pages 4480–           [44] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qiong-
     4490, 2019. 1, 2                                                     hai Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Double-
[31] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,                 fusion: Real-time capture of human performances with in-
     Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and                 ner body shapes from a single depth sensor. In ICCV, pages
     Michael J Black. Expressive body capture: 3d hands, face,            7287–7296, 2018. 2
     and body from a single image. In CVPR, pages 10975–             [45] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
     10985, 2019. 3                                                       Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-
[32] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang,                text encoding for semantic segmentation. In CVPR, pages
     Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body:                7151–7160, 2018. 3
     Implicit neural representations with structured latent codes    [46] Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and
     for novel view synthesis of dynamic humans. arXiv preprint           Yebin Liu. 4d association graph for realtime multi-person
     arXiv:2012.15838, 2020. 6                                            motion capture using multiple video cameras. In CVPR,
[33] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-               pages 1324–1333, 2020. 2, 3
     ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned        [47] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.
     implicit function for high-resolution clothed human digitiza-        Pamir: Parametric model-conditioned implicit representa-
     tion. In ICCV, October 2019. 1, 2, 3, 4, 6, 8                        tion for image-based human reconstruction. arXiv preprint
[34] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul              arXiv:2007.03858, 2020. 2, 6, 7, 8
     Joo. Pifuhd: Multi-level pixel-aligned implicit function for    [48] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and
     high-resolution 3d human digitization. In CVPR, 2020. 1, 2,          Yebin Liu. Deephuman: 3d human reconstruction from a
     3, 4, 5, 6, 8                                                        single image. In ICCV, pages 7739–7749, 2019. 1, 2
[35] David Smith, Matthew Loper, Xiaochen Hu, Paris                  [49] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang
     Mavroidis, and Javier Romero. Facsimile: Fast and accu-              Yang. Detailed human shape estimation from a single image
     rate scans from an image in less than a second. In ICCV,             by hierarchical mesh deformation. In CVPR, pages 4491–
     pages 5330–5339, 2019. 2                                             4500, 2019. 2
[36] Jonathan Starck and Adrian Hilton. Surface capture for
     performance-based animation. IEEE Computer Graphics
     and Applications, 27(3):21–31, 2007. 1
[37] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin
     Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volu-
     metric inference of 3d human body shapes. In ECCV, pages
     20–36, 2018. 1, 2
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
     reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
     Polosukhin. Attention is all you need. In Advances in neural
     information processing systems, pages 5998–6008, 2017. 2,
     3, 4, 5
[39] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan
     Popović. Articulated mesh animation from multi-view sil-
     houettes. In ACM SIGGRAPH 2008 papers, pages 1–9.
     2008. 2

You can also read