Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation Zicong Fan1,2 Adrian Spurr1 Muhammed Kocabas1,2 Siyu Tang1 Michael J. Black2 Otmar Hilliges1 1 ETH Zürich, Switzerland 2 Max Planck Institute for Intelligent Systems, Tübingen arXiv:2107.00434v1 [cs.CV] 1 Jul 2021 Figure 1. When estimating the 3D pose of interacting hands, state-of-the-art methods struggle to disambiguate the appearance of the two hands and their parts. In this example, significant uncertainty between the left and right wrist arise (1.1), resulting in erroneous pose estimation (1.2). Our model, DIGIT, reduces the ambiguity by predicting and leveraging a probabilistic part segmentation volume (2.1) to produce reliable pose estimates even when the two hands are in direct contact and under significant occlusion (2.2, 2.3). Abstract In natural conversation and interaction, our hands often overlap or are in contact with each other. Due to the ho- 1. Introduction mogeneous appearance of hands, this makes estimating the 3D pose of interacting hands from images difficult. In this Hands are our primary means of interaction with the paper we demonstrate that self-similarity, and the resulting physical world, be it to manipulate objects. Consequently, ambiguities in assigning pixel observations to the respec- a method for estimating 3D hand pose from monocular im- tive hands and their parts, is a major cause of the final 3D ages would have many applications in human-computer in- pose error. Motivated by this insight, we propose DIGIT, a teraction, AR/VR, and robotics. We often use both hands novel method for estimating the 3D poses of two interacting in a concerted manner and, as a consequence, our hands hands from a single monocular image. The method consists are often close to or in contact with each other. The vast of two interwoven branches that process the input imagery majority of the 3D hand pose estimation methods assumes into a per-pixel semantic part segmentation mask and a vi- that inputs contain only a single hand [5, 9, 10, 18, 30, 32, sual feature volume. In contrast to prior work, we do not 47, 48, 50, 54, 58]. This is for good reasons: hands dis- decouple the segmentation from the pose estimation stage, play a large amount of self-similarity and are very dexter- but rather leverage the per-pixel probabilities directly in the ous. This leads to self-occlusion, which, together with the downstream pose estimation task. To do so, the part prob- inherent depth-ambiguities, results in a challenging pose re- abilities are merged with the visual features and processed construction problem. Estimating two interacting hands is via fully-convolutional layers. We experimentally show that even more difficult due to the self-similar appearance and the proposed approach achieves new state-of-the-art per- complex occlusion patterns, where often large areas of the formance on the InterHand2.6M [30] dataset for both sin- hands are unobservable. gle and interacting hands across all metrics. We provide Recently, Moon et al. [30] proposed a large-scale an- detailed ablation studies to demonstrate the efficacy of our notated dataset, captured via a massive multi-view setup, method and to provide insights into how the modelling of allowing for the study of the 3D interacting hand pose es- pixel ownership affects single and interacting hand pose es- timation task. The method shows feasibility but struggles timation. Our code will be released for research purposes. with interacting hands. One of the main sources of difficulty
in the task is ambiguities caused by the relatively homoge- pose estimation from monocular images that depict neous appearance of hands and fingers. Even the fingers of two hands, often under self-contact. a single hand can be difficult to tell apart if only parts of 3. An approach to incorporate a semantic part- the hand are visible. Considering hands in close interaction segmentation network and means to combine the only makes this problem more pronounced. Consider the per-pixel probabilities with visual features for the final example in Fig. 1. Here an interacting hand pose estimator task of 3D pose estimation. struggles to disambiguate the wrists of the two hands, re- 4. Detailed ablation studies revealing a reduction of un- flected in a bi-modal and dispersed heatmap, resulting in a certainty due to self-similarity in interacting hands and poor 3D pose estimate. improvements in the accuracy of single hands, and the To address this problem, we introduce DIGIT (DIs- estimation of relative depth between hands. ambiGuating hands in InTeraction), a novel method for 5. Our method reaches state-of-the-art performance learning-based reconstruction of 3D hand poses for interact- across all metrics in single and interacting hand pose ing hands. The key insight is to explicitly reason about the estimation on the InterHand2.6M [30] dataset. per-pixel segmentation of the images into the separate hands and their parts, thus assigning ownership of each pixel to a 2. Related work specific part of one of the hands. We show that this reduces the ambiguities brought on by the self-similarity of hands Here we briefly review related work in monocular hand and, in turn, significantly improves the accuracy of 3D pose pose estimation, reconstruction of 3D pose of bi-manual in- estimates. While prior work on hand- [4, 58] and body- teraction, and the use of segmentation in related tasks. pose estimation [38, 41, 55] and hand-tracking [12, 33, 53] Monocular 3D hand pose estimation. Monocular RGB have leveraged some form of segmentation, most often sil- 3D hand pose estimation has a long history beginning with houettes or per-pixel masks, this is typically done as a pre- Rehg and Kanade [42]. Surface-based approaches estimate processing step. In contrast, our ablations show that inte- dense hand surfaces by either fitting a hand model to ob- grating a semantic segmentation branch into an end-to-end servations or by regressing model parameters directly from trained architecture already increases pose estimation ac- pixels [7, 11, 13, 14, 16, 17, 23, 25, 28, 29, 36, 39, 42, curacy. We also demonstrate that leveraging the per-pixel 43, 57]. More closely related to ours are keypoint-based probabilities, rather than class-labels, alongside the image approaches, that regress the 3D joint positions [5, 9, 10, features further improves the accuracy of the 3D pose es- 18, 30, 32, 47, 48, 50, 54, 58]. For example, Zimmermann timation task. Finally, our experiments reveal that the pro- et al. [58] propose the first convolutional network for RGB posed approach helps to disambiguate interacting hands but hand pose estimation. Iqbal et al. [18] introduce a 2.5D rep- also improves the accuracy of single hands, and in the esti- resentation, allowing training on in-the-wild 2D annotation. mation of relative positions between hands, via a reduction However, all of the above approaches assume single hand of uncertainty due to self-similarity. images. Recently, Moon et al. [30] introduce a large scale More precisely, DIGIT is an end-to-end trainable net- dataset and a 3D hand pose estimator for both single and work architecture (see Fig. 2) that uses two separate, but interacting hands. In our work, we show that existing ap- interwoven, branches for the tasks of semantic segmen- proaches struggle with occlusions and appearance ambigu- tation and pose estimation respectively. Importantly, the ity. To this end, we propose a novel method that can better output of the segmentation branch is per-pixel logits (i.e., disambiguate strongly interacting hands and thus improves the full probability distribution) rather than the more com- interacting hand pose estimation. monly used discrete class labels. These probabilities are Interacting hand tracking and pose estimation. Model- then merged with the visual features and processed via fully based approaches to tracking of interacting hands have been convolutional layers to attain a fused feature representation proposed [3, 33, 37, 46, 51, 53] as well as multi-view meth- that is ultimately used for the final pose estimates. The net- ods to reconstruct the pose of interacting hands [15, 30, 45]. work is supervised via a 3D pose estimation loss and a se- Oikonomidis et al. [37] provide a formulation to track in- mantic segmentation loss. We show that all design choices teracting hands using Particle Swarm Optimization from are necessary in order to attain the best performing architec- RGBD videos. Ballan et al. [3] introduce an offline method ture in ablation studies and that the final proposed method to capture hand motion during hand-hand and hand-object reaches state-of-the-art performance on the InterHand2.6M interaction in a multi-camera setup. Tzionas et al. [51] [30] dataset across all metrics. In summary, we contribute: extend the idea in [3] with a physical model. Mueller et al. [33] and Wang et al. [53] propose interacting hand 1. An analysis showing that SOTA hand-pose meth- tracking methods by predicting left/right-hand silhouettes ods are sensitive to self-occlusions and ambiguities and correspondence masks used in a post-processing en- brought on by interacting hands ergy minimization step. Smith et al. [46] propose a multi- 2. A novel end-to-end trainable architecture for the 3D view system that constrains a vision-based tracking algo-
Segmentation Loss 2.5D Pose Image Segm. Fused GT Visual (S) Semantic Visual Pose CNN CNN CNN Features Estimator Supervision Features Features Semantic (F’) (F) (S’) Features Concatenate Pose Loss Figure 2. An illustration of our hand pose estimation model (DIGIT). Given an image, DIGIT extracts visual features (F) and predict part segmentation probability volume (S). The segmentation volume is projected into latent semantic features (S0 ). The visual features (F) and the semantic features (S0 ) are fused across multiple scales and are used for interacting hand pose estimation (illustrated in Fig. 3). rithm with a physical model. In 3D interacting hand pose is that of Omran et al. [38]. In addition to not training the estimation, Simon et al. [45] propose a multi-view boot- networks end-to-end, they use discrete part segments while strapping technique to triangulate full-body 2D keypoints we preserve uncertainty with a probabilistic segmentation into 3D. He et al. [15] incorporate epipolar geometry into a map. Our experiments show that end-to-end training and transformer network [52]. Moon et al. [30] propose a large- the use of probabilistic segmentation maps significantly re- scale dataset and a model for interacting hand pose estima- duce hand pose estimation errors. tion. The model from Moon et al. [30] is the most closely related to this paper since it is the only prior work that esti- 3. Method mates 3D hand pose of interacting hands from a single RGB image. Compared to hand tracking [3, 33, 37, 46, 51, 53], 3.1. Overview our method does not require RGB video or depth image se- At the core of DIGIT lies the observation that the self- quences. In contrast to the existing interacting pose esti- similarity between joints and the similarity between hands, mation frameworks [15, 30, 45], we explicitly model uncer- which are especially pronounced during hand-to-hand inter- tainty caused by appearance ambiguity in interacting hands action, are major sources of errors for monocular 3D hand and we do not require a multi-view supervision [15, 45]. pose estimation. Our experiments (see Fig. 8) show that standard approaches do not have a mechanism to cope with Segmentation in pose estimation. Segmentation has been such ambiguity. Embracing this challenge, we propose a used in 3D hand pose estimation, 3D human pose estima- simple yet effective framework for interacting hand pose es- tion and hand tracking and can be grouped into four cate- timation by modelling per-pixel ownership via probabilistic gories: as a localization step [1, 19, 34, 35, 56, 58], as a part segmentation maps. Our method leverages both visual training loss [2, 4], as an optimization term [6, 33, 53], or features and distinctive semantic features to address the am- as an intermediate representation [38, 41, 55]. Most sin- biguity caused by self-similarity. gle hand pose estimation approaches follow Zimmermann et al. [58] in localizing a hand in an image by predicting In contrast to prior work, which separates the image-to- the hand silhouette, which is used to crop the input image segmentation and segmentation-to-pose steps [38, 41, 55], before performing pose estimation. Boukhayma et al. [4] we propose a holistic approach that is trained to jointly rea- predict a dense hand surface and use a neural rendering son about pixel-to-part assignment and 3D joint locations. technique to obtain a silhouette loss. In contrast, we lever- In particular, given an input image, our model identifies in- age part segmentation to explicitly address self-similarity dividual parts of each hand in the form of probabilistic seg- in hand pose estimation. In tracking interacting hands, left mentation maps, which are used to encourage the local in- and right hand masks can be predicted from either depth im- fluence of visual features for estimating the corresponding ages [33] or monocular RGB images [53], which are used in 3D joint. We fuse the probabilistic segmentation maps and a optimization-based post-processing step. Our method nei- the visual feature maps across multiple scales using a con- ther assume RGB nor depth image sequences. In 3D human volutional fusion layer. Our experiments show that end-to- pose and shape estimation, existing methods predict part end training and the use of probabilistic segmentation maps segmentation maps [38, 55] or silhouettes [41] from RGB significantly improve hand pose estimation. images and use the predicted masks as an intermediate rep- 3.2. Segmentation-aware pose estimation resentation. Specifically, they decouple the image-to-pose problem into image-to-segmentation and segmentation-to- Fig. 2 illustrates the main components of our frame- pose. The two models of the subtasks are trained sepa- work. Given an image region I ∈ IRWI ×HI ×3 , cropped by rately. Our method, on the other hand, trains image-to-pose a bounding box including all hands, the goal of our model in an end-to-end fashion. The most closely related method is to estimate the 3D hand pose P3D ∈ IR2J×3 in the cam-
Pose Estimator 2D keypoints to obtain a fused feature map F0 ∈ IRWF ×HF ×(DF +DS ) for 2D Latent Softmax 2D pose estimation (see Sup. Mat. for the UNet details). Normalization soft-argmax Heatmaps Heatmaps ( H* ) ( H2D ) Interacting hand pose estimation. To estimate the final 3D hand pose of both hands, we learn a function F : F0 → 2D CNN Root-relative depth P2.5D that maps the fused feature map F0 to 2.5D pose. Latent Element-wise The 2.5D representation P2.5D ∈ IR2J×3 consists of in- Multiplication Depth Maps Fused Depth Maps ( HZ ) Σ dividual 2.5D joints (xi , yi , zi ) ∈ IR3 where (xi , yi ) is ( H* ) z Features (F’) the 2D projection of the 3D joint (Xi , Yi , Zi ) ∈ IR3 , and zi = Zi − Zroot(i) . The notation root(i) denotes the hand MLP L/R relative root root of joint i. During inference, 3D pose can be recov- depth CNN x ered by applying inverse perspective projection on (xi , yi ) MLP L/R handedness using the depth estimation Zi . To model the function F, we use a custom estimator inspired by [18]. We found that Figure 3. Our interacting hand pose estimator the 2.5D representation of [18] performs equally well to the interacting pose estimator by Moon et al. [30], while being more memory efficient due to the lack of a requirement for a volumetric heatmap representation (see Sup. Mat). Figure 3 shows a schematic of our proposed pose esti- Figure 4. Part segmentation classes. Each of the left hand and the mator. Similar to [30], our model estimates the handedness right hand is partitioned into 16 classes shown in different colors. (hL , hR ) ∈ [0, 1]2 , the 2.5D left and right-hand pose P2.5D , Including the background, there are 33 classes in total. and the right hand-relative left-hand depth z R→L ∈ IR, where L and R denote left and right hands. Since our model predicts 2.5D joints (xi , yi , zi ), and converting from 2.5D era coordinate for 2J joints where J is the number of joints to 3D requires an inverse perspective projection, we need to in one hand. In particular, we first extract a feature map estimate the depth Zi for a joint i by Zi = zi + Zroot(i) . F ∈ IRWF ×HF ×DF from the image I using a CNN back- The root relative depth z R→L is used to obtain the left-hand bone network to provide visual features for pose estimation root depth when both hands are present. and part segmentation. Here WF , HF , and DF denote the Handedness and relative root depth. The handedness width, height and channel dimension of the feature map. (hL , hR ) detects the presence of the two hands and z R→L Probabilistic segmentation. Since there is an inherent self- measures the depth of the left root relative to the right root. similarity between different parts of the hands, we learn We repeatedly convolve and downsample F0 to a latent vec- a part segmentation network to predict a probabilistic seg- tor x, which is used to estimate (hL , hR ) and z R→L by mentation volume S ∈ IRWS ×HS ×C , which is directly su- two separate multi-layer perception (MLP) networks. For pervised by groundtruth part segmentation maps. Each z R→L , we use the MLP to estimate a 1D heatmap p ∈ IRDz pixel on the segmentation volume S is a channel of prob- that is softmax-normalized, representing the probability dis- ability logits over C classes where C is the number of cat- tribution over Dz possible values for z R→L . The final rela- egories including the parts of the two hands and the back- tive depth z R→L is obtained by ground (see Fig. 4). Note that to preserve the uncertainty in segmentation prediction, we do not pick the class with the z −1 DX highest response among the C classes for each pixel in the z R→L = k p[k]. (1) segmentation volume. For display purposes only, we show k=0 the class with the highest probability in Fig. 2. Finally, since the segmentation volume S has a higher resolution than F, 2.5D hand pose estimator (F). Inspired by [18], our we perform a series of convolution and downsampling op- pose estimator predicts the latent 2D heatmap H∗2D ∈ erations to obtain semantic features S0 ∈ IRWF ×HF ×DS IRWF ×HF ×2J for the 2D joint locations and the latent root- where DS is the channel dimension. relative depth map H∗z ∈ IRWF ×HF ×2J for the root-relative Visual semantic fusion. The visual features F and the se- depth of each joint. The heatmap H∗2D is spatially softmax- mantic features S0 are concatenated along the channel di- normalized to a probability map H2D ∈ IRWF ×HF ×2J . mension to provide rich visual cues for estimating accurate Since H2D indicates potential 2D joint locations, to focus 3D hand poses and distinctive semantic features for avoid- the depth values on the joint locations, H2D is element-wise ing appearance ambiguity. However, a naive concatenation multiplied with the latent depth map H∗z to obtain the depth does not provide global context from the semantic features. map Hz = H∗z H2D . To allow our network to be fully- Therefore, we fuse the visual and semantic features across differentiable, we use soft-argmax [26] to convert the 2D different scales using a custom and lightweight UNet [44] heatmap H2D to 2D keypoints {(xi , yi )}2J i=0 . Finally, we
sum across the values on each slice of the depth map Hz to where (i, j) ∈ E denotes a bone from the edge set E of the obtain the root-relative depth zi for a joint i. directed kinematic tree connecting the groundtruth joints From 2.5D pose to 3D. To convert P2.5D to 3D pose P3D , P̄i2.5D and P̄j2.5D . This loss encourages the predicted bones following [30], we apply an inverse perspective projection to have similar length to those in the groundtruth. to map 2D keypoints to 3D camera coordinate by 4. Experiments PL3D = Π T−1 PL2.5D + ZL (2) In this paper we want to demonstrate that self-similarity, PR −1 R P2.5D + ZR and the resulting ambiguities between joints is a major 3D = Π T (3) cause of 3D hand pose error, and the ambiguity becomes where Π and T−1 are the camera back-projection operation more severe during interaction. We hypothesize that the am- and the inverse affine transformation (undoing cropping and biguity problem can be alleviated by modelling pixel own- resizing). The projection requires the absolute depth of the ership via part segmentation. To demonstrate the efficacy of left and right roots ZL and ZR (written in vector form): our approach, we first compare our model with the state-of- ( the-art. We investigate the benefits of modelling part seg- L [0, 0, z L ]| , if hR < 0.5 mentation in an ablation study that sheds further light on Z = R R→L | (4) when and why our approach works. [0, 0, z + z ] , otherwise, Dataset. We evaluate our interacting hand pose estimation and ZR = [0, 0, z R ]| (5) model using the InterHand2.6M [30] dataset. It is the only large-scale dataset for modeling hand interaction, which in- where z L and z R are the absolute depth of the roots for cludes images of both single-hand (SH) and interacting- the left and the right hands. Following [30], we use the hand (IH) sequences of 21 subjects in the training set. For estimates from RootNet [27] for z L and z R . the validation and the test set, there are 1 and 8 subjects Training loss. The loss used to train our model is: respectively. We use the initial release of the 5 frames- L = Lh + L2.5D + Lz + λs Ls + λb Lb , (6) per-second (FPS) subset for our experiments because the full dataset has not been released at the time of submission. where the terms are the handedness loss Lh , the 2.5D hand The subset uses an official split [31] containing 371K single pose loss L2.5D , the right-hand relative left hand depth loss hand images and 367K interacting hand images for training, Lz , the segmentation loss Ls , and a bone regularization 113K single hand images, and 71K interacting hand images loss Lb . In particular, we use the multi-label binary cross- for validation, and 198K single hand images and 155K in- entropy loss to supervise the handedness prediction. For teracting hand images for testing. segmentation, we use the multi-class cross-entropy loss: Evaluation metrics. We use the three metrics from [30] for hand pose evaluation. The average precision of hand- WF X HF X C X edness estimation (AP) measures the accuracy of hand- Ls = − Tj [m, n] log(σ(Sj [m, n])) (7) edness prediction. The root-relative mean per joint posi- m=1 n=1 j=1 tion error (MPJPE) measures the error in root-relative 3D where Tj [m, n] ∈ IRC is a one-hot vector with 1 hand pose estimation. It is the Euclidean distance between positive class and C − 1 negative classes according to the predicted 3D joint locations and the groundtruth after the groundtruth for the segmentation pixel at (m, n) root alignment. For interacting sequences, the alignment and σ(·) is a softmax normalization operation so that is done on the two hands separately. To measure the per- PC formance in estimating the relative position between the j=1 Sj [m, n] = 1. We use the L1 loss to supervise the 2.5D hand pose and the relative root depth. left and the right root in interacting sequences, we use the Kinematic consistency. In our experiments we observe that mean relative-root position error (MRRPE). It is defined both the method by Moon et al. [30] and our own baseline as the Euclidean distance between the predicted and the yield asymmetric predictions in terms of bone length be- groundtruth left-hand root position after aligning the left- tween left and right hand, due to the appearance ambigui- hand root by the root of the right hand. ties (with an average difference of 8mm and 10mm on the 4.1. Implementation details baseline and the InterHand2.6M [30] model in the valida- tion set). To encourage more physically plausible predic- We implement our models in PyTorch [40] using an tions we propose a bone vector loss: HRNet-W32 [49] backbone pre-trained on the ImageNet X dataset [8]. Following [30], we crop the hand region us- Lb = (Pi2.5D − Pj2.5D ) − (P̄i2.5D − P̄j2.5D ) , ing the groundtruth bounding box for both training and test- (i,j)∈E ing images and resize the cropped image to 256x256 before (8) feeding it to the network. The spatial dimension of the 2D
Predicted GT Input image Predicted 2D pose GT 2D pose Front View Rotated View segmentation segmentation Predicted 3D pose GT 3D pose Predicted 3D pose GT 3D pose Figure 5. Qualitative results from our model. The lighting is adjusted for display purposes (not model input). Best viewed zoomed in. heatmap H2D and the 2D depth map Hz are 64x64. We ob- Methods MPJPE Val MRRPE Val MPJPE Test MRRPE Test InterHand2.6M [30] 14.82/20.59 35.99 12.63/17.36 34.49 tain groundtruth part segmentation for training our segmen- Baseline 14.64/20.24 35.05 12.32/17.23 32.70 Ours 13.54/18.28 32.21 11.32/15.57 30.51 tation network by rendering groundtruth hand meshes from % in improvement over [30] 8.64/11.22 10.50 10.37/10.31 11.54 InterHand2.6M [30] with a neural renderer [20, 22] using a Table 1. Comparison with the state-of-the-art. The left and right custom texture map (see Fig. 4). The detail of our network of the slash are for the single and interacting images. is in Sup. Mat. To balance the loss in Eq. 6, we choose λb = 1.0 and λs = 10.0 based on the average MPJPE for 4.3. Ablation study single and interacting images on the validation set. We do not apply Ls for models without a segmentation network. Here we aim to provide further insights how, when and Training procedure. We train all models with both single- why our proposed method (Fig. 2) improves over the base- hand and interacting-hand sequences using the Adam opti- line (Fig. 6a). In particular, we examine the impact of the mizer [21] with an initial learning rate 10.0−4 and a batch bone loss Lb (BL), the part segmentation loss Ls (SL), and size of 64. For experiments comparing to the state-of-the- the segmentation features S (SF) for hand pose estimation. art, we train our models for 50 epochs and decay the learn- From Table 2, comparing the performance between the ing rate at epoch 40. For the ablation experiments, since the baseline with or without the bone loss, using the same ar- InterHand2.6M [30] subset has a large number of frames chitecture (Fig. 6a), we see a 0.32mm and 1.17mm improve- (738K frames), for time efficiency, we train the models for ment of 3D pose estimation for single and interacting hand 30 epochs and decay the learning rate at epoch 10 and 20. on the test set. The bone loss improves both single and in- We use a factor of 10 for all learning rate decays. teracting hand pose because the loss is applied whenever both joints of a bone are labeled. 4.2. Comparison with the state-of-the-art Inspired by the multi-task learning paradigm [24], we in- Table 1 shows the root-relative mean pose per joint er- vestigate whether predicting part segmentation regularizes ror (MPJPE) in millimeters for both single and interacting hand pose estimation. In particular, we train a pose estima- hand sequences. The results show that despite having fewer tor with an additional head to predict the part segmentation parameters, our baseline network slightly improves Inter- of hands (see Fig. 6b) but do not use segmentation for pose Hand2.6M [30]. For estimating root-relative hand pose, estimation. The result shows that the segmentation loss im- shown in MPJPE, our proposed model outperforms Inter- proves pose estimation over the baseline with bone loss by Hand2.6M [30] by 1.31mm/1.79mm for single and inter- 0.84mm / 0.49mm for single and interacting hand on the test acting hand images on the test set. For estimating the rela- set. For the relative root position, the segmentation loss dra- tive position between the left and the right-hand roots, our matically improves the MRRPE metric by 5.26mm on the model outperforms InterHand2.6M [30] by 3.98mm on the test set over the baseline with bone loss. The reason is that, test set. The average precision of InterHand2.6M [30], our to satisfy the loss Ls , the backbone has to distinguish the baseline, and our proposed model on the validation set are left and the right roots and the distinction reduces appear- 98.35, 98.13, and 98.12 percent. Fig. 5 shows the qualitative ance ambiguity, resulting in better hand root localization. results from our model. The figure shows the input image, Finally, in addition to the bone loss and the segmentation 2D keypoint, and the segmentation masks overlaid on the loss, our model (see Fig. 2) makes use of the segmentation image. The predicted 3D pose is shown in two views. More and visual features for pose estimation. Compared to the examples are in Sup. Mat. baseline with bone loss, our model reduces MPJPE for sin-
(a) Baseline / Baseline + BL Ablation Study MPJPE Val MRRPE Val MPJPE Test MRRPE Test Image Part segm.§ 16.68/23.52 41.99 14.35/20.57 38.87 Visual CNN Features Pose Estimator GT Supervision Part segm. (ours) 14.06/20.01 35.13 12.30/17.22 32.88 (F) Pose Loss Table 3. Effects of end-to-end training. The entry with § trains the segm. and the pose networks separately [38, 55]. The left and (b) Baseline + SL Segmentation Loss Image right of the slash are for the single and interacting images. Segm. Visual (S) CNN CNN Visual Features (F) Pose GT Features Estimator Supervision (F) Pose Loss argmax [26]. In contrast, with segmentation, our network (c) Only Segmentation Segmentation Loss Image (Baseline + BL + SF in Table 2) disambiguates different Segm. (S) Semantic Fused GT hands and provides a single-mode estimation. A similar ob- CNN Pose CNN Visual CNN Features Features Estimator Supervision Features Pose (F) (S’) (F’) Loss servation can be made in the single-hand case for ambiguity between fingers. Figure 6. Pose estimators for the ablation studies: (a) without Impact of interaction and occlusion. We study how pose segmentation, (b) with segmentation but only for applying a loss, (c) segmentation for pose estimation (no visual features). estimation performance is affected by the degree of interac- tion. In particular, we use the IoU between the groundtruth Ablation Study MPJPE Val MRRPE Val MPJPE Test MRRPE Test left/right masks (not part segm.) to measure the degree of Baseline 15.27/21.91 36.73 13.15/18.71 34.05 interaction and occlusion. The higher IoU implies more oc- Baseline + BL 14.97/20.43 38.84 12.83/17.54 37.36 Baseline + BL + SL 14.14/19.72 34.51 11.99/17.05 32.10 clusion. Fig. 8 compares the results on the validation set. Baseline + BL + SF* 14.50/20.31 40.09 12.46/17.60 38.20 The bars show the MPJPE over annotated joints for each Baseline + BL + SF (ours) 13.82/19.05 34.14 11.64/16.55 31.39 IoU range while the half-length of the error bars correspond Table 2. Effects of bone loss (BL), segmentation loss (SL), and segmentation features (SF). The symbol * denotes not using segm. to 0.5 times (for better display) of MPJPE standard devi- supervision. Left/right denotes single and interacting images. ation in that range. Typical hand masks are shown above the bars of each IoU range. We observe more errors in in- teracting hand cases when the two hands do not intersect, gle and interacting hand further by 1.15 mm/1.38mm on the which indicates that appearance ambiguity applies as long validation set, and 1.19mm/0.99mm on the test set. Further, as two hands are present. For non-degenerative occlusion we improve MRRPE by 5.97 mm on the test set. We also (IoU ≤ 0.67), our method has consistent improvement over trained a network with the same architecture but not super- InterHand2.6M [30] . The improvement in single-hand is vised by the segmentation loss (Baseline + BL + SF* ). Since smaller because appearance ambiguity is amplified in inter- the performance of Baseline + BL is similar to Baseline + acting hands. In the high IoU regime (> 0.67), the improve- BL + SF* , we can assume that our improvement is not due ment levels off, which is expected since the second hand to the additional network parameters. is almost entirely invisible and the problem is no longer caused by ambiguities. It would be extremely challenging 4.4. Analysis of our proposed model to reliably estimate the correct pose from a single image. Here we first provide qualitative analysis to show how End-to-end training. Table 3 shows the effect of end-to- segmentation helps appearance ambiguity. We then investi- end training compared to training the image-to-segm. and gate hand pose performance between InterHand2.6M [30], segm.-to-pose models separately [38]. We use the architec- our baseline, and our final model under different degree of ture in Fig. 6c. The first row summarizes the accuracy of a interaction to show that modeling pixel ownership via seg- variant where we train the segmentation network alongside mentation helps to reduce errors in hand pose estimation. the backbone for 25 epochs and then train the pose network Qualitative Analysis. We investigate how segmentation for another 25 epochs, while freezing the segmentation and helps to reduce appearance ambiguity in hand pose esti- the backbone networks. For the end-to-end condition, we mation. To build intuition, we provide the 2D estimations train the two models jointly for a total of 25 epochs. This of individual joints in Fig. 7. The cross sign indicates the ensures the same amount of training epochs for the segmen- groundtruth 2D location of the joint of interest and the tation network in both conditions. The results show that plus sign is the predicted location. The example in the end-to-end training outperforms two-stage training. first column shows that without segmentation the baseline Full probabilities vs part-labels. Existing body-pose model’s predictions (Baseline + BL in Table 2) contain sig- methods [38, 55] decouple the image-to-pose problem into nificant uncertainty due to the presence of the other hand in image-to-segm. and segm.-to-pose. They use class-label the image, as indicated by the dispersed 2D heatmap with segmentation maps while our method leverages the full seg- modes on both hands (see the same behaviour in the In- mentation probability distribution. Table 4 compares the terHand2.6M [30] model in Sup. Mat). As a result, the two for pose estimation. All models use the network in 2D prediction is centered between the hands after the soft- Fig. 6c. The models using class-label maps are marked with
Figure 7. Qualitative results of how segmentation reduces appearance ambiguity. The image lighting is adjusted for display purposes (not model input). The dispersed 2D heatmap issue also arises in the model proposed in InterHand2.6M [30] . See Sup. Mat for a more in-depth analysis. The skeleton notation is also provided in Sup. Mat. Plus: prediction; Cross: groundtruth. Best viewed zoomed in. ment in the main task. In particular, we pass the unnormal- ized segmentation probability distribution (i.e., logits) to the pose estimator, preserving the uncertainty for the down- stream task. In contrast, the methods from [38, 55] take the quantized information (i.e., class labels). Our simple yet effective formulation also enables fully-differentiable end- to-end learning of hand pose estimation in conjunction with hand segmentation. We empirically show that our end-to- end multi-task setup achieves better performance compared to separate training of tasks (see Table 3) and class-label Figure 8. Comparing pose estimation performance by the de- inputs (see Table 4) as in [38, 55]. gree of interaction/occlusion. The IoU between groundtruth left/right masks measures the degree of interaction. SH and IH denote single and interacting images. The left (yellow) and right 6. Conclusion (blue) hand masks provide interaction examples in each IoU range. In this paper, we introduce a framework for interact- ing 3D hand pose estimation that explicitly addresses self- Ablation Study MPJPE Val MRRPE Val MPJPE Test MRRPE Test similarity between joints. Our method consists of two in- LR segm.† 28.72/36.05 50.85 25.75/31.46 46.98 Part segm.† 17.69/25.49 46.00 15.16/22.08 41.46 terwoven branches that process an input image into a per- LR segm. (ours) 14.87/21.19 34.70 12.92/18.40 32.13 pixel semantic part segmentation mask and a visual fea- Part segm. (ours) 14.03/20.01 35.26 12.29/17.23 32.88 ture volume. The part segmentation mask provides seman- Table 4. Different segmentation types as intermediate repre- tic features for visually-similar hand regions while the vi- sentations. With the model in Fig. 6c, we show the effects sual feature volume provides rich visual cues for accurate of left/right masks and part segm. Entries with † use class- label [38, 55] instead of probabilistic maps. The left and right pose estimation. Our experiments show that our proposed of the slash are for the single and interacting images. method achieves state-of-the-art performance on the Inter- Hand2.6M [30] dataset across all metrics. Detailed ablation †. Compared to probabilistic maps, the class-label maps studies show the efficacy of our method and provide insights lose lots of information for pose estimation. into how the modeling of pixel ownership addresses self- ambiguity in single and interacting hand pose estimation. 5. Discussion Acknowledgement. The authors want to thank Emre Ak- san, and Dimitrios Tzionas for their valuable feedback. Our insight is that self-similarity in hands can be ad- Disclosure. MJB has received research gift funds from dressed by modeling pixel ownership. While our approach Adobe, Intel, Nvidia, Facebook, and Amazon. While MJB resembles 3D body-pose [38, 55] in terms of leveraging is a part-time employee of Amazon, his research was per- part segmentation, the key difference is that we incorporate formed solely at, and funded solely by, Max Planck. MJB the segmentation task in an end-to-end fashion, leading to has financial interests in Amazon, Datagen Technologies, more informative representations and a significant improve- and Meshcapade GmbH.
References a deformable model. In Proceedings of the Second Interna- tional Conference on Automatic Face and Gesture Recogni- [1] Vassilis Athitsos and Stan Sclaroff. Estimating 3d hand pose tion, pages 140–145. Ieee, 1996. 2 from a cluttered image. In CVPR, volume 2, pages II–432. [17] Umar Iqbal, Andreas Doering, Hashim Yasin, Björn Krüger, IEEE, 2003. 3 Andreas Weber, and Juergen Gall. A dual-source approach [2] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Push- for 3d human pose estimation from single images. Comput. ing the envelope for rgb-based dense 3d hand pose estimation Vis. Image Underst., 172:37–49, 2018. 2 via neural rendering. In CVPR, pages 1067–1076. Computer [18] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, Vision Foundation / IEEE, 2019. 3 and Jan Kautz. Hand pose estimation via latent 2.5D [3] Luca Ballan, Aparna Taneja, Jürgen Gall, Luc Van Gool, heatmap regression. In ECCV, pages 118–134, 2018. 1, 2, 4 and Marc Pollefeys. Motion capture of hands in action us- [19] Byeongkeun Kang, Kar-Han Tan, Nan Jiang, Hung-Shuo ing discriminative salient points. In ECCV, pages 640–653. Tai, Daniel Tretter, and Truong Nguyen. Hand segmentation Springer, 2012. 2, 3 for hand-object interaction from depth map. In GlobalSIP, [4] Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr. pages 259–263. IEEE, 2017. 3 3d hand shape and pose from images in the wild. In CVPR, [20] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu- pages 10843–10852. Computer Vision Foundation / IEEE, ral 3d mesh renderer. In CVPR, 2018. 6 2019. 2, 3 [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for [5] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, stochastic optimization. In ICLR (Poster), 2015. 6 Junsong Yuan, and Nadia Magnenat-Thalmann. Exploit- [22] Nikos Kolotouros. Pytorch implementation of the neural ing spatial-temporal relationships for 3d pose estimation via mesh renderer, 2018. 6 graph convolutional networks. In ICCV, pages 2272–2281. [23] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, IEEE, 2019. 1, 2 Michael M. Bronstein, and Stefanos Zafeiriou. Weakly- [6] Yunlong Che and Yue Qi. Dynamic projected segmentation supervised mesh-convolutional hand reconstruction in the networks for hand pose estimation. In ICRA, pages 477–482. wild. In CVPR, pages 4989–4999. IEEE, 2020. 2 IEEE, 2018. 3 [24] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi [7] Martin de La Gorce, David J Fleet, and Nikos Para- Parikh, and Stefan Lee. 12-in-1: Multi-task vision and gios. Model-based 3d hand pose estimation from monocular language representation learning. In CVPR, pages 10437– video. TPAMI, 33(9):1793–1805, 2011. 2 10446, 2020. 6 [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [25] Shan Lu, Dimitris Metaxas, Dimitris Samaras, and John and Li Fei-Fei. Imagenet: A large-scale hierarchical image Oliensis. Using multiple cues for hand tracking and model database. In CVPR, pages 248–255. Ieee, 2009. 5 refinement. In CVPR, volume 2, pages II–443. IEEE, 2003. [9] Bardia Doosti, Shujon Naha, Majid Mirbagheri, and David J. 2 Crandall. Hope-net: A graph-based model for hand-object [26] Diogo C Luvizon, Hedi Tabia, and David Picard. Human pose estimation. In CVPR, pages 6607–6616. IEEE, 2020. pose regression by combining indirect part detection and 1, 2 contextual information. Computers & Graphics, 85:15–22, [10] Zhipeng Fan, Jun Liu, and Yao Wang. Adaptive computa- 2019. 4, 7 tionally efficient network for monocular 3d hand pose esti- [27] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. mation. In ECCV, pages 127–144. Springer, 2020. 1, 2 Camera distance-aware top-down approach for 3d multi- [11] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying person pose estimation from a single rgb image. In ICCV, Wang, Jianfei Cai, and Junsong Yuan. 3D hand shape and pages 10133–10142, 2019. 5 pose estimation from a single rgb image. In CVPR, pages [28] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image- 10833–10842, 2019. 2 to-lixel prediction network for accurate 3d human pose and [12] Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D mesh estimation from a single rgb image. In ECCV, 2020. 2 Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung [29] Gyeongsik Moon, Takaaki Shiratori, and Kyoung Mu Lee. Tai, Muzaffer Akbay, Zheng Wang, et al. Megatrack: Deephandmesh: A weakly-supervised deep encoder-decoder monochrome egocentric articulated hand-tracking for virtual framework for high-fidelity hand mesh modeling. In CVPR, reality. TOG, 39(4):87–1, 2020. 2 2020. 2 [13] Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, [30] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, Marc Pollefeys, and Cordelia Schmid. Leveraging photomet- and Kyoung Mu Lee. Interhand2.6m: A dataset and baseline ric consistency over time for sparsely supervised hand-object for 3d interacting hand pose estimation from a single rgb im- reconstruction. In CVPR, pages 568–577. IEEE, 2020. 2 age. In ECCV, 2020. 1, 2, 3, 4, 5, 6, 7, 8 [14] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kale- [31] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. and Kyoung Mu Lee. InterHand2.6M: A dataset and baseline Learning joint reconstruction of hands and manipulated ob- for 3d interacting hand pose estimation from a single rgb im- jects. In CVPR, pages 11807–11816. Computer Vision Foun- age. https://github.com/facebookresearch/ dation / IEEE, 2019. 2 InterHand2.6M, 2020. 5 [15] Yihui He, Rui Yan, Katerina Fragkiadaki, and Shoou-I Yu. [32] Franziska Mueller, Florian Bernard, Oleksandr Sotny- Epipolar transformers. In CVPR, pages 7779–7788, 2020. 2, chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and 3 Christian Theobalt. Ganerated hands for real-time 3d hand [16] Tony Heap and David Hogg. Towards 3d hand tracking using tracking from monocular RGB. In CVPR, pages 49–59.
IEEE Computer Society, 2018. 1, 2 Springer, 2015. 4 [33] Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr [45] Tomas Simon, Hanbyul Joo, Iain A. Matthews, and Yaser Sotnychenko, Mickeal Verschoor, Miguel A Otaduy, Dan Sheikh. Hand keypoint detection in single images using Casas, and Christian Theobalt. Real-time pose and shape multiview bootstrapping. In CVPR, pages 4645–4653. IEEE reconstruction of two interacting hands with a single depth Computer Society, 2017. 2, 3 camera. TOG, 38(4):1–13, 2019. 2, 3 [46] Breannan Smith, Chenglei Wu, He Wen, Patrick Peluse, [34] Markus Oberweger and Vincent Lepetit. Deepprior++: Im- Yaser Sheikh, Jessica K Hodgins, and Takaaki Shiratori. proving fast and accurate 3d hand pose estimation. In ICCV Constraining dense hand surface tracking with elasticity. Workshops, pages 585–594, 2017. 3 TOG, 39(6):1–14, 2020. 2, 3 [35] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. [47] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, Hands deep in deep learning for hand pose estimation. arXiv and Jan Kautz. Weakly supervised 3d hand pose estimation preprint arXiv:1502.06807, 2015. 3 via biomechanical constraints. In ECCV, 2020. 1, 2 [36] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A. Ar- [48] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. gyros. Full DOF tracking of a hand interacting with an object Cross-modal deep variational hand pose estimation. In by modeling occlusions and physical constraints. In Dim- CVPR, pages 89–98. IEEE Computer Society, 2018. 1, 2 itris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc Van [49] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep Gool, editors, ICCV, pages 2088–2095. IEEE Computer So- high-resolution representation learning for human pose esti- ciety, 2011. 2 mation. In CVPR, 2019. 5 [37] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A Ar- [50] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: uni- gyros. Tracking the articulated motion of two strongly inter- fied egocentric recognition of 3d hand-object poses and in- acting hands. In CVPR, pages 1862–1869. IEEE, 2012. 2, teractions. In CVPR, pages 4511–4520. Computer Vision 3 Foundation / IEEE, 2019. 1, 2 [38] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe- [51] Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo ter Gehler, and Bernt Schiele. Neural body fitting: Unifying Aponte, Marc Pollefeys, and Juergen Gall. Capturing hands deep learning and model based human pose and shape esti- in action using discriminative salient points and physics sim- mation. In 3DV, pages 484–494. IEEE, 2018. 2, 3, 7, 8 ulation. IJCV, 118(2):172–193, 2016. 2, 3 [39] Paschalis Panteleris, Iason Oikonomidis, and Antonis Argy- [52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- ros. Using a single rgb frame for real time 3d hand pose es- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia timation in the wild. In WACV, pages 436–445. IEEE, 2018. Polosukhin. Attention is all you need. In NIPS, pages 5998– 2 6008, 2017. 3 [40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [53] Jiayi Wang, Franziska Mueller, Florian Bernard, Suzanne James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Otaduy, Dan Casas, and Christian Theobalt. Rgb2hands: Andreas Köpf, Edward Yang, Zachary DeVito, Martin Rai- real-time tracking of 3d hand interactions from monocular son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, rgb video. TOG, 39(6):1–16, 2020. 2, 3 Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An [54] Linlin Yang and Angela Yao. Disentangling latent hands for imperative style, high-performance deep learning library. In image synthesis and pose estimation. In CVPR, pages 9877– NeurIPS, pages 8024–8035, 2019. 5 9886. Computer Vision Foundation / IEEE, 2019. 1, 2 [41] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas [55] Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, Daniilidis. Learning to estimate 3d human pose and shape William T Freeman, Rahul Sukthankar, and Cristian Smin- from a single color image. In CVPR, pages 459–468, 2018. chisescu. Weakly supervised 3d human pose and shape re- 2, 3 construction with normalizing flows. In ECCV, pages 465– [42] James M. Rehg and Takeo Kanade. Visual tracking of high 481. Springer, 2020. 2, 3, 7, 8 dof articulated structures: An application to human hand [56] Cairong Zhang, Guijin Wang, Xinghao Chen, Pengwei Xie, tracking. In Jan-Olof Eklundh, editor, ECCV, pages 35–46, and Toshihiko Yamasaki. Weakly supervised segmentation Berlin, Heidelberg, 1994. Springer Berlin Heidelberg. 2 guided hand pose estimation during interaction with un- [43] Javier Romero, Dimitrios Tzionas, and Michael J. Black. known objects. In ICASSP, pages 2673–2677. IEEE, 2020. Embodied hands: modeling and capturing hands and bodies 3 together. ACM Trans. Graph., 36(6):245:1–245:17, 2017. 2 [57] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Zheng. End-to-end hand mesh recovery from a monocular net: Convolutional networks for biomedical image segmen- RGB image. In ICCV, pages 2354–2364. IEEE, 2019. 2 tation. In International Conference on Medical image com- [58] Christian Zimmermann and Thomas Brox. Learning to esti- puting and computer-assisted intervention, pages 234–241. mate 3d hand pose from single RGB images. In ICCV, pages 4913–4921. IEEE Computer Society, 2017. 1, 2, 3
You can also read