KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations Yang You, Yujing Lou∗, Chengkun Li∗, Zhoujun Cheng, Liangwei Li, Lizhuang Ma, Weiming Wang†, Cewu Lu Shanghai Jiao Tong University, China arXiv:2002.12687v6 [cs.CV] 7 Aug 2020 Abstract Detecting 3D objects keypoints is of great interest to the areas of both graphics and computer vision. There have been several 2D and 3D keypoint datasets aiming to address this problem in a data-driven way. These datasets, however, either lack scalability or bring ambiguity to the definition of keypoints. Therefore, we present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 103,450 keypoints and 8,234 3D models from 16 object cat- egories, by leveraging numerous human annotations. To handle the inconsistency between annotations from differ- ent people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked Figure 1. We propose a large-scale KeypointNet dataset. It con- on our proposed dataset. Our code and data are available tains 8K+ models and 103K+ keypoint annotations. on https://github.com/qq456cvb/KeypointNet. In order to alleviate the bias of experts’ definitions on 1. Introduction keypoints, we ask a large group of people to annotate vari- ous keypoints according to their own understanding. Chal- Detection of 3D keypoints is essential in many applica- lenges rise in that different people may annotate different tions such as object matching, object tracking, shape re- keypoints and we need to identify the consensus and pat- trieval and registration [22, 6, 37]. Utilization of keypoints terns in these annotations. Finding such patterns is not triv- to match 3D objects has its advantage of providing features ial when a large set of keypoints spread across the entire that are semantically significant and such keypoints are usu- model. A simple clustering would require a predefined dis- ally made invariant to rotations, scales and other transfor- tance threshold and fail to identify closely spaced keypoints. mations. As shown in Figure 1, there are four closely spaced key- In the trend of deep learning, 2D semantic point detec- points on each airplane empennage and it is extremely hard tion has been boosted with the help of a large quantity of for simple clustering methods to distinguish them. Besides, high-quality datasets [3, 23]. However, there are few 3D clustering algorithms do not give semantic labels of key- datasets focusing on the keypoint representation of an ob- points since it is ambiguous to link clustered groups with ject. Dutagaci et al. [11] collect 43 models and label them each other. In addition, people’s annotations are not al- according to annotations from various persons. Annotations ways exact and errors of annotated keypoint locations are from different persons are finally aggregated by geodesic inevitable. In order to solve these problems, we propose a clustering. ShapeNetCore keypoint dataset [42], and a sim- novel method to aggregate a large number of keypoint an- ilar dataset [15], in another way, resort to an expert’s anno- notations from distinct people, by optimizing a fidelity loss. tation on keypoints, making them vulnerable and biased. After this auto aggregation process, we verify these gener- ∗ These authors contributed equally. ated keypoints based on some simple priors such as symme- † Weiming Wang is the corresponding author. try. 1
In this paper, we build the first large-scale and diverse PoseTrack [2] annotate millions of keypoints on humans. dataset named KeypointNet which contains 8,234 models For more general objects, SPair-71k [23] contains 70,958 with 103,450 keypoints. These keypoints are of high fidelity image pairs with diverse variations in viewpoint and scale, and rich in structural or semantic meanings. Some examples with a number of corresponding keypoints on each image are given in Figure 1. We hope this dataset could boost pair. PUB [36] provides 15 part locations on 11,788 images semantic understandings of common objects. from 200 bird categories and PASCAL [5] provides key- In addition, we propose two large-scale keypoint predic- point annotations for 20 object categories. HAKE [19] pro- tion tasks: keypoint saliency estimation and keypoint cor- vides numerous annotations on human interactiveness key- respondence estimation. We benchmark ten state-of-the- points. ADHA [25] annotates key adverbs in videos, which art algorithms with mIoU, mAP and PCK metrics. Results is a sequence of 2D images. show that the detection and identification of keypoints re- Keypoint datasets on 3D objects, include Dutagaci et main a challenging task. al. [11], SyncSpecCNN [42] and Kim et al. [15]. Dutagaci In summary, we make the following contributions: et al. [11] aggregates multiple annotations from different people with an ad-hoc method while the dataset is extremely • To the best of our knowledge, we provide the first small. Though SyncSpecCNN [42], Pavlakos et al. [26] and large-scale dataset on 3D keypoints, both in number Kim et al. [15] give a relatively large keypoint dataset, they of categories and keypoints. rely on a manually designed template of keypoints, which is • We come up with a novel approach on aggregating inevitably biased and flawed. GraspNet [12] gives dense an- people’s annotations on keypoints, even if their anno- notations on 3D object grasping. The differences between tations are independent from each other. theirs and ours are illustrated in Table 1. • We experiment with ten state-of-the-art benchmarks on 3. KeypointNet: A Large-scale 3D Keypoint our dataset, including point cloud, graph, voxel and local geometry based keypoint detection methods. Dataset 3.1. Data Collection 2. Related Work KeypointNet is built on ShapeNetCore [8]. ShapeNet- 2.1. Detection of Keypoints Core covers 55 common object categories with about Detection of 3D keypoints has been a very important task 51,300 unique 3D models. for 3D object understanding which can be used in many We filter out those models that deviate from the majority applications, such as object pose estimation, reconstruc- and keep at most 1000 instances for each category in or- tion, matching, segmentation, etc. Researchers have pro- der to provide a balanced dataset. In addition, a consistent posed various methods to produce interest points on ob- canonical orientation is established (e.g., upright and front) jects to help further objects processing. Traditional meth- for every category because of the incomplete alignment in ods like 3D Harris [31], HKS [32], Salient Points [7], Mesh ShapeNetCore. Saliency [18], Scale Dependent Corners [24], CGF [14], We let annotators determine which points are important, SHOT [35], etc, exploit local reference frames (LRF) to and same keypoint indices should indicate same meanings extract geometric features as local descriptors. However, for each annotator. Though annotators are free to give their these methods only consider the local geometric informa- own keypoints, three general principles should be obeyed: tion without semantic knowledge, which forms a gap be- (1) each keypoint should describe an objects semantic infor- tween detection algorithms and human understanding. mation shared across instances of the same object category, Recent deep learning methods like SyncSpecCNN [42], (2) keypoints of an object category should spread over the deep functional dictionaries [33] are proposed to detect whole object and (3) different keypoints have distinct se- keypoints. Unlike traditional ones, these methods do not mantic meanings. After that, we utilize a heuristic method handle rotations well. Though some recent methods like to aggregate these points, which will be discussed in Sec- S2CNN [9] and PRIN [43] try to fix this, deep learning tion 4. methods still rely on ground-truth keypoint labels annotated Keypoints are annotated on meshes and these annotated by human with expert verification. meshes are then downsampled to 2,048 points. Our final dataset is a collection of point clouds, with keypoint indices. 2.2. Keypoint Datasets 3.2. Annotation Tools Keypoint datasets have its origin in 2D images, where plenty of datasets on human skeletons and object inter- We develop an easy-to-use web annotation tool based est points are proposed. For human skeletons, MPII hu- on NodeJS. Every user is allowed to click up to 24 inter- man pose dataset [3], MSCOCO keypoint challenge [1] and est points according to his/her own understanding. The UI
Dataset Domain Correspondence Template-free Instances Categories Keypoints Format √ FAUST [4] human × 100 1 689K mesh √ SyncSpecCNN [42] chair × 6243 1 ∼60K point cloud √ Dutagaci et al. [11] general × 43 16
Airplane Bottle Cap Chair Skateboard Table Figure 3. Dataset Visualization. Here we plot ground-truth keypoints for several categories. We can see that by utilizing our automatic aggregation method, keypoints of high fidelity are extracted. where dθ is the L2 distance between two vectors in embed- robust to noisy points as the embedding space encodes the ding space: semantic information of an object. Equation 1 involves both θ and Y and it is impractical dθ (a, b) = kgθ (a) − gθ (b)k22 , to solve this problem in closed form. In practice, we use and alternating minimization with a deep neural network to ap- proximate the embedding function gθ , so that we solve the n∗ = arg min dθ (lm,n (c) , ym ). following dual problem instead (by slightly loosening the n constraints): Unlike previous methods such as Dutagaci et al.[11] where a simple geodesic average of human labeled points is given as ground-truth points, we seek a point whose ex- θ∗ = arg min[H(gθ ) pected embedding distance to all human labeled points is θ smallest. The reason is that geodesic distance is sensi- Km M X X (2) tive to the misannotated keypoints and could not distinguish +λ kgθ (ym1 ,k ) − gθ (ym2 ,k )k22 ], closely spaced keypoints, while embedding distance is more m1 ,m2 k
Fidelity Error Calculatation …… …... …… Raw Annotations Embedding Space Fidelity Error Map NMS Human Clustering Verification t-SNE …… …… Aggregated Keypoints Clustered Keypoints Potential Keypoints Figure 4. Keypoint aggregation pipeline. We first infer dense embeddings from human labeled raw annotations. Then fidelity error maps are calculated by summing embedding distances to human labeled keypoints. Non Minimum Suppression is conducted to form a potential set of keypoints. These keypoints are then projected onto 2D subspace with t-SNE and verified by humans. Category Train Val Test All #Annotators Category Train Val Test All Airplane 715 102 205 1022 26 Airplane 9695 1379 2756 13830 Bathtub 344 49 99 492 15 Bathtub 5519 772 1589 7880 Bed 102 14 30 146 8 Bed 1276 188 377 1841 Bottle 266 38 76 380 9 Bottle 4366 625 1269 6260 Cap 26 4 8 38 6 Cap 154 24 48 226 Car 701 100 201 1002 17 Car 14976 2133 4294 21403 Chair 699 100 200 999 15 Chair 8488 1180 2395 12063 Guitar 487 70 140 697 13 Guitar 4112 591 1197 5900 Helmet 62 10 18 90 8 Helmet 529 85 162 776 Knife 189 27 54 270 5 Knife 946 136 276 1358 Laptop 307 44 88 439 12 Laptop 1842 264 528 2634 Motorcycle 208 30 60 298 7 Motorcycle 2690 394 794 3878 Mug 130 18 38 186 9 Mug 1427 198 418 2043 Skateboard 98 14 29 141 9 Skateboard 903 137 283 1323 Table 786 113 225 1124 17 Table 6325 913 1809 9047 Vessel 637 91 182 910 18 Vessel 9168 1241 2579 12988 Total 5757 824 1653 8234 194 Total 72416 10260 20774 103450 Table 2. Keypoint Dataset statistics. Left: number of models in each category. Right: number of keypoints in each category. Non Minimum Suppression Equation 3 may be hard to solve since Km is also unknown beforehand. For each M X X Km Z C X (c) φ(lm,n? , x) model Mm , the fidelity error associated with each poten- Y ∗ = arg min dθ (x, ym,k ) dx, tial keypoint ym ∈ Mm is: Y m=1 c=1 k=1 Mm Z(φ) M X C Z (c) φ(lm0 ,n∗ , x) s.t. gθ (ym1 ,k ) ≡ gθ (ym2 ,k ), ∀m1 , m2 , k, X f (ym , gθ ) = dθ (x, ym ) 0 dx, (3) 0 c=1 Mm0 Z(φ) m 6=m (4) and alternate between the two equations until convergence. where ym0 = arg minym0 ∈Mm0 kgθ (ym0 ) − gθ (ym )k22 . By solving this problem, we find both an optimal em- Then ym∗ is found by conducting Non Minimum Sup- bedding function gθ , together with intra-class consistent pression (NMS), such that: ground-truth keypoints Y, while keeping its embedding dis- tance from human-labeled keypoints as close as possible. ∗ f (ym , gθ ) ≤ f (ym , gθ ), The ground-truth keypoints can be viewed as the projection ∗ (5) of human labeled data onto embedding space. ∀ym ∈ Mm , dθ (ym , ym ) < δ,
GroudTruth RSNet RSCNN DGCNN GraphCNN ISS3D HARRIS3D Table Chair Mug Figure 5. Visualizations of detected keypoints for six algorithms. Airplane Bathtub Bed Bottle Cap Car Chair Guitar Helmet Knife Laptop Motor Mug Skate Table Vessel Average PointNet 10.4/9.3 0.8/6.3 3.0/9.4 0.5/8.2 0.0/0.3 7.3/9.2 7.7/6.5 0.1/1.4 0.0/1.3 0.0/0.9 23.6/32.0 0.1/4.0 0.2/3.9 0.0/2.9 20.9/23.1 7.1/9.1 5.1/8.0 PointNet++ 33.5/44.6 18.5/32.3 16.4/23.4 22.1/38.2 13.2/12.1 24.0/41.0 18.2/29.4 24.3/27.7 7.0/7.4 19.0/22.1 30.9/60.0 21.9/36.0 16.3/21.4 13.7/16.1 29.3/47.4 16.7/26.9 20.3/30.4 RSNet 33.7/38.8 23.1/36.0 28.4/42.5 29.4/44.3 26.7/26.9 30.4/42.3 22.7/26.2 30.5/34.2 15.8/21.3 21.7/31.5 39.8/62.0 31.5/43.2 24.1/35.2 21.6/25.9 39.7/54.6 18.3/22.3 27.3/36.7 SpiderCNN 24.6/24.7 13.0/16.4 16.4/17.1 3.6/8.2 0.0/0.5 11.8/17.2 18.2/20.4 7.4/9.9 0.0/0.5 14.9/18.7 25.6/40.0 12.0/13.3 2.2/4.0 1.8/3.3 27.9/40.5 12.9/14.6 12.0/15.6 PointConv 40.0/51.2 29.0/44.7 25.1/43.7 33.6/46.5 0.0/7.0 35.9/55.6 25.5/39.8 28.0/39.3 7.7/8.3 26.6/38.1 43.1/63.5 35.9/50.6 19.8/32.0 9.3/10.1 44.1/64.0 21.7/31.8 26.6/39.1 RSCNN 31.1/45.0 20.4/34.9 26.2/40.1 25.6/40.1 7.0/8.9 25.4/44.7 22.0/33.0 27.3/34.1 8.9/10.8 21.1/27.2 35.2/58.8 23.6/37.0 15.9/18.6 21.9/30.1 27.4/49.8 18.1/28.9 22.3/33.9 DGCNN 39.2/52.0 25.5/40.6 32.4/42.8 40.3/61.6 0.0/2.0 27.0/48.6 24.9/34.6 31.9/44.1 17.1/19.9 31.5/39.9 44.8/69.6 30.3/41.4 23.9/33.2 22.6/29.8 41.0/59.0 24.4/35.6 28.5/40.9 GraphCNN 0.7/0.7 17.0/23.7 22.2/31.5 23.3/42.6 0.0/4.1 23.0/36.5 0.6/0.6 17.0/21.6 9.7/12.5 18.7/23.3 33.3/49.7 0.6/0.7 0.5/0.6 11.7/12.8 24.5/34.7 15.1/19.3 13.6/19.7 Harris3D 1.6/- 0.3/- 0.9/- 0.5/- 0.0/- 0.9/- 0.7/- 1.2/- 0.0/- 1.9/- 0.3/- 0.7/- 0.0/- 1.0/- 0.4/- 1.2/- 0.7/- SIFT3D 1.2/- 1.0/- 0.5/- 1.1/- 0.4/- 1.0/- 0.7/- 0.7/- 0.4/- 0.6/- 0.2/- 0.8/- 0.5/- 0.7/- 0.3/- 1.0/- 0.7/- ISS3D 1.1/- 1.4/- 0.9/- 1.8/- 0.6/- 1.7/- 0.7/- 0.7/- 0.8/- 0.4/- 0.1/- 0.8/- 0.9/- 0.8/- 0.3/- 1.1/- 0.9/- Table 3. mIoU and mAP results (in percentage) for compared methods with distance threshold 0.01. where δ is some neighborhood threshold. 4.5. Pipeline After NMS, we would get several ground-truth points ym,1 , ym,2 , . . . , ym,k for each manifold Mm . However, the The whole pipeline is shown in Figure 4. We first in- arbitrarily assigned index k within each model does not pro- fer dense embeddings from human labeled raw annotations. vide a consistent semantic correspondence across different Then fidelity error maps are calculated by summing em- models. Therefore we cluster these points according to their bedding distances to human labeled keypoints. Non Min- embeddings by first projecting them onto 2D subspace with imum Suppression is conducted to form a potential set of t-SNE [21]. keypoints. These keypoints are then projected onto 2D sub- space with t-SNE and verified by humans. Ground-truth Verification Though the above method automatically aggregate a set of potential set of keypoints 5. Tasks and Benchmarks with high precision, it omits some keypoints in some cases. As the last step, experts manually verify these keypoints In this section, we propose two keypoint prediction based on some simple priors such as the rotational symme- tasks: keypoint saliency estimation and keypoint correspon- try and centrosymmetry of an object. dence estimation. Keypoint saliency estimation requires evaluated methods to give a set of potential indistinguish- 4.4. Implementation Details able keypoints while keypoint correspondence estimation At the start of the alternating minimization, we initialize asks to localize a fixed number of distinguishable keypoints. Y to be sampled from raw annotations and then run one it- eration, which is enough for the convergence. We choose 5.1. Keypoint Saliency Estimation PointConv with hidden dimension 128 as the embedding function g. During the optimization of Equation 3, we clas- Dataset Preparation For keypoint saliency estimation, sify each point into K classes with a SoftMax layer and ex- we only consider whether a point is the keypoint or not, tract the feature of the last but one layer as the embedding. without giving its semantic label. Our dataset is split into The learning rate is 1e-3 and the optimizer is Adam [17]. train, validation and test sets with the ratio 70%, 10%, 20%.
Evaluation Metrics Two metrics are adopted to evaluate the performance of keypoint saliency estimation. Firstly, we evaluate their mean Intersection over Unions [34] (mIoU), which can be calculated as TP IoU = . (6) TP + FP + FN mIoU is calculated under different error tolerances from 0 to 0.1. Secondly, for those methods that output keypoint prob- abilities, we evaluate their mean Average Precisions (mAP) over all categories. Benchmark Algorithms We benchmark eight state- of-the-art algorithms on point cloud semantic analysis: PointNet [27], PointNet++ [28], RSNet [13], Spider- CNN [41], PointConv [39], RSCNN [20], DGCNN [38] and Figure 6. mIoU results under various distance thresholds (0-0.1) GraphCNN [10]. Three traditional local geometric keypoint for compared algorithms. detectors are also considered: Harris3D [31], SIFT3D [29] and ISS3D [30]. Evaluation Results For deep learning methods, we use the default network architectures and hyperparameters to predict the keypoint probability of each point and mIoU and mAP are adopted to evaluate their performance. For local geometry based methods, mIoU is used. Each method is tested with various geodesic error thresholds. In Table 3, we report mIoU and mAP results under a restrictive thresh- old 0.01. Figure 6 shows the mIoU curves under different distance thresholds from 0 to 0.1 and Figure 7 shows the mAP results. We can see that under a restrictive distance threshold 0.01, geometric and deep learning methods both fail to predict qualified keypoints. Figure 5 shows some visualizations of the results from RSNet, RSCNN, DGCNN, GraphCNN, ISS3D and Har- Figure 7. mAP results under various distance thresholds (0-0.1) for ris3D. Deep learning methods can predict some of ground- compared algorithms. truth keypoints while the predicted keypoints are sometimes missing. For local geometry based methods like ISS3D and Harris3D, they give much more interest points spread over number of keypoints is fixed and the data split is the same the entire model while these points are agnostic of seman- as keypoint saliency estimation. tic information. Learning discriminative features for better localizing accurate and distinct keypoints across various ob- Evaluation Metric The prediction of network is evalu- jects is still a challenging task. ated by the percentage of correct keypoints (PCK), which 5.2. Keypoint Correspondence Estimation is used to evaluate the accuracy of keypoint prediction in many previous works [42, 33]. Keypoint correspondence estimation is a more challeng- ing task, where one needs to predict not only the keypoints, but also their semantic labels. The semantic labels should Benchmark Algorithms We benchmark eight state- be consistent across different objects in the same category. of-the-art algorithms: PointNet [27], PointNet++[28], RSNet [13], SpiderCNN [41], PointConv [39], Dataset Preparation For keypoint correspondence esti- RSCNN [20], DGCNN [38] and GraphCNN [10]. mation, each keypoint is labeled with a semantic index. For those keypoints that do not exist on some objects, index - Evaluation Results Similarly, we use the default network 1 is given. Similar to SyncSpecCNN [42], the maximum architectures. Table 4 shows the PCK results with error dis-
GroundTruth PointNet RSCNN PointConv SpiderCNN DGCNN GraphCNN Bathtub Helmet Motorbike Figure 8. Visualizations of detected keypoints and their semantic labels. Same colors indicate same semantic labels. Airplane Bath Bed Bottle Cap Car Chair Guitar Helmet Knife Laptop Motor Mug Skate Table Vessel Average PointNet 65.4 57.2 55.4 78.9 16.7 66.6 43.2 65.4 42.6 32.6 76.1 62.1 60.7 50.4 63.8 42.5 55.0 PointNet++ 64.1 45.4 44.2 10.2 18.8 55.1 38.0 55.4 23.6 43.8 64.6 49.9 36.4 42.3 57.1 37.9 42.9 RSNet 68.9 56.0 69.2 78.2 47.9 65.7 49.4 63.7 33.8 56.9 80.3 67.4 61.7 69.2 72.3 46.7 61.7 SpiderCNN 54.9 36.8 43.9 50.4 0.0 47.1 36.9 37.7 5.6 32.2 63.4 36.6 16.4 15.8 61.0 36.8 36.0 PointConv 70.3 54.4 58.6 64.6 25.0 65.7 51.1 62.6 16.2 47.4 74.2 61.7 51.2 51.8 69.5 46.7 54.4 RSCNN 66.8 47.9 52.4 59.4 18.8 56.5 45.0 60.0 21.3 51.4 66.9 52.8 29.5 51.7 64.8 41.2 49.2 DGCNN 66.8 51.5 56.3 8.9 39.6 58.6 46.0 62.4 36.1 52.1 73.9 57.7 51.4 52.3 70.5 46.8 51.9 GraphCNN 41.8 19.5 35.4 32.1 8.3 36.9 27.0 28.9 8.8 20.0 65.9 34.4 15.6 26.2 17.3 22.1 27.5 Table 4. PCK results under distance threshold 0.01 for various deep learning networks. ferent methods. Same colors denote same semantic labels. We can see that most methods can accurately predict some of keypoints. However, there are still some missing key- points and inaccurate localizations. Keypoint saliency estimation and keypoint correspon- dence estimation are both important for object understand- ing. Keypoint saliency estimation gives a spare represen- tation of object by extracting meaningful points. Key- point correspondence estimation establishes relations be- tween points on different objects. From the results above, we can see that these two tasks still remain challenging. The reason is that object keypoints from human perspective are not simply geometrically salient points but abstracts seman- tic meanings of the object. Figure 9. PCK results under various distance thresholds (0-0.1) for 6. Conclusion compared algorithms. In this paper, we propose a large-scale and high-quality KeypointNet dataset. In order to generate ground-truth key- points from raw human annotations where identification of tance threshold 0.01. Figure 9 illustrates the percentage of their modes are non-trivial, we transform the problem into correct points curves with distance thresholds varied from an optimization problem and solve it in an alternating fash- 0 to 0.1. RS-Net performs relatively better than the other ion. By optimizing a fidelity loss, ground-truth keypoints, methods under all thresholds. However, all eight methods together with their correspondences are generated. In addi- face difficulty in giving exact consistent semantic keypoints tion, we evaluate and compare several state-of-the-art meth- under a threshold of 0.01. ods on our proposed dataset and we hope this dataset could Figure 8 shows some visualizations of the results for dif- boost the semantic understanding of 3D objects.
Acknowledgements [12] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated This work is supported in part by the National Key datase for object grasping. arXiv preprint arXiv:1912.13470, R&D Program of China (No. 2017YFA0700800), Na- 2019. 2 tional Natural Science Foundation of China under Grants [13] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Re- 61772332, 51675342, 51975350, SHEITC (2018-RGZN- current slice networks for 3d segmentation of point clouds. 02046) and Shanghai Science and Technology Commit- In Proceedings of the IEEE Conference on Computer Vision tee. and Pattern Recognition, pages 2626–2635, 2018. 7 [14] Marc Khoury, Qian-Yi Zhou, and Vladlen Koltun. Learning References compact geometric features. In Proceedings of the IEEE In- ternational Conference on Computer Vision, pages 153–161, [1] Mscoco keypoint challenge 2016, 2016. 2 2017. 2 [2] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, [15] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Siddhartha Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Chaudhuri, Stephen DiVerdi, and Thomas Funkhouser. Schiele. Posetrack: A benchmark for human pose estima- Learning part-based templates from large collections of 3d tion and tracking. In Proceedings of the IEEE Conference shapes. ACM Transactions on Graphics (TOG), 32(4):70, on Computer Vision and Pattern Recognition, pages 5167– 2013. 1, 2 5176, 2018. 2 [16] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Stephen DiVerdi, [3] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and and Thomas Funkhouser. Exploring collections of 3d models Bernt Schiele. 2d human pose estimation: New benchmark using fuzzy correspondences. ACM Transactions on Graph- and state of the art analysis. In IEEE Conference on Com- ics (TOG), 31(4):1–11, 2012. 3 puter Vision and Pattern Recognition (CVPR), June 2014. 1, [17] Diederik P Kingma and Jimmy Ba. Adam: A method for 2 stochastic optimization. arXiv preprint arXiv:1412.6980, [4] Federica Bogo, Javier Romero, Matthew Loper, and 2014. 6 Michael J. Black. FAUST: Dataset and evaluation for 3D [18] Chang Ha Lee, Amitabh Varshney, and David W Jacobs. mesh registration. In Proceedings IEEE Conf. on Com- Mesh saliency. ACM transactions on graphics (TOG), puter Vision and Pattern Recognition (CVPR), Piscataway, 24(3):659–666, 2005. 2 NJ, USA, June 2014. IEEE. 3 [19] Yong-Lu Li, Liang Xu, Xijie Huang, Xinpeng Liu, Ze Ma, [5] Lubomir Bourdev and Jitendra Malik. Poselets: Body part Mingyang Chen, Shiyi Wang, Hao-Shu Fang, and Cewu Lu. detectors trained using 3d human pose annotations. In 2009 Hake: Human activity knowledge engine. arXiv preprint IEEE 12th International Conference on Computer Vision, arXiv:1904.06539, 2019. 2 pages 1365–1372. IEEE, 2009. 2 [20] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong [6] M Bueno, J Martı́nez-Sánchez, H González-Jorge, and H Pan. Relation-shape convolutional neural network for point Lorenzo. Detection of geometric keypoints and its applica- cloud analysis. In Proceedings of the IEEE Conference tion to point cloud coarse registration. International Archives on Computer Vision and Pattern Recognition, pages 8895– of the Photogrammetry, Remote Sensing & Spatial Informa- 8904, 2019. 7 tion Sciences, 41, 2016. 1 [21] Laurens van der Maaten and Geoffrey Hinton. Visualiz- [7] Umberto Castellani, Marco Cristani, Simone Fantoni, and ing data using t-sne. Journal of machine learning research, Vittorio Murino. Sparse points matching by combining 9(Nov):2579–2605, 2008. 6 3d mesh saliency with statistical descriptors. In Computer [22] Ajmal S Mian, Mohammed Bennamoun, and Robyn Owens. Graphics Forum, volume 27, pages 643–652. Wiley Online Three-dimensional model-based object recognition and seg- Library, 2008. 2 mentation in cluttered scenes. IEEE transactions on pattern [8] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, analysis and machine intelligence, 28(10):1584–1601, 2006. Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, 1 Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: [23] Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. An information-rich 3d model repository. arXiv preprint Spair-71k: A large-scale benchmark for semantic correspon- arXiv:1512.03012, 2015. 2 dence. arXiv preprint arXiv:1908.10543, 2019. 1, 2 [9] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max [24] John Novatnack and Ko Nishino. Scale-dependent 3d geo- Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, metric features. In 2007 IEEE 11th International Conference 2018. 2 on Computer Vision, pages 1–8. IEEE, 2007. 2 [10] Michaël Defferrard, Xavier Bresson, and Pierre Van- [25] Bo Pang, Kaiwen Zha, Yifan Zhang, and Cewu Lu. Further dergheynst. Convolutional neural networks on graphs with understanding videos through adverbs: A new video task. In fast localized spectral filtering. In Advances in neural infor- AAAI, 2020. 2 mation processing systems, pages 3844–3852, 2016. 7 [26] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstanti- [11] Helin Dutagaci, Chun Pan Cheung, and Afzal Godil. Eval- nos G Derpanis, and Kostas Daniilidis. 6-dof object pose uation of 3d interest point detection techniques via human- from semantic keypoints. In 2017 IEEE International Con- generated ground truth. The Visual Computer, 28(9):901– ference on Robotics and Automation (ICRA), pages 2011– 917, 2012. 1, 2, 3, 4 2018. IEEE, 2017. 2
[27] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. [41] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Pointnet: Deep learning on point sets for 3d classification Spidercnn: Deep learning on point sets with parameterized and segmentation. In Proceedings of the IEEE Conference on convolutional filters. In Proceedings of the European Con- Computer Vision and Pattern Recognition, pages 652–660, ference on Computer Vision (ECCV), pages 87–102, 2018. 2017. 7 7 [28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J [42] Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Sync- Guibas. Pointnet++: Deep hierarchical feature learning on speccnn: Synchronized spectral cnn for 3d shape segmenta- point sets in a metric space. In Advances in neural informa- tion. In Proceedings of the IEEE Conference on Computer tion processing systems, pages 5099–5108, 2017. 7 Vision and Pattern Recognition, pages 2282–2290, 2017. 1, [29] Blaine Rister, Mark A Horowitz, and Daniel L Rubin. 2, 3, 7 Volumetric image registration from invariant keypoints. [43] Yang You, Yujing Lou, Qi Liu, Yu-Wing Tai, Weiming IEEE Transactions on Image Processing, 26(10):4900–4910, Wang, Lizhuang Ma, and Cewu Lu. Prin: Pointwise rotation- 2017. 7 invariant network. arXiv preprint arXiv:1811.09361, 2018. [30] Samuele Salti, Federico Tombari, Riccardo Spezialetti, and 2 Luigi Di Stefano. Learning a descriptor-specific 3d keypoint detector. In Proceedings of the IEEE International Confer- ence on Computer Vision, pages 2318–2326, 2015. 7 [31] Ivan Sipiran and Benjamin Bustos. Harris 3d: a robust ex- tension of the harris operator for interest point detection on 3d meshes. The Visual Computer, 27(11):963, 2011. 2, 7 [32] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-scale signature based on heat diffusion. In Computer graphics forum, volume 28, pages 1383–1392. Wiley Online Library, 2009. 2 [33] Minhyuk Sung, Hao Su, Ronald Yu, and Leonidas J Guibas. Deep functional dictionaries: Learning consistent semantic structures on 3d models from functions. In Advances in Neu- ral Information Processing Systems, pages 485–495, 2018. 2, 7 [34] Leizer Teran and Philippos Mordohai. 3d interest point de- tection via discriminative learning. In European Conference on Computer Vision, pages 159–173. Springer, 2014. 7 [35] Federico Tombari, Samuele Salti, and Luigi Di Stefano. Unique signatures of histograms for local surface descrip- tion. In European conference on computer vision, pages 356–369. Springer, 2010. 2 [36] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Re- port CNS-TR-2011-001, California Institute of Technology, 2011. 2 [37] Hanyu Wang, Jianwei Guo, Dong-Ming Yan, Weize Quan, and Xiaopeng Zhang. Learning 3d keypoint descriptors for non-rigid shape matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018. 1 [38] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019. 7 [39] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2019. 7 [40] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014. 3
You can also read