Learning from Crowds with Sparse and Imbalanced Annotations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Learning from Crowds with Sparse and Imbalanced Annotations Ye Shi1 , Shao-Yuan Li∗1,2 , Sheng-Jun Huang1 1 Ministry of Industry and Information Technology Key Laboratory of Pattern Analysis and Machine Intelligence College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China 2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China {shiye1998, lisy, huangsj}@nuaa.edu.cn Abstract arXiv:2107.05039v1 [cs.LG] 11 Jul 2021 Traditional supervised learning requires ground truth labels for the training data, whose collec- tion can be difficult in many cases. Recently, crowdsourcing has established itself as an efficient labeling solution through resorting to non-expert crowds. To reduce the labeling error effects, one common practice is to distribute each instance to multiple workers, whereas each worker only an- notates a subset of data, resulting in the sparse annotation phenomenon. In this paper, we note that when meeting with class-imbalance, i.e., when the ground truth labels are class-imbalanced, the sparse annotations are prone to be skewly dis- tributed, which thus can severely bias the learn- Figure 1: Class distributions of ground truth labels, observed anno- ing algorithm. To combat this issue, we pro- tations and two intermediate steps (iteration 5 and 10) of confidence pose one self-training based approach named Self- based self-training. In each iteration, 10, 000 pseudo-annotations Crowd by progressively adding confident pseudo- are added into the training data. annotations and rebalancing the annotation dis- tribution. Specifically, we propose one distribu- tion aware confidence measure to select confident training data with crowdsourcing annotations is called learn- pseudo-annotations, which adopts the resampling ing from crowds or crowdsourcing learning, and has attracted strategy to oversample the minority annotations and much attention during the last years[Thierry et al., 2010]. undersample the majority annotations. On one real- As the crowds can make mistakes, one core task in crowd- world crowdsourcing image classification task, we sourcing learning is to deal with the annotation noise, for show that the proposed method yields more bal- which purpose many approaches have been proposed[Philip anced annotations throughout training than the dis- and M, 1979; Raykar et al., 2010a; Zhou et al., 2012; tribution agnostic methods and substantially im- Filipe and Francisco, 2018]. In this paper, we move one proves the learning performance at different anno- step further by noticing the sparsity and class-imbalance phe- tation sparsity levels. nomenon of crowdsourcing annotations in real-world appli- cations. In crowdsourcing, annotation sparsity is common. For ex- 1 Introduction ample, to reduce the labeling error effects, repetitive label- The achievements of deep learning rely on large amounts of ing is employed to introduce labeling redundancy, i.e., each ground truth labels, but collecting them is difficult in many instance is distributed to more than one workers. At the cases. To alleviate this problem, crowdsourcing provides same time, to collect annotations efficiently, a rather large a time- and cost-efficient solution through collecting non- number of workers are employed, whereas each worker only expertise annotations from crowd workers. Learning from annotates a subset of data. This results in sparse annota- tions. We note that when meeting with class-imbalance, i.e., ∗ This research was supported by National Natural Science Foun- when the ground truth labels of the concerned task are class- dation of China (61906089), Jiangsu Province Basic Research imbalanced, the sparsity can lead to inevitable skewed distri- Program (BK20190408), and China Postdoc Science Foundation bution, which may severely bias the learning algorithm. (2019TQ0152). Shao-Yuan Li is the corresponding author. Here we show one real-world example. LabelMe[Rus-
sell et al., 2008] is an image crowdsourcing dataset, consist- ing of 1000 training data with annotations collected from 59 workers through the Amazon Mechanical Turk (AMT) plat- form. On average, each image is annotated by 2.547 work- ers, and each worker is assigned with 43.169 images. This sparsity on one hand makes estimating each worker’s exper- DNN tise quite challenging. On the other hand, Figure 1 shows (…) (…) the effects of sparsity encountering class-imbalance. Ex- cept for the ground truth labels and observed crowd anno- tations, we also show results of two intermediate steps for normal confidence based self-training, i.e., the most confi- dent pseudo-annotations are iteratively added into the train- Instances Softmax Output Crowdlayer Crowd Output ing data for updating the model. For the ground truth la- Classifier ; bels and observed annotations, their standard deviations over classes are respectively 1.85% and 2.95%, meaning a more Figure 2: The network architecture of the deep crowdlayer model. skewed annotation distribution. Moreover, the skewness has biased the self-training algorithm to prefer majority class an- notations and ignore the minority classes, which in turn leads and conducted optimization using expectation-maximization to more severely skewed annotations and learning bias. We (EM) algorithm. To avoid the computational overhead of the will show in the experiment section that this bias significantly iterative EM procedure, [Filipe and Francisco, 2018] intro- hurts the learning performance. Nevertheless, this issue has duced the crowdlayer model and conducted efficient end-to- been rarely paid attention to and touched in crowdsourcing end SGD optimization. learning. In detail, using f (x; θ) ∈ [0, 1]1×C to denote the softmax In this paper, we propose one distribution aware self- output of the deep neural network classifier f (·) with param- training based approach to combat this issue. At a high level, eter θ for some instance x, the crowdlayer model introduced we iteratively add confident pseudo-annotations and rebal- R parameters {Wr ∈ RC×C }r=1,··· ,R to capture the annotat- ance the annotation distribution. Within each iteration, we ing process of the crowds, i.e., the annotations of x given by efficiently train a deep learning model using available an- worker r is derived as: notations, and then use it as a teacher model to generate pseudo-annotations. To alleviate the imbalance issue, we p̂(y r ) = sof tmax(f (x; θ) · Wr ). (1) propose to select the most confident pseudo-annotations us- ing resampling strategies, i.e., we undersample the majority While Wr is real valued without any structural constraints, classes and oversample minority classes. Then the learn- it can be used to represent the worker’s annotating expertise, ing model is retrained on the combination of observed and i.e., Wr (i, j) can denote the process that instances belong- pseudo-annotations. We name our approach Self-Crowd, and ing to class i are annotated with class label j by worker r. empirically show its positive effect at different sparsity levels Larger diagonal values mean better worker expertise. Given on the LabelMe dataset. a specific loss function `, e.g., the cross entropy loss used in this paper, the loss over the crowdsourcing training data D is defined as: 2 The Self-Crowd Framework N X R With X ⊂ Rd denoting feature space and Y = {1, 2, · · · , C} X denoting label space, we use x ∈ X to denote the instance, as L := I[y ri 6= 0]`(p̂(y ri ), y ri ). (2) well as y, y ∈ Y denote the corresponding ground truth labels i=1 r=1 and crowd annotations respectively. Let D = {(xi , y i )}N i=1 Here I is the indicator function. Then regarding Wr as one denote the training data with N instances, and y i = {y ri }R r=1 crowdlayer after the neural network classifier f (·), [Filipe denote the crowdsourcing annotations from R workers. and Francisco, 2018] proposed to simultaneously optimizing In the following, we first introduce the deep crowdlayer the classifier parameter θ and Wr in an end-to-end manner by model proposed by [Filipe and Francisco, 2018] as our base minimizing the loss defined in Eq. 2. learning model, then propose the distribution aware confi- The network architecture of the crowdlayer is shown dence measure to deal with annotation sparsity and class- in Figure 2. Actually, this architecture and the end-to- imbalance, and finally summarize the algorithm procedure. end loss optimization over the derived loss in Eq. 2 have been the cornerstone of various deep crowdsourcing learning 2.1 Deep Crowdlayer Base Learning Model approaches[Tanno et al., 2019; Chu et al., 2021; Zhijun et al., With the ubiquitous success of deep neural networks (DNN), 2020]. They mainly differ in specific structural regularization deep crowdsourcing learning has been studied by combining over the expertise parameters Wr with different motivations. the strength of DNN with crowdsourcing. As one of the pio- In this paper, we adopt the straightforward crowdlayer as our neers in this direction, [Shadi et al., 2016] extended the clas- base learning model for simplicity, and focus on using the sic DS model[Philip and M, 1979] by using a convolutional self-training idea to deal with the sparse and class-imbalanced neural network (CNN) classifier as the latent true label prior, issue in crowdsourcing annotations.
Algorithm 1 The Self-Crowd Framework class, i.e., the Mc most confident pseudo-annotations for each 1: Input: class c ∈ {1, · · · , C} are selected: 2: D = {(xi , y i )}N i=1 : crowdsourcing training data C `: loss function X 3: Mc = tc · M, tc = 1. (4) 4: Output: classifier f c=1 5: Initialization: 6: train the crowdlayer model using the loss in Eq. 2 on D Here M denotes the total number of selected pseudo- 7: obtain the pseudo-annotations predictions of each worker annotations within each iteration, which is a hyperparameter on its unannotated instances using Eq. 1 set by the users. tc denotes the normalized fraction coeffi- 8: Repeat: cient of class c, which is inversely proportional to the number 9: for each pseudo-annotation, calculate its confidence of pseudo-annotations Nc0 of class c among all the generated score according to Eq. 3 pseudo-annotations: 10: for each class c, calculate the corresponding selection 1 number Mc according to Eq. 4 tc ∝ . (5) 11: select the Mc most confident pseudo-annotations within Nc0 each class Algorithm 1 summarizes the main steps of the Self-Crowd 12: add the selected pseudo-annotations into the training data approach. We iteratively predict the unobserved annotations and retrain the crowdlayer model and add the most confident ones into the training data. Those 13: Until expected performance reached pseudo-annotations with lower entropy values and rebalanc- ing the annotation distribution are selected according to Eq. 4- 2.2 Distribution Aware Confidence Measure 5. Then the learning model is retrained on the combination of observed and pseudo-annotations. This process repeats until To combat the annotation sparsity and class-imbalance, we the expected performance is reached. use the crowdlayer model as the base model, and pro- pose one distribution aware confidence measure to conduct self-training. During the training, we progressively predict 3 Experiments pseudo-annotations for the unannotated instances for each 3.1 Settings worker, and add some of them into the training data, then update the learning model. The most confident pseudo- Dataset We conduct experiments on LabelMe[Russell et annotations which contribute to rebalancing the annotation al., 2008], a real-world crowdsourcing image classification distribution are selected. Next, we will explain the measure dataset. LabelMe consists of respectively 1000 and 1688 in detail. training and testing images concerning 8 classes: "high- Confidence Confidence is a commonly used measure in self- way", "inside city", "tall building", "street", "forest", "coast", training, which measures how confident the prediction of the "mountain" and "open country". The authors distributed the current model is for some instances. Using p̂(y r ) defined in training images to 59 crowd workers through the Amazon Eq. 1 to denote the pseudo-annotations probability of worker Mechanical Turk (AMT) platform, and got on average 2.547 r on some unannotated instance x, we propose to use entropy annotations for each image. The accuracy of each worker to measure its confidence: ranges from 0 to 100%, with mean accuracy and standard de- viation 69.2% ± 18.1%. As shown in Figure 1, the LabelMe C r X dataset is imbalanced with regard to both the ground truth la- entropy(y ) = − p̂(y r ) · log p̂(y r ) (3) bels and collected crowds annotations. c=1 Network and Optimization For a fair comparison, we im- The pseudo-annotations with lower entropy values are con- plement the methods following the setting in [Filipe and Fran- sidered to be more confident and more likely to be cor- cisco, 2018]. Specifically, we use the pretrained CNN layers rect. Based on traditional self-training motivation, pseudo- of the VGG-16 deep neural network [Simonyan and Zisser- annotations with the least entropy values should be selected man, 2015] with one fully connected layer with 128 units and as authentic ones. However, as we discussed in the introduc- ReLU activations and one output layer on top. Besides, 50% tion, without taking the class-imbalance issue into account, random dropout is exploited. The training is conducted by the learning algorithm would be biased towards selecting ma- using Adam optimizer[Kingma and Ba, 2015] for 25 epochs jority class annotations and ignore the minority annotations. with batch size 512 and a learning rate of 0.001. L2 weight More seriously, this bias can accumulate throughout the train- decay regularization is used on all layers with λ = 0.0005. ing process, which will inevitably damage the performance. Baselines To assess the performance of the proposed ap- In the following, we propose our distribution aware confi- proach, we conduct comparisons for the following implemen- dence measure. tations: Distribution Aware Confidence Resampling is a common Self-Crowdr : which randomly selects the pseudo-annotations strategy for addressing the class-imbalance problem. It intu- and is used for training. itively oversamples the majority classes or undersamples the Self-Crowdc : which selects the most confident pseudo- minority classes to avoid the dominant effect of majority data. annotations with the least entropy values according to Eq. 3 In this paper, we adopt the resampling strategy within each and used for training without considering class-imbalance.
Self-Crowd Self-Crowdr Self-Crowdc Crowd Annotations Ground Truth 5.0 4.5 0.200 4.0 85 3.5 0.175 0.4 3.0 0.150 80 2.5 Test Accuracy (%) 0.3 0.125 2.0 R Value R Value 75 1.5 0.100 0.2 1.0 0.075 0.04 0.02 0.00 0.02 0.04 70 0.1 0.050 65 0.025 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iterations Iterations Iterations (a) Test Accuracy (b) R of Pseudo-Annotations (c) R of Combined Annotations Figure 3: Comparison of Self-Crowd, Self-Crowdr and Self-Crowdc from three different perspectives. Self-Crowd: which selects the pseudo-annotations taking into seen that our proposed method greatly alleviates the class- account class-imbalance issue according to Eq. 3- 4. imbalance issue during learning whereas the random and con- We examine the classification accuracy on the test images. fidence based selection measure always leads to more imbal- To avoid the influence of randomness, we repeat the experi- anced annotations. This explains the performance decline of ments for 20 times and report the average results. Self-Crowdc and Self-Crowdr . 3.2 Results 3.3 Various Sparsity Level Study Figure 3 (a) shows the test accuracy of compared methods To examine the effectiveness of our approach with different on the LabelMe dataset as the self-training process iterates. sparsity levels, we remove fractions of the original observed Here results for 14 iterations are recorded, and in each itera- annotations for LabelMe and conduct experiment. Specifi- tion 10, 000 pseudo-annotations are selected without replace- cally, we remove p fractions of the observed annotations with ment. As we can see, the test accuracy of Self-Crowdc and p ranges from 0% to 90% in a uniformly random manner. Self-Crowdr decrease rapidly as the self-training proceeds. In To alleviate the effect of randomness, we repeat each experi- contrast, Self-Crowd stably improves. ments for 5 times and report the average results. For the self- To examine what happened during the learning procedure, training process, 5 iterations are conducted with each 10, 000 we define one class-imbalance ratio R as following: pseudo-annotations selected within each iteration. Figure 4 shows the results. Nmax − Nmin R= (6) Nanno 90 Here Nmin , Nmax respectively denote the number of gen- erated annotations for the most frequent class and the least 80 frequent class, Nanno denotes the total number of generated Test Accuracy (%) annotations over all classes. It can be seen that R ranges in [0, 1], with smaller value meaning more balanced annotations. 70 We record the variation of R for the pseudo-annotations selected by the three methods during self-training in Fig- 60 ure 3 (b). It can be seen that the R value of Self-Crowdc in- creases rapidly, indicating that the confidence based measure Self-Crowd mostly selects the majority class pseudo-annotations, lead- 50 Self-Crowdr ing to severely imbalanced annotation distribution, which in Self-Crowdc turn badly hurt the learning performance as shown in Figure 3 Crowdlayer 40 (a). The random selection strategy is much better than confi- 100 90 80 70 60 50 40 30 20 10 dence based measure but still biased by the imbalance issue. Observed Annotataion Fraction (%) The proposed distribution aware strategy is more robust and achieves improved performance. Figure 4: Test accuracy with different sparsity levels. Combing the original observed annotations and the se- lected pseudo-annotations, Figure 3 (c) shows the R value The yellow line represents the test accuracy when only variation on the combined annotations. The solid and dashed the observed annotations are used for training without self- black line respectively represents the R value of original ob- training, i.e., t = 0. It can be seen our approach always served annotations and the ground truth labels. It can be achieves the best and most stable performance. However, the
confidence based approach Self-Crowdc decreases rapidly as database and web-based tool for image annotation. Int. J. the observation annotations decrease, and Self-Crowdr per- Comput. Vis., 77:157–173, 2008. forms stably but worse than the crowdlayer baseline. [Samira et al., 2018] Pouyanfar Samira, Tao Yudong, Mo- han Anup, Tian Haiman, Kaseb Ahmed S., Gauen Kent, 4 Conclusion Dailey Ryan, Aghajanzadeh Sarah, Lu Yung-Hsiang, and In this paper, we propose a self-training based method Self- Chen Shu-Ching. Dynamic sampling in convolutional neu- Crowd to deal with the sparsity and class-imbalance issue in ral networks for imbalanced data classification. In MIPR, crowdsourcing learning. To combat the selection bias towards pages 112–117, 2018. majority class annotations, we propose a distribution aware [Shadi et al., 2016] Albarqouni Shadi, Baur Christoph, confidence measure to select the most confident pseudo- Achilles Felix, Belagiannis Vasileios, Demirci Stefanie, annotations and rebalance the annotation distribution. Exper- and Navab Nassir. Aggnet: Deep learning from crowds iments on a real-world crowdsourcing dataset show the effec- for mitosis detection in breast cancer histology images. tiveness of our approach. As a primary attempt to sparse and IEEE Trans. Medical Imaging, pages 1313–1321, 2016. imbalance crowdsourcing learning, the proposed method can [Simonyan and Zisserman, 2015] Karen Simonyan and An- be extended by combining with sophisticated deep crowd- drew Zisserman. Very deep convolutional networks for sourcing learning models and selection measures. large-scale image recognition. In ICLR, 2015. [Tanno et al., 2019] Ryutaro Tanno, Ardavan Saeedi, Swami References Sankaranarayanan, Daniel C. Alexander, and Nathan Sil- [Buda et al., 2018] Mateusz Buda, Atsuto Maki, and Ma- berman. Learning from noisy labels by regularized esti- ciej A. Mazurowski. A systematic study of the class im- mation of annotator confusion. In CVPR, pages 11244– balance problem in convolutional neural networks. Neural 11253, 2019. Networks, 106:249–259, 2018. [Thierry et al., 2010] Buecheler Thierry, Sieg Jan Henrik, [Chu et al., 2021] Zhendong Chu, Jing Ma, and Hongning Füchslin Rudolf Marcel, and Pfeifer Rolf. Crowdsourcing, Wang. Learning from crowds by modeling common con- open innovation and collective intelligence in the scientific fusions. In AAAI, pages 5832–5840, 2021. method: a research agenda and operational framework. In [Devansh et al., 2017] Arpit Devansh, Jastrz˛ebski Stanisław, ALIFE, pages 679–686, 2010. Ballas Nicolas, Krueger David, Bengio Emmanuel, Kan- [Venanzi et al., 2014] Matteo Venanzi, John Guiver, wal Maxinder S, Maharaj Tegan, Fischer Asja, Courville Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. Aaron, Bengio Yoshua, et al. A closer look at memoriza- Community-based bayesian aggregation models for tion in deep networks. In ICML, pages 233–242, 2017. crowdsourcing. In WWW, pages 155–164, 2014. [Filipe and Francisco, 2018] Rodrigues Filipe and Pereira C [Zhijun et al., 2020] Chen Zhijun, Wang Huimin, Sun Hai- Francisco. Deep learning from crowds. AAAI, page 8, long, Chen Pengpeng, Han Tao, Liu Xudong, and Yang Jie. 2018. Structured probabilistic end-to-end learning from crowds. In IJCAI, pages 1512–1518, 2020. [Guan et al., 2018] Melody Y. Guan, Varun Gulshan, An- drew M. Dai, and Geoffrey E. Hinton. Who said what: [Zhou et al., 2012] D. Zhou, S. Basu, Y. Mao, and J.C. Platt. Modeling individual labelers improves classification. In Learning from the wisdom of crowds by minimax entropy. AAAI, pages 3109–3118, 2018. In P.L. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Infor- [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. mation Processing Systems 25, pages 2195–2203. 2012. Adam: A method for stochastic optimization. In ICLR, page (Poster), 2015. [Philip and M, 1979] Dawid Alexander Philip and Skene Al- lan M. Maximum likelihood estimation of observer error- rates using the em algorithm. Journal of the Royal Statisti- cal Society: Series C (Applied Statistics), 28:20–28, 1979. [Raykar et al., 2010a] V.C. Raykar, S. Yu, L.H. Zhao, G.H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297– 1322, 2010. [Raykar et al., 2010b] Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Mach. Learn. Res., 11:1297–1322, 2010. [Russell et al., 2008] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Labelme: A
You can also read