ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING Junghoon Seo1† Joon Suk Huh2† 1 2 SI Analytics Co. Ltd, South Korea UW–Madison, USA jhseo@si-analytics.ai jhuh23@wisc.edu ABSTRACT In this paper, we focus on Partial label learning [6] (PLL), arXiv:2010.11600v2 [cs.LG] 8 Feb 2021 Partial label learning (PLL) is a class of weakly supervised which is one of the most classic examples of weakly super- learning where each training instance consists of a data and vised learning. In the PLL problem, classifiers are trained a set of candidate labels containing a unique ground truth la- with a set of candidate labels, among which only one label bel. To tackle this problem, a majority of current state-of- is the ground truth. Web mining [7], ecoinformatic [8], and the-art methods employs either label disambiguation or av- automatic image annotation [9] are notable examples of real- eraging strategies. So far, PLL methods without such tech- world instantizations of the PLL problem. niques have been considered impractical. In this paper, we The majority of state-of-the-art parametric methods for challenge this view by revealing the hidden power of the old- PLL involves two types of parameters. One is associated est and naivest PLL method when it is instantiated with deep with the label confidence, and the other is the model parame- neural networks. Specifically, we show that, with deep neu- ters. These methods iteratively and alternatively update these ral networks, the naive model can achieve competitive perfor- two types of parameters. This type of methods is denoted as mances against the other state-of-the-art methods, suggesting identification-based. On the other hand, average-based meth- it as a strong baseline for PLL. We also address the question ods [10, 11] treat all the candidate labels equally, assuming of how and why such a naive model works well with deep they contribute equally to the trained classifier. Average- neural networks. Our empirical results indicate that deep neu- based methods do not require any label disambiguation pro- ral networks trained on partially labeled examples generalize cesses so they are much simpler than identification-based very well even in the over-parametrized regime and without methods. However, numerous works [6, 12, 13, 14, 15] label disambiguations or regularizations. We point out that pointed out that the label disambiguation processes are es- existing learning theories on PLL are vacuous in the over- sential to achieve high-performance in PLL problems, hence, parametrized regime. Hence they cannot explain why the attempts to build a high-performance PLL model through the deep naive method works. We propose an alternative theory average-based scheme have been avoided. on how deep learning generalize in PLL problems. Contrary to this common belief, we show that one of naivest and oldest average-based methods can train accu- Index Terms— classification, partial label learning, rate classifiers in real PLL problems. Specifically, our main weakly supervised learning, deep neural network, empiri- contributions are two-fold: cal risk minimization 1. We generalize the classic naive model of [6] to the 1. INTRODUCTION modern deep learning setting. Specifically, we present a naive surrogate loss for deep PLL. We test our deep State-of-the-art performance of the standard classification naive model’s performance and show that it outper- task is one of the fastest-growing in the field of machine forms the existing state-of-the-art methods despite its learning. In the standard classification setting, a learner re- simplicity.1 quires an unambiguously labeled dataset. However, it is often hard or even not possible to obtain completely labeled 2. We empirically analyze the unreasonable effectiveness datasets in the real world. Many pieces of research formu- of the naive loss with deep neural networks. Our exper- lated problem settings under which classifiers are trainable iments shows closing generalization gaps in the over- with incompletely labeled datasets. These settings are often parametrized regime where bounds from existing learn- denoted as weakly supervised. Learning from similar vs. dis- ing theories are vacuous. We propose an alternative ex- similar pairs [1], Learning from positive vs. unlabeled data planation of the working of deep PLL based on obser- [2, 3], Multiple instance learning [4, 5] are some examples of vations of Valle-Perez et al. [16]. weakly supervised learning. 1 All codes for the experiments in this paper are public on https:// † Both authors contributed equally to this work. github.com/mikigom/DNPL-PyTorch.
2. DEEP NAIVE MODEL FOR PLL 2.3. Existing Theories of Generalization in PLL In this sub-section, we review two existing learning theories 2.1. Problem Formulation and their implications which may explain the effectiveness of We denote x ∈ X as a data and y ∈ Y = {1, . . . , K} as deep naive models. a label, and a set S ∈ S = 2Y \ ∅ such that y ∈ S as a partial label. A partial label data distribution is defined by a 2.3.1. EPRM Learnability joint data-label distribution p(x, y) and a partial label gener- Under a mild assumption on data distributions, Liu and Diet- ating process p(S|x, y) where p(S|x, y) = 0 if y ∈ / S. A terich [17] proved that minimizing an empirical partial label learner’s task is to output a model θ with small Err(θ) = risk gives a correct classifier. E(x,y)∼p(x,y) I (hθ (x) 6= y) given with a finite number of par- n Formally, they proved a finite sample complexity bound tially labeled samples {(xi , Si )}i=1 , where each (xi , Si ) is for the empirical partial risk minimizer (EPRM): independently sampled from p(x, S). θ̂n = arg min R̂p,n (θ), (5) θ∈Θ 2.2. Deep Naive Loss for PLL under a mild distributional assumption called small ambiguity The work of Jin and Gharhramani [6], which is the first pio- degree condition. The ambiguity degree [11] quantifies the neering work on PLL, proposed a simple baseline method for hardness of a PLL problem and is defined as PLL denoted as the ‘Naive model’. It is defined as follows: γ= sup Pr [ȳ ∈ S] . (6) (x,y)∈X ×Y, S∼p(S|x,y) n X 1 X ȳ∈Y:p(x,y)>0, ȳ6=y θ̂ = arg max log p (y|xi ; θ) . (1) θ∈Θ i=1 |Si | When γ is less than 1, we say the small ambiguity degree y∈Si condition is satisfied. Intuitively, it measures how a specific We denote the naive loss as the negative of the objective in non-ground-truth label co-occurs with a specific ground-truth the above. In [6], the authors proposed the disambiguation label. When such distractor labels co-occurs with a ground- strategy as a better alternative to the naive model. Moreover, truth label in every instance, it is impossible to disambiguate many works on PLL [12, 13, 14, 15] considered this naive the label hence PLL is not EPRM learnable. With the mild as- model to be low-performing and it is still commonly believed sumption that γ < 1, Liu and Ditterich showed the following that label disambiguation processes are crucial in achieving sample complexity bound for PLL, high-performance. Theorem 1. (PLL Sample complexity bound [17]). Suppose In this work, we propose the following differentiable loss the ambiguity degree of a PLL problem is small, 0 ≤ γ < 1. to instantiate the naive loss with deep neural networks: 2 Let η = log 1+γ and dH be the Natarajan dimension of the n hypothesis space H. Define ˆln (θ) = − 1 X log hS θ,i Si i , (2) n0 (H, , δ) = n i=1 4 1 1 S θ,i = SOFTMAX (fθ (xi )) , (3) dH log 4dH + 2 log K + log + log + 1 , η η δ where fθ (xi ) ∈ RK is the output of the neural network. The then when n > n0 (H, , δ), Err(θ̂n ) < with probability at softmax layer is used to make the outputs of the neural net- least 1 − δ. work lie in the probability simplex. One can see that the above loss is almost identical to the naive loss in (1) up to constant We denote this result as Empirical Partial Risk Minimiza- factors, hence we denote (2) as the deep naive loss while a tion (EPRM) learnability. model trained from it is denoted as a deep naive model. The above loss can be identified as a surrogate of the par- 2.3.2. Classifier-consistency tial label risk defined as follows: A very recent work by Feng et al. [18] proposed new PLL risk estimators by viewing the partial label generation process as Rp (θ) = E I (hθ (x) ∈ / S) , (4) a multiple complementary label generation process [19, 20]. (x,S)∼p(x,S) One of the proposed estimators is called classifier-consistent (CC) risk Rcc (θ). For any multi-class loss function L : RK × where I (·) is the indicator function. We denote R̂p,n (θ) as an Y → R+ , Rcc (θ) it is defined as follows: empirical estimator of Rp (θ) over n samples. When hθ (x) = arg maxi fθ,i (x), one can easily see that the deep naive loss L Q> p (y|x; θ) , s , Rcc (θ) = E (7) (2) is a surrogate of the partial-label risk (4). (x,S)∼p(x,S)
Method Lost MSRCv2 Soccer Player Yahoo! News Avg. Rank Reference Presented at DNPL 81.1 ±3.7% (2) 54.4 ±4.3% (1) 57.3 ±1.4% (2) 69.1 ±0.9% (1) 1.50 This Work CLPL 74.2 ±3.8% (7) • 41.3 ±4.1% (12) • 36.8 ±1.0% (12) • 46.2 ±0.9% (12) • 10.75 [11] JMLR 11 CORD 80.6 ±2.6% (4) 47.4 ±4.0% (9) • 45.7 ±1.3% (11) • 62.4 ±1.0% (9) • 8.25 [13] AAAI 17 ECOC 70.3 ±5.2% (9) • 50.5 ±2.7% (6) • 53.7 ±2.0% (7) • 66.2 ±1.0% (5) • 6.75 [21] TKDE 17 GM-PLL 73.7 ±4.3% (8) • 53.0 ±1.9% (3) 54.9 ±0.9% (4) • 62.9 ±0.7% (8) • 5.75 [22] TKDE 19 IPAL 67.8 ±5.3% (10) • 52.9 ±3.9% (4) 54.1 ±1.6% (5) • 60.9 ±1.1% (10) • 7.25 [12] AAAI 15 PL-BLC 80.6 ±3.2% (4) 53.6 ±3.7% (2) 54.0 ±0.8% (6) • 67.9 ±0.5% (2) • 3.50 [15] AAAI 20 PL-LE 62.9 ±5.6% (11) • 49.9 ±3.7% (7) • 53.6 ±2.0% (8) • 65.3 ±0.6% (6) • 8.00 [23] AAAI 19 PLKNN 43.2 ±5.1% (12) • 41.7 ±3.4% (11) • 49.5 ±1.8% (10) • 48.3 ±1.1% (11) • 11.00 [10] IDA 06 PRODEN 81.6 ±3.5% (1) 43.4 ±3.3% (10) • 55.3 ±5.6% (3) • 67.5 ±0.7% (3) • 4.25 [24] ICML 20 SDIM 80.1 ±3.1% (5) 52.0 ±3.7% (5) 57.7 ±1.6% (1) 66.3 ±1.3% (4) • 3.75 [14] IJCAI 19 SURE 78.0 ±3.6% (6) • 48.1 ±3.6% (8) • 53.3 ±1.7% (9) • 64.4 ±1.5% (7) • 7.50 [25] AAAI 19 Table 1. Benchmark results (mean accuracy±std) on the real-world datasets. Numbers in parenthesis represent rankings of com- paring methods and the sixth column is the average rankings. Best methods are emphasized in boldface. •/◦ indicates whether our method (DNPL) is better/worse than the comparing methods with respect to unpaired Welch t-test at 5% significance level. where Q ∈ RK×K is a label transition matrix in the context Especially, Valle-Perez et al. [16] empirically observed that of multiple complementary label learning, s is a uniformly solutions from stochastic gradient descent (SGD) are biased randomly chosen label from S. R̂cc,n (θ) is denoted as empir- toward neural networks with smaller complexity. They ob- ical risk of Eq. 7. served the following universal scaling behavior in the output Feng et al.’s main contribution is to prove an estimation distribution p(θ) of SGD: error bound for the CC risk (7). Let θ̂n = arg minθ∈Θ R̂cc,n (θ) and θ? = arg minθ∈Θ Rcc (θ) denote the empirical and the p(θ) . e−aC(θ)+b , (8) true minimizer, respectively. Additionally, Hy refers the model hypothesis space for label y. Then, the estimation where C(θ) is a computable proxy of (uncomputable) Kol- error bound for the CC risk is given as mogorov complexity and a, b are θ-independent constants. One example of complexity measure C(θ) is Lempel-Ziv Theorem 2. (Estimation error bound for the CC risk [18]). complexity [16] which is roughly the length of compressed θ Assume the loss function L Q> p (y|x; θ) , s is ρ-Lipschitz with ZIP compressor. with respect to the first augment in the 2-norm and upper- In the deep naive PLL, the model parameter is a mini- bounded by M . Then, for any δ > 0, with probability at least mizer of the empirical partial label risk R̂p,n (θ) (Eq. 4). The 1 − δ, minima of R̂p,n (θ) is wide because there are many model pa- q 2 rameters perfectly fit to given partially labeled examples. The Pk log δ Rcc (θ̂n ) − Rcc (θ? ) ≤ 8ρ y=1 Rn (Hy ) + 2M 2n , support of SGD’s output distribution will lie in this wide min- ima. According to Eq. 8, this distribution is heavily biased where Rn (Hy ) refers the expected Rademacher complexity toward parameters with small complexities. One crucial ob- of the hypothesis space for the label y, Hy , with sample size servation is that models fitting inconsistent labels will gener- n. ally have large complexities since they have to memorize each example. According to Eq. 8, such models are exponentially If the uniform label transition probability is assumed i.e., unlikely to be outputted by SGD. Hence the most likely out- Qij = δij I (j ∈ Sj ) / 2K−1 − 1 , Eq. 7 becomes equiva- put of the deep naive PLL method is a classifier with small lent to our deep naive loss (Eq. 2) up to some constant fac- error. As a result, the implications of both Theorem 1 and tors. Hence, Theorem 1 and 2 give generalization bounds on 2 appear to be empirically correct in spite of their vacuity of the partial risk and the CC risk (same as Eq. 2) respectively. model complexity. 2.4. Alternative Explanation of Generalization in DNPL 3. EXPERIMENTS Since the work of [26], the mystery of deep learning’s gener- alization ability has been widely investigated in the standard In this section we give the readers two points. First, deep neu- supervised learning setting. While it is still not fully under- ral network classifiers trained with the naive loss can achieve stood why over-parametrized deep neural networks generalize competitive performance in real-world benchmarks. Second, well, several studies are suggesting that deep learning mod- the generalization gaps of trained classifiers effectively de- els are inherently biased toward simple functions [16, 27]. crease with respect to the increasing training set size.
Fig. 1. Generalization gaps with respect to training set size for (a) Yahoo! dataset and (b) Soccer dataset are shown. Error bars represent STDs over 10 repeated experiments. Note that we went through the same experiment process for the other two smaller datasets (Lost / MSRCv2), but these results were omitted because of the same tendency. 3.1. Benchmarks on Real-world PLL Datasets or CORD, DNPL does not need computationally expensive processes like label identification and mean-teaching. This 3.1.1. Datasets and Comparing Methods means that by simply borrowing our surrogate loss to the We use four real-world datasets including Lost [28], MSRCv2 deep learning classifier, we can build a sufficiently competi- [8], Soccer Player [9], and Yahoo! News [29]. All real- tive PLL model. world datasets can be found in this website2 . We denote Observing that for Soccer Player and Yahoo! News the suggested method as Deep Naive Partial label Learning datasets, DNPL outperforms almost all of the comparing (DNPL). We compare DNPL with eleven baseline methods. methods. Regarding the large-scale and high-dimensional There are eight parametric methods: CLPL, CORD [13], nature of Soccer Player and Yahoo! News datasets comparing ECOC, PL-BLC [15], PL-LE [23], PRODEN, SDIM [14], to other datasets, this observation suggests that DNPL has its SURE, and three non-parametric methods: GM-PLL [22], advantage on large-scale, high-dimensional datasets. IPAL, PLKNN. Note that both CORD and PL-BLC are deep learning-based PLL methods which includes label identifica- 3.2. Generalization Gaps of Deep Naive PLL tion or mean-teaching techniques. In this section, we empirically show that conventional learn- ing theories (Theorem 1, 2) cannot explain the learning be- 3.1.2. Models and Hyperparameters haviors of DNPL. Figure 1 shows how the gap |Err(θ̂n ) − We employ a neural network of the following architecture: R̂p,n (θ̂n )| and the CC risk3 Rcc (θ̂n ) decreases as dataset size din − 512 − 256 − dout , where numbers represent dimensions n increases. We observe that gap closing behaviors despite of layers and din (dout ) is input (output) dimension. The neu- the neural networks are over-parametrized, i.e., # of parame- ral network have the same size as that of PL-BLC. Batch ters ∼ 105 >> the training set size ∼ 104 . normalization [30] is applied after each layer followed by ELU activation layer [31]. Yogi optimizer [32] is used with 4. CONCLUSIONS fixed learning rate 10−3 and default momentum parameters (0.9, 0.999). This work showed that a simple naive loss is applicable in training high-performance deep classifiers with partially la- 3.1.3. Benchmark Results beled examples. Moreover, this method does not require any label disambiguation or explicit regularization. Our observa- Table 1 reports means and standard deviations of observed tions indicate that the deep naive method’s unreasonable ef- accuracies. Accuracies of the naive model are measured over fectiveness cannot be explained by existing learning theories. 5 repeated 10-fold cross-validation and accuracies of others These raise interesting questions deserving further studies: 1) are measured over 10-fold cross-validation. To what extent does the label disambiguation help learning The benchmark results indicate that DNPL achieves state- with partial labels? 2) How deep learning generalizes in par- of-the-art performances over all four datasets. Especially, tial label learning? DNPL outperforms PL-BLC which uses a neural network of the same size as ours on those datasets. Unlike PL-BLC 3 We have always observed that with our over-parameterized neural net- 2 http://palm.seu.edu.cn/zhangml/ work zero risk can be achieved for Rcc (θ? ). Therefore, we omit this term.
5. REFERENCES [17] Liping Liu and Thomas Dietterich, “Learnability of the superset label learning problem,” in ICML, 2014. [1] Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip Odom, and Zsolt Kira, “Multi-class classification with- [18] Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin out multi-class labels,” in ICLR, 2019. Geng, Bo An, and Masashi Sugiyama, “Provably con- sistent partial-label learning,” in NeurIPS, 2020. [2] Ryuichi Kiryo, Gang Niu, Marthinus C Du Plessis, and Masashi Sugiyama, “Positive-unlabeled learning with [19] Lei Feng and Bo An, “Learning from multiple comple- non-negative risk estimator,” in NeurIPS, 2017. mentary labels,” in ICML, 2020. [3] Hirotaka Kaji, Hayato Yamaguchi, and Masashi [20] Yuzhou Cao and Yitian Xu, “Multi-complementary and Sugiyama, “Multi task learning with positive and unla- unlabeled learning for arbitrary losses and models,” in beled data and its application to mental state prediction,” ICML, 2020. in ICASSP, 2018. [21] Min-Ling Zhang, Fei Yu, and Cai-Zhi Tang, “Disambiguation-free partial label learning,” IEEE [4] Oded Maron and Tomás Lozano-Pérez, “A framework Trans Knowl Data Eng, 2017. for multiple-instance learning,” in NeurIPS, 1998. [22] Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang, [5] Yun Wang, Juncheng Li, and Florian Metze, “A com- and Yidong Li, “Gm-pll: Graph matching based partial parison of five multiple instance learning pooling func- label learning,” IEEE Trans Knowl Data Eng, 2019. tions for sound event detection with weak labeling,” in ICASSP, 2019. [23] Ning Xu, Jiaqi Lv, and Xin Geng, “Partial label learning via label enhancement,” in AAAI, 2019. [6] Rong Jin and Zoubin Ghahramani, “Learning with mul- tiple labels,” in NeurIPS, 2003. [24] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama, “Progressive identification of true [7] Jie Luo and Francesco Orabona, “Learning from candi- labels for partial-label learning,” in ICML, 2020. date labeling sets,” in NeurIPS, 2010. [25] Lei Feng and Bo An, “Partial label learning with self- [8] Liping Liu and Thomas G Dietterich, “A conditional guided retraining,” in AAAI, 2019. multinomial mixture model for superset label learning,” in NeurIPS, 2012. [26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, “Understanding deep learning [9] Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, requires rethinking generalization,” in ICLR, 2017. Shenghua Gao, Dong Xu, and Yi Ma, “Learning by as- sociating ambiguously labeled images,” in CVPR, 2013. [27] Giacomo De Palma, Bobak Kiani, and Seth Lloyd, “Random deep neural networks are biased towards sim- [10] Eyke Hüllermeier and Jürgen Beringer, “Learning from ple functions,” in NeurIPS, 2019. ambiguously labeled examples,” Intell Data Anal, 2006. [28] Gabriel Panis, Andreas Lanitis, Nicholas Tsapatsoulis, [11] Timothee Cour, Ben Sapp, and Ben Taskar, “Learning and Timothy F Cootes, “Overview of research on facial from partial labels,” JMLR, 2011. ageing using the FG-NET ageing database,” IET Bio- metrics, 2016. [12] Min-Ling Zhang and Fei Yu, “Solving the partial la- bel learning problem: An instance-based approach,” in [29] Matthieu Guillaumin, Jakob Verbeek, and Cordelia AAAI, 2015. Schmid, “Multiple instance metric learning from au- tomatically labeled bags of faces,” in ECCV, 2010. [13] Cai-Zhi Tang and Min-Ling Zhang, “Confidence-rated discriminative partial label learning,” in AAAI, 2017. [30] Sergey Ioffe and Christian Szegedy, “Batch normaliza- tion: Accelerating deep network training by reducing [14] Lei Feng and Bo An, “Partial label learning by semantic internal covariate shift,” in ICML, 2015. difference maximization,” in IJCAI, 2019. [31] Djork-Arné Clevert, Thomas Unterthiner, and Sepp [15] Yan Yan and Yuhong Guo, “Partial label learning with Hochreiter, “Fast and accurate deep network learning batch label correction,” in AAAI, 2020. by exponential linear units (ELUs),” in ICLR, 2016. [16] Guillermo Valle-Perez, Chico Q Camargo, and Ard A [32] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Louis, “Deep learning generalizes because the Satyen Kale, and Sanjiv Kumar, “Adaptive methods for parameter-function map is biased towards simple func- nonconvex optimization,” in NeurIPS, 2018. tions,” in ICLR, 2018.
You can also read