Learning Curves for Analysis of Deep Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Learning Curves for Analysis of Deep Networks Derek Hoiem 1 Tanmay Gupta 2 Zhizhong Li 1 Michal M. Shlapentokh-Rothman 1 Abstract trapolate rather than evaluate and have typically required large-scale experiments that are outside the computational Learning curves model a classifier’s test error as a budgets of many research groups. function of the number of training samples. Prior works show that learning curves can be used to Experimentally, we find that the extended power law select model parameters and extrapolate perfor- etest (n) = α + ηnγ yields a well-fitting learning curve, mance. We investigate how to use learning curves where etest is test error and n is the number of training sam- to evaluate design choices, such as pretraining, ples (or “training size”), but that the parameters {α, η, γ} architecture, and data augmentation. We propose are individually unstable under measurement perturbations. a method to robustly estimate learning curves, ab- To facilitate curve comparisons, we abstract the curve into stract their parameters into error and data-reliance, two key parameters, eN and βN , that have intuitive mean- and evaluate the effectiveness of different param- ings and are more stable under measurement variance. eN eterizations. Our experiments exemplify use of is the test error at n = N , and βN is a measure of data- learning curves for analysis and yield several in- reliance, how much a classifier’s error will change if the teresting observations. training size changes. Our experiments show that learning curves provide insights that cannot be obtained by single- point comparisons of performance. Our aim is to promote 1. Introduction the use of learning curves as part of a standard learning system evaluation. The performance of a learning system depends strongly on the number of training samples, but standard evaluations use Our key contributions: a fixed train/test split. While some works measure perfor- mance using subsets of training data, the lack of a system- • Investigate how to best model, estimate, characterize, atic way to measure and report performance as a function and display learning curves for use in classifier analysis of training size is a barrier to progress in machine learning research, particularly in areas like representation learning, • Exemplify use of learning curves with analysis of im- data augmentation, and low-shot learning that specifically pact of error and data-reliance due to network archi- address the limited-data regime. What gets measured gets tecture, optimization, depth, width, fine-tuning, data optimized, so we need better measures of learning ability to augmentation, and pretraining. design better classifiers for the spectrum of data availability. In this paper, we establish and demonstrate use of learn- Table 1 shows validated and rejected popular beliefs that ing curves to improve evaluation of classifiers (see Fig. 1). single-point comparisons often overlook. In the follow- Learning curves, which model error as a function of training ing sections, we investigate how to model learning curves set size, were introduced nearly thirty years ago by Cortes (Sec. 2), how to estimate them (Sec. 3), and what they can et al. (1993)) to accelerate model selection of deep networks, tell us about the impact of design decisions (Sec. 4), with and recent works have demonstrated the predictability of discussion of limitations and future work in Sec. 5. performance improvements with more data (Hestness et al., 2017; Johnson & Nguyen, 2017; Kaplan et al., 2020; Rosen- 2. Modeling Learning Curves feld et al., 2020) or more network parameters (Kaplan et al., 2020; Rosenfeld et al., 2020). But such studies aim to ex- The learning curve measures test error etest as a function of the number of training samples n for a given classification 1 University of Illinois at Urbana-Champaign 2 PRIOR @ model and learning method. Previous empirical observa- Allen Institute for AI. Correspondence to: Derek Hoiem tions suggest a functional form etest (n) = α + ηnγ , with . bias-variance trade-off and generalization theories typically Proceedings of the 38 th International Conference on Machine indicating γ = −0.5. We summarize what bias-variance Learning, PMLR 139, 2021. Copyright 2021 by the author(s). trade-off and generalization theories (Sec. 2.1) and empiri-
Learning Curves for Analysis of Deep Networks Model Error on full dataset e(n) = ↵ + ⌘ · n Local Linear Approximation at . Error (%) 100 ( = 400) Model A (e400 = 27.91; = 18.23; = 0.41) 3 6- where 6- = −2*7. + 400 Model B (e400 = 32.42; = 5.61; = 0.35) %̃ 2. = %- + 4 −1 A 27.92 400 80 B 32.40 % # = ( + *#+ %̃-/1 (a) Typical classifier 60 evaluation and reporting %̃-/1 = %- + 6- Error %̃- %̃- = %- 40 B better than A %̃1- %̃1- = %- − 0.56- %̃; Model ()) ()) %; %̃; = %- − 6- 20 A better than B A 27.91 18.23 -0.41 ( B 32.42 5.61 -0.35 0 0 0.05 0.1 0.15 0.2 0.25 1/ 4. 1/ . 2/ . 1/ # (1) (400) (100) (45) (25) (16) (# = 4.) (# = .) (# = ./4) (b) Proposed evaluation that also n 0.5 characterizes change in performance with changing number of training (c) Model comparison via (d) Characterization of a learning curve via samples as shown in (d) learning curves error ( & ) and data-reliance ( & ) Figure 1: Evaluation with learning curves vs single-point comparison: Comparing error of models trained on the full dataset, as shown in (a), is standard practice but provides an incomplete view of a classifier’s performance. We propose a methodology to estimate and visualize learning curves that model a classifier’s performance with varying amounts of training data (b,c). We also propose a succinct summary of a model’s curve in terms of error and data-reliance (b,d). Your Supp- Exp. Popular beliefs guess orted? figures Pre-training on similar domains nearly always helps compared to training from scratch. Y 5a, 5b, 6 Pre-training, even on similar domains, introduces bias that would harm performance with a large enough training set. U 6 Self-/un-supervised training performs better than supervised pre-training for small datasets. N 6 Fine-tuning the entire network (vs. just the classification layer) is only helpful if the training set is large. N 5a, 5b Increasing network depth, when fine-tuning, harms performance for small training sets, due to an overly complex model. N 7a Increasing network depth, when fine-tuning, is more helpful for larger training sets than smaller ones. N 7a Increasing network depth, if the backbone is frozen, is more helpful for smaller training sets than larger ones. N 7d Increasing depth or width improves more than ensembles of smaller networks with the same number of parameters. Y 7f Data augmentation is roughly equivalent to using a m-times larger training set for some m. Y 8 Table 1: Deep learning quiz! We encourage our readers to judge each claim as T (true) or F (false) and then see if our experimental results support the claim (Yes/No/Unsure). In many cases, particularly regarding fine-tuning and network depth, the results surprised the authors. Our experiments show that learning curve analysis provides a systematic way to investigate these suppositions and others. cal studies (Sec. 2.2) can tell us about learning curves, and preconditions must be built in rather than learned. Domin- describe our proposed abstraction in Sec. 2.3. gos (2000) extends the analysis to classification. Theoret- ically, the mean squared error (MSE) can be modeled as 2.1. Bias-variance Trade-off and Generalization e2test (n) = bias2 + noise2 + var(n), where “noise” is ir- Theory reducible error due to non-unique mapping from inputs to labels, and variance can be modeled as var(n) = σ 2 /n for The bias-variance trade-off is an intuitive and theoretically n training samples. sound way to think about generalization. The “bias” is error due to inability of the classifier to encode the opti- The ηnγ term in etest (n) appears throughout machine learn- mal decision function, and the “variance” is error due to ing generalization theory, usually with γ = −0.5. For ex- limited availability of training samples for parameter esti- ample, the bounds based on hypothesis VC-dimension (Vap- mation. This is called a trade-off because a classifier with nik & Chervonenkis, 1971) and Rademacher Complex- more parameters tends to have less bias but higher variance. ity (Gnecco & Sanguineti, 2008) are both O(cn−0.5 ) where Geman et al. (1992) decompose mean squared regression c depends on the complexity of the classification model. error into bias and variance and explore the implications More recent work also follows this form, e.g. (Neyshabur for neural networks, leading to the conclusion that “iden- et al., 2018; Bartlett et al., 2017; Arora et al., 2018; Bousquet tifying the right preconditions is the substantial problem & Elisseeff, 2002). One caveat is that the exponential term in neural modeling”. This conclusion foreshadows the im- γ can deviate from −0.5 if the classifier parameters depend portance of pretraining, though Geman et al. thought the directly on n. For example, (Tsybakov, 2008) shows that setting the bandwidth of a kernel density estimator based on
Learning Curves for Analysis of Deep Networks n causes the dominant term to be bias, rather than variance, sensitivity to training size in a way that can be derived from changing γ in the bound. In our experiments, all training various learning curve models and is insensitive to data and model parameters are fixed for each curve, except the perturbations. The curve is characterized by error eN = learning schedule, but we note that the learning curve de- α + ηN γ and data-reliance βN , and we typically choose N pends on optimization methods and hyperparameters as well as the full dataset size. Noting that most learning curves as the classifier model. are locally well approximated by a model linear in n−0.5 , we compute data-reliance as βN = N −0.5 ∂n∂e −0.5 n=N = 2.2. Empirical Studies −2ηγN γ . When the error is plotted against n−0.5 , βN is the slope at N scaled by N −0.5 , with the scaling chosen to Some recent empirical studies (e.g. Sun et al. (2017)) claim make the practical implications of βN more intuitive. This a log-linear relationship between error and training size, but yields a simple predictor for error when changing training this holds only when asymptotic error is zero. Hestness et al. size by a factor of d: (2017) model error as etest (n) = α + ηnγ but often find γ much smaller in magnitude than −0.5 and suggest that poor 1 fits indicate need for better hyperparameter tuning. This em- ee(d · N ) = eN + √ − 1 βN . (1) d pirically supports that data sensitivity depends both on the classification model and on the efficacy of the optimization This is a first order Taylor expansion of e(n) around n = N algorithm and parameters. Johnson & Nguyen (2017) also with respect to n−0.5 . By this linearized estimate, asymp- find a better fit with this extended power law model than by totic error is eN − βN , a 4-fold increase in data (e.g. 400 → restricting γ = −0.5 or α = 0. 1600) reduces error by 0.5βN , and using only one quarter of the dataset (e.g. 400 → 100) increases the error by βN . In the language domain, learning curves are used in a fasci- For two models with similar eN , the one with a larger βN nating study by Kaplan et al. (2020). For natural language would outperform with more data but underperform with transformers, they show that a power law relationship be- less. (eN , βN , γ) is a complete re-parameterization of the tween logistic loss, model size, compute time, and dataset extended power law, with γ + 0.5 indicating the curvature size is maintained if, and only if, each is increased in tan- in n−0.5 scale. See Fig. 1d for illustration. dem. We draw some similar conclusions to their study, such as that increasing model size tends to improve perfor- mance especially for small training sets (which surprised 3. Estimating Learning Curves us). However, the studies are largely complementary, as we We now describe the method for estimating the learning study convolutional nets in computer vision, classification curve from error measurements with confidence bounds error instead of logistic loss, and a broader range of design on the estimate. Let eij denote the random variable corre- choices such as data augmentation, pretraining source, ar- sponding to test error when the model is trained on the j th chitecture, and optimization. Also related, Rosenfeld et al. fold of ni samples (either per class or in total). We assume (2020) model error as a function of both training size and {eij }F 2 j=1 are i.i.d according to N (µi , σi ). We want to esti- i number of model parameters with a five-parameter function mate learning curve parameters α (asymptotic error), η, and that accounts for training size, model parameter size, and γ, such that eij = α + ηnγi + ij where ij ∼ N (0, σi2 ) and chance performance. A key difference in our work is that µij = E[eij ] = µi . Sections 3.1 and 3.2 describe how to we focus on how to best draw insights about design choices estimate mean and variance of α and η for a given γ, and from learning curves, rather than on extrapolation. As such, Sec. 3.3 describes our approach for estimating γ. we propose methods to estimate learning curves and their variance from a relatively small number of trained models. 3.1. Weighted Least Squares Formulation 2.3. Proposed Characterization of Learning Curves for We estimate learning curve parameters {α, η} by optimizing Evaluation a weighted least squares objective: Our experiments in Sec. 4.1 show that the learning curve Fi S X 2 X model e(n) = α + ηnγ results in excellent leave-one-size- G(γ) = min wij (eij − α − ηnγ ) (2) α,η out RMS error and extrapolation. However, α, η, and γ i=1 j=1 cannot be meaningfully compared across curves because the where wij = 1/(Fi σi2 ). Fi is the number of models trained parameters have high covariance with small data perturba- with data size ni and is used to normalize the weight so tions, and comparing η values is not meaningful unless γ that the total weight for observations from each training size is fixed and vice-versa. This would prevent tabular compar- does not depend on Fi . The factor of σi2 accounts for the isons and makes it harder to draw quantitative conclusions. variance of ij . Assuming constant σi2 and removing the Fi To overcome this problem, we propose to report error and factor would yield unweighted least squares.
Learning Curves for Analysis of Deep Networks The variance of the estimate of σi2 from Fi samples is by searching over γ ∈ {−0.99, ..., −0.01} with λ = 5 (with 2σi4 /Fi , which can lead to over- or under-weighting data error on 100 point scale) for our experiments. for particular i if Fi is small. Recall that each sample eij requires training an entire model, so Fi is always small in 4. Experiments our experiments. We would expect the variance to have the form σi2 = σ02 + σ̂ 2 /ni , where σ02 is the variance due We validate our choice of learning curve model and estima- to random initialization and optimization and σ̂ 2 /ni is the tion method in Sec. 4.1 and use the learning curves to ex- variance due to randomness in selecting ni samples. We plore impact of design decisions on error and data-reliance validated this variance model by averaging over the vari- in Sec. 4.2. ance estimates for many different network models on the Setup: Each learning curve is fit to test errors measured CIFAR-100 (Krizhevsky, 2012) dataset. We use σ02 = 0.02 after training the classifier on various subsets of the training in all experiments and then use a least squares fit to esti- data. When less than the full training set is used, we train mate a single σ̂ 2 parameter from all samples e in a given multiple classifiers using different partitions of data (e.g. learning curve. This attention to wij may seem fussy, but four classifiers on four quarters of the data). We avoid use without such care we find that the learning curve often fails of benchmark test sets, since the curves are used for analysis to account sufficiently for all the data in some cases. and ablation. Instead, training data are split 80/20, with 80% of data used for training and validation and remaining 20% 3.2. Solving for Learning Curve Mean and Variance for testing. For each curve, all models are trained using Concatenating errors across dataset sizes (indexed by i) an initial learning rate that is selected using one subset of training data, and the learning rate schedule is set for each PS folds results in an error vector e of dimension D = and i=1 Fi . For each d ∈ {1, · · · , D}, e[d] is an observation n based on validation. Unless otherwise noted, models are of error at dataset size nid that follows N (µid , σi2d ) with id pretrained on ImageNet. Other hyperparameters are fixed mapping d to the corresponding i. for all experiments. Unless otherwise noted, we use the Ranger optimizer (Wright, 2019), which combines Rectified The weighted least squares problem can be formulated as Adam (Liu et al., 2020), Look Ahead (Zhang et al., 2019), solving a system of linear equations denoted by W 1/2 e = and Gradient Centralization (Yong et al., 2020), as prelim- W 1/2 Aθ, where W ∈ RD×D is a diagonal matrix inary experiments showed its effectiveness. For “linear”, of weights Wdd = wd , A ∈ RD×2 is a matrix with we train only the final classification layer with the other T A[d, :] = [1 nγd ], and θ = [α η] are the parameters weights frozen to initialized values. All weights are trained of the learning curve, treating γ as fixed for now. The when “fine-tuning”. Tests are on Cifar100 (Krizhevsky, estimator for the learning curve is then given by θ̂ = 2012), Cifar10, Places365 (Zhou et al., 2017), or Caltech- (W 1/2 A)+ W 1/2 e = M e, where M ∈ R2×D and + is 101 (L. Fei-Fei; Fergus, 2006). See supplemental materials pseudo-inverse operator. (Appendix A) for more implementation details. The covariance of the estimator is given by Σθ̂ = M Σe M T , where Σθ̂ ∈ R2×2 and Σe ∈ RD×D is the diagonal covari- 4.1. Evaluation of Learning Curves Model and Fitting ance of e with Σe [d, d] = σi2d . We compute our empirical We validate our learning curve model using leave-one-size- estimate of σi2 as described in Sec. 3.1. out prediction error, e.g. predicting empirical mean perfor- Since the estimated curve is given by ê(n) = [1 nγ ] θ̂, mance with 400 samples per class based on observing error the 95% bounds at any n can be computed as from models trained on 25, 50, 100, and 200 samples. ê(n) ± 1.96 × σ̂(n) with Weighting Schemes. In the Fig. 2 table, learning curve 1 prediction error is averaged for 16 Cifar100 classifiers with 2 γ σ̂ (n) = 1 n Σθ̂ γ (3) varying design decisions. We compare three weighting n schemes (w’s in Eq. 2): wij = 1 is unweighted; wij = 1/σi2 For a given γ, these confidence bounds reflect the variance in is weighted by estimated size-dependent standard devia- the estimated curve due to variance in error measurements. tion; wij = 1/(Fi σi2 ) makes the total weight for a given dataset size invariant to the number of folds. On average our 3.3. Estimating γ proposed weighting performs best with high significance compared to unweighted. The p-value is paired t-test of We search for γ that minimizes the weighted least squares difference of means calculated across all dataset sizes. objective with an L1-prior that slightly encourages values close to 0.5. Specifically, we solve Model Choice. We consider other parameterizations that are special cases of e(n) = α + ηnγ + δn2γ . Setting δ = 0 min G(γ) + λ|γ + 0.5| (4) (top row of table) yields the model described in Sec. 2.3 γ∈(−1,0)
Learning Curves for Analysis of Deep Networks 100 RMSE Params Weights R2 25 50 100 200 400 avg p-value 80 1 σi2 Fi 0.998 2.40 0.86 0.54 0.57 0.85 1.04 - 60 α, η, γ 1 0.999 2.38 0.83 0.69 0.54 1.08 1.10 0.06 Error σi2 1 0.998 2.66 0.86 0.79 0.50 1.26 1.21 0.008 40 1 α, η σi2 Fi 0.988 3.41 1.09 0.69 0.72 1.21 1.42
Learning Curves for Analysis of Deep Networks e(n) = α + η · nγ e(n) = α + η · nγ 100 100 Adam (e4000 = 5.51; β4000 = 9.89; γ = −0.43) Adam (e4000 = 2.86; β4000 = 2.82; γ = −0.47) Ranger (e4000 = 5.08; β4000 = 7.73; γ = −0.5) Ranger (e4000 = 2.98; β4000 = 2.97; γ = −0.4) SGD (e4000 = 6.03; β4000 = 8.5; γ = −0.5) SGD (e4000 = 2.76; β4000 = 2.85; γ = −0.38) 80 SGD w/o momentum (e4000 = 7.3; β4000 = 12.66; γ = −0.32) 80 SGD w/o momentum (e4000 = 2.64; β4000 = 3.08; γ = −0.36) 60 60 Error Error 40 40 20 20 0 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.02 0.04 0.06 0.08 0.1 0.12 (∞) (2500) (625) (278) (157) (100) (70) (∞) (2500) (625) (278) (157) (100) (70) n−0.5 n−0.5 Figure 4: Optimization on Cifar10 with ResNet-18. Without pretraining on left; with ImageNet pretraining on right. Pretraining and fine-tuning: Some prior works examine e(n) = α + η · nγ 100 the effectiveness of pretraining. Kornblith et al. (2019) show that fine-tuned pretrained models outperform randomly ini- 80 tialized models across many architectures and datasets, but the gap is often small and narrows as training size grows. 60 For object detection, He et al. (2019) find that, with long Error learning schedules, randomly initialized networks approach 40 the performance of networks pretrained for classification, even on smaller training sizes, and Zoph et al. (2020) show R18; No Pretr; Linear (e400 = 79.29; β400 = 1.32; γ = −0.84) 20 that pretraining can sometimes harm performance when R18; No Pretr; Finetune (e400 = 27.91; β400 = 18.23; γ = −0.41) R18; Pretr; Linear (e400 = 32.42; β400 = 5.61; γ = −0.35) strong data augmentation is used. R18; Pretr; Finetune (e400 = 18.86; β400 = 7.28; γ = −0.57) 0 0 0.05 0.1 0.15 0.2 0.25 Compared to prior works, our use of learning curve models (∞) (400) (100) (45) (25) (16) n−0.5 enables extrapolation beyond available data and numerical comparison of data reliance. In Fig. 5 we see that, without (a) Transfer: ImageNet to Cifar100 fine-tuning (“linear”), pretraining leads to a huge improve- e(n) = α + η · nγ 100 ment in e400 for all training sizes. When fine-tuning, the pretraining greatly reduces data-reliance β400 and also re- 80 duces e400 , providing strong advantages with smaller train- ing sizes that may disappear with enough data. 60 Error Pretraining data sources: In Fig. 6, we test on Ci- far100 and Caltech-101 to compare different pretrained 40 models: randomly initialized, supervised on ImageNet or Places365 (Zhou et al., 2017), and self-supervised on Im- 20 R18; No Pretr; Linear (e400 = 92.79; β400 = 0.96; γ = −0.5) R18; No Pretr; Finetune (e400 = 57.89; β400 = 12.86; γ = −0.26) ageNet (MOCO by He et al. (2020)). The strongest im- R18; Pretr; Linear (e400 = 59.91; β400 = 4.16; γ = −0.38) R18; Pretr; Finetune (e400 = 54.0; β400 = 7.33; γ = −0.28) pact is on data-reliance, which leads to consistent orderings 0 0 0.05 0.1 0.15 0.2 0.25 of models across training sizes for both datasets. Super- (∞) (400) (100) (45) (25) (16) n−0.5 vised ImageNet pretraining has lowest eN and βN , then self-supervised MOCO, then supervised Places365, with (b) Transfer: ImageNet to Places365 random initialization trailing far behind all pretrained mod- els. The original papers excluded either analysis of training Figure 5: Pretraining and fine-tuning with ResNet-18. size (MOCO) or fine-tuning (Places365), so ours is the first analysis on image benchmarks to compare finetuned models overfitting. In Figs. 7a, 7d, we show that, not only do deeper from different pretraining sources as a function of training networks have better potential at higher data sizes, their data size. Newly proposed methods for representation learning reliance does not increase (nearly parallel and drops a little would benefit from further learning curve analysis. for fine-tuning), making deeper networks perfectly suitable Network depth, width, and ensembles: The classical for smaller datasets. For linear classifiers (Fig. 7d), the view is that smaller datasets need simpler models to avoid deeper networks provide better features, leading to consis- tent drop in e400 . The small jump in data reliance between
Learning Curves for Analysis of Deep Networks e(n) = α + η · nγ e(n) = α + η · nγ 100 100 R50; No Pretr (eN = 27.66; βN = 20.93; γ = −0.35) R50; No Pretr (e57 = 21.63; β57 = 22.92; γ = −0.5) R50; Imagenet Pretr (eN = 18.33; βN = 4.32; γ = −0.67) R50; Imagenet Pretr (e57 = 5.91; β57 = 6.25; γ = −0.5) R50; Imagenet MOCO Pretr (eN = 18.74; βN = 11.21; γ = −0.3) R50; Imagenet MOCO Pretr (e57 = 8.39; β57 = 10.46; γ = −0.5) 80 R50; Places Pretr (eN = 20.49; βN = 12.53; γ = −0.24) 80 R50; Places Pretr (e57 = 11.27; β57 = 12.86; γ = −0.5) 60 60 Error Error 40 40 20 20 0 0 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16) (12) (9) n−0.5 n−0.5 Figure 6: Pretraining sources: test on Cifar100 (left) and Caltech-101 (right). Resnet-34 and Resnet-50 may be due to the increased last 5. Discussion layer input size from 512 to 2048 nodes. When increasing width, the fine-tuned networks (Fig. 7b) have reduced e400 Evaluation methodology is the foundation of research, im- without much change to data-reliance. With linear classi- pacting how we choose problems and rank solutions. Large fiers (Fig. 7e), increasing the width leads to little change or train and test sets now serve as the fuel and crucible to refine even increase in e400 with slight decrease in data-reliance. machine learning methods. We now discuss the need for learning curves, limitations of our experiments, and direc- An alternative to using a deeper or wider network is forming tions for future work. ensembles. Figure 7c shows that, while an ensemble of six ResNet-18’s (each 11.7M parameters) improves over Case for learning curve models. Compared to individual a single model, it has higher e400 and data-reliance than runs, learning curves often provide more accurate or confi- ResNet-101 (44.5M), Wide-ResNet-50 (68.9M), and Wide- dent conclusions, which is crucial, as any misleading/partial ResNet-101 (126.9M). Three ResNet-50’s (each 25.6M) conclusions can block progress. For example, in Fig. 5b, underperforms Wide-ResNet-50 on e400 but outperforms comparison at N = 1600 may indicate that pretraining for small amounts of data due to lower data reliance. Fig. 7f has little impact on error when fine-tuning, but the curves tabulates data reliance and error to simplify comparison. show large differences in data-reliance leading to a large gap when less data is available. In Fig. 6 (left), compari- Rosenfeld et al. (2020) show that error can be modeled son at N = 400 may indicate that MOCO and supervised as a function of either training size, model size, or both. ImageNet pretraining are equally effective, but the curves Modeling both jointly can provide additional capabilities show that the supervised ImageNet pretrained classifier has such as selecting model size based on data size, but requires lower data-reliance (β = 4.32 vs. β = 11.21) and thus many more experiments to fit the curve. Our experiments outperforms with less data. Generally, individual runs can- show more clearly the effect on data reliance due to different not reveal the sample efficiency of a learner. One curve ways of changing model size. may dominate another because it has less bias (leading to Data Augmentation: We are not aware of previous stud- uniformly lower error, as with “linear” classifiers of increas- ies on interaction between data augmentation and training ing depth in Fig. 7) or less variance (leading to different size. For example, a large survey (Shorten & Khoshgoftaar, curve slopes, as in different pretrained models in Fig. 6), 2019) compares different augmentation methods only on and distinguishing helps understand and improve. full training size. One may expect that data augmentation Compared to a piecewise affine fit (i.e., plotting a line acts as a regularizer with reduced effect for large training through observed error points), our modeling and character- sizes, or even possibly negative effect due to introducing ization in terms of error@N and data-reliance has important bias. However, Fig. 8 shows that data augmentation on advantages. First, our two-parameter characterization can be Places365 reduces error for all training sizes with little or tabulated (Fig. 7f) facilitating comparison within and across no change to data-reliance when fine-tuning. e(n) with aug- papers. We can learn from object detection research, where mentation roughly equals e(1.8n) without it, supporting the the practice of reporting PR/ROC curves gave way to report- view that augmentation acts as a multiplier on the value of ing AP, making it easier to average and compare within and an example. For the linear classifier, data augmentation has across works when working with multiple datasets. Second, little apparent effect due to low data-reliance, but the results a piecewise affine fit is less stable under measurement er- are still consistent with this multiplier. ror, leading to significantly poorer extrapolations. Finally,
Learning Curves for Analysis of Deep Networks e(n) = α + η · nγ e(n) = α + η · nγ e(n) = α + η · nγ 100 100 100 Resnet-18 (e400 = 18.86; β400 = 7.28; γ = −0.57) Resnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67) 6xResnet-18 (e400 = 15.96; β400 = 7.09; γ = −0.49) Resnet-34 (e400 = 18.75; β400 = 4.36; γ = −0.73) 2xWide-Resnet-50 (e400 = 14.04; β400 = 6.0; γ = −0.57) 1xResnet-18 (e400 = 18.86; β400 = 7.28; γ = −0.57) Resnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67) Resnet-101 (e400 = 14.97; β400 = 5.03; γ = −0.62) 3xResnet-50 (e400 = 15.96; β400 = 4.15; γ = −0.64) 80 Resnet-101 (e400 = 14.97; β400 = 5.03; γ = −0.62) 80 2xWide-Resnet-101 (e400 = 13.37; β400 = 5.81; γ = −0.5) 80 1xResnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67) 60 60 60 Error Error Error 40 40 40 20 20 20 0 0 0 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16) n−0.5 n−0.5 n−0.5 (a) Depth: finetune (b) Width: finetune (c) Ensemble: finetune e(n) = α + η · nγ e(n) = α + η · nγ 100 100 Resnet-18 (e400 = 32.42; β400 = 5.61; γ = −0.35) Resnet-50 (e400 = 28.63; β400 = 6.66; γ = −0.22) Resnet-34 (e400 = 30.02; β400 = 5.44; γ = −0.33) 2xWide-Resnet-50 (e400 = 28.77; β400 = 4.89; γ = −0.35) Resnet-50 (e400 = 28.63; β400 = 6.66; γ = −0.22) Resnet-101 (e400 = 25.86; β400 = 6.17; γ = −0.21) 80 Resnet-101 (e400 = 25.86; β400 = 6.17; γ = −0.21) 80 2xWide-Resnet-101 (e400 = 27.76; β400 = 5.41; γ = −0.25) Model #params 60 60 Resnet 18 11.7M 18.86 7.28 Error Error Resnet 50 25.6M 18.33 4.32 40 40 Resnet 101 44.5M 14.97 5.03 2x Wide Resnet 50 68.9M 14.04 6.00 20 20 2x Wide Resnet 101 126.9M 13.37 5.81 0 0 6x Ensemble Resnet 18 70.2M 15.96 7.09 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 (∞) (400) (100) n−0.5 (45) (25) (16) (∞) (400) (100) n−0.5 (45) (25) (16) 3x Ensemble Resnet 50 76.8M 15.96 4.15 (d) Depth: linear (e) Width: linear (f) Ensemble vs Depth vs Width: finetune Figure 7: Depth, width, and ensembles on Cifar100. The table (f) compactly compares several curves. 100 e(n) = α + η · nγ over-commitment to initial conditions. We speculate (but with some disagreement among authors) that γ ≈ −0.5 is an indication of a well-trained curve and will generally 80 outperform curves with higher or lower γ, given the same classification model. 60 Small training sets: Error is bounded and classifier perfor- mance with small training sets may be modeled as transi- Error tioning from random guess to informed prediction, as shown 40 R18 No Pretr; Ft; w/ Aug (e400 = 58.01; β400 = 12.44; γ = −0.28) by Rosenfeld et al. (2020). For simplicity, we do not model R18 No Pretr; Ft; w/o Aug (e400 = 63.73; β400 = 13.01; γ = −0.24) performance with very small training size, but studying the R18; Pretr; Ft; w/ Aug (e400 = 53.88; β400 = 7.99; γ = −0.22) 20 R18; Pretr; Ft; w/o Aug (e400 = 56.68; β400 = 7.94; γ = −0.23) small data regime could be interesting, particularly to deter- R18; Pretr; Lin; w/ Aug (e400 = 60.01; β400 = 3.83; γ = −0.43) mine whether design decisions have an impact at the small R18; Pretr; Lin; w/o Aug (e400 = 60.48; β400 = 4.52; γ = −0.36) size that is not apparent at larger sizes. 0 0 0.05 0.1 0.15 0.2 0.25 (∞) (400) (100) −0.5 (45) (25) (16) Losses and Prediction types: We analyze multiclass clas- n sification error, but the same analysis could likely be ex- Figure 8: Data augmentation on Places365. tended to other prediction types. For example, Kaplan et al. (2020) analyze learning manifolds of cross-entropy loss of the parametric model helps to identify poor hyperparameter language model transformers. Learning curve models of selection, indicated by poor model fit or extreme values cross-entropy loss could also be used for problems like ob- of γ. ject detection or semantic segmentation that typically use Cause and impact of γ: We speculate that γ is largely de- more complex aggregate evaluations. termined by hyperparameters and optimization rather than More design parameters and interactions: The interac- model design. This presents an opportunity to identify poor tion between data scale, model scale, and performance is training regimes and improve them. Intuitively, one would well-explored by Kaplan et al. (2020) and Rosenfeld et al. expect that more negative γ values are better (i.e. γ = −1 (2020), but it could also be interesting to explore interac- preferable to γ = −0.5), since the error is O(nγ ), but we tions, e.g. between class of architecture (e.g. VGG, ResNet, find the high-magnitude γ tends to come with high asymp- EfficientNet (Tan & Le, 2019)) and some design parameters, totic error, indicating that the efficiency comes at cost of
Learning Curves for Analysis of Deep Networks to see the impact of ideas such as skip-connections, residual Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrally- layers and bottlenecks. More extensive evaluation of data normalized margin bounds for neural networks. In Confer- augmentation, representation learning, optimization, and ence on Neural Information Processing Systems (NIPS), regularization would also be interesting. 2017. Unbalanced class distributions: In most of our experi- Bousquet, O. and Elisseeff, A. Stability and generalization. ments, we use equal number of samples per class. Our Journal of Machine Learning Research, 2:499–526, 2002. experiments on Caltech-101 in Fig. 6 (right) and Pets and Sun397 (Fig. 10 in supplemental) use the original imbal- Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., and anced distributions, demonstrating that the same learning Denker, J. S. Learning curves: Asymptotic values and curve model applies, but further experiments are needed to rate of convergence. In Advances in Neural Information examine the impact of class imbalance. Processing Systems (NIPS), 1993. Limits to extrapolation: Our experiments indicate good extrapolation up to 4x the observed training size. However, Domingos, P. A unified bias-variance decomposition for we caution against drawing conclusions about asymptotic zero-one and squared loss. In National Conference on error, since estimation of α is highly sensitive to data per- Artificial Intelligence and Conference on Innovative Ap- turbations. eN − βN provides a more stable indicator of plications of Artificial Intelligence, 2000. large-sample performance. Falcon, W. Pytorch lightning. GitHub. Note: The supplemental material contains implementation de- https://github.com/PyTorchLightning/pytorch-lightning, tails (Appendix A); a user guide to fitting, displaying, and 3, 2019. using learning curves (Appendix B); experiments on addi- tional datasets (Appendix C); and a table of learning curve Geman, S., Bienenstock, E., and Doursat, R. Neural net- parameters for all experiments, also comparing eN and βN works and the bias/variance dilemma. Neural Computa- produced by two learning curve models (Appendix D). Code tion, 4(1):1–58, January 1992. is currently available at prior.allenai.org/projects/lcurve. Gnecco, G. and Sanguineti, M. Approximation error bounds via rademacher’s complexity. Applied Mathematical Sci- 6. Conclusion ences, 2(4), 2008. Performance depends strongly on training size, so size- He, K., Zhang, X., Ren, S., and Sun, J. Delving deep variant analysis more fully shows the impact of innovations. into rectifiers: Surpassing human-level performance on A reluctant researcher may protest that such analysis is too imagenet classification. In International Conference on complicated, too computationally expensive, or requires too Computer Vision (ICCV), 2015. much space to display. We show that learning curve mod- els can be easily fit (Sec. 3) from a few trials (Fig. 2) and He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- compactly summarized (Fig. 7f), leaving our evasive experi- ing for image recognition. In Conference on Computer menter no excuse. Our experiments serve as examples that Vision and Pattern Recognition (CVPR), 2016. learning curve analysis can yield interesting observations. Learning curves can further inform training methodology, He, K., Girshick, R., and Dollar, P. Rethinking imagenet continual learning, and representation learning, among other pre-training. In International Conference on Computer problems, providing a better understanding of contributions Vision (ICCV), 2019. and ultimately leading to faster progress in machine learn- ing. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo- mentum contrast for unsupervised visual representation learning. In Conference on Computer Vision and Pattern Acknowledgements: This research is supported in part by Recognition (CVPR), 2020. ONR MURI Award N000141612007. We thank reviewers for prompting further discussion. Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and References Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, 2017. Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a compression Johnson, M. and Nguyen, D. Q. How much data is enough? approach. In International Conference on Machine Learn- Predicting how accuracy varies with training data size. ing (ICML), 2018. Technical report, 2017.
Learning Curves for Analysis of Deep Networks Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Tan, M. and Le, Q. V. Efficientnet: Rethinking model scal- Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and ing for convolutional neural networks. In International Amodei, D. Scaling laws for neural language models. Conference on Machine Learning (ICML), 2019. ArXiv preprint, 2020. Tsybakov, A. B. Introduction to Nonparametric Estimation. Kingma, D. P. and Ba, J. Adam: A method for stochastic Springer Publishing Company, Incorporated, 1st edition, optimization. In International Conference on Learning 2008. ISBN 0387790519. Representations (ICLR), 2015. Vapnik, V. N. and Chervonenkis, A. Y. On the uniform Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet convergence of relative frequencies of events to their models transfer better? In Conference on Computer probabilities. Theory of Probability and its Applications, Vision and Pattern Recognition (CVPR), 2019. 16(2):264–280, 1971. doi: 10.1137/1116025. Krizhevsky, A. Learning multiple layers of features from Wright, L. New deep learning optimizer, ranger: Syner- tiny images. University of Toronto, 05 2012. gistic combination of radam + lookahead for the best of both. Github https://github.com/lessw2020/Ranger-Deep- Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Learning-Optimizer, 08 2019. classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre- (NIPS), 2012. gated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recogni- L. Fei-Fei; Fergus, R. P. One-shot learning of object cate- tion (CVPR), 2017. gories. IEEE Transactions on Pattern Analysis Machine Intelligence (PAMI), 28:594–611, April 2006. Yong, H., Huang, J., Hua, X., and Zhang, L. Gradient centralization: A new optimization technique for deep Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and neural networks. In European Conference on Computer Han, J. On the variance of the adaptive learning rate Vision (ECCV), 2020. and beyond. In International Conference on Learning Representations (ICLR), 2020. Zagoruyko, S. and Komodakis, N. Wide residual networks. In British Machine Vision Conference (BMVC), 2016. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A PAC-bayesian approach to spectrally-normalized margin Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P., bounds for neural networks. In International Conference Riquelme, C., Lucic, M., Djolonga, J., Pinto, A. S., on Learning Representations (ICLR), 2018. Neumann, M., Dosovitskiy, A., Beyer, L., Bachem, O., Tschannen, M., Michalski, M., Bousquet, O., Gelly, S., Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, and Houlsby, N. A large-scale study of representation N. A constructive prediction of the generalization error learning with the visual task adaptation benchmark. arXiv across scales. In International Conference on Learning preprint arXiv:1910.04867, 2020. Representations (ICLR), 2020. Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead Shorten, C. and Khoshgoftaar, T. A survey on image data optimizer: k steps forward, 1 step back. In Conference on augmentation for deep learning. Journal of Big Data, 6, Neural Information Processing Systems (NeurIPS), 2019. 07 2019. doi: 10.1186/s40537-019-0197-0. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor- Simonyan, K. and Zisserman, A. Very deep convolutional ralba, A. Places: A 10 million image database for scene networks for large-scale image recognition. In Interna- recognition. IEEE Transactions on Pattern Analysis and tional Conference on Learning Representations (ICLR), Machine Intelligence, 2017. 2015. Zoph, B., Ghiasi, G., Lin, T., Cui, Y., Liu, H., Cubuk, E. D., Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting and Le, Q. Rethinking pre-training and self-training. In unreasonable effectiveness of data in deep learning era. Conference on Neural Information Processing Systems In International Conference on Computer Vision (ICCV), (NeurIPS), 2020. 2017. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
You can also read