Learning Curves for Analysis of Deep Networks

Page created by Harry Norton

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Learning Curves for Analysis of Deep Networks

Derek Hoiem 1 Tanmay Gupta 2 Zhizhong Li 1 Michal M. Shlapentokh-Rothman 1

Abstract trapolate rather than evaluate and have typically required
large-scale experiments that are outside the computational
Learning curves model a classifier’s test error as a
budgets of many research groups.
function of the number of training samples. Prior
works show that learning curves can be used to Experimentally, we find that the extended power law
select model parameters and extrapolate perfor- etest (n) = α + ηnγ yields a well-fitting learning curve,
mance. We investigate how to use learning curves where etest is test error and n is the number of training sam-
to evaluate design choices, such as pretraining, ples (or “training size”), but that the parameters {α, η, γ}
architecture, and data augmentation. We propose are individually unstable under measurement perturbations.
a method to robustly estimate learning curves, ab- To facilitate curve comparisons, we abstract the curve into
stract their parameters into error and data-reliance, two key parameters, eN and βN , that have intuitive mean-
and evaluate the effectiveness of different param- ings and are more stable under measurement variance. eN
eterizations. Our experiments exemplify use of is the test error at n = N , and βN is a measure of data-
learning curves for analysis and yield several in- reliance, how much a classifier’s error will change if the
teresting observations. training size changes. Our experiments show that learning
curves provide insights that cannot be obtained by single-
point comparisons of performance. Our aim is to promote
1. Introduction the use of learning curves as part of a standard learning
system evaluation.
The performance of a learning system depends strongly on
the number of training samples, but standard evaluations use Our key contributions:
a fixed train/test split. While some works measure perfor-
mance using subsets of training data, the lack of a system- • Investigate how to best model, estimate, characterize,
atic way to measure and report performance as a function and display learning curves for use in classifier analysis
of training size is a barrier to progress in machine learning
research, particularly in areas like representation learning, • Exemplify use of learning curves with analysis of im-
data augmentation, and low-shot learning that specifically pact of error and data-reliance due to network archi-
address the limited-data regime. What gets measured gets tecture, optimization, depth, width, fine-tuning, data
optimized, so we need better measures of learning ability to augmentation, and pretraining.
design better classifiers for the spectrum of data availability.
In this paper, we establish and demonstrate use of learn- Table 1 shows validated and rejected popular beliefs that
ing curves to improve evaluation of classifiers (see Fig. 1). single-point comparisons often overlook. In the follow-
Learning curves, which model error as a function of training ing sections, we investigate how to model learning curves
set size, were introduced nearly thirty years ago by Cortes (Sec. 2), how to estimate them (Sec. 3), and what they can
et al. (1993)) to accelerate model selection of deep networks, tell us about the impact of design decisions (Sec. 4), with
and recent works have demonstrated the predictability of discussion of limitations and future work in Sec. 5.
performance improvements with more data (Hestness et al.,
2017; Johnson & Nguyen, 2017; Kaplan et al., 2020; Rosen- 2. Modeling Learning Curves
feld et al., 2020) or more network parameters (Kaplan et al.,
2020; Rosenfeld et al., 2020). But such studies aim to ex- The learning curve measures test error etest as a function of
the number of training samples n for a given classification
1
University of Illinois at Urbana-Champaign 2 PRIOR @ model and learning method. Previous empirical observa-
Allen Institute for AI. Correspondence to: Derek Hoiem tions suggest a functional form etest (n) = α + ηnγ , with
.
bias-variance trade-off and generalization theories typically
Proceedings of the 38 th International Conference on Machine indicating γ = −0.5. We summarize what bias-variance
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). trade-off and generalization theories (Sec. 2.1) and empiri-

Learning Curves for Analysis of Deep Networks

 Model Error on full dataset e(n) = ↵ + ⌘ · n
 Local Linear Approximation at .

 Error (%)
 100
 ( = 400) Model A (e400 = 27.91; = 18.23; = 0.41) 3
 6- where 6- = −2*7. +
 400
 Model B (e400 = 32.42; = 5.61; = 0.35) %̃ 2. = %- + 4
 −1
 A 27.92
 400

 80
 B 32.40 % # = ( + *#+
 %̃-/1
 (a) Typical classifier 60
 evaluation and reporting
 %̃-/1 = %- + 6-

 Error
 %̃-
 %̃- = %-
 40
 B better than A %̃1-
 %̃1- = %- − 0.56-
 %̃;
 Model ()) ()) %; %̃; = %- − 6-
 20 A better than B
 A 27.91 18.23 -0.41 (

 B 32.42 5.61 -0.35
 0
 0 0.05 0.1 0.15 0.2 0.25 1/ 4. 1/ . 2/ .
 1/ #
 (1) (400) (100) (45) (25) (16)
 (# = 4.) (# = .) (# = ./4)
 (b) Proposed evaluation that also n 0.5

 characterizes change in performance
 with changing number of training (c) Model comparison via (d) Characterization of a learning curve via
 samples as shown in (d) learning curves error ( & ) and data-reliance ( & )

Figure 1: Evaluation with learning curves vs single-point comparison: Comparing error of models trained on the full
dataset, as shown in (a), is standard practice but provides an incomplete view of a classifier’s performance. We propose
a methodology to estimate and visualize learning curves that model a classifier’s performance with varying amounts of
training data (b,c). We also propose a succinct summary of a model’s curve in terms of error and data-reliance (b,d).

 Your Supp- Exp.
 Popular beliefs
 guess orted? figures
 Pre-training on similar domains nearly always helps compared to training from scratch. Y 5a, 5b, 6
 Pre-training, even on similar domains, introduces bias that would harm performance with a large enough training set. U 6
 Self-/un-supervised training performs better than supervised pre-training for small datasets. N 6
 Fine-tuning the entire network (vs. just the classification layer) is only helpful if the training set is large. N 5a, 5b
 Increasing network depth, when fine-tuning, harms performance for small training sets, due to an overly complex model. N 7a
 Increasing network depth, when fine-tuning, is more helpful for larger training sets than smaller ones. N 7a
 Increasing network depth, if the backbone is frozen, is more helpful for smaller training sets than larger ones. N 7d
 Increasing depth or width improves more than ensembles of smaller networks with the same number of parameters. Y 7f
 Data augmentation is roughly equivalent to using a m-times larger training set for some m. Y 8

Table 1: Deep learning quiz! We encourage our readers to judge each claim as T (true) or F (false) and then see if our
experimental results support the claim (Yes/No/Unsure). In many cases, particularly regarding fine-tuning and network
depth, the results surprised the authors. Our experiments show that learning curve analysis provides a systematic way to
investigate these suppositions and others.

cal studies (Sec. 2.2) can tell us about learning curves, and preconditions must be built in rather than learned. Domin-
describe our proposed abstraction in Sec. 2.3. gos (2000) extends the analysis to classification. Theoret-
 ically, the mean squared error (MSE) can be modeled as
2.1. Bias-variance Trade-off and Generalization e2test (n) = bias2 + noise2 + var(n), where “noise” is ir-
 Theory reducible error due to non-unique mapping from inputs to
 labels, and variance can be modeled as var(n) = σ 2 /n for
The bias-variance trade-off is an intuitive and theoretically n training samples.
sound way to think about generalization. The “bias” is
error due to inability of the classifier to encode the opti- The ηnγ term in etest (n) appears throughout machine learn-
mal decision function, and the “variance” is error due to ing generalization theory, usually with γ = −0.5. For ex-
limited availability of training samples for parameter esti- ample, the bounds based on hypothesis VC-dimension (Vap-
mation. This is called a trade-off because a classifier with nik & Chervonenkis, 1971) and Rademacher Complex-
more parameters tends to have less bias but higher variance. ity (Gnecco & Sanguineti, 2008) are both O(cn−0.5 ) where
Geman et al. (1992) decompose mean squared regression c depends on the complexity of the classification model.
error into bias and variance and explore the implications More recent work also follows this form, e.g. (Neyshabur
for neural networks, leading to the conclusion that “iden- et al., 2018; Bartlett et al., 2017; Arora et al., 2018; Bousquet
tifying the right preconditions is the substantial problem & Elisseeff, 2002). One caveat is that the exponential term
in neural modeling”. This conclusion foreshadows the im- γ can deviate from −0.5 if the classifier parameters depend
portance of pretraining, though Geman et al. thought the directly on n. For example, (Tsybakov, 2008) shows that
 setting the bandwidth of a kernel density estimator based on

Learning Curves for Analysis of Deep Networks

n causes the dominant term to be bias, rather than variance, sensitivity to training size in a way that can be derived from
changing γ in the bound. In our experiments, all training various learning curve models and is insensitive to data
and model parameters are fixed for each curve, except the perturbations. The curve is characterized by error eN =
learning schedule, but we note that the learning curve de- α + ηN γ and data-reliance βN , and we typically choose N
pends on optimization methods and hyperparameters as well as the full dataset size. Noting that most learning curves
as the classifier model. are locally well approximated by a model linear in n−0.5 ,
we compute data-reliance as βN = N −0.5 ∂n∂e −0.5
n=N
=
2.2. Empirical Studies −2ηγN γ . When the error is plotted against n−0.5 , βN is
the slope at N scaled by N −0.5 , with the scaling chosen to
Some recent empirical studies (e.g. Sun et al. (2017)) claim
make the practical implications of βN more intuitive. This
a log-linear relationship between error and training size, but
yields a simple predictor for error when changing training
this holds only when asymptotic error is zero. Hestness et al.
size by a factor of d:
(2017) model error as etest (n) = α + ηnγ but often find γ
much smaller in magnitude than −0.5 and suggest that poor

1
fits indicate need for better hyperparameter tuning. This em- ee(d · N ) = eN + √ − 1 βN . (1)
d
pirically supports that data sensitivity depends both on the
classification model and on the efficacy of the optimization This is a first order Taylor expansion of e(n) around n = N
algorithm and parameters. Johnson & Nguyen (2017) also with respect to n−0.5 . By this linearized estimate, asymp-
find a better fit with this extended power law model than by totic error is eN − βN , a 4-fold increase in data (e.g. 400 →
restricting γ = −0.5 or α = 0. 1600) reduces error by 0.5βN , and using only one quarter
of the dataset (e.g. 400 → 100) increases the error by βN .
In the language domain, learning curves are used in a fasci- For two models with similar eN , the one with a larger βN
nating study by Kaplan et al. (2020). For natural language would outperform with more data but underperform with
transformers, they show that a power law relationship be- less. (eN , βN , γ) is a complete re-parameterization of the
tween logistic loss, model size, compute time, and dataset extended power law, with γ + 0.5 indicating the curvature
size is maintained if, and only if, each is increased in tan- in n−0.5 scale. See Fig. 1d for illustration.
dem. We draw some similar conclusions to their study,
such as that increasing model size tends to improve perfor-
mance especially for small training sets (which surprised 3. Estimating Learning Curves
us). However, the studies are largely complementary, as we We now describe the method for estimating the learning
study convolutional nets in computer vision, classification curve from error measurements with confidence bounds
error instead of logistic loss, and a broader range of design on the estimate. Let eij denote the random variable corre-
choices such as data augmentation, pretraining source, ar- sponding to test error when the model is trained on the j th
chitecture, and optimization. Also related, Rosenfeld et al. fold of ni samples (either per class or in total). We assume
(2020) model error as a function of both training size and {eij }F 2
j=1 are i.i.d according to N (µi , σi ). We want to esti-
i

number of model parameters with a five-parameter function mate learning curve parameters α (asymptotic error), η, and
that accounts for training size, model parameter size, and γ, such that eij = α + ηnγi + ij where ij ∼ N (0, σi2 ) and
chance performance. A key difference in our work is that µij = E[eij ] = µi . Sections 3.1 and 3.2 describe how to
we focus on how to best draw insights about design choices estimate mean and variance of α and η for a given γ, and
from learning curves, rather than on extrapolation. As such, Sec. 3.3 describes our approach for estimating γ.
we propose methods to estimate learning curves and their
variance from a relatively small number of trained models.
3.1. Weighted Least Squares Formulation
2.3. Proposed Characterization of Learning Curves for We estimate learning curve parameters {α, η} by optimizing
Evaluation a weighted least squares objective:
Our experiments in Sec. 4.1 show that the learning curve Fi
S X
2
X
model e(n) = α + ηnγ results in excellent leave-one-size- G(γ) = min wij (eij − α − ηnγ ) (2)
α,η
out RMS error and extrapolation. However, α, η, and γ i=1 j=1
cannot be meaningfully compared across curves because the
where wij = 1/(Fi σi2 ). Fi is the number of models trained
parameters have high covariance with small data perturba-
with data size ni and is used to normalize the weight so
tions, and comparing η values is not meaningful unless γ
that the total weight for observations from each training size
is fixed and vice-versa. This would prevent tabular compar-
does not depend on Fi . The factor of σi2 accounts for the
isons and makes it harder to draw quantitative conclusions.
variance of ij . Assuming constant σi2 and removing the Fi
To overcome this problem, we propose to report error and factor would yield unweighted least squares.

Learning Curves for Analysis of Deep Networks

The variance of the estimate of σi2 from Fi samples is by searching over γ ∈ {−0.99, ..., −0.01} with λ = 5 (with
2σi4 /Fi , which can lead to over- or under-weighting data error on 100 point scale) for our experiments.
for particular i if Fi is small. Recall that each sample eij
requires training an entire model, so Fi is always small in 4. Experiments
our experiments. We would expect the variance to have
the form σi2 = σ02 + σ̂ 2 /ni , where σ02 is the variance due We validate our choice of learning curve model and estima-
to random initialization and optimization and σ̂ 2 /ni is the tion method in Sec. 4.1 and use the learning curves to ex-
variance due to randomness in selecting ni samples. We plore impact of design decisions on error and data-reliance
validated this variance model by averaging over the vari- in Sec. 4.2.
ance estimates for many different network models on the
 Setup: Each learning curve is fit to test errors measured
CIFAR-100 (Krizhevsky, 2012) dataset. We use σ02 = 0.02
 after training the classifier on various subsets of the training
in all experiments and then use a least squares fit to esti-
 data. When less than the full training set is used, we train
mate a single σ̂ 2 parameter from all samples e in a given
 multiple classifiers using different partitions of data (e.g.
learning curve. This attention to wij may seem fussy, but
 four classifiers on four quarters of the data). We avoid use
without such care we find that the learning curve often fails
 of benchmark test sets, since the curves are used for analysis
to account sufficiently for all the data in some cases.
 and ablation. Instead, training data are split 80/20, with 80%
 of data used for training and validation and remaining 20%
3.2. Solving for Learning Curve Mean and Variance for testing. For each curve, all models are trained using
Concatenating errors across dataset sizes (indexed by i) an initial learning rate that is selected using one subset of
 training data, and the learning rate schedule is set for each
PS folds results in an error vector e of dimension D =
and
 i=1 Fi . For each d ∈ {1, · · · , D}, e[d] is an observation
 n based on validation. Unless otherwise noted, models are
of error at dataset size nid that follows N (µid , σi2d ) with id pretrained on ImageNet. Other hyperparameters are fixed
mapping d to the corresponding i. for all experiments. Unless otherwise noted, we use the
 Ranger optimizer (Wright, 2019), which combines Rectified
The weighted least squares problem can be formulated as Adam (Liu et al., 2020), Look Ahead (Zhang et al., 2019),
solving a system of linear equations denoted by W 1/2 e = and Gradient Centralization (Yong et al., 2020), as prelim-
W 1/2 Aθ, where W ∈ RD×D is a diagonal matrix inary experiments showed its effectiveness. For “linear”,
of weights Wdd = wd , A ∈ RD×2 is a matrix with we train only the final classification layer with the other
 T
A[d, :] = [1 nγd ], and θ = [α η] are the parameters weights frozen to initialized values. All weights are trained
of the learning curve, treating γ as fixed for now. The when “fine-tuning”. Tests are on Cifar100 (Krizhevsky,
estimator for the learning curve is then given by θ̂ = 2012), Cifar10, Places365 (Zhou et al., 2017), or Caltech-
(W 1/2 A)+ W 1/2 e = M e, where M ∈ R2×D and + is 101 (L. Fei-Fei; Fergus, 2006). See supplemental materials
pseudo-inverse operator. (Appendix A) for more implementation details.
The covariance of the estimator is given by Σθ̂ = M Σe M T ,
where Σθ̂ ∈ R2×2 and Σe ∈ RD×D is the diagonal covari- 4.1. Evaluation of Learning Curves Model and Fitting
ance of e with Σe [d, d] = σi2d . We compute our empirical
 We validate our learning curve model using leave-one-size-
estimate of σi2 as described in Sec. 3.1.
 out prediction error, e.g. predicting empirical mean perfor-
Since the estimated curve is given by ê(n) = [1 nγ ] θ̂, mance with 400 samples per class based on observing error
the 95% bounds at any n can be computed as from models trained on 25, 50, 100, and 200 samples.
ê(n) ± 1.96 × σ̂(n) with
 Weighting Schemes. In the Fig. 2 table, learning curve
  
 1 prediction error is averaged for 16 Cifar100 classifiers with
 2 γ
  
 σ̂ (n) = 1 n Σθ̂ γ (3) varying design decisions. We compare three weighting
 n
 schemes (w’s in Eq. 2): wij = 1 is unweighted; wij = 1/σi2
For a given γ, these confidence bounds reflect the variance in is weighted by estimated size-dependent standard devia-
the estimated curve due to variance in error measurements. tion; wij = 1/(Fi σi2 ) makes the total weight for a given
 dataset size invariant to the number of folds. On average our
3.3. Estimating γ proposed weighting performs best with high significance
 compared to unweighted. The p-value is paired t-test of
We search for γ that minimizes the weighted least squares difference of means calculated across all dataset sizes.
objective with an L1-prior that slightly encourages values
close to 0.5. Specifically, we solve Model Choice. We consider other parameterizations that
 are special cases of e(n) = α + ηnγ + δn2γ . Setting δ = 0
 min G(γ) + λ|γ + 0.5| (4) (top row of table) yields the model described in Sec. 2.3
 γ∈(−1,0)

Learning Curves for Analysis of Deep Networks

 100
 RMSE
 Params Weights R2 25 50 100 200 400 avg p-value 80

 1
 σi2 Fi
 0.998 2.40 0.86 0.54 0.57 0.85 1.04 -
 60
 α, η, γ 1
 0.999 2.38 0.83 0.69 0.54 1.08 1.10 0.06

 Error
 σi2
 1 0.998 2.66 0.86 0.79 0.50 1.26 1.21 0.008 40

 1
 α, η σi2 Fi
 0.988 3.41 1.09 0.69 0.72 1.21 1.42

Learning Curves for Analysis of Deep Networks

 e(n) = α + η · nγ e(n) = α + η · nγ
 100 100
 Adam (e4000 = 5.51; β4000 = 9.89; γ = −0.43) Adam (e4000 = 2.86; β4000 = 2.82; γ = −0.47)
 Ranger (e4000 = 5.08; β4000 = 7.73; γ = −0.5) Ranger (e4000 = 2.98; β4000 = 2.97; γ = −0.4)
 SGD (e4000 = 6.03; β4000 = 8.5; γ = −0.5) SGD (e4000 = 2.76; β4000 = 2.85; γ = −0.38)
 80 SGD w/o momentum (e4000 = 7.3; β4000 = 12.66; γ = −0.32) 80 SGD w/o momentum (e4000 = 2.64; β4000 = 3.08; γ = −0.36)

 60 60
 Error

 Error
 40 40

 20 20

 0 0
 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.02 0.04 0.06 0.08 0.1 0.12
 (∞) (2500) (625) (278) (157) (100) (70) (∞) (2500) (625) (278) (157) (100) (70)
 n−0.5 n−0.5

 Figure 4: Optimization on Cifar10 with ResNet-18. Without pretraining on left; with ImageNet pretraining on right.

Pretraining and fine-tuning: Some prior works examine e(n) = α + η · nγ
 100

the effectiveness of pretraining. Kornblith et al. (2019) show
that fine-tuned pretrained models outperform randomly ini- 80
tialized models across many architectures and datasets, but
the gap is often small and narrows as training size grows. 60
For object detection, He et al. (2019) find that, with long
 Error
learning schedules, randomly initialized networks approach
 40
the performance of networks pretrained for classification,
even on smaller training sizes, and Zoph et al. (2020) show R18; No Pretr; Linear (e400 = 79.29; β400 = 1.32; γ = −0.84)
 20
that pretraining can sometimes harm performance when R18; No Pretr; Finetune (e400 = 27.91; β400 = 18.23; γ = −0.41)
 R18; Pretr; Linear (e400 = 32.42; β400 = 5.61; γ = −0.35)
strong data augmentation is used. R18; Pretr; Finetune (e400 = 18.86; β400 = 7.28; γ = −0.57)
 0
 0 0.05 0.1 0.15 0.2 0.25
Compared to prior works, our use of learning curve models (∞) (400) (100) (45) (25) (16)
 n−0.5
enables extrapolation beyond available data and numerical
comparison of data reliance. In Fig. 5 we see that, without (a) Transfer: ImageNet to Cifar100
fine-tuning (“linear”), pretraining leads to a huge improve- e(n) = α + η · nγ
 100
ment in e400 for all training sizes. When fine-tuning, the
pretraining greatly reduces data-reliance β400 and also re- 80
duces e400 , providing strong advantages with smaller train-
ing sizes that may disappear with enough data.
 60
 Error

Pretraining data sources: In Fig. 6, we test on Ci-
far100 and Caltech-101 to compare different pretrained 40

models: randomly initialized, supervised on ImageNet or
Places365 (Zhou et al., 2017), and self-supervised on Im- 20
 R18; No Pretr; Linear (e400 = 92.79; β400 = 0.96; γ = −0.5)
 R18; No Pretr; Finetune (e400 = 57.89; β400 = 12.86; γ = −0.26)
ageNet (MOCO by He et al. (2020)). The strongest im- R18; Pretr; Linear (e400 = 59.91; β400 = 4.16; γ = −0.38)
 R18; Pretr; Finetune (e400 = 54.0; β400 = 7.33; γ = −0.28)
pact is on data-reliance, which leads to consistent orderings 0
 0 0.05 0.1 0.15 0.2 0.25
of models across training sizes for both datasets. Super- (∞) (400) (100) (45) (25) (16)
 n−0.5
vised ImageNet pretraining has lowest eN and βN , then
self-supervised MOCO, then supervised Places365, with (b) Transfer: ImageNet to Places365
random initialization trailing far behind all pretrained mod-
els. The original papers excluded either analysis of training Figure 5: Pretraining and fine-tuning with ResNet-18.
size (MOCO) or fine-tuning (Places365), so ours is the first
analysis on image benchmarks to compare finetuned models
 overfitting. In Figs. 7a, 7d, we show that, not only do deeper
from different pretraining sources as a function of training
 networks have better potential at higher data sizes, their data
size. Newly proposed methods for representation learning
 reliance does not increase (nearly parallel and drops a little
would benefit from further learning curve analysis.
 for fine-tuning), making deeper networks perfectly suitable
Network depth, width, and ensembles: The classical for smaller datasets. For linear classifiers (Fig. 7d), the
view is that smaller datasets need simpler models to avoid deeper networks provide better features, leading to consis-
 tent drop in e400 . The small jump in data reliance between

Learning Curves for Analysis of Deep Networks

 e(n) = α + η · nγ e(n) = α + η · nγ
 100 100
 R50; No Pretr (eN = 27.66; βN = 20.93; γ = −0.35) R50; No Pretr (e57 = 21.63; β57 = 22.92; γ = −0.5)
 R50; Imagenet Pretr (eN = 18.33; βN = 4.32; γ = −0.67) R50; Imagenet Pretr (e57 = 5.91; β57 = 6.25; γ = −0.5)
 R50; Imagenet MOCO Pretr (eN = 18.74; βN = 11.21; γ = −0.3) R50; Imagenet MOCO Pretr (e57 = 8.39; β57 = 10.46; γ = −0.5)
 80 R50; Places Pretr (eN = 20.49; βN = 12.53; γ = −0.24) 80 R50; Places Pretr (e57 = 11.27; β57 = 12.86; γ = −0.5)

 60 60
 Error

 Error
 40 40

 20 20

 0 0
 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16) (12) (9)
 n−0.5 n−0.5

 Figure 6: Pretraining sources: test on Cifar100 (left) and Caltech-101 (right).

Resnet-34 and Resnet-50 may be due to the increased last 5. Discussion
layer input size from 512 to 2048 nodes. When increasing
width, the fine-tuned networks (Fig. 7b) have reduced e400 Evaluation methodology is the foundation of research, im-
without much change to data-reliance. With linear classi- pacting how we choose problems and rank solutions. Large
fiers (Fig. 7e), increasing the width leads to little change or train and test sets now serve as the fuel and crucible to refine
even increase in e400 with slight decrease in data-reliance. machine learning methods. We now discuss the need for
 learning curves, limitations of our experiments, and direc-
An alternative to using a deeper or wider network is forming tions for future work.
ensembles. Figure 7c shows that, while an ensemble of
six ResNet-18’s (each 11.7M parameters) improves over Case for learning curve models. Compared to individual
a single model, it has higher e400 and data-reliance than runs, learning curves often provide more accurate or confi-
ResNet-101 (44.5M), Wide-ResNet-50 (68.9M), and Wide- dent conclusions, which is crucial, as any misleading/partial
ResNet-101 (126.9M). Three ResNet-50’s (each 25.6M) conclusions can block progress. For example, in Fig. 5b,
underperforms Wide-ResNet-50 on e400 but outperforms comparison at N = 1600 may indicate that pretraining
for small amounts of data due to lower data reliance. Fig. 7f has little impact on error when fine-tuning, but the curves
tabulates data reliance and error to simplify comparison. show large differences in data-reliance leading to a large
 gap when less data is available. In Fig. 6 (left), compari-
Rosenfeld et al. (2020) show that error can be modeled son at N = 400 may indicate that MOCO and supervised
as a function of either training size, model size, or both. ImageNet pretraining are equally effective, but the curves
Modeling both jointly can provide additional capabilities show that the supervised ImageNet pretrained classifier has
such as selecting model size based on data size, but requires lower data-reliance (β = 4.32 vs. β = 11.21) and thus
many more experiments to fit the curve. Our experiments outperforms with less data. Generally, individual runs can-
show more clearly the effect on data reliance due to different not reveal the sample efficiency of a learner. One curve
ways of changing model size. may dominate another because it has less bias (leading to
Data Augmentation: We are not aware of previous stud- uniformly lower error, as with “linear” classifiers of increas-
ies on interaction between data augmentation and training ing depth in Fig. 7) or less variance (leading to different
size. For example, a large survey (Shorten & Khoshgoftaar, curve slopes, as in different pretrained models in Fig. 6),
2019) compares different augmentation methods only on and distinguishing helps understand and improve.
full training size. One may expect that data augmentation Compared to a piecewise affine fit (i.e., plotting a line
acts as a regularizer with reduced effect for large training through observed error points), our modeling and character-
sizes, or even possibly negative effect due to introducing ization in terms of error@N and data-reliance has important
bias. However, Fig. 8 shows that data augmentation on advantages. First, our two-parameter characterization can be
Places365 reduces error for all training sizes with little or tabulated (Fig. 7f) facilitating comparison within and across
no change to data-reliance when fine-tuning. e(n) with aug- papers. We can learn from object detection research, where
mentation roughly equals e(1.8n) without it, supporting the the practice of reporting PR/ROC curves gave way to report-
view that augmentation acts as a multiplier on the value of ing AP, making it easier to average and compare within and
an example. For the linear classifier, data augmentation has across works when working with multiple datasets. Second,
little apparent effect due to low data-reliance, but the results a piecewise affine fit is less stable under measurement er-
are still consistent with this multiplier. ror, leading to significantly poorer extrapolations. Finally,

Learning Curves for Analysis of Deep Networks

 e(n) = α + η · nγ e(n) = α + η · nγ e(n) = α + η · nγ
 100 100 100
 Resnet-18 (e400 = 18.86; β400 = 7.28; γ = −0.57) Resnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67) 6xResnet-18 (e400 = 15.96; β400 = 7.09; γ = −0.49)
 Resnet-34 (e400 = 18.75; β400 = 4.36; γ = −0.73) 2xWide-Resnet-50 (e400 = 14.04; β400 = 6.0; γ = −0.57) 1xResnet-18 (e400 = 18.86; β400 = 7.28; γ = −0.57)
 Resnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67) Resnet-101 (e400 = 14.97; β400 = 5.03; γ = −0.62) 3xResnet-50 (e400 = 15.96; β400 = 4.15; γ = −0.64)
 80 Resnet-101 (e400 = 14.97; β400 = 5.03; γ = −0.62) 80 2xWide-Resnet-101 (e400 = 13.37; β400 = 5.81; γ = −0.5) 80 1xResnet-50 (e400 = 18.33; β400 = 4.32; γ = −0.67)

 60 60 60
 Error

 Error

 Error
 40 40 40

 20 20 20

 0 0 0
 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25
 (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16) (∞) (400) (100) (45) (25) (16)
 n−0.5 n−0.5 n−0.5

 (a) Depth: finetune (b) Width: finetune (c) Ensemble: finetune
 e(n) = α + η · nγ e(n) = α + η · nγ
 100 100
 Resnet-18 (e400 = 32.42; β400 = 5.61; γ = −0.35) Resnet-50 (e400 = 28.63; β400 = 6.66; γ = −0.22)
 Resnet-34 (e400 = 30.02; β400 = 5.44; γ = −0.33) 2xWide-Resnet-50 (e400 = 28.77; β400 = 4.89; γ = −0.35)
 Resnet-50 (e400 = 28.63; β400 = 6.66; γ = −0.22) Resnet-101 (e400 = 25.86; β400 = 6.17; γ = −0.21)
 80 Resnet-101 (e400 = 25.86; β400 = 6.17; γ = −0.21) 80 2xWide-Resnet-101 (e400 = 27.76; β400 = 5.41; γ = −0.25)
 Model #params 
 60 60 Resnet 18 11.7M 18.86 7.28
 Error

 Error

 Resnet 50 25.6M 18.33 4.32
 40 40
 Resnet 101 44.5M 14.97 5.03
 2x Wide Resnet 50 68.9M 14.04 6.00
 20 20
 2x Wide Resnet 101 126.9M 13.37 5.81

 0 0
 6x Ensemble Resnet 18 70.2M 15.96 7.09
 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25
 (∞) (400) (100)
 n−0.5
 (45) (25) (16) (∞) (400) (100)
 n−0.5
 (45) (25) (16)
 3x Ensemble Resnet 50 76.8M 15.96 4.15

 (d) Depth: linear (e) Width: linear (f) Ensemble vs Depth vs Width: finetune

 Figure 7: Depth, width, and ensembles on Cifar100. The table (f) compactly compares several curves.

 100
 e(n) = α + η · nγ over-commitment to initial conditions. We speculate (but
 with some disagreement among authors) that γ ≈ −0.5
 is an indication of a well-trained curve and will generally
 80 outperform curves with higher or lower γ, given the same
 classification model.
 60 Small training sets: Error is bounded and classifier perfor-
 mance with small training sets may be modeled as transi-
Error

 tioning from random guess to informed prediction, as shown
 40
 R18 No Pretr; Ft; w/ Aug (e400 = 58.01; β400 = 12.44; γ = −0.28) by Rosenfeld et al. (2020). For simplicity, we do not model
 R18 No Pretr; Ft; w/o Aug (e400 = 63.73; β400 = 13.01; γ = −0.24) performance with very small training size, but studying the
 R18; Pretr; Ft; w/ Aug (e400 = 53.88; β400 = 7.99; γ = −0.22)
 20
 R18; Pretr; Ft; w/o Aug (e400 = 56.68; β400 = 7.94; γ = −0.23)
 small data regime could be interesting, particularly to deter-
 R18; Pretr; Lin; w/ Aug (e400 = 60.01; β400 = 3.83; γ = −0.43) mine whether design decisions have an impact at the small
 R18; Pretr; Lin; w/o Aug (e400 = 60.48; β400 = 4.52; γ = −0.36) size that is not apparent at larger sizes.
 0
 0 0.05 0.1 0.15 0.2 0.25
 (∞) (400) (100)
 −0.5
 (45) (25) (16) Losses and Prediction types: We analyze multiclass clas-
 n
 sification error, but the same analysis could likely be ex-
 Figure 8: Data augmentation on Places365. tended to other prediction types. For example, Kaplan et al.
 (2020) analyze learning manifolds of cross-entropy loss of
the parametric model helps to identify poor hyperparameter
 language model transformers. Learning curve models of
selection, indicated by poor model fit or extreme values
 cross-entropy loss could also be used for problems like ob-
of γ.
 ject detection or semantic segmentation that typically use
Cause and impact of γ: We speculate that γ is largely de- more complex aggregate evaluations.
termined by hyperparameters and optimization rather than
 More design parameters and interactions: The interac-
model design. This presents an opportunity to identify poor
 tion between data scale, model scale, and performance is
training regimes and improve them. Intuitively, one would
 well-explored by Kaplan et al. (2020) and Rosenfeld et al.
expect that more negative γ values are better (i.e. γ = −1
 (2020), but it could also be interesting to explore interac-
preferable to γ = −0.5), since the error is O(nγ ), but we
 tions, e.g. between class of architecture (e.g. VGG, ResNet,
find the high-magnitude γ tends to come with high asymp-
 EfficientNet (Tan & Le, 2019)) and some design parameters,
totic error, indicating that the efficiency comes at cost of

Learning Curves for Analysis of Deep Networks

to see the impact of ideas such as skip-connections, residual Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrally-
layers and bottlenecks. More extensive evaluation of data normalized margin bounds for neural networks. In Confer-
augmentation, representation learning, optimization, and ence on Neural Information Processing Systems (NIPS),
regularization would also be interesting. 2017.
Unbalanced class distributions: In most of our experi- Bousquet, O. and Elisseeff, A. Stability and generalization.
ments, we use equal number of samples per class. Our Journal of Machine Learning Research, 2:499–526, 2002.
experiments on Caltech-101 in Fig. 6 (right) and Pets and
Sun397 (Fig. 10 in supplemental) use the original imbal- Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., and
anced distributions, demonstrating that the same learning Denker, J. S. Learning curves: Asymptotic values and
curve model applies, but further experiments are needed to rate of convergence. In Advances in Neural Information
examine the impact of class imbalance. Processing Systems (NIPS), 1993.
Limits to extrapolation: Our experiments indicate good
extrapolation up to 4x the observed training size. However, Domingos, P. A unified bias-variance decomposition for
we caution against drawing conclusions about asymptotic zero-one and squared loss. In National Conference on
error, since estimation of α is highly sensitive to data per- Artificial Intelligence and Conference on Innovative Ap-
turbations. eN − βN provides a more stable indicator of plications of Artificial Intelligence, 2000.
large-sample performance.
Falcon, W. Pytorch lightning. GitHub. Note:
The supplemental material contains implementation de- https://github.com/PyTorchLightning/pytorch-lightning,
tails (Appendix A); a user guide to fitting, displaying, and 3, 2019.
using learning curves (Appendix B); experiments on addi-
tional datasets (Appendix C); and a table of learning curve Geman, S., Bienenstock, E., and Doursat, R. Neural net-
parameters for all experiments, also comparing eN and βN works and the bias/variance dilemma. Neural Computa-
produced by two learning curve models (Appendix D). Code tion, 4(1):1–58, January 1992.
is currently available at prior.allenai.org/projects/lcurve.
Gnecco, G. and Sanguineti, M. Approximation error bounds
via rademacher’s complexity. Applied Mathematical Sci-
6. Conclusion ences, 2(4), 2008.
Performance depends strongly on training size, so size-
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
variant analysis more fully shows the impact of innovations.
into rectifiers: Surpassing human-level performance on
A reluctant researcher may protest that such analysis is too
imagenet classification. In International Conference on
complicated, too computationally expensive, or requires too
Computer Vision (ICCV), 2015.
much space to display. We show that learning curve mod-
els can be easily fit (Sec. 3) from a few trials (Fig. 2) and He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
compactly summarized (Fig. 7f), leaving our evasive experi- ing for image recognition. In Conference on Computer
menter no excuse. Our experiments serve as examples that Vision and Pattern Recognition (CVPR), 2016.
learning curve analysis can yield interesting observations.
Learning curves can further inform training methodology, He, K., Girshick, R., and Dollar, P. Rethinking imagenet
continual learning, and representation learning, among other pre-training. In International Conference on Computer
problems, providing a better understanding of contributions Vision (ICCV), 2019.
and ultimately leading to faster progress in machine learn-
ing. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation
learning. In Conference on Computer Vision and Pattern
Acknowledgements: This research is supported in part by
Recognition (CVPR), 2020.
ONR MURI Award N000141612007. We thank reviewers
for prompting further discussion. Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun,
H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and
References Zhou, Y. Deep learning scaling is predictable, empirically.
CoRR, 2017.
Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger
generalization bounds for deep nets via a compression Johnson, M. and Nguyen, D. Q. How much data is enough?
approach. In International Conference on Machine Learn- Predicting how accuracy varies with training data size.
ing (ICML), 2018. Technical report, 2017.

Learning Curves for Analysis of Deep Networks

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Tan, M. and Le, Q. V. Efficientnet: Rethinking model scal-
 Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and ing for convolutional neural networks. In International
 Amodei, D. Scaling laws for neural language models. Conference on Machine Learning (ICML), 2019.
 ArXiv preprint, 2020.
 Tsybakov, A. B. Introduction to Nonparametric Estimation.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Springer Publishing Company, Incorporated, 1st edition,
 optimization. In International Conference on Learning 2008. ISBN 0387790519.
 Representations (ICLR), 2015.
 Vapnik, V. N. and Chervonenkis, A. Y. On the uniform
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet convergence of relative frequencies of events to their
 models transfer better? In Conference on Computer probabilities. Theory of Probability and its Applications,
 Vision and Pattern Recognition (CVPR), 2019. 16(2):264–280, 1971. doi: 10.1137/1116025.

Krizhevsky, A. Learning multiple layers of features from Wright, L. New deep learning optimizer, ranger: Syner-
 tiny images. University of Toronto, 05 2012. gistic combination of radam + lookahead for the best of
 both. Github https://github.com/lessw2020/Ranger-Deep-
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Learning-Optimizer, 08 2019.
 classification with deep convolutional neural networks.
 In Conference on Neural Information Processing Systems Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
 (NIPS), 2012. gated residual transformations for deep neural networks.
 In Conference on Computer Vision and Pattern Recogni-
L. Fei-Fei; Fergus, R. P. One-shot learning of object cate- tion (CVPR), 2017.
 gories. IEEE Transactions on Pattern Analysis Machine
 Intelligence (PAMI), 28:594–611, April 2006. Yong, H., Huang, J., Hua, X., and Zhang, L. Gradient
 centralization: A new optimization technique for deep
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and neural networks. In European Conference on Computer
 Han, J. On the variance of the adaptive learning rate Vision (ECCV), 2020.
 and beyond. In International Conference on Learning
 Representations (ICLR), 2020. Zagoruyko, S. and Komodakis, N. Wide residual networks.
 In British Machine Vision Conference (BMVC), 2016.
Neyshabur, B., Bhojanapalli, S., and Srebro, N. A
 PAC-bayesian approach to spectrally-normalized margin Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P.,
 bounds for neural networks. In International Conference Riquelme, C., Lucic, M., Djolonga, J., Pinto, A. S.,
 on Learning Representations (ICLR), 2018. Neumann, M., Dosovitskiy, A., Beyer, L., Bachem, O.,
 Tschannen, M., Michalski, M., Bousquet, O., Gelly, S.,
Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, and Houlsby, N. A large-scale study of representation
 N. A constructive prediction of the generalization error learning with the visual task adaptation benchmark. arXiv
 across scales. In International Conference on Learning preprint arXiv:1910.04867, 2020.
 Representations (ICLR), 2020.
 Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead
Shorten, C. and Khoshgoftaar, T. A survey on image data optimizer: k steps forward, 1 step back. In Conference on
 augmentation for deep learning. Journal of Big Data, 6, Neural Information Processing Systems (NeurIPS), 2019.
 07 2019. doi: 10.1186/s40537-019-0197-0.
 Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-
Simonyan, K. and Zisserman, A. Very deep convolutional ralba, A. Places: A 10 million image database for scene
 networks for large-scale image recognition. In Interna- recognition. IEEE Transactions on Pattern Analysis and
 tional Conference on Learning Representations (ICLR), Machine Intelligence, 2017.
 2015.
 Zoph, B., Ghiasi, G., Lin, T., Cui, Y., Liu, H., Cubuk, E. D.,
Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting and Le, Q. Rethinking pre-training and self-training. In
 unreasonable effectiveness of data in deep learning era. Conference on Neural Information Processing Systems
 In International Conference on Computer Vision (ICCV), (NeurIPS), 2020.
 2017.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
 Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
 A. Going deeper with convolutions. In Conference on
 Computer Vision and Pattern Recognition (CVPR), 2015.

You can also read