Mathematics of Data: From Theory to Computation

Page created by Marshall Lambert

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Mathematics of Data: From Theory to Computation

Mathematics of Data: From Theory to Computation

                    Prof. Volkan Cevher
                   volkan.cevher@epfl.ch

Lecture 8: Double descent curves and overparametrization
Laboratory for Information and Inference Systems (LIONS)
   École Polytechnique Fédérale de Lausanne (EPFL)

                    EE-556 (Fall 2021)

License Information for Mathematics of Data Slides

 I This work is released under a Creative Commons License with the following terms:
 I Attribution
     I The licensor permits others to copy, distribute, display, and perform the work. In return, licensees must give the
       original authors credit.
 I Non-Commercial
     I The licensor permits others to copy, distribute, display, and perform the work. In return, licensees may not use the
       work for commercial purposes – unless they get the licensor’s permission.
 I Share Alike
     I The licensor permits others to distribute derivative works only under a license identical to the one that governs the
       licensor’s work.
 I Full Text of the License

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 2/ 48

Outline

This lecture :
 I The classical trade-off between model complexity and risk
 I The generalization mystery in deep learning
  I Implicit regularization of optimization algorithms
  I Double descent curves
  I Generalization bounds based on algorithmic stability
Next lecture :
 I Optimization in Deep Learning

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 3/ 48

Understanding the trade-off between model complexity and expected risk

                                                       Models
                                                       Let [Xi : i = 1, . . .] be a nested sequence of parameter domain, i.e.,
                                                       Xi ⊆ Xi+1 . For example, let Xi = neural networks with i neurons.
                                                          1. Rn (xi? ) = minx∈Xi Rn (x): ERM solution over Xi
                                                          2. R(xi? ): True risk of the ERM solution over Xi
                                                          3. supx∈Xi |R(x) − Rn (x)|: Worst-case Generalization error of Xi

Practical performance of the ERM estimator

                                              R(xi? ) ≤ min Rn (x) + sup |R(x) − Rn (x)|                                         (1)
                                                             x∈Xi         x∈Xi

As we increase the index i → i + 1 of the parameter domain, i.e., we choose a larger (more complex) model
 1. The minimum empirical risk decreases minx∈Xi Rn (x) ≥ minx∈Xi+1 Rn (x).
 2. The generalization error increases. supx∈Xi |R(x) − Rn (x)| ≤ supx∈Xi+1 |R(x) − Rn (x)|.
 3. What happens with the true risk R(xi? )?

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch    Slide 4/ 48

The classical trade-off between model complexity and risk

                                                         Figure: Bias-variance trade-off [9].

                                            Occam’s Razor: Simple is better than complex.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch       Slide 5/ 48

The dangers of complex function classes: sévère (cevher) overfitting

                                                Degree 3 polynomial funtion
                2.0
                                                                                               ground-truth

                1.5

                1.0

                0.5

                0.0

                0.5

                1.0

                1.5

                2.0
                      1.00     0.75     0.50     0.25      0.00       0.25    0.50      0.75         1.00

            Figure: Training over a complex function class can lead to overfitting.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch                 Slide 6/ 48

The dangers of complex function classes: sévère (cevher) overfitting

                                                        Noisy samples
                 2.0
                                                                                                 ground-truth
                                                                                                 samples
                 1.5

                 1.0

                 0.5

                 0.0                                                                                            minx∈X Rn (x) %
                 0.5
                                                                                                                supx∈X |R(x) − Rn (x)| &
                 1.0

                 1.5

                 2.0
                       1.00    0.75     0.50     0.25       0.00        0.25   0.50       0.75         1.00

            Figure: Training over a complex function class can lead to overfitting.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch                  Slide 6/ 48

The dangers of complex function classes: sévère (cevher) overfitting

                                                 Degree 19 polynomial fit
                 2.0

                 1.5

                 1.0

                 0.5

                 0.0                                                                                        minx∈X Rn (x) &
                 0.5
                                                                                                            supx∈X |R(x) − Rn (x)| %
                 1.0

                 1.5                                      ground-truth
                                                          model
                                                          samples
                 2.0
                       1.00     0.75     0.50     0.25     0.00          0.25   0.50       0.75      1.00

            Figure: Training over a complex function class can lead to overfitting.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch                   Slide 6/ 48

“Hidden” Regularization in Neural Networks
          The complexity vs risk trade-off in practice (I)

     • Fitting single-layer neural network with SGD and no regularization:

     • Test   error continues to go down!?! Where is fundamental trade-off??
        Figure: Training (empirical) and test (true) error for one-hidden-layer networks of increasing width, trained with SGD [20].

     • There     exist global mins with large #hidden units have test error = 1.
          Empirical error becomes zero for a wide enough network. What should happen for even wider networks?

            – But among the global minima, SGD is somehow converging to “good” ones.
ps://www.neyshabur.net/papers/inductive_bias_poster.pdf
                  Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 7/ 48

“Hidden” Regularization in Neural Networks
          The complexity vs risk trade-off in practice (II)

     • Fitting single-layer neural network with SGD and no regularization:

     • Test
        Figure: Training (empirical) and test (true) error for one-hidden-layer networks of increasing width, trained with SGD [20].
              error continues to go down!?! Where is fundamental trade-off??
     • There exist         global mins with large #hidden units have test error = 1.
                    Test error continues to go down even if we keep incresing the complexity of the model!

            – But among the global minima, SGD is somehow converging to “good” ones.
ps://www.neyshabur.net/papers/inductive_bias_poster.pdf
                  Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 8/ 48

The benefits of overparametrization

                                                  Degree 200 polynomial fit
                 2.0

                 1.5

                 1.0

                 0.5

                 0.0                                                                                       minx∈X Rn (x) &
                 0.5
                                                                                                           supx∈X |R(x) − Rn (x)| &
                 1.0

                                                         ground-truth
                 1.5                                     model
                                                         overparametrized
                                                         samples
                 2.0
                        1.00    0.75     0.50     0.25      0.00        0.25   0.50       0.75      1.00

                       Figure: Overparametrization leads to benign overfitting.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch                  Slide 9/ 48

The generalization mystery in deep learning

                                                                                      A gap between theory and practice
                                                                                      ◦ In practice, simple algorithms like SGD can
                                                                                      train neural networks to zero error and
                                                                                      achieve low test error.
                                                                                      ◦ This happens even for large and complex
                                                                                      neural network architectures.
                                                                                      ◦ Complexity measures like the Rademacher
                                                                                      complexity suggest the opposite behaviour
                                                                                      (overfitting)

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 10/ 48

Multiple global minimizers of the empirical risk

                                                                                         empirical risk

                          risk

                                                                  parameter

                         ?
◦ The global minimum is Rn

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch     Slide 11/ 48

Multiple global minimizers of the empirical risk

                                                                                          empirical risk

                          risk

                                                                        global optima

                                                                  parameter

                         ? , but many parameters can attain such value.
◦ The global minimum is Rn

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch      Slide 11/ 48

Multiple global minimizers of the empirical risk

                                                                                            empirical risk
                                                                                            true risk

                         risk

                                                                        global optima

                                                                 parameter

                         ? , but many parameters can attain such value.
◦ The global minimum is Rn
◦ Each minimizer of the empirical risk might have a different true risk.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch        Slide 11/ 48

Not all global minimizers are the same

◦ Consider a simple 2D classification task, and train a neural network with fixed step-size SGD.
◦ The plots below correspond to two different global minimizers:

                   5                                                                5

                   4                                                                4

                   3                                                                3

                   2                                                                2

                   1                                                                1

                   0                                                                0

                        0.0    0.5     1.0    1.5     2.0    2.5        3.0              0.0   0.5   1.0   1.5   2.0   2.5   3.0

                       SGD never lands on the global minimum on the right! Why?

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch         Slide 12/ 48

Understanding the implicit bias of optimization algorithms

◦ SGD seems to be biased towards good global minimizers (low true risk).
◦ Some optimization algorithms have an implicit bias towards certain kinds of global minimizers.
◦ Can we characterize this implicit bias?

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 13/ 48

Understanding the implicit bias of optimization algorithms

◦ SGD seems to be biased towards good global minimizers (low true risk).
◦ Some optimization algorithms have an implicit bias towards certain kinds of global minimizers.
◦ Can we characterize this implicit bias?

Definition (Algorithm)
We will refer to a function (deterministic or randomized) A : Z → X , mapping Z 7→ AZ as an algorithm with
input Z ∈ Z and output AZ ∈ X .

Example: Gradient Descent Algorithm
We denote GD(T,α,x0 ,∇f ) := T -steps of GD with stepsize α, starting from x0 , using gradient ∇f .

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 13/ 48

What is implicit regularization?

Definition (Implicit Regularization of a Deterministic Algorithm)
Consider a minimization problem
                                                                    F ? = min F (x)
                                                                            x∈X

and let A be a deterministic algorithm with input Z ∈ Z and output AZ ∈ X .

We say that A solves problem (2) and has implicit regularization H : X × Z → R if

                                                              AZ ∈ arg min H(x, Z).
                                                                        F (x)=F ?

 Given the input Z ∈ Z, the algorithm outputs a global minimizer of F that, additionaly, minimizes H(·, Z).

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch         Slide 14/ 48

Implicit bias of gradient descent for linear regression

◦ Consider for example an underdetermined linear system

                                                    Ax = b,             with A ∈ Rn×p ,      n

Same problem and same initialization vs different algorithms and different solutions
                                                                              Consider the following simple 2D example :
                                                                                                                             h x1 i
                                                                                                                    1    2              =5
                                                                                                                                x2

                  Ax = b                                                      Different Solutions
                                                                              Gradient Descent and AdaGrad converge to different
                                                                              points on the line.

                       Gradient                                                                   102
                       Descent                                                                                                                     GD
                                                                                                                                                   AdaGrad
                                            AdaGrad                                               100

                                                                                                 10   2
                           Initialization

                                                                                                 10   4

                                                                                     |Axt b|22
                                                                                                 10   6
Euclidean norm
           ball
                                                                                                 10   8

                                                                                                 10   10

                                                                                                 10   12

                                                                                                           0   10       20        30     40   50             60
                                                                                                                                  t
        Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 16/ 48

Implicit bias of gradient descent for linear regression

◦ Gradient descent seems to converge to the closest one in terms of `2 -norm.

Theorem (Implicit bias of Gradient Descent [10])
For the underdetermined, realizable linear system
                                                                            1
                                                        F ? = min F (x) =     kAx − bk22
                                                                 x∈X        2

the gradient descent algorithm GD(T,α,x0 ,∇F ) , for T = ∞ and for any x0 ∈ Rp , and valid step-size α, has
implicit bias H(x) = kx − x0 k2 , i.e.,

                                                GD(T =∞,α,x0 ,∇F ) = arg min kx − x0 k2 .
                                                                        F (x)=F ?

Remark:              ◦ The theorem also holds for stochastic gradient descent, see [2].

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 17/ 48

Same problem and same initialization vs different algorithms and different solutions

                                                                              Proof : For simplicity, take x0 = 0.
                  Ax = b                                                        I The gradient of F is AT (Ax − b).
                                                                                 I This implies that ∀x, ∇f (x) ∈ colspan(AT ).

                                                                              GD iterates stay in the rowspan
                       Gradient
                       Descent                                                Gradient Descent is therefore constrained to the space
                                            AdaGrad
                                                                                          colspan(AT ) = rowspan(A)
                           Initialization

                                                                              So its limit point at T = ∞ is in rowspan(A).

Euclidean norm                                                                   I Note that because of the preconditionning,
           ball
                                                                                   AdaGrad can get out of the rowspan(A).

        Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 18/ 48

Same problem and same initialization vs different algorithms and different solutions

                                                                          Proof (continued):
                                                                            I The minimum norm solution

                                      x̂candidate                                          x̂candidate = arg min kxk22
                                                                                                       x:Ax=b

                                                                               is also in rowspan(A).
                                               x\
                                                                             I So both x̂candidate and the limit point of GD are
                                                                               solutions of Ax = b that are in the rowspan(A)
                                                                             I Since nullA ∩ rowspan(A) = {0}, there can only
                                                                               be one solution in the rowspan(A), so
                                                                                                 ?
                                                                                                xGD = x̂candidate

                          x\ + h, h 2 null(A)

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 19/ 48

Implicit bias for linear models

◦ We can extend this analysis to linear models:
                                                                          n
                                                                          X
                                                      arg min F (x) :=           L(hx, ai i, bi ).
                                                        x∈Rp
                                                                           i=1

◦ If the observations are realizable and there are many global minima Glob = {x : F (x) = 0}, then

Theorem (Implicit Bias of Gradient Descent [10])
If the loss L is convex and has a unique (attained) minimum, then the iterates xt of Gradient Descent converge
to the global minimum that is closest to initialization x0 in the `2 -distance :

                                                          xt −−−−→ arg min kx − x0 k2
                                                                t→∞     x∈Glob

Proof : (Sketch) The assumption on L implies the problem reduces to a linear system: If x is a global
minimum, we must have hx, ai i = bi for all i ∈ {1, .., n}. We can recycle the results we have just seen.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch      Slide 20/ 48

Implicit bias for linearly separable datasets

◦ For linearly separable datasets, we know of an algorithm capable of finding a separating hyperplane.
◦ It maximizes the margin (i.e., distance between the boundary and the nearest training-data point).

                                                                           Hard-margin Support Vector Machines
                                                                           The hard margin Support Vector Machine solves the
                                                                           following optimization problem :

                                                                                       arg min kxk2   subject to yi hx, ai i ≥ 1.
                                                                                        x∈Rp

                                                                           It finds a hyperplane that maximizes the margin. It
                                                                           does so by design.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 21/ 48

Implicit bias for linearly separable datasets

◦ What happens if we do not explicitly enforce margin maximization?

Theorem (Implicit Bias of Gradient Descent on Separable Data [23, 10])
For the logistic loss (and some other strictly monotonically decreasing losses) and for linearly separable datasets,
the direction of the iterates xt of Gradient Descent for any initialization converges to the hard-margin SVM
direction:
                               ?
                                                                                                                    
                  xt         xSVM                               ?
                   t
                       −−−−→  ?
                                                         where xSVM =        arg min kxk2   subject to yi hx, ai i ≥ 1
                 kx k2 t→∞ kxSVM k2                                              x∈Rp

Remarks:              ◦ Here, without explicit instructions, gradient descent maximizes the margin.
                                                                           1
                                                                                 
                      ◦ The rate of this convergence is O                log t
                                                                                     .

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch     Slide 22/ 48

Implicit bias for linearly separable datasets

◦ A similar result can be established for stochastic gradient descent for the logistic loss on separable datasets.

Theorem (Implicit Bias of Stochastic Gradient Descent on Separable Data [18])
The direction of the iterates xt of Stochastic Gradient Descent for any initialization and for a small enough
fixed step-size, converges almost surely to the hard-margin SVM direction:
                                                                   ?
                                                         xt      xSVM                         1
                                                                                                   
                                                          t
                                                              −   ?
                                                                                 =O
                                                        kx k2   kxSVM k2                    log t
                                                                             2

Remarks:              ◦ This result is particularly interesting as it establishes convergence of fixed step-size SGD.
                      ◦ Both SGD and GD have the same implicit bias towards maximizing margins.

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 23/ 48

Implicit bias for non-convex objectives

◦ Characterizing implicit bias of stochastic gradient descent for non-convex objectives is an active research area.
◦ Some papers study deep matrix factorization as a first step towards getting results for neural networks.

Deep Matrix Factorization
Deep matrix factorization consists of parametrizing a matrix M as a product of N matrices:

                                                               M = XN XN −1 . . . X1

which can be understood as parametrizing M by a depth N “linear neural network,” i.e., a neural network with
no activations and with weight matrices X.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch     Slide 24/ 48

Implicit bias for deep matrix completion

◦ The matrix completion problem consists of filling the missing entries of a partially observed matrix.
◦ The deep matrix factorization approach consists of solving the following problem with gradient descent:
                                                                  X
                                             arg min                    ([XN XN −1 . . . X1 ]i,j − bi,j )2 .
                                       XN ,XN −1 ...,X1
                                                                (i,j)∈Ω

◦ It was conjectured in 2017 [11] that gradient descent was biased towards solutions with small nuclear norm.

Theorem (Implicit Regularization May Not Be Explainable by Norms (2020) [22])
For deep matrix completion the implicit bias can not be expressed as a function of a norm or semi-norm.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch        Slide 25/ 48

Double descent

◦ A failure of conventional wisdom
A B

Fig. 1. Curves for training risk (dashed line) and test risk (solid line). (A) The classical U-shaped risk curve arising from the bias–variance trade-off. (B) The
double-descent risk curve, which incorporates the U-shaped risk curve (i.e., the “classical” regime) together with the observed behavior from using high-
Figure: The classical U-shaped risk curve vs. double-descent risk curve. source: [7].
capacity function classes (i.e., the “modern” interpolating regime), separated by the interpolation threshold. The predictors to the right of the interpolation
threshold have zero training risk.

I classicalnetworks
large-sample limit setting:
and kernel machines n → ∞theunder
trained to interpolate training fixed
applypto regression and also does not predict the second descent
data obtain near-optimal test results even when the training data beyond the interpolation threshold. Recently, there has been an
I high dimensional setting:
are corrupted with nofand
high levels noisep(5,comparably
6). large emerging recognition that certain interpolating predictors (not
The main finding of this work is a pattern in how perfor- based on ERM) can indeed be provably statistically optimal or
mance on unseen data depends on model capacity and the near optimal (11, 12), which is compatible with our empirical
mechanism underlying its emergence. This dependence, empir- observations in the interpolating regime.
ically witnessed with important model classes including neural In the remainder of this article, we discuss empirical evidence
networks and a range of datasets, is summarized in the “double- for the double-descent curve and the mechanism for its emer-
descent” risk curve shown in Fig. 1B. The curve subsumes the gence and conclude with some final observations and parting
classical U-shaped risk curve from Fig. 1A by extending it beyond thoughts.
the point of interpolation.
MathematicsWhen function
of Data class capacity
| Prof. Volkan is below the “interpolation
Cevher, volkan.cevher@epfl.ch Neural
Slide 26/Networks
48

60
                                                                                                                                                               Test
                                                                                                                                                               Train

                                                                                    Zero-one loss (%)
Double descent curve in practice (I)                                                                    40
Typical examples:
  I linear/nonlinear regression [13]                  20
  I random features, random forest, and shallow neural networks [7]
                                                                                                          0
                                   Zero-one loss                                                                            Squared loss
                88                                                              1709                    0.6                                                    Test
                                               RFF                                                                                    RFF
                                               Min. norm solution hn,                                                                 Min. norm solution hn,   Train
                                               (original kernel)                                                                      (original kernel)
                                                                                     100

                                                                                    Squared loss
                                                                                                        0.4
     Test (%)

                15

                                                                             Test
                                                                                          10
                                                                                                        0.2
                                                                                              1
                 4
                 2                                                                            0         0.0
                     0      10    20      30      40        50          60                                0    3 10     2010 30        4040 50 10060     300       800
                     Number of Random Fourier Features (×103) (N)                                              10 Number
                                                                                                                       20     of parameters/weights
                                                                                                                              30      40       50       (×103)
                                                                                                           Number of Random Fourier Features (×103) (N)
                         (a) Random features model                                                                      (b) A fully connected neural network

                                                     Figure: Experiments on MNIST. Source: [7].

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch                                    Slide 27/ 48

identify certain regimes where increasing (even quadrupling) the number of train
                 samples actually hurts test performance.
Double descent curve in practice (II)

        1    I NTRODUCTION

        Figure 1: Left: Train and test error as a function of model size, for ResNet18s of varying width
Figure: Left: Train and test error as a function of model size, for ResNet18s of varying width on CIFAR-10 with 15% label noise.
        on CIFAR-10
Right: Test  error, shownwith 15% label
                          for varying trainnoise. Right:
                                            epochs.      Test
                                                    source:     error, shown for varying train epochs. All models
                                                            [19].
        trained using Adam for 4K epochs. The largest model (width 64) corresponds to standard ResNet18.

       The bias-variance trade-off is a fundamental concept in classical statistical learning theory (e.g.,
       Hastie et al. (2005)). The idea is that models of higher complexity have lower bias but higher vari-
       ance. According to this theory, once model complexity passes a certain threshold, models “overfit”
       with the variance term dominating the test error, and hence from this point onward, increasing model
       complexity will only decrease performance (i.e., increase
      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch
                                                                                test error). Hence conventional wisdom
                                                                       Slide 28/ 48

Underparametrized regime

  Figure: Low generalization but high empirical error                                 Figure: Sweet spot for the model complexity

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 29/ 48

Interpolation threshold

                                    Figure: The unique degree 19 polynomial that can fit 20 samples.

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 30/ 48

Benign overfitting in the over-parametrized regime

                                                                                                 106

                                                                                                 105

                                                                                                 104

                                                                                         error
                                                                                                 103

                                                                                          2
                                                                                                 102

Figure: A degree 200 polynomial that can harmlessly fits noisy 20 points.                        101

                                                                                                       0   200     400            600       800   1000
                                                                                                                 Degree of polynomial fit

                                                                                                  Figure: Double descent for polynomial fits

 Benign Overfitting [6]: good prediction with zero training error for regression loss
   I Statistical wisdom: a predictor should not fit too well.
   I deep networks fit perfectly on noisy data and generalize well on test data.

       Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 31/ 48

A simple case: linear regression with Gaussian data
Problem setting
◦ linear model:
  I training data {ai }n with ai ∼ N (0, Ip ) such that A = [a1 , a2 , · · · , an ]>
                       i=1
  I label bi = hx\ , ai i + wi with the target vector x\ ∼ N (0, 1 Ip ), noise wi ∼ N (0, σ 2 )
                                                                 p
◦ min-norm solution:
  I x? = arg minx {kxk2 : Ax = b} = (A> A)† A> b
◦ excess risk: for a test point a
  I R(x? ; x\ ) = E[(ha, x? i − ha, x\ i)2 |A]

Theorem [13]
Under the above problem setting, assume that kx\ k22 = r2 , as n, p → ∞, and p/n → γ, then we have
                                                       2 γ
                                                       σ 1 − γ , for γ < 1
                                        R(x? , x\ ) →
                                                       r2 1 − 1 + σ2 1 , for γ > 1 .
                                                                 
                                                                         γ              γ−1

Remark: The asymptotic risk curves depend on γ and the SNR r2 /σ 2 .
      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch       Slide 32/ 48

Alternatives to complexity-based generalization bounds

◦ So far we have seen that complexity based generalization bounds:
  I characterize worst-case scenario
  I not tight in practice
  I disregard the effect of the optimization algorithm

                 Can we understand generalization as a property of an optimization algorithm?

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 33/ 48

Alternatives to complexity-based generalization bounds

◦ So far we have seen that complexity based generalization bounds:
  I characterize worst-case scenario
  I not tight in practice
  I disregard the effect of the optimization algorithm

                 Can we understand generalization as a property of an optimization algorithm? YES!

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 33/ 48

Formal definition of stability (I)

Definition (Uniform Stability [12])
Let A : Z → H be a randomized algorithm with input a finite sample S, and output a function AS ∈ H.

The algorithm A has uniform stability (βn )n≥1 with respect to the loss function L if for all subsets
S, S 0 ⊆ A × B such that |S| = |S 0 | = n and S and S 0 differ in at most one sample:

                                                  sup        E|L(AS (a), b) − L(AS 0 (a), b)| ≤ βn
                                             (a,b)∈A×B

The expectation is taken with respect to the randomness in the algorithm A .

 Misnomer: Lower stability (small values of βn ) means the difference in the output of the algorithm is smaller.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch     Slide 34/ 48

Formal definition of stability (II)

                                            Figure: Algorithm B is less stable than algorithm A .

    Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 35/ 48

Generalization bounds based on uniform stability — definitions

Definition (Empirical Risk on a set)
Let S := [(a1 , b1 ), . . . , (an , bn )] be an i.i.d. sample drawn from a distribution on A × B. Let L : B × B → R be
a loss function and H be a class of functions h : A → B. The empirical risk of h ∈ H on the set S is defined as:
                                                                             n
                                                                         1   X
                                                           RS (h) :=               L(h(ai ), bi )
                                                                         n
                                                                             i=1

                     (Almost) same definition as before. Makes explicit the dependence on the set S.

Definition (Expected Generalization Error)
Let A : Z → H be a randomized algorithm that takes as input a finite sample S of arbitrary size, and outputs a
function AS ∈ H. Suppose that S = [(a1 , b1 ), . . . , (an , bn )] is an i.i.d. sample form probability distribution on
A × B. The expected generalization error on a sample of size n is the value

                                                                 E[RS (AS ) − R(AS ))]

the expectation is taken with respect to the draw of the sample S and the randomness of A .

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch        Slide 36/ 48

Generalization bounds based on uniform stability — Fundamental theorem (I)
Theorem (Hardt et al. 2016 [12])
Let A be uniformly stable with stability (βn )n≥1 , then for a random i.i.d. sample S of size n, the expected
generalization error is bounded as follows

                                                           E[|RS (AS ) − R(AS ))|] ≤ βn

Proof.
Let S = [(a1 , b1 ), . . . , (an , bn )] and S 0 = [(a10 , b01 ), . . . , (an
                                                                            0 , b )] be two i.i.d. samples of size n. Denote
                                                                                 n

                             S (i) := [(a1 , b1 ), . . . , (ai−1 , bi−1 ), (ai0 , b0i ), (ai+1 , bi+1 ), . . . , (an , bn )]
the sample that results from replacing (ai , bi ) by (ai0 , b0i ) in S.
                                  "       n
                                                                    #         "       n
                                                                                                                 #
                                      1   X                                       1   X
         E[RS (AS )] = E                        L(AS (ai ), bi )         =E                 L(AS (i) (ai0 ), b0i )
                                      n                                           n
                                          i=1                                         i=1
                                  "    n                                      n
                                                                                                          #          "       n
                                                                                                                                                        #
                                      1X                                     1X                                          1   X
                           =E                   L(AS (i) (ai0 ), b0i )   −             L(AS (ai0 ), b0i )     +E                   L(AS (ai0 ), b0i )
                                      n                                      n                                           n
                                          i=1                                    i=1                                         i=1

                                                                                                                                                            
      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch            Slide 37/ 48

Generalization bounds based on uniform stability — Fundamental theorem (II)

Proof. (continued).
We have
                                 "       n                                      n
                                                                                                          #        "       n
                                                                                                                                                      #
                                     1   X                                  1   X                                      1   X
           E[RS (AS )] = E                     L(AS (i) (ai0 ), b0i )     −             L(AS (ai0 ), b0i )    +E                 L(AS (ai0 ), b0i )
                                     n                                      n                                          n
                                         i=1                                    i=1                                        i=1

Note that S and S (i) only differ in one sample: uniform stability allows bounding the first term as:
       "       n                                   n
                                                                                #            n
           1   X                              1    X                                     1   X                                                       
  =E                 L(AS (i) (ai0 ), b0i ) −             L(AS (ai0 ), b0i )        =              E L(AS (i) (ai0 ), b0i ) − L(AS (ai0 ), b0i ) ≤ βn
           n                                  n                                          n
               i=1                                  i=1                                      i=1

Finally note that because the samples (ai , bi ) are independent of S we have:
                                                         "       n
                                                                                             #
                                                             1   X
                                                     E                  L(AS (ai0 ), b0i )       = R(AS )
                                                             n
                                                                 i=1

analogously we can show E [R(AS ) − RS (AS )] ≤ βn .                                                                                                      

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch             Slide 38/ 48

The stability of SGD
◦ Let hx ∈ HX be an element of a parametric function class. Consider the ERM optimization objective:
                                                             n
                                                         1   X
                                            f (x) :=               fi (x),    fi (x) := L(hx (ai ), bi ).
                                                         n
                                                             i=1

◦ The SGD iterates for t = 0, . . . , T are xt+1 = xt − αt ∇x fi (xt ), for i ∼ Unif[n].

                 Algorithm                                     Assumptions on fi                                   Stability
                                                                                                                β 2 PT
                      SGD                      convex, L-smooth, β-Lipschitz, αt ≤ 2/L                                      α
                                                                                                                        t=0 t
                                                                                                                n
                                                                                                                       β2
                     SGD                   µ-str convex, L-smooth, β-Lipschitz, αt ≤ 2/L
                                                                                                                       nµ
                                                                                                 1                 β 2 + Lρ
                     SGD                   µ-str convex, L-smooth, β-Lipschitz, αt =
                                                                                                 µt                    nµ
                                                                                                                      β2T
            SGD avg. iterate                           convex, L-smooth, β-Lipschitz
                                                                                                                      nL
                                                                                                            1 + 1/β L+1   2    L
                     SGD                    non-convex, L-smooth, β-Lipschitz, αt = 1/t                               β     T L+1
                                                                                                               n
                 Table: Summary of stability upper bounds for different assumptions on the objective function [12]

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch       Slide 39/ 48

Effect of the number of iterations on the stability of SGD and the generalization error

                                         Figure 4: Parameter distance versus generalization error on Cifar10.
Figure: Normalized parameter distance between two networks trained on two datasets S, S 0 differing only in one sample,
training error, test error and generalization error (0-1 loss) on CIFAR10 [12].

◦ Parameter distance is a stronger notion than stability.22
◦ More iterations ⇒ Parameter distance increases (we expect stability to increase).
◦ Generalization error follows the same behavior as the parameter distance (proxy for stability).

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch     Slide 40/ 48

Wrap up!

◦ The visualizations can be deceiving to understand the high-dimensional behavior
◦ Are we really in the interpolation regime in machine learning?

Theorem (Probability of interpolation [5])
Given a p-dimensional dataset An = {a1 , . . . , an } with i.i.d. samples, where ai ∼ N (0, I) for all i = 1, . . . , n,
the probability that a new sample a ∼ N (0, I)is in the interpolation regime (i.e., within the convex hull of An )
has the following limiting behavior

                                                                                
                                                                                    1   if n > 2p/2 /p;
                                          lim p(a ∈ convex hull(An )) =
                                         p→∞                                        0   if n < 2p/2 /p.

◦ We are most likely in the extrapolation regime [4]

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 41/ 48

Wrap up!

◦ The visualizations can be deceiving to understand the high-dimensional behavior
◦ Are we really in the interpolation regime in machine learning?

Theorem (Probability of interpolation [5])
Given a p-dimensional dataset An = {a1 , . . . , an } with i.i.d. samples, where ai ∼ N (0, I) for all i = 1, . . . , n,
the probability that a new sample a ∼ N (0, I)is in the interpolation regime (i.e., within the convex hull of An )
has the following limiting behavior

                                                                                
                                                                                    1   if n > 2p/2 /p;
                                          lim p(a ∈ convex hull(An )) =
                                         p→∞                                        0   if n < 2p/2 /p.

◦ We are most likely in the extrapolation regime [4]
◦ Recitation 4 on Friday!

      Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 41/ 48

?
    From neural networks to random features model [14, 21]

1-hidden-layer neural network with m neurons (fully-connected architecture):

Let X1 ∈ Rm×p , a ∈ Rp , X2 ∈ Rm , and µ2 ∈ R
                                                                   weight       input
                                              activation
                                                                                         bias
                                                                                                    bias
                                                                    ↓            ↓        ↓          ↓
                                                  y
                       "                  #                "                  #" # " # " #
                                                                                       
        hx (a) :=             X2                  σ        
                                                                  X1            a + µ1 + µ2 ,
                                                                                        
                                                                                                            x := [X1 , X2 , µ1 , µ2 ]
                                                                                       
                                                                                       

                                              |                           {z                     }
                                                      hidden layer = fixed random features

    I X1 : Gaussian initialization and then fixed
    I X2 : to be learned
    I over-parameterized model: #neurons m > #training data n

       Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch          Slide 42/ 48

?
    Double descent: random features model (I)

◦ high dimensions: #training data n, #neurons m, feature dimension p are comparably large

                  (a) SGD vs. min-norm solution                    (b) Bias . B1 + B2 + B3    (c) Variance . V1 + V2 + V3

Figure: Test MSE, Bias, and Variance of RF regression as a function of the ratio m/n on MNIST data set (digit 3 vs. 7) for
p = 784 and n = 600 across the Gaussian kernel. Source: [16].

    I random features regression solved by SGD: interplay between excess risk and optimization
    I bias variance decomposition for understanding multiple randomness sources
    I monotonic decreasing bias and unimodal variance ⇒ double descent

       Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch        Slide 43/ 48

?
    Double descent: random features model (II)

                   Algorithm           data assumption                solution type             Result on risk curve
                       [13]                  Gaussian                  closed-form                  variance % &
                      [17]               i.i.d on sphere               closed-form               variance, bias % &
                       [8]                   Gaussian                  closed-form       refined decomposition on variance
                       [1]                   Gaussian                  closed-form        fully decomposition on variance
                      [15]                    general                  closed-form                       %&
                       [3]                   Gaussian                      GD                       variance % &
                      [16]              sub-exponential                    SGD                 variance % &, bias &
                                   Table: Comparison of representative random features on double descent.

◦ multiple randomness sources: data sampling, label noise, initialization
◦ phase transition due to non-monotonic variance

       Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch       Slide 44/ 48

References I

[1] Ben Adlam and Jeffrey Pennington.
    Understanding double descent requires a fine-grained bias-variance decomposition.
    In Advances in neural information processing systems, 2020.
[2] Navid Azizan and Babak Hassibi.
    Stochastic gradient/mirror descent: Minimax optimality and implicit regularization.
[3] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Denny Wu, and Tianzong Zhang.
    Generalization of two-layer neural networks: an asymptotic viewpoint.
    In International Conference on Learning Representations, pages 1–8, 2020.
[4] Randall Balestriero, Jerome Pesenti, and Yann LeCun.
    Learning in high dimension always amounts to extrapolation.
    arXiv preprint arXiv:2110.09485, 2021.
[5] Imre Bárány and Zoltán Füredi.
    On the shape of the convex hull of random points.
    Probability theory and related fields, 77(2):231–240, 1988.
[6] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler.
    Benign overfitting in linear regression.
    Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 45/ 48

References II

[7] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.
    Reconciling modern machine-learning practice and the classical bias–variance trade-off.
    Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
[8] Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala.
    Double trouble in double descent: Bias and variance (s) in the lazy regime.
    pages 2280–2290, 2020.
[9] Stuart Geman, Elie Bienenstock, and René Doursat.
    Neural networks and the bias/variance dilemma.
    Neural computation, 4(1):1–58, 1992.
[10] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro.
    Characterizing implicit bias in terms of optimization geometry.
[11] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.
    Implicit regularization in matrix factorization.
    In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
    Advances in Neural Information Processing Systems 30, pages 6151–6159. Curran Associates, Inc.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 46/ 48

References III

[12] Moritz Hardt, Ben Recht, and Yoram Singer.
    Train faster, generalize better: Stability of stochastic gradient descent.
    In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
[13] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani.
    Surprises in high-dimensional ridgeless least squares interpolation.
    arXiv preprint arXiv:1903.08560, 2019.
[14] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew.
    Extreme learning machine: theory and applications.
    Neurocomputing, 70(1-3):489–501, 2006.
[15] Zhenyu Liao, Romain Couillet, and Michael Mahoney.
    A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition,
    and the corresponding double descent.
    In Neural Information Processing Systems, 2020.
[16] Fanghui Liu, Johan A.K. Suykens, and Volkan Cevher.
    On the double descent of random features models trained with sgd.
    arXiv preprint arXiv:2110.06910, 2021.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 47/ 48

References IV

[17] Song Mei and Andrea Montanari.
    The generalization error of random features regression: Precise asymptotics and double descent curve.
    arXiv preprint arXiv:1908.05355, 2019.
[18] Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry.
    Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate.
[19] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever.
    Deep double descent: Where bigger models and more data hurt.
[20] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro.
    The role of over-parametrization in generalization of neural networks.
    In International Conference on Learning Representations, 2019.
[21] Ali Rahimi and Benjamin Recht.
    Random features for large-scale kernel machines.
    In Advances in Neural Information Processing Systems, pages 1177–1184, 2007.
[22] Noam Razin and Nadav Cohen.
    Implicit regularization in deep learning may not be explainable by norms.
[23] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro.
    The implicit bias of gradient descent on separable data.

     Mathematics of Data | Prof. Volkan Cevher, volkan.cevher@epfl.ch   Slide 48/ 48

You can also read