Variable selection in high-dimensional genetic data - Sahir ...

Page created by Rhonda Burgess
 
CONTINUE READING
Variable selection in high-dimensional genetic data - Sahir ...
Variable selection in high-dimensional genetic
                      data

                            Sahir Rai Bhatnagar
      Department of Epidemiology, Biostatistics, and Occupational Health
                    Department of Diagnostic Radiology
                             McGill University

                           sahirbhatnagar.com

                              Feb. 23, 2021

                                                                           1 / 76   .
Variable selection in high-dimensional genetic data - Sahir ...
Classical Methods

          Betting on Sparsity

          A Thought Experiment

          Motivating Dataset

          Notre proposition

          Résultats

          Other applications

          Extensions
             Fonctions de pénalité non convexes
             Analyse de survie

          Orientations futures

Classical Methods                                 2 / 76   .
Variable selection in high-dimensional genetic data - Sahir ...
Setting

               • We are concerned with the analysis of data in which we are
                 attempting to predict an outcome Y using a number of explanatory
                 factors X1 , X2 , X3 , . . ., some of which may not be particularly useful

Classical Methods                                                                             3 / 76
Variable selection in high-dimensional genetic data - Sahir ...
Setting

               • We are concerned with the analysis of data in which we are
                 attempting to predict an outcome Y using a number of explanatory
                 factors X1 , X2 , X3 , . . ., some of which may not be particularly useful
               • Although the methods we will discuss can be used solely for
                 prediction (i.e., as a “black box”), I will adopt the perspective that we
                 would like the statistical methods to be interpretable and to explain
                 something about the relationship between the X and Y

Classical Methods                                                                             3 / 76
Variable selection in high-dimensional genetic data - Sahir ...
Setting

               • We are concerned with the analysis of data in which we are
                 attempting to predict an outcome Y using a number of explanatory
                 factors X1 , X2 , X3 , . . ., some of which may not be particularly useful
               • Although the methods we will discuss can be used solely for
                 prediction (i.e., as a “black box”), I will adopt the perspective that we
                 would like the statistical methods to be interpretable and to explain
                 something about the relationship between the X and Y
               • Regression models are an attractive framework for approaching
                 problems of this type, and the focus today will be on extending
                 classical regression modeling to deal with high-dimensional data

Classical Methods                                                                             3 / 76   .
Variable selection in high-dimensional genetic data - Sahir ...
Classical Methods
               • A nice and powerful toolbox for analyzing the more traditional datasets where
                 the sample size (N) is far greater than the number of covariates (p):
                   ▶ linear regression, logistic regression, LDA, QDA, glm,
                   ▶ regression spline, smoothing spline, kernel smoothing, local smoothing,
                       GAM,
                   ▶ Neural Network, SVM, Boosting, Random Forest, ...

Classical Methods                                                                                4 / 76
Variable selection in high-dimensional genetic data - Sahir ...
Classical Methods
               • A nice and powerful toolbox for analyzing the more traditional datasets where
                 the sample size (N) is far greater than the number of covariates (p):
                   ▶ linear regression, logistic regression, LDA, QDA, glm,
                   ▶ regression spline, smoothing spline, kernel smoothing, local smoothing,
                       GAM,
                   ▶ Neural Network, SVM, Boosting, Random Forest, ...
                                                                           
                                                   x11 x12 · · · x1p
                                                 x21 x12 · · · x1p 
                                                                           
                                                 x31 x12 · · · x1p 
                                                                           
                                                  .         ..    ..    .. 
                                                  .                        
                                                  .          .     .     . 
                                                  .       ..     ..    .. 
                                                  .                        
                                                  .        .      .     . 
                                         Xn×p =                            
                                                  .        ..   ..    .. 
                                                  ..        .    .     . 
                                                                           
                                                  .      .      .     .    
                                                  .      .      .     .    
                                                  .      .      .     . 
                                                                           
                                                  ..     ..    ..    .. 
                                                  .       .     .     . 
                                                   xn1 x12 · · · xnp

Classical Methods                                                                                4 / 76   .
Variable selection in high-dimensional genetic data - Sahir ...
Classical Methods   5 / 76   .
Variable selection in high-dimensional genetic data - Sahir ...
Classical Linear Regression

          Data: (x1 , y1 ), . . . , (xn , yn ) iid from

                                                 y = xT β + ϵ

          where E(ϵ|x) = 0, and dim(x) = p. To include an intercept, we can set
          x1 ≡ 1. Using Matrix notation:

                                                 y = Xβ + ϵ

          The least squares estimator

                                         b = arg min ∥y − Xβ∥2
                                         β LS
                                                          β

                                            b = (XT X)−1 XT y
                                            β LS

Classical Methods                                                                 6 / 76
Variable selection in high-dimensional genetic data - Sahir ...
Classical Linear Regression

          Data: (x1 , y1 ), . . . , (xn , yn ) iid from

                                                 y = xT β + ϵ

          where E(ϵ|x) = 0, and dim(x) = p. To include an intercept, we can set
          x1 ≡ 1. Using Matrix notation:

                                                 y = Xβ + ϵ

          The least squares estimator

                                         b = arg min ∥y − Xβ∥2
                                         β LS
                                                          β

                                            b = (XT X)−1 XT y
                                            β LS

               • Question: How to find the important variables xj ?

Classical Methods                                                                 6 / 76   .
Best-subset Selection (Beal et al. 1967, Biometrika)

Classical Methods                                       7 / 76   .
Which variables are important?

               • Scientists know only a small subset of variables (such as genes) are
                 important for the response variable.
               • An old Idea: try all possible subset models and pick the best one.
               • Fit a subset of predictors to the linear regression model. Let S be the
                 subset predictors, e.g., S = {1, 3, 7}.
                                      RSSS               RSSS
                               Cp =        − (n − 2|S|) = 2 + 2|S| − n
                                       σ2                 σ
               • We pick the model with the smallest Cp value.

Classical Methods                                                                          8 / 76   .
Model selection criteria

          Minimizing Cp is equivalent to minimizing

                                                 b ∥2 + 2|S|σ 2 .
                                         ∥y − XS β S

          which is AIC score.
          Many popular model selection criteria can be written as

                                                 b ∥2 + λ|S|σ 2 .
                                         ∥y − XS β S

                                  p
               • BIC uses λ = σ       log(n)/n.

Classical Methods                                                   9 / 76   .
Remarks

          Best subset selection plus model selection criteria (AIC, BIC, etc.)
            • Computing all possible subset models is a combinatorial optimization
               problem (NP hard)
            • Instability in the selection process (Breiman, 1996)

Classical Methods                                                                    10 / 76   .
Ridge Regression (Hoerl & Kennard 1970, Technometrics)

                 b = arg min ||y − Xβ||2 + λ||β||2
               • β          β                    2

                             Pp
               • ||β||22 =    j=1   βj2

Classical Methods                                     11 / 76
Ridge Regression (Hoerl & Kennard 1970, Technometrics)

                 b = arg min ||y − Xβ||2 + λ||β||2
               • β          β                    2

                             Pp
               • ||β||22 =    j=1   βj2

                 b
               • β           ⊤        −1 ⊤
                                        X y → exact solution
                   Ridge = (X X + λI)

                 b = (X⊤ X)−1 X⊤ y
               • β LS

Classical Methods                                              11 / 76
Ridge Regression (Hoerl & Kennard 1970, Technometrics)

                 b = arg min ||y − Xβ||2 + λ||β||2
               • β          β                    2

                             Pp
               • ||β||22 =    j=1   βj2

                 b
               • β           ⊤        −1 ⊤
                                        X y → exact solution
                   Ridge = (X X + λI)

                 b = (X⊤ X)−1 X⊤ y
               • β LS

               • Let X⊤ X = Ip×p

Classical Methods                                              11 / 76
Ridge Regression (Hoerl & Kennard 1970, Technometrics)

                 b = arg min ||y − Xβ||2 + λ||β||2
               • β          β                    2

                             Pp
               • ||β||22 =    j=1   βj2

                 b
               • β           ⊤        −1 ⊤
                                        X y → exact solution
                   Ridge = (X X + λI)

                 b = (X⊤ X)−1 X⊤ y
               • β LS

               • Let X⊤ X = Ip×p
                                                          β̂j (MCO)
                                          β̂j (Ridge) =
                                                          1+λ

Classical Methods                                                     11 / 76   .
Least squares vs. Ridge

Classical Methods          12 / 76   .
High-dimensional data (n
Why can’t we fit OLS to High-dimensional data?

Classical Methods                                 14 / 76   .
High-dimensional data (n
High-dimensional data (n  n

Classical Methods                                                                          15 / 76
High-dimensional data (n  n
                    ▶ However, the ideas we discuss in this course are also relevant to many
                      situations in which p < n; for example, if n = 100 and p = 80, we
                      probably don’t want to use ordinary least squares

Classical Methods                                                                              15 / 76   .
A fundamental picture for data science

                    ESL, Hastie et al. 2009
Classical Methods                             16 / 76   .
Classical Methods

           Betting on Sparsity

           A Thought Experiment

           Motivating Dataset

           Notre proposition

           Résultats

           Other applications

           Extensions
              Fonctions de pénalité non convexes
              Analyse de survie

           Orientations futures

Betting on Sparsity                                17 / 76   .
Bet on Sparsity Principle
                                                                    X X19 X18 X17
                                                             X22 X21 20           X16
                                                       X23                                  X15
                                                 X24                                              X14
                                           X25                                                          X13
                                     X26                                                                      X12
                               X27                                                                                  X11
                             X28                                                                                       X10
                         X29                                                                                                 X9
                       X30                                                                                                        X8
                   X31                                                                                                             X7
                 X32                                                                                                                   X6
             X33                                                                                                                         X5
           X34                                                                                                                               X4
           X35                                                                                                                               X3
          X36                                                                                                                                 X2
          X37
                                                                         YY                                                                    X1
          X38                                                                                                                                X72
           X39                                                                                                                               X71
           X40                                                                                                                              X70
             X41                                                                                                                         X69
                 X42                                                                                                                   X68
                   X43                                                                                                             X67
                       X44                                                                                                    X66
                         X45                                                                                              X65
                            X46                                                                                        X64
                               X47                                                                                  X63
                                     X48                                                                      X62
                                           X49                                                          X61
                                                 X50                                              X60
                                                       X51                                  X59
                                                             X52 X                    X58
                                                                   53 X54 X55 X56 X57
Bet on Sparsity Principle
                                                                               X X19 X18 X17
                                                                        X22 X21 20           X16
                                                                  X23                                  X15
                                                            X24                                              X14
                                                      X25                                                          X13
                                                X26                                                                      X12
                                          X27                                                                                  X11
                                        X28                                                                                       X10
                                    X29                                                                                                 X9
                                  X30                                                                                                        X8
                              X31                                                                                                             X7
                            X32                                                                                                                   X6
                        X33                                                                                                                         X5
                       X34                                                                                                                              X4
                      X35                                                                                                                               X3
                      X36                                                                                                                                X2
                      X37
                                                                                    YY                                                                    X1
                      X38                                                                                                                               X72
                      X39                                                                                                                               X71
                       X40                                                                                                                             X70
                        X41                                                                                                                         X69
                            X42                                                                                                                   X68
                              X43                                                                                                             X67
                                  X44                                                                                                    X66
                                    X45                                                                                              X65
                                       X46                                                                                        X64
                                          X47                                                                                  X63
                                                X48                                                                      X62
                                                      X49                                                          X61
                                                            X50                                              X60
                                                                  X51                                  X59
                                                                        X52 X                    X58
                                                                              53 X54 X55 X56 X57

Betting on Sparsity                                                                                                                                            18 / 76   .
Bet on Sparsity Principle

            Use a procedure that does well in sparse problems, since
                  no procedure does well in dense problems.1

                1 The elements of statistical learning. Springer series in statistics, 2001.
Betting on Sparsity                                                                            19 / 76
Bet on Sparsity Principle

            Use a procedure that does well in sparse problems, since
                  no procedure does well in dense problems.1

                • We often don’t have enough data to estimate so many parameters

                • Even when we do, we might want to identify a relatively small
                  number of predictors (k < N) that play an important role

                • Faster computation, easier to understand, and stable predictions on
                  new datasets.

                1 The elements of statistical learning. Springer series in statistics, 2001.
Betting on Sparsity                                                                            19 / 76   .
Classical Methods

         Betting on Sparsity

         A Thought Experiment

         Motivating Dataset

         Notre proposition

         Résultats

         Other applications

         Extensions
            Fonctions de pénalité non convexes
            Analyse de survie

         Orientations futures

A Thought Experiment                             20 / 76   .
How would you schedule a meeting of 20 people?

A Thought Experiment                              21 / 76
How would you schedule a meeting of 20 people?

A Thought Experiment                              21 / 76   .
Classical Methods

          Betting on Sparsity

          A Thought Experiment

          Motivating Dataset

          Notre proposition

          Résultats

          Other applications

          Extensions
             Fonctions de pénalité non convexes
             Analyse de survie

          Orientations futures

Motivating Dataset                                22 / 76   .
UKBiobank
               • Données de génotypage sont issues de 500 000 individus d’origine
                 caucasienne recrutés au Royaume-Uni
               • La puce UKBioBANK comporte plus de 800 000 SNPs
               • Grand nombre de variables réponses (ex. maladie, densité minérale
                 osseuse)
               • Objectif: Quelles variables explicatives sont associées à la variable
                 réponse?

Motivating Dataset                                                                       23 / 76   .
Un échantillon

                     ID   Response Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
            1 2610781      −1.255   1     2     0      0     0     1
            2 4114347      −0.339   1     2     0      2     0     1
            3 4399930       −0.6    1     2     1      1     0     1
            4 2081319      0.809    1     2     0      1     0     2
            5 1347380      0.279    2     2     0      0     0     0
            6 3262449      −0.421   2     2     0      1     0     1
            7 4870063      −0.454   2     2     0      0     0     2
            8 1141212      1.383    2     2     1      1     1     0
            9 2997954      −2.29    1     2     0      0     0     1
          10 5805218       2.289    1     2     0      1     1     1

Motivating Dataset                                                       24 / 76   .
Reviews
 GWAS
                                           a Genome-wide association                                                                                                                                                      b Functional characterization
                                                                                              Disease                                                                           Trait

                                                                                                                                                                                                                          Chromatin
                                           Population                                                                                                                                                                     accessibility

                                                                          Controls                                 Cases                                   Unselected sample

                                                                                                                                                                                                                          Histone
                                                                                                                                                                                                                          modification

                                           Genotyping method

                                                                                                                                                                                                                          DNA
                                           Meta-analysis                                  SNP array and                                                                                                                   methylation
                                                                                           imputation                                                                WGS

                                                                              4

                                                            –log10(P value)
                                                                              3
                                                                                                                                                                                                                          Transcription
                                                                                                                                                                                                                          factor binding
                                           Statistical                        2
                                           association
                                                                              1

                                                                              0
                                                                                      1

                                                                                                  2

                                                                                                           3

                                                                                                                   4

                                                                                                                           5
                                                                                                                                  6
                                                                                                                                          7
                                                                                                                                                  8
                                                                                                                                                        9
                                                                                                                                                       10
                                                                                                                                                       11
                                                                                                                                                                             12
                                                                                                                                                                             13
                                                                                                                                                                             14
                                                                                                                                                                             15
                                                                                                                                                                             16
                                                                                                                                                                             17
                                                                                                                                                                             18
                                                                                                                                                                             19
                                                                                                                                                                             20
                                                                                                                                                                             21
                                                                                                                                                                             22
                                                                                                                                  Chromosome
                                                                                                                                                                                                                          Conservation

                                                                                                                                                  SNP 10
                                                                                                                                                           SNP 11
                                                                                                                                                                    SNP 14
                                                                                                                                                                             SNP 16
                                                                                                                                                                                      SNP 18
                                                                                                                                                                                               SNP 19
                                                                                                                                                                                                        SNP 20
                                                                                                                                                                                                                 SNP 21
                                                                                  SNP 1
                                                                                          SNP 2
                                                                                                  SNP 3
                                                                                                          SNP 5
                                                                                                                  SNP 6
                                                                                                                          SNP 7
                                                                                                                                  SNP 8
                                                                                                                                          SNP 9

                                           Linkage                                Block 1                 Block 1                                 Block 3                                          Block 4
                                           disequilibrium
                                                                                                                                                                                                                          eQTL

                                                                                                                                                                                                                                                          Genomic region

               1 Tam V. et al. Benefits and limitations of genome-wide association studies. Nat Rev Genet (2019)
                                              c Experimental validation                      d GWAS variants
Motivating Dataset                                                                                                                                                                                                                                                     25 / 76   .
Confounding

Motivating Dataset   26 / 76   .
pedigree. In contrast, association studies are used to           tude finer resolution than a comparable linkage study

 Population           structure
   search for loci at which there is a significant associ
  ation between the phenotypes and genotypes of unre
                                                                   (Morris and Cardon, 2007; Altshuler, Daly and Lan
                                                                   der, 2008). GWAS are preferred for detecting com
    lated individuals. These associations arise because of         mon causal variants (say, population fraction > 0.05),
   correlations in transmissions of phenotypes and geno            which typically have only a weak effect on phenotype,
           • Les
    types over manyGWAS     comparent
                    generations,            des analy
                                 but association individus nonlinkage
                                                        whereas apparentés,      mais
                                                                      studies remain    «nonfor the detec
                                                                                     superior
                     apparentés» en fait signifie que lestion
    ses do not model these transmissions directly, whereas relations    sont of
                                                              of rare variants inconnues        et these ef
                                                                                 large effect (because
                                                        fects are more strongly concentrated within particular
    linkage analyses do. The relatedness of study subjects
                     présumées éloignées.               families).
    is therefore central to a linkage study, whereas the re

                    Time                                 Association
             (generations)                               Relatedness:
                               Linkage                         - unknown

                                Relatedness:                    - distant
                                     - known                    - nuisance
                                     - recent
                                     - useful

                                                                                             ?
    FlG. 1. Schematic illustration of differences between linkage studies, which
    ciation studies which assume "unrelated" individuals. Open circles denote stu
    lines denote observed parent-child relationships. Dotted lines indicate unobs
     tions,1 and filled circles indicate the common ancestors at which these lineages
                 Astle and Balding. Population structure and cryptic relatedness in genetic association studies.
   founders of a linkage study, but these have little impact on inferences and are
      Statistical Science (2009)
   association analysis and constitute an important potential confounder.
Motivating Dataset                                                                                                      27 / 76   .
Les observations ne sont pas indépendants
                                • Les observations sont corrélées, mais cette relation est souvent
                                  inconnue
                                • Cependant, elle peut être estimé à partir des données

                                              ID   Response Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
               1 2610781                            −1.255   1     2      0       0        0         1
               2 4114347                            −0.339   1     2      0       2        0         1
               3 4399930                             −0.6    1     2      1       1        0         1
               4 2081319                            0.809    1     2      0       1        0         2
               5 1347380                            0.279    2     2      0       0        0         0
               6 3262449                            −0.421   2     2      0       1        0         1
               7 4870063                            −0.454   2     2      0       0        0         2
               8 1141212                            1.383    2     2      1       1        1         0
               9 2997954                            −2.29    1     2      0       0        0         1
          10 5805218
           Powered by TCPDF (www.tcpdf.org)
                                                    2.289    1     2      0       1        1         1

Motivating Dataset                                                                                       28 / 76   .
La matrice de parenté (kinship)

               • Soit kinship une liste de SNP utilisée pour estimer la matrice de
                 parenté
               • Soit Xkinship une matrice de génotype normalisée n × q.
               • Une matrice de parenté (Φ) peut être calculée comme:

                                               1
                                        Φ=        Xkinship X⊤
                                                            kinship                  (1)
                                              q−1

Motivating Dataset                                                                         29 / 76   .
Test d’association avec un modèle mixte linéaire (LMM)

                                           p
                                           X
                                      Y=          βj · SNPj + P + ε             (2)
                                            j=1

                            P ∼ N (0, ησ 2 Φ)         ε ∼ N (0, (1 − η)σ 2 I)

               • σ 2 est la variance totale du phénotype
               • η ∈ [0, 1] est l’héritabilité du phénotype
               • Y|(η, σ 2 ) ∼ N (0, ησ 2 Φ + (1 − η)σ 2 I)

Motivating Dataset                                                                    30 / 76   .
Régression ridge (Hoerl & Kennard 1970, Technometrics),
 Lasso (Tibshirani 1996, JRSSB)

                 [
               • β ridge
                         = arg minβ ||y − Xβ||2 + λ||β||22

                                           Pn               Pp            2        Pp
                 b lasso = arg min
               • β                                    yi −                                  |βj |
                                                              j=1 xij βj
                                       1
                                   β   2    i=1                                 +λ    j=1

          Lasso, ridge, ect. ne sont pas directement applicable au LMM

Motivating Dataset                                                                                  31 / 76   .
Procédure en deux étapes

               • Étape 1: Ajuster un LMM sous l’hypothèse nul avec un seul effet
                 aléatoire

                                             Y=P+ε
                              P ∼ N (0, ησ 2 Φ)    ε ∼ N (0, (1 − η)σ 2 I)

Motivating Dataset                                                                 32 / 76
Procédure en deux étapes

               • Étape 1: Ajuster un LMM sous l’hypothèse nul avec un seul effet
                 aléatoire

                                              Y=P+ε
                              P ∼ N (0, ησ 2 Φ)     ε ∼ N (0, (1 − η)σ 2 I)

               • Étape 2: Utilisez les résidus de l’étape 1 comme nouvelle réponse
                 indépendante

Motivating Dataset                                                                   32 / 76   .
Procédure en deux étapes

                     X_
                      kinship

Motivating Dataset              33 / 76   .
Procédure en deux étapes

                     X_
                      kinship

                                X_kinship X_kinship T

Motivating Dataset                                      34 / 76   .
Procédure en deux étapes

                         X_
                          kinship

                                        X_kinship X_kinship T

                                                                +   E
                     ͠
   Y
Motivating Dataset                  P                               35 / 76   .
Procédure en deux étapes

                        Y              P

Step 1:                                    + E    1

                             ͠

Step 2:              Residuals             + E
                     from Step 1   ͠              2

Motivating Dataset                          36 / 76
Procédure en deux étapes

                         Y                                          P

Step 1:                                                                                    + E    1

                                    ͠

Step 2:              Residuals                                                             + E
                     from Step 1            ͠                                                     2

               • Dans les tests d’association, on sait qu’il souffre d’énormes pertes de
                 puissance (Oualkacha et al. Gene. Epi. (2013))
Motivating Dataset                                                                          36 / 76   .
Classical Methods

          Betting on Sparsity

          A Thought Experiment

          Motivating Dataset

          Notre proposition

          Résultats

          Other applications

          Extensions
             Fonctions de pénalité non convexes
             Analyse de survie

          Orientations futures

Notre proposition                                 37 / 76   .
Notre proposition
                     • Nous proposons, ggmix, une procédure en une seule étape qui
                       contrôle simultanément les populations structurées et effectue une
                       sélection de variables dans les modèles mixtes linéaires

                  PLOS GENETICS

                                                                       RESEARCH ARTICLE

                                                                       Simultaneous SNP selection and adjustment
                                                                       for population structure in high dimensional
                                                                       prediction models
                                                                       Sahir R. Bhatnagar ID1,2*, Yi Yang3, Tianyuan Lu ID4,5, Erwin Schurr ID6, JC Loredo-Osti7,
                                                                       Marie Forest ID8, Karim Oualkacha ID9, Celia M. T. Greenwood ID1,4,5,10,11
                                                                       1 Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec,
                                                                       Canada, 2 Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada, 3 Department
                                                                       of Mathematics and Statistics, McGill University, Montréal, Québec, Canada, 4 Quantitative Life Sciences,
                  a1111111111                                          McGill University, Montréal, Québec, Canada, 5 Lady Davis Institute, Jewish General Hospital, Montréal,
                  a1111111111                                          Québec, Canada, 6 Department of Medicine, McGill University, Montréal, Québec, Canada, 7 Department of
                  a1111111111                                          Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada, 8 École
                  a1111111111                                          de Technologie Supérieure, Montréal, Québec, Canada, 9 Département de Mathématiques, Université du
                  a1111111111                                          Québec à Montréal, Montréal, Québec, Canada, 10 Gerald Bronfman Department of Oncology, McGill
                                                                       University, Montréal, Québec, Canada, 11 Department of Human Genetics, McGill University, Montréal,
                                                                       Québec, Canada

                                                                       * sahir.bhatnagar@mcgill.ca

                      OPEN ACCESS

                                              Abstract
                Citation: Bhatnagar SR, Yang Y, Lu T, Schurr E,
                    1   R package: sahirbhatnagar.com/ggmix, https://cran.r-project.org/package=ggmix
                Loredo-Osti  J, Forest M, et al. (2020) Simultaneous

Notre
                SNP selection and adjustment for population
        proposition
                structure in high dimensional prediction models.
                                                                       Complex traits are known to be influenced by a combination of environmental factors and                       38 / 76   .
ggmix: une procédure en une seule étape

  Y                                    X                                          P

                                                         +                                            +E
                ͠

                    1 R package: sahirbhatnagar.com/ggmix, https://cran.r-project.org/package=ggmix
Notre proposition                                                                                     39 / 76   .
Data and Model

               • Phenotype: Y = (y1 , . . . , yn ) ∈ Rn
               • SNPs: X = (X1 ; . . . , Xn )T ∈ Rn×p , where p ≫ n
               • Twice the Kinship matrix or Realized Relationship matrix: Φ ∈ Rn×n
               • Regression Coefficients: β = (β1 , . . . , βp )T ∈ Rp
               • Polygenic random effect: P = (P1 , . . . , Pn ) ∈ Rn
               • Error: ε = (ε1 , . . . , εn ) ∈ Rn
               • We consider the following LMM with a single random effect:

                                                 Y = Xβ + P + ε
                               P ∼ N (0, ησ Φ)
                                             2
                                                       ε ∼ N (0, (1 − η)σ 2 I)

               • σ 2 is the phenotype total variance
               • η ∈ [0, 1] is the phenotype heritability (narrow sens)
               • Y|(β, η, σ 2 ) ∼ N (Xβ, ησ 2 Φ + (1 − η)σ 2 I)

Notre proposition                                                                     40 / 76   .
Likelihood

                    • The negative log-likelihood is given by
                                  n            1               1
                        −ℓ(Θ) ∝     log(σ 2 ) + log (det(V)) + 2 (Y − Xβ)T V−1 (Y − Xβ)
                                  2            2              2σ

                                                  V = ηΦ + (1 − η)I
                    • Assume the spectral decomposition of Φ

                                                      Φ = UDU⊤

                    • U is an n × n orthogonal matrix and D is an n × n diagonal matrix
                    • One can write
                                         V = U(ηD + (1 − η)I)U⊤ = UWU⊤
                      with W = diag (wi )ni=1 , wi = ηDii + (1 − η)

Notre proposition                                                                         41 / 76   .
Likelihood

                    • Projection of Y (and columns of X) into Span(U) leads to a simplified
                      correlation structure for the transformed data: Ỹ = U⊤ Y
                    • Ỹ|(β, η, σ 2 ) ∼ N (X̃β, σ 2 W), with X̃ = U⊤ X
                    • The negative log-likelihood can then be expressed as

                                               1∑
                                                  n
                                 n                               1 (       )T  (         )
                       −ℓ(Θ) ∝     log(σ 2 ) +       log (wi ) + 2 Ỹ − X̃β W−1 Ỹ − X̃β
                                 2             2 i=1            2σ

Notre proposition                                                                             42 / 76
Likelihood

                    • Projection of Y (and columns of X) into Span(U) leads to a simplified
                      correlation structure for the transformed data: Ỹ = U⊤ Y
                    • Ỹ|(β, η, σ 2 ) ∼ N (X̃β, σ 2 W), with X̃ = U⊤ X
                    • The negative log-likelihood can then be expressed as

                                                 1∑
                                                    n
                                   n                               1 (       )T  (         )
                        −ℓ(Θ) ∝      log(σ 2 ) +       log (wi ) + 2 Ỹ − X̃β W−1 Ỹ − X̃β
                                   2             2 i=1            2σ

                    • For fixed σ 2 and η, solving for β is a weighted least squares problem

Notre proposition                                                                              42 / 76   .
Penalized Maximum Likelihood Estimator

               • Define the objective function:
                                                               X
                                        Qλ (Θ) = −ℓ(Θ) + λ         pj (βj )
                                                               j

               • pj (·) is a penalty term on β1 , . . . , βp
                                                     b λ is obtained by
               • An estimate of the model parameters Θ

                                              b λ = arg min Qλ (Θ)
                                              Θ
                                                         Θ

Notre proposition                                                             43 / 76   .
Block Relaxation (De Leeuw, 1994)

          To solve for the optimization problem we use a block relaxation technique
          Set k ← 0, initial values for the parameter vector Θ(0) and ϵ;
          for λ ∈ {λmax , . . . , λmin } do
              repeat
                                                                                      (k)
                                                                                             
                                            (k+1)                (k)
                    For j = 1, . . . , p, βj      ← arg min Qλ β −j , η (k) , σ 2
                                                             βj
                                                                                      (k)
                                                                                             
                                           η   (k+1)
                                                       ← arg min Qλ β (k+1) , η, σ 2
                                                             η
                                                                                           
                                           2 (k+1)
                                       σ               ← arg min Qλ β (k+1) , η (k+1) , σ 2
                                                             σ2

                k←k+1
             until convergence criterion is satisfied: ||Θ(k+1) − Θ(k) ||2 < ϵ;
          end
                       Algorithm 1: Block Relaxation Algorithm

Notre proposition                                                                                44 / 76   .
Coordinate Gradient Descent Method

               • We take advantage of smoothness of ℓ(Θ)
               • We approximate Qλ (Θ) by a strictly convex quadratic function (using
                 gradient)
               • We use CGD to calculate a descent direction
               • To achieve the descent property for the objective function, we employ
                 further line search

                1 Tseng P& Yun S. Math. Program., Ser. B, (2009)
Notre proposition                                                                        45 / 76
Coordinate Gradient Descent Method

               • We take advantage of smoothness of ℓ(Θ)
               • We approximate Qλ (Θ) by a strictly convex quadratic function (using
                 gradient)
               • We use CGD to calculate a descent direction
               • To achieve the descent property for the objective function, we employ
                 further line search

          Theorem [Convergence] 1 :
          If {Θ(k) , k = 0, 1, 2, . . .} is a sequence of iterates generated by the iteration
          map of Algorithm 1, then each cluster point (i.e. limit point) of
          {Θ(k) , k = 0, 1, 2, . . .} is a stationary point of Qλ (Θ)

                1 Tseng P& Yun S. Math. Program., Ser. B, (2009)
Notre proposition                                                                               45 / 76   .
Choice of the tuning parameter

               • We use the BIC:

                                                            b σ
                                                 BICλ = −2ℓ(β,                b
                                                               b2 , ηb) + c · df λ

                 b is the number of non-zero elements in β
               • df                                          b plus two 1
                    λ                                         λ
               • Several authors 2 have used this criterion for variable selection in
                 mixed models with c = log n
               • Other authors 3 have proposed c = log(log(n)) ∗ log(n)

                1 Zou et al. The Annals of Statistics, (2007)
                2 Bondell et al. Biometrics (2010)
                3 Wang et al. JRSS(Ser. B), (2009)
Notre proposition                                                                       46 / 76   .
Classical Methods

            Betting on Sparsity

            A Thought Experiment

            Motivating Dataset

            Notre proposition

            Résultats

            Other applications

            Extensions
               Fonctions de pénalité non convexes
               Analyse de survie

            Orientations futures

Résultats                                           47 / 76   .
Simulation study
            • We simulated data from the model Y = Xβ + P + ε
            • We used heritability η = {0.1, 0.3}, number of covariates p = 5, 000,
              number of kinship SNPs k = 10, 000, percentage of causal SNPs
              c = {0%, 1%} and σ 2 = 1.
            • In addition to these parameters, we also varied the amount of overlap
              between the causal list and the kinship list:
                1. None of the causal SNPs are included in kinship set.
                2. All of the causal SNPs are included in the kinship set.
            • These were meant to contrast the model behavior when causal SNPs
              are included in both the main effects and random effects vs. when the
              causal SNPs are only included in the main effects.
            • These scenarios are motivated by the current standard of practice in
              GWAS where the candidate marker is excluded from the calculation of
              the kinship matrix.
            • This approach becomes much more difficult to apply in large-scale
              multivariable models where there is likely to be overlap between the
              variables in the design matrix and kinship matrix.

Résultats                                                                             48 / 76   .
Simulation study results

            • Both the lasso+PC and twostep selected more false positives
              compared to ggmix
            • Overall, we observed that variable selection results and RMSE for
              ggmix were similar regardless of whether the causal SNPs were in the
              kinship matrix or not.
            • This result is encouraging since in practice the kinship matrix is
              constructed from a random sample of SNPs across the genome, some
              of which are likely to be causal, particularly in polygenic traits.
            • In particular, our simulation results show that the principal
              component adjustment method may not be the best approach to
              control for confounding by population structure, particularly when
              variable selection is of interest.

Résultats                                                                            49 / 76   .
Real data applications

            1. UK Biobank
               ▶ 10,000 LD-pruned SNPs (Essentially un-correlated variables) to predict
                 standing height in 18k related individuals
               ▶ Standing height is highly polygenic (many variables associated with
                 response)

Résultats                                                                                 50 / 76
Real data applications

            1. UK Biobank
                ▶ 10,000 LD-pruned SNPs (Essentially un-correlated variables) to predict
                  standing height in 18k related individuals
                ▶ Standing height is highly polygenic (many variables associated with
                  response)
            2. GAW20 Simulated dataset
                ▶ 50,000 SNPs (all on chromosome 1) to predict high-density lipoproteins
                  in 679 related individuals
                ▶ Not much correlation between causal SNP and others
                ▶ Very sparse signals (only 1 causal variant)

Résultats                                                                                  50 / 76
Real data applications

            1. UK Biobank
                ▶ 10,000 LD-pruned SNPs (Essentially un-correlated variables) to predict
                  standing height in 18k related individuals
                ▶ Standing height is highly polygenic (many variables associated with
                  response)
            2. GAW20 Simulated dataset
                ▶ 50,000 SNPs (all on chromosome 1) to predict high-density lipoproteins
                  in 679 related individuals
                ▶ Not much correlation between causal SNP and others
                ▶ Very sparse signals (only 1 causal variant)
            3. Mouse Crosses
                ▶ Find loci associated with mouse sensitivity to mycobacterial infection
                ▶ 189 samples, and 625 microsatellite markers
                ▶ Highly correlated variables

Résultats                                                                                  50 / 76   .
Results: UK Biobank

            (A)                                                                             (B)
                                                1.3                                                                  1.000
                                                                                                                                                 ●
                  RMSE in model selection set

                                                1.2
                                                                                                                     0.975

                                                                                                  RMSE in test set
                                                1.1

                                                                                                                     0.950

                                                1.0

                                                                                                                     0.925
                                                0.9
                                                                                                                                     ●●

                                                                                                                                 ●
                                                0.8                                                                  0.900
                                                      0   25        50       75       100                                    9            10     11        12
                                                                λ index                                                  log2(Number of active variables)
                                                          twostep    lasso    ggmix                                          ● twostep ● lasso ● ggmix ● bslmm

Résultats                                                                                                                                                        51 / 76   .
Results: GAW20

                        Method          Median number               RMSE (SD)
                                       of active variables
                                     (Inter-quartile range)
                        twostep             1 (1 - 11)            0.3604 (0.0242)
                         lasso              1 (1 - 15)            0.3105 (0.0199)
                         ggmix              1 (1 - 12)            0.3146 (0.0210)
                         BSLMM       40,737 (39,901 - 41,539)     0.2503 (0.0099)

            Table: Summary of model performance based on 200 GAW20 simulations. Five-fold
            cross-validation root-mean-square error was reported for each simulation replicate.

Résultats                                                                                         52 / 76   .
Results: Mouse crosses
                   (a) twostep                    Chr1                   Chr2                  Chr3              Chr4              Chr5                Chr6           Chr7                  Chr8              Chr9                 Chr10

                                       150
                                       100
                                       50
                                 19%

                                       0
                                             20    60    100        20        60    100   0     40     80   0     40     80   0     40        80   0 20     60   0 20        50        10    40     70   10    40        70   10       40    70
                                                  Chr11              Chr12                     Chr13            Chr14              Chr15               Chr16       Chr17                    Chr18             Chr19                ChrX

                                       150
             81%

                                       100
                                       50
                                       0

                                             0     40      80   0        20    40    60   10    40     70   10    30    50    10   30    50   70   0   20   50   10     30        50   10     30         0    20    40        0   20    50
                    (b) lasso                     Chr1                   Chr2                  Chr3              Chr4              Chr5                Chr6           Chr7                  Chr8              Chr9                 Chr10
                                       80
                                       60
                                       40

                    56%
                                       20
                                       0

                                             20    60    100        20        60    100   0     40     80   0     40     80   0     40        80   0 20     60   0 20        50        10    40     70   10    40        70   10       40    70
                                                  Chr11              Chr12                     Chr13            Chr14              Chr15               Chr16       Chr17                    Chr18             Chr19                ChrX
                                       80
                                       60

                          44%
                                       40
                                       20
                                       0

                                             0     40      80   0        20    40    60   10    40     70   10    30    50    10   30    50   70   0   20   50   10     30        50   10     30         0    20    40        0   20    50
                   (c) ggmix                      Chr1                   Chr2                  Chr3              Chr4              Chr5                Chr6           Chr7                  Chr8              Chr9                 Chr10
                                       200
                                       150
                                       100
                                       50
                                       0

            99%                   1%         20    60    100        20        60    100   0     40     80   0     40     80   0     40        80   0 20     60   0 20        50        10    40     70   10    40        70   10       40    70
                                                  Chr11              Chr12                     Chr13            Chr14              Chr15               Chr16       Chr17                    Chr18             Chr19                ChrX
                                       200
                                       150
                                       100
                                       50
                                       0

                                             0     40      80   0        20    40    60   10    40     70   10    30    50    10   30    50   70   0   20   50   10     30        50   10     30         0    20    40        0   20    50

Résultats                                                                                                                                                                                                                                         53 / 76   .
Discussion

            • La procédure en deux étapes conduit à un grand nombre de faux
              positifs et de faux négatifs
            • L’ajustement de la composante principale dans lasso peut ne pas être
              suffisant pour contrôler la confusion, en particulier lorsqu’il y a
              beaucoup de corrélation entre les observations
            • ggmix fonctionne bien même lorsque les variables causales sont
              utilisées dans le calcul de la matrice de parenté
            • ggmix a montré la plus grande amélioration par rapport à twostep et
              lasso quand il y avait des variables hautement corrélées avec
              beaucoup de structure (exemple de croix de souris)

Résultats                                                                            54 / 76   .
ggmix R package
            library(ggmix)
            data("admixed")
            fit
ggmix R package
            hdbic
Classical Methods

           Betting on Sparsity

           A Thought Experiment

           Motivating Dataset

           Notre proposition

           Résultats

           Other applications

           Extensions
              Fonctions de pénalité non convexes
              Analyse de survie

           Orientations futures

Other applications                                 57 / 76   .
gSOS

Other applications   58 / 76   .
gSOS

Other applications   59 / 76   .
Classical Methods

             Betting on Sparsity

             A Thought Experiment

             Motivating Dataset

             Notre proposition

             Résultats

             Other applications

             Extensions
                Fonctions de pénalité non convexes
                Analyse de survie

             Orientations futures

Extensions                                           60 / 76   .
SCAD (Fan et Li, JASA, 2001), MCP (Zhang, Ann. Stat.,
 2010)

Extensions                                               61 / 76   .
Computational challenges

             • Past approaches for optimization for SCAD/MCP relies upon descent
               method, first- or second- order

             • e.g., sparsenet (Mazumder et al. 2011) uses coordinate descent with
               full step size,
                          Pn whose coordinate        update cycles through
                                        j                      j   P
               β̃j = Sγk     i=1 yi − ỹi  xij , λ ℓ  , where ỹi =   k̸=j xik β̃k

             • However, coordinate descent is difficult to vectorize, and rate of
               convergence is difficult of establish – though past literature suggests
               O(1/k) rate of convergence for ISTA

Extensions                                                                               62 / 76   .
Noname manuscript No.
                                                            (will be inserted by the editor)
 Our proposal: Accelerated gradient (AG) method
                                                           Improving Convergence for Nonconvex Composite
                                                           Programming

                                                           Kai Yang · Masoud Asgharian · Sahir
                                                           Bhatnagar
                    v:2009.10629v2 [math.OC] 12 Oct 2020

                                                           Received: date / Accepted: date

                                                           Abstract High-dimensional nonconvex composite problems are popular in today’s
                                                           machine learning and statistical genetics research. Recently, Ghadimi and Lan [1]
                                                           proposed an algorithm to optimize nonconvex high-dimensional problems. There are
                                                           several parameters in their algorithm that are to be set before running the algorithm.
                                                           It is not trivial how to choose these parameters nor there is, to the best of our knowl-
                                                           edge, an explicit rule how to select the parameters to make the algorithm converges
                                                           faster. We analyze Ghadimi and Lan’s algorithm to gain an interpretation based on
                                                           the inequality constraints for convergence and the upper bound for the norm of the
                                                           gradient analogue. Our interpretation of their algorithm suggests this to be a damped
                                                           accelerated gradient scheme. Based on this, we propose an approach how to select
                                                           the parameters to improve convergence of the algorithm. Our numerical studies us-
                                                           ing high-dimensional nonconvex sparse learning problems, motivated by image de-
                                                           noising and statistical genetics applications, show that convergence can be made, on
                                                           average, considerably faster than that of the conventional ISTA algorithm for such
                                                           optimization problems with over 10000 variables should the parameters be chosen
                                                           using our proposed approach.

                                                           Keywords Accelerated Gradient · Composite Optimization · Nonconvex Optimiza-
                                                           tion

                                                           K. Yang
                                                           McGill University
             1 https://arxiv.org/abs/2009.10629
                          E-mail: kai.yang2@mail.mcgill.ca
                                                           ORCID ID: 0000-0001-5505-6886
Extensions                                                 M. Asgharian                                                                               63 / 76   .
Numerical Study for SCAD

                                         10
                                                                              AG
                                                                              ISTA
                                          5

                                          0
                       log(gk − g ∗ )

                                        −5

                                        −10

                                        −15

                                              0   200   400       600   800    1000
                                                              k

                i.i.d.            i.i.d.                                      ∥τ generate ∥
                                                                                       2

             xi ∼ N (0, I) , εi ∼ N 0, σ 2 , y = Xτ generate + ε, σ 2 =              3       ,
             τ generate ∈ R10006 is a sparse constant vector with 6 values of
             1.23(intercept), 3, 4, 5, 6, 59 as true effect coefficients and 10000 values of
             0. Start point: τ 0 = 110006 , a = 3.7, λ = 0.6.
Extensions                                                                                       64 / 76   .
Numerical Study for MCP

                                                                                 AG
                                       8                                         ISTA

                                       6
                     log(gk − g ∗ )

                                       4

                                       2

                                       0

                                      −2

                                      −4
                                           0   500   1000   1500   2000   2500    3000
                                                             k

             Simulation settings here is same as before in SCAD, γ = 2.5, λ = 0.6.

Extensions                                                                               65 / 76   .
casebase

                               JSS                          Journal of Statistical Software
                                                        MMMMMM YYYY, Volume VV, Issue II.                    doi: 10.18637/jss.v000.i00

                                  casebase: An Alternative Framework For Survival
                                      Analysis and Comparison of Event Rates
stat.ME] 22 Sep 2020

                                              Sahir Rai Bhatnagar*                            Maxime Turgeon*
                                                   McGill University                         University of Manitoba

                                          Jesse Islam                   James A. Hanley                       Olli Saarela
                                       McGill University                 McGill University                University of Toronto

                                1 https://arxiv.org/abs/2009.10264,         Abstract
                             https://cran.r-project.org/package=casebase
                                       In epidemiological studies of time-to-event data, a quantity of interest to the clinician
                Extensions            and the patient is the risk of an event given a covariate profile. However, methods relying         66 / 76   .
Case-base sampling
                                             Control group                                            Screening group

                               75K
             Population Size

                               50K

                               25K

                                 0

                                     0   5                   10         15      0                 5                     10   15
                                                                    Follow−up time

                                                                  Case series       Base series

Extensions                                                                                                                        67 / 76   .
Case-base sampling

             • The unit of analysis is a person-moment.
             • Case-base sampling reduces the model fitting to a familiar logistic
               regression.
             • The sampling process is taken into account using an offset term.
             • By sampling a large base series, the information loss eventually
               becomes negligible.
             • This framework can easily be used with time-varying covariates (e.g.
               time-varying exposure). We can fit any hazard λ of the following
               form:
                                   log λ(t; α, β) = g(t; α) + βX
             • Different choices of the function g leads to familiar parametric
               families:
                 ▶ Exponential: g is constant.
                 ▶ Gompertz: g(t; α) = αt.
                 ▶ Weibull: g(t; α) = α log t

Extensions                                                                            68 / 76   .
Classical Methods

           Betting on Sparsity

           A Thought Experiment

           Motivating Dataset

           Notre proposition

           Résultats

           Other applications

           Extensions
              Fonctions de pénalité non convexes
              Analyse de survie

           Orientations futures

Orientations futures                               69 / 76   .
Orientations futures

                • ggmix est limité par le nombre d’individus (ne s’applique pas à
                  l’ensemble de la cohorte UK Biobank de 500k) → approximations de
                  rang inférieur de la matrice de parenté
                • Problèmes de mémoire lorsque le nombre de covariables dans le
                  modèle dépasse 50k → stratégies de mappage de mémoire (par
                  exemple biglasso de Zeng et Breheny (2017))
                • Extension aux données multivariées, longitudinales, combinaisons de
                  plusieurs cohortes → Plusieurs effets aléatoires.

Orientations futures                                                                    70 / 76   .
Remerciements

Orientations futures   71 / 76   .
Remerciements

Orientations futures   72 / 76   .
Remerciements

Orientations futures   73 / 76   .
Remerciements

         • Masoud Asgharian (McGill)
         • Tianyuan Lu (McGill)
         • Yi Yang (McGill)
         • Karim Oualkacha (UQÀM)
         • Celia Greenwood (Lady Davis
           Institute)
         • Erica Moodie (McGill)
         • James Hanley (McGill)
         • Maxime Turgeon (UManitoba)
         • Olli Saarela (UofT)
         • Luda Diatchenko (McGill)
         • UK Biobank Resource under project
           number 27449. We appreciate the
           generosity of UK Biobank volunteers

Orientations futures                             74 / 76   .
References

               1. Yang K, Asgharian M, Bhatnagar SR (2020+). Improving Rate of
                  Convergence for Nonconvex Composite Programming. Submitted to
                  Optimization Letters. https://arxiv.org/abs/2009.10629.
               2. Bhatnagar SR, Turgeon M, Islam J, Hanley JA, Saarela O (2020+).
                  casebase: An Alternative Framework For Survival Analysis and
                  Comparison of Event Rates. Submitted to Journal of Statistical
                  Software. https://arxiv.org/abs/2009.10264.
               3. Bhatnagar SR, Yang Y, Lu T, Schurr E, Loredo-Osti JC, Forest M,
                  Oualkacha K, Greenwood CMT (2020). Simultaneous SNP selection
                  and adjustment for population structure in high dimensional
                  prediction models. PLoS Genetics 16(5): e1008766. DOI
                  10.1371/journal.pgen.1008766.

           sahirbhatnagar.com

Orientations futures                                                                75 / 76   .
Session Info

           R version 4.0.2 (2020-06-22)
           Platform: x86_64-pc-linux-gnu (64-bit)
           Running under: Pop!_OS 20.10

           Matrix products: default
           BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
           LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.10.so

           attached base packages:
           [1] stats     graphics grDevices utils      datasets    methods   base

           other attached packages:
           [1] ggmix_0.0.1 knitr_1.31

           loaded via a namespace (and not attached):
            [1] lattice_0.20-41 codetools_0.2-16 glmnet_4.1-1       foreach_1.5.1
            [5] grid_4.0.2       magrittr_2.0.1   evaluate_0.14     highr_0.8
            [9] stringi_1.5.3    Matrix_1.2-18    splines_4.0.2     iterators_1.0.13
           [13] tools_4.0.2      stringr_1.4.0    survival_3.2-3    xfun_0.21
           [17] compiler_4.0.2   shape_1.4.5

Orientations futures                                                                    76 / 76   .
You can also read