Modern Variable Selection in Action: Comment on the Papers by HTT and BPV

Page created by Maurice Rios
 
CONTINUE READING
Statistical Science
2020, Vol. 35, No. 4, 609–613
https://doi.org/10.1214/20-STS808
Main article: https://doi.org/10.1214/19-STS701, https://doi.org/10.1214/19-STS733
© Institute of Mathematical Statistics, 2020

Modern Variable Selection in Action:
Comment on the Papers by HTT and BPV
Edward I. George

   Let me begin by congratulating the authors of these two                                 mization approach (MIO) for best subsets selection in
papers, hereafter HTT and BPV, for their superb contribu-                                  problems with p as large as in the thousands. Requiring
tions to the comparisons of methods for variable selection                                 the optimization of 0 -constrained least squares, conven-
problems in high dimensional regression. The methods                                       tional wisdom had long considered best subsets to be the
considered are truly some of today’s leading contenders                                    computationally elusive gold standard for variable selec-
for coping with the size and complexity of big data prob-                                  tion, having defied computation for p much larger than
lems of so much current importance. Not surprisingly,                                      30. Finally breaking this seemingly impenetrable barrier,
there is no clear winner here because the terrain of com-                                  MIO had suddenly unleashed a feasible implementation
parisons is so vast and complex, and no single method can                                  of best subsets for application in sparse high dimensional
dominate across all situations. The considered setups vary                                 regression.
greatly in terms of the number of observations n, the num-                                    Illustrating the performance of MIO, BKM carried out
ber of predictors p, the number and relative sizes of the                                  simulation comparisons with some of its most prominent
underlying nonzero regression coefficients, predictor cor-                                 alternatives, including forward stepwise selection and the
relation structures and signal-to-noise ratios (SNRs). And                                 lasso. A close cousin of best subsets, stepwise is one of the
even these only scratch the surface of the infinite possi-                                 most routinely used computable heuristic approximations
bilities. Further, there is the additional issue as to which                               for large p. The lasso, on the other hand, differs funda-
performance measure is most important. Is the goal of                                      mentally from best subsets by its very nature. Obtained by
an analysis exact variable selection or prediction or both?                                optimizing an 1 -penalized least squares criterion rather
And what about computational speed and scalability? All                                    the best subsets 0 -constrained criterion, it substitutes
these considerations would naturally depend on the prac-                                   a rapidly computable convex optimization problem for
tical application at hand.                                                                 an NP-hard nonconvex optimization problem. The BKM
   The methods compared by HTT and BPV have been un-                                       simulations demonstrated setups where best subsets sub-
leashed by extraordinary developments in computational                                     stantially dominated both stepwise and the lasso in terms
speed, and so it is tempting to distinguish them primar-                                   of both predictive squared error loss and variable selection
ily by their novel implementation algorithms. In particu-                                  precision, appearing to confirm the gold standard promise
lar, the recent integer optimization related algorithms for                                of best subsets.
variable selection differ in fundamental ways from the                                        Concerned that BKM’s simulation terrain was too lim-
now widely adopted coordinate ascent algorithms for the                                    ited to come to such a universal conclusion, HTT set out
lasso related methods. Undoubtedly, the impressive im-                                     to perform broader simulation comparisons. In particu-
provements in computational speed unleashed by these                                       lar, the terrain of comparisons has been expanded to in-
algorithms are critical for the feasibility of practical ap-                               clude setups with a broader range of SNRs. As opposed to
plications. However, the more fundamental story behind                                     BKM, HTT now include setups with weaker SNRs corre-
the performance differences has to do with the differences                                 sponding to PVE values that more realistically character-
between the criteria that their algorithms are seeking to                                  ize applications often encountered in practice. In addition
optimize. In an important sense, they are being guided by                                  to comparing best subsets, stepwise and the lasso, HTT
different solutions to the general variable selection prob-                                include a new contender, the (simplified) relaxed lasso,
lem.                                                                                       driven by an interesting combination of the lasso and least
   Focusing first on the paper of HTT, its main thrust ap-                                 squares.
pears to have been kindled by the computational break-                                        From this broader terrain of comparisons presented by
through of Bertsimas, King and Mazumder (2016) (here-                                      HTT, new patterns of relative performance emerge. To
after BKM), which had proposed a mixed integer opti-                                       begin with, the performance of stepwise is now appears
                                                                                           very similar to that of best subsets throughout. The ma-
Edward I. George is the Universal Furniture Professor of                                   jor differences between stepwise and best subsets found
Statistics and Economics, Department of Statistics, The                                    by BKM disappear when stepwise is tuned by cross-
Wharton School, University of Pennsylvania, Philadelphia,                                  validation (here on external validation data) rather than
Pennsylvania, 19104, USA (e-mail:                                                          AIC. This is valuable to see because the choice of stop-
edgeorge@wharton.upenn.edu).                                                               ping rule has been controversial for applications of step-

                                                                                     609
610                                                        E. I. GEORGE

wise in practice. HTT’s comparisons suggest that cross-            prediction for the lasso. With the choice of λ guided by
validated stopping may be the best way to go.                      prediction error, the lasso is forced to shrink less in or-
   More importantly, HTT’s performance evaluations                 der not to over-bias the nonzero estimates of the larger
highlight interesting differences between best subsets and         sized coefficients. This is what is leading to the lasso’s
both the lasso and relaxed lasso. In predictive perfor-            increased number of nonsparse estimates that appears in
mance, the lasso now appears better than best subsets              so many of the plots.
for smaller SNRs, but worse for larger SNRs with a                    It is very interesting to understand how the relaxed
performance crossover that varies from setup to setup.             lasso is able to better negotiate the tension between se-
The same is true for HTT’s selection accuracy metrics,             lection and prediction faced by the lasso. Combining the
though the lasso regularly produces too many nonzero               lasso and least squares by unshrinking the nonzero lasso
estimates especially in the higher SNR settings. An ex-            estimates back toward the least squares estimates, what
ception where best subsets dominates throughout occurs             makes the relaxed lasso so effective is that it does this
when the known mutual incoherence (MI) conditions for              adaptively. At first glance, it may seem puzzling that the
lasso consistency seem to fail. Remarkably, the relaxed            relaxed lasso would yield fewer nonzero estimates than
lasso fares extremely well throughout the comparisons.             the lasso. However, this is explained by the fact that both
It appears to adaptively move closer to the best of the            λ, the amount of lasso shrinkage, and γ , its mixing param-
methods for each SNR value, occasionally even outper-              eter, are being simultaneously tuned by prediction error
forming them all. Although no one of the four procedures           on external validation data. By controlling the shrinkage
is best in every setting, the relaxed lasso certainly comes        of its nonzero estimates for better predictions, the relaxed
the closest. Finally, it is notable that in terms of explained     lasso can balance increasing thresholding of the smaller
variation, namely PVE, no essential differences between            estimates with less biasing of the larger estimates. Via the
the four methods appeared.                                         external validation data, it adapts this balance to the data
   The observed performance differences between these              at hand. Thus the the relaxed lasso is able to offer good
methods can be intuitively understood as stemming from             accuracy and prediction error across all the SNR settings.
how their induced coefficient estimators are driven to bal-
                                                                      In a nutshell, the least squares component of the relaxed
ance bias-variance tradeoffs in order to minimize predic-
                                                                   lasso is serving to stabilize the estimates of the larger
tion error on the validation data. For best subsets and step-
                                                                   size coefficients so that they do not interfere with the
wise, this induced estimator is simply least squares on a
                                                                   lasso variable selection, thereby resulting in fewer false
selected subset of coefficients. For the lasso, it is a shrink-
                                                                   nonzero coefficient estimates. It may be of interest to note
age adjusted least squares estimator which has been soft-
                                                                   that this aspect of the relaxed lasso is similar in spirit to
thresholded by a selected λ. And for the relaxed lasso, it
                                                                   the scalable, fully Bayesian spike-and slab lasso (SSL) of
is a γ -weighted mixture of a λ-soft-thresholded lasso esti-
                                                                   Ročková and George (2018). The SSL also stabilizes the
mator with a least squares estimator on the lasso’s nonzero
coordinates, for a selected λ and γ . Note that each of these      larger size estimates, but by using the log of a mixture of
induced estimators consists of both zero and nonzero esti-         spike-and-slab Laplace priors as a regularization criterion.
mates of the components of the actual underlying β. Thus,          In effect, the SSL adaptively applies simultaneous strong
the squared prediction error over the validation set will in-      thresholding lasso shrinkage to smaller coefficients and
clude the squared bias of the zero estimates in addition to        weak lasso shrinkage to stabilize the larger coefficients.
the variance and squared bias of the nonzero estimates.            By including a prior on the mixing weights, the SSL be-
   From this bias-variance tradeoff perspective, it is             comes even more self adaptive, avoiding the need to use
straightforward to see why best subsets and stepwise pro-          cross-validation for mixing weight estimation.
vide desirable variable selection and prediction when the             Turning now to focus on the paper by BPV, which fol-
variance of the subset least squares estimates is small            lowed the paper by HTT, its main thrust revolves around
enough relative to the sizes of the underlying nonzero co-         CIS and SS, two newer approaches introduced by the au-
efficients. This is exactly what occurs throughout HTT’s           thors in Bertsimas, Pauphilet and Van Parys (2017) and
setups as SNR increases (with β and n fixed) and the sub-          Bertsimas and Van Parys (2020). In contrast to the 0 -
set least squares estimates become more and more accu-             constrained best subsets criterion of MIO, both CIS and
rate. However, when the variance of subset least squares is        SS are driven by an 0 -constrained, 2 -penalized least
relatively large compared to the sizes of the nonzero coef-        squares criterion for variable selection in linear regres-
ficients, which is what occurs in HTT’s setups where SNR           sion. CIS does this with a cutting-plane algorithm for fast
is small, the accuracy and stability of the least squares es-      integer optimization, while SS applies a dual subgradi-
timates can be substantially improved by shrinkage. This           ent algorithm to a continuous Boolean relaxation of the
is precisely where we see the lasso improvements over              criterion. These algorithms are impressive and effective
best subsets and stepwise. However, there is also a no-            advances, especially in terms of their speed and ability to
table conflict between the goals of variable selection and         handle larger problems. As seen across the simulations,
MODERN VARIABLE SELECTION IN ACTION                                             611

CIS and SS offer dramatic improvements in speed over            lasso. Both MCP and SCAD are parametrized by both a
MIO, coping now with problems of size n = 1000 and              shrinkage parameter λ and a shape parameter γ which are
p = 20,000 within seconds of the glmnet implementa-             adaptively tuned to the data via cross-validation. Just as
tion of the lasso. Apparently, they can even scale further      for the HTT comparisons, the statistical performance dif-
to feasibly handle problems of sizes n, p of 100,000 or         ferences between all these methods can be intuitively un-
n = 10,000 and p = 1,000,000 within minutes, a major            derstood from the bias-variance tradeoffs faced by their
step forward.                                                   induced coefficient estimators as they are predictively
   Beyond this increased feasibility for larger problems,       tuned by cross-validation.
which is obviously a huge boon for practicality, the sta-          BPV carry out simulated comparisons of CIS, SS,
tistical value of CIS and SS ultimately rests on the infer-     ENet, MCP and SCAD across various versions of the ba-
ential consequences of adding an 2 -penalization to the        sic underlying Gaussian structure used by HTT. For their
0 -constrained least squares criterion of MIO. With this       terrain of comparisons, they consider six noise/correlation
addition, an 2 -penalized subset least squares estimator,      settings with three noise levels: low (SNR = 6, k = 100,
adaptively tuned by cross-validation, is now induced to         p = 20,000), medium (SNR = 1, k = 50, p = 10,000)
reduce prediction error. Such 2 -penalization was implicit     and high (SNR = 0.05, k = 10, p = 2000), and two pre-
in the early shrinkage proposals of ridge regression and        dictor correlation levels: ρ = 0.2, 0.7. Although there are
Stein estimation for improved prediction as well as sta-        lots of moving parts across these setups, one can say
bilization of least squares estimation (Hoerl and Kennard,      roughly that the low noise settings, where the signal is
1970, Stein, 1960, James and Stein, 1961). But in contrast      easiest to measure, are the least challenging for variable
to 1 induced shrinkage, 2 -penalization alone does not        selection, whereas the high noise settings, where the sig-
lead to thresholding and so is not directly useful for selec-   nal is most hidden, are the most challenging and perhaps
tion. However, when combined with the 0 -constrained           most like what is most often encountered in practice. As
criterion as BPV have done, the improved accuracy of            elaborated by HTT, both the low and medium noise SNR
the induced coefficient estimator can be expected to lead       values correspond to PVE values that may be less fre-
to both improved variable selection and prediction, espe-       quently in encountered in practice. It should also be noted
cially in settings where the variance of least squares is       that unlike HTT, BPV vary the sample size n within each
large relative to the sizes of the nonzero regression coef-     of their settings, thereby also illustrating performance dif-
ficients. It may be of interest to note that such threshold-    ferences as n increases. Revealing the limiting behavior
ing of an 2 shrinkage estimator is akin to the automatic       of the methods, increasing n has the effect of making
thresholding by the positive-part James–Stein estimator,        the problems statistically easier in terms of measuring the
which also has the attractive theoretical property of being     signal. By decreasing the implicit variance of the subset
classically minimax under squared predictive loss. Fur-         least squares estimates on which shrinkage is applied, in-
ther, more powerful minimax versions of such positive-          creasing n thus diminishes the importance of shrinkage
part shrinkage estimators for selection were developed by       for each the methods. In this way, the effect of increasing
Zhou and Hwang (2005).                                          n parallels the effect of increasing SNR.
   Illustrating the statistical potential of CIS and SS, BPV       Beyond these settings, BPV also provide valuable com-
go on to compares their performance to three prominent          parisons on simulated data where the MI conditions for
shrinkage-selection methods that use enhanced lasso re-         lasso consistency fail, as well as comparisons using data
lated penalties, namely ENet (the elastic net), MCP (max-       generated with a real design matrix X with p = 14,858
imum convex penalty) and SCAD (smoothly clipped ab-             gene expressions for n = 1145 lung cancer patients. With
solute deviation). ENet enhances the lasso criterion with       this X, they simulate hypothetical continuous patient out-
the addition of an 2 -penalization, in the same way that       comes from various Gaussian linear models with SNR
CIS and SS enhance the best subsets criterion. More pre-        levels varying from 0.05 to 6, as well as discrete 0–1 pa-
cisely, ENet is driven by a mixture of 1 and 2 penal-         tient outcomes by thresholding these linear models. Ex-
ties, parametrized by a shrinkage parameter λ and a mix-        tending the methods for the classification data, they apply
ing parameter α, both of which are then tuned by cross-         CIS and SS driven by hinge loss, and ENet, MCP and
validation. The combination of the 1 and 2 penalization       SCAD driven by logistic loss.
serves to stabilize the shrinkage of subset least squares,         For performance evaluations of the competing methods,
especially in the context of highly correctly predictors. On    BPV highlight their potential for variable selection with
the other hand, both SCAD and MCP are driven by care-           accuracy A (the fraction of true nonzero found), FDR, (the
fully tailored nonconvex relaxations of the 1 -criterion       false discovery rate), and ROC curves of true positives
that automatically allow for lighter shrinkage of the es-       versus false positives. Taken together these reveal the se-
timates of the larger coefficients, in that way mitigating      lection tradeoffs obtained by each of the methods. For pre-
the tension between selection and prediction faced by the       dictive performance, they report MSE, casting prediction
612                                                       E. I. GEORGE

primarily as a “practical implication of accuracy.” And to        ENet persistently worst in all settings. SS and MCP are
convey the practicality of their CIS and SS implementa-           also sometimes best, improving along with SCAD as n in-
tions, they also report speed comparisons across various          creases. As BPV suggest, this overall observed tendency
settings, gauging the speed of every method against the           of ENet, MCP and SCAD to select many more variables
gold standard fastest speed of the lasso.                         than are selected by CIS and SS, can be understood to
   Across all these simulation comparisons, many differ-          be a consequence of using soft penalization rather than a
ent patterns of relative accuracy, false discovery rates and      hard cardinality constraint to drive variable selection and
prediction quality emerge. Beginning with the accuracy            prediction.
and MSE comparisons when the actual underlying spar-                 For the settings where the MI conditions for lasso con-
sity level ktrue is treated as fixed and known, so that           sistency fail, Enet is seen, not surprisingly, to be among
A = 1 − FDR. This constraint forces all the methods to            the least accurate across all three noise settings with an
explicitly trade correct selection for false detections. It is    accuracy converging to less than 1 for large n. In contrast,
hence not surprising that ENet is the least accurate as it        the accuracies of the other four methods, which do not
also suffers from the lasso’s tension between selection and       rely on the MI conditions, converge to 1 in the low and
prediction. In contrast it is impressive that both CIS and        medium noise settings. CIS and SS are best in all settings,
MCP are the most accurate throughout, MCP apparently              along with MCP which is also one of the best in the low
avoiding this tradeoff tension with its relaxed penalty.          noise setting. CIS and SS are also seen to dominate in pre-
More pronounced in the low noise and high correlation             diction in the low and medium noise settings. However, in
settings, these differences diminish rapidly as we would          the high noise setting, the predictive performances of all
expect when n and/or the noise level is increased, virtu-         the methods become indistinguishable.
ally disappearing in the high noise setting. For prediction          Lastly, we consider the simulated comparisons based
in the low and medium noise settings, CIS is best, slightly       on the real data design matrix where the predictors cor-
beating SS and MCP and SCAD, and with ENet clearly                relations are apt to be less symmetrically structured and
worst, though again all these differences seem to disap-          perhaps more realistic. For the linear model setup, as SNR
pear in the more statistically difficult high noise setting.      increases, ENet is increasingly the least accurate with the
The speed comparisons here demonstrate that CIS and SS            most false detections, while MCP is increasingly the most
are dramatically faster than MIO, and even computation-           accurate with the fewest false detections. However, these
ally comparable with MCP and SCAD for larger n. In-               differences all disappear at the lowest SNR levels where
deed, in the high noise setting, SS appears to be nearly as       the variable selection problem is most challenging. In
fast as ENet, which is as fast as the lasso in every setting.     terms of the ROC curve tradeoffs, ENet goes from best to
   For the more realistic settings where the sparsity level       worst and MCP goes from worst to best as the SNR level
ktrue is unknown and estimated via cross validation, we           is increased from 0.05 to 6. For the classification model
first see from the ROC curves, that CIS, SS and MCP               setup, we see ENet, MCP and SCAD generally provid-
essentially provide the most desirable tradeoffs between          ing better accuracy compared to CIS and SS across all the
true and false selection in the low and medium noise set-         noise settings, with the differences diminishing as n, the
tings, with ENet performing worst. However, these differ-         noise level and the predictor correlations are increased.
ences become much less pronounced and even reversed in            But in terms of false detection rates, this ordering is re-
the high noise setting, especially when the predictor cor-        versed, except for MCP which also obtains lower false
relations are high. For prediction in the low and medium          detection in the low noise settings. For the classifications
noise settings, CIS continues to be best up to around ktrue       obtained by their respective cross-validated support selec-
with MCP becoming best at the higher levels of estimated          tions, SS appears to provide the most desirable tradeoffs
sparsity, with ENet worst throughout. It is interesting that      between accuracy, false detection and sparsity.
the predictions of MCP, SCAD and ENet continue to im-                In concluding, let me thank the HTT and BPV for
prove with increasing k, suggesting that they add more            their illuminating comparisons of some of the leading ap-
variables to improve their predictive cross-validation tun-       proaches for variable selection in high dimensional re-
ing. In the high noise setting, it appears that MCP and           gression, and for their many insightful summary conclu-
SCAD predict best, with ENet second under high correla-           sions with which I agree. Together, they have provided
tion, although the wide error bars for these comparisons          convincing evidence that the relaxed lasso, CIS and SS
suggest such differences may not be real. In terms of ac-         will be very valuable additions to the arsenal of mod-
curacy, ENet, MCP and SCAD are best, followed by CIS              ern variable selection approaches. Of course, as so aptly
and then SS in the low and medium noise settings, all dif-        demonstrated throughout the simulations, no single ap-
ferences disappearing as n increases. In the high noise           proach will dominate all others in every setting, especially
setting, these differences are more pronounced and per-           as pertains to the ease of detecting the underlying sig-
sistent. However, in terms of FDR, CIS is clearly best and        nal which is typically unknown. So in practice it makes
MODERN VARIABLE SELECTION IN ACTION                                                        613

good sense to consider a collection of methods together,                 B ERTSIMAS , D., PAUPHILET, J. and VAN PARYS , B. (2017). Sparse
for example, beginning an application with lasso predic-                    Classification: a scalable discrete optimization perspective. Under
tive screening and then proceeding from there with other                    revision.
                                                                         B ERTSIMAS , D. and VAN PARYS , B. (2020). Sparse high-dimensional
methods, as BPV have suggested. Finally, it should be
                                                                            regression: Exact scalable algorithms and phase transitions.
emphasized that the arsenal of valuable modern variable                     Ann. Statist. 48 300–323. MR4065163 https://doi.org/10.1214/
selection approaches goes well beyond the methods con-                      18-AOS1804
sidered here, as HTT have indicated. In particular, in line              H OERL , A. E. and K ENNARD , R. W. (1970). Ridge regression: Biased
with the underlying Bayesian nature of penalized likeli-                    estimation for nonorthogonal problems. Technometrics 12 55–67.
hood methods, a growing number of promising scalable                     JAMES , W. and S TEIN , C. (1961). Estimation with quadratic loss. In
Bayesian variable selection approaches, for example, the                    Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I 361–
spike-and-slab lasso described earlier, are valuable addi-                  379. Univ. California Press, Berkeley, CA. MR0133191
                                                                         RO ČKOVÁ , V. and G EORGE , E. I. (2018). The spike-and-slab
tions to the arsenal and should not be overlooked.                          LASSO. J. Amer. Statist. Assoc. 113 431–444. MR3803476
                                                                            https://doi.org/10.1080/01621459.2016.1260469
                     ACKNOWLEDGMENT                                      S TEIN , C. M. (1960). Multiple Regression. Essays in Honor of Harold
  This work was supported by NSF Grant DMS-1916245.                         Hotelling. Stanford Univ. Press, Palo Alto, CA.
                                                                         Z HOU , H. H. and H WANG , J. T. G. (2005). Minimax estima-
                                                                            tion with thresholding and its application to wavelet analysis.
                         REFERENCES                                         Ann. Statist. 33 101–125. MR2157797 https://doi.org/10.1214/
B ERTSIMAS , D., K ING , A. and M AZUMDER , R. (2016). Best subset          009053604000000977
   selection via a modern optimization lens. Ann. Statist. 44 813–852.
   MR3476618 https://doi.org/10.1214/15-AOS1388
You can also read