Modern Variable Selection in Action: Comment on the Papers by HTT and BPV
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Statistical Science 2020, Vol. 35, No. 4, 609–613 https://doi.org/10.1214/20-STS808 Main article: https://doi.org/10.1214/19-STS701, https://doi.org/10.1214/19-STS733 © Institute of Mathematical Statistics, 2020 Modern Variable Selection in Action: Comment on the Papers by HTT and BPV Edward I. George Let me begin by congratulating the authors of these two mization approach (MIO) for best subsets selection in papers, hereafter HTT and BPV, for their superb contribu- problems with p as large as in the thousands. Requiring tions to the comparisons of methods for variable selection the optimization of 0 -constrained least squares, conven- problems in high dimensional regression. The methods tional wisdom had long considered best subsets to be the considered are truly some of today’s leading contenders computationally elusive gold standard for variable selec- for coping with the size and complexity of big data prob- tion, having defied computation for p much larger than lems of so much current importance. Not surprisingly, 30. Finally breaking this seemingly impenetrable barrier, there is no clear winner here because the terrain of com- MIO had suddenly unleashed a feasible implementation parisons is so vast and complex, and no single method can of best subsets for application in sparse high dimensional dominate across all situations. The considered setups vary regression. greatly in terms of the number of observations n, the num- Illustrating the performance of MIO, BKM carried out ber of predictors p, the number and relative sizes of the simulation comparisons with some of its most prominent underlying nonzero regression coefficients, predictor cor- alternatives, including forward stepwise selection and the relation structures and signal-to-noise ratios (SNRs). And lasso. A close cousin of best subsets, stepwise is one of the even these only scratch the surface of the infinite possi- most routinely used computable heuristic approximations bilities. Further, there is the additional issue as to which for large p. The lasso, on the other hand, differs funda- performance measure is most important. Is the goal of mentally from best subsets by its very nature. Obtained by an analysis exact variable selection or prediction or both? optimizing an 1 -penalized least squares criterion rather And what about computational speed and scalability? All the best subsets 0 -constrained criterion, it substitutes these considerations would naturally depend on the prac- a rapidly computable convex optimization problem for tical application at hand. an NP-hard nonconvex optimization problem. The BKM The methods compared by HTT and BPV have been un- simulations demonstrated setups where best subsets sub- leashed by extraordinary developments in computational stantially dominated both stepwise and the lasso in terms speed, and so it is tempting to distinguish them primar- of both predictive squared error loss and variable selection ily by their novel implementation algorithms. In particu- precision, appearing to confirm the gold standard promise lar, the recent integer optimization related algorithms for of best subsets. variable selection differ in fundamental ways from the Concerned that BKM’s simulation terrain was too lim- now widely adopted coordinate ascent algorithms for the ited to come to such a universal conclusion, HTT set out lasso related methods. Undoubtedly, the impressive im- to perform broader simulation comparisons. In particu- provements in computational speed unleashed by these lar, the terrain of comparisons has been expanded to in- algorithms are critical for the feasibility of practical ap- clude setups with a broader range of SNRs. As opposed to plications. However, the more fundamental story behind BKM, HTT now include setups with weaker SNRs corre- the performance differences has to do with the differences sponding to PVE values that more realistically character- between the criteria that their algorithms are seeking to ize applications often encountered in practice. In addition optimize. In an important sense, they are being guided by to comparing best subsets, stepwise and the lasso, HTT different solutions to the general variable selection prob- include a new contender, the (simplified) relaxed lasso, lem. driven by an interesting combination of the lasso and least Focusing first on the paper of HTT, its main thrust ap- squares. pears to have been kindled by the computational break- From this broader terrain of comparisons presented by through of Bertsimas, King and Mazumder (2016) (here- HTT, new patterns of relative performance emerge. To after BKM), which had proposed a mixed integer opti- begin with, the performance of stepwise is now appears very similar to that of best subsets throughout. The ma- Edward I. George is the Universal Furniture Professor of jor differences between stepwise and best subsets found Statistics and Economics, Department of Statistics, The by BKM disappear when stepwise is tuned by cross- Wharton School, University of Pennsylvania, Philadelphia, validation (here on external validation data) rather than Pennsylvania, 19104, USA (e-mail: AIC. This is valuable to see because the choice of stop- edgeorge@wharton.upenn.edu). ping rule has been controversial for applications of step- 609
610 E. I. GEORGE wise in practice. HTT’s comparisons suggest that cross- prediction for the lasso. With the choice of λ guided by validated stopping may be the best way to go. prediction error, the lasso is forced to shrink less in or- More importantly, HTT’s performance evaluations der not to over-bias the nonzero estimates of the larger highlight interesting differences between best subsets and sized coefficients. This is what is leading to the lasso’s both the lasso and relaxed lasso. In predictive perfor- increased number of nonsparse estimates that appears in mance, the lasso now appears better than best subsets so many of the plots. for smaller SNRs, but worse for larger SNRs with a It is very interesting to understand how the relaxed performance crossover that varies from setup to setup. lasso is able to better negotiate the tension between se- The same is true for HTT’s selection accuracy metrics, lection and prediction faced by the lasso. Combining the though the lasso regularly produces too many nonzero lasso and least squares by unshrinking the nonzero lasso estimates especially in the higher SNR settings. An ex- estimates back toward the least squares estimates, what ception where best subsets dominates throughout occurs makes the relaxed lasso so effective is that it does this when the known mutual incoherence (MI) conditions for adaptively. At first glance, it may seem puzzling that the lasso consistency seem to fail. Remarkably, the relaxed relaxed lasso would yield fewer nonzero estimates than lasso fares extremely well throughout the comparisons. the lasso. However, this is explained by the fact that both It appears to adaptively move closer to the best of the λ, the amount of lasso shrinkage, and γ , its mixing param- methods for each SNR value, occasionally even outper- eter, are being simultaneously tuned by prediction error forming them all. Although no one of the four procedures on external validation data. By controlling the shrinkage is best in every setting, the relaxed lasso certainly comes of its nonzero estimates for better predictions, the relaxed the closest. Finally, it is notable that in terms of explained lasso can balance increasing thresholding of the smaller variation, namely PVE, no essential differences between estimates with less biasing of the larger estimates. Via the the four methods appeared. external validation data, it adapts this balance to the data The observed performance differences between these at hand. Thus the the relaxed lasso is able to offer good methods can be intuitively understood as stemming from accuracy and prediction error across all the SNR settings. how their induced coefficient estimators are driven to bal- In a nutshell, the least squares component of the relaxed ance bias-variance tradeoffs in order to minimize predic- lasso is serving to stabilize the estimates of the larger tion error on the validation data. For best subsets and step- size coefficients so that they do not interfere with the wise, this induced estimator is simply least squares on a lasso variable selection, thereby resulting in fewer false selected subset of coefficients. For the lasso, it is a shrink- nonzero coefficient estimates. It may be of interest to note age adjusted least squares estimator which has been soft- that this aspect of the relaxed lasso is similar in spirit to thresholded by a selected λ. And for the relaxed lasso, it the scalable, fully Bayesian spike-and slab lasso (SSL) of is a γ -weighted mixture of a λ-soft-thresholded lasso esti- Ročková and George (2018). The SSL also stabilizes the mator with a least squares estimator on the lasso’s nonzero coordinates, for a selected λ and γ . Note that each of these larger size estimates, but by using the log of a mixture of induced estimators consists of both zero and nonzero esti- spike-and-slab Laplace priors as a regularization criterion. mates of the components of the actual underlying β. Thus, In effect, the SSL adaptively applies simultaneous strong the squared prediction error over the validation set will in- thresholding lasso shrinkage to smaller coefficients and clude the squared bias of the zero estimates in addition to weak lasso shrinkage to stabilize the larger coefficients. the variance and squared bias of the nonzero estimates. By including a prior on the mixing weights, the SSL be- From this bias-variance tradeoff perspective, it is comes even more self adaptive, avoiding the need to use straightforward to see why best subsets and stepwise pro- cross-validation for mixing weight estimation. vide desirable variable selection and prediction when the Turning now to focus on the paper by BPV, which fol- variance of the subset least squares estimates is small lowed the paper by HTT, its main thrust revolves around enough relative to the sizes of the underlying nonzero co- CIS and SS, two newer approaches introduced by the au- efficients. This is exactly what occurs throughout HTT’s thors in Bertsimas, Pauphilet and Van Parys (2017) and setups as SNR increases (with β and n fixed) and the sub- Bertsimas and Van Parys (2020). In contrast to the 0 - set least squares estimates become more and more accu- constrained best subsets criterion of MIO, both CIS and rate. However, when the variance of subset least squares is SS are driven by an 0 -constrained, 2 -penalized least relatively large compared to the sizes of the nonzero coef- squares criterion for variable selection in linear regres- ficients, which is what occurs in HTT’s setups where SNR sion. CIS does this with a cutting-plane algorithm for fast is small, the accuracy and stability of the least squares es- integer optimization, while SS applies a dual subgradi- timates can be substantially improved by shrinkage. This ent algorithm to a continuous Boolean relaxation of the is precisely where we see the lasso improvements over criterion. These algorithms are impressive and effective best subsets and stepwise. However, there is also a no- advances, especially in terms of their speed and ability to table conflict between the goals of variable selection and handle larger problems. As seen across the simulations,
MODERN VARIABLE SELECTION IN ACTION 611 CIS and SS offer dramatic improvements in speed over lasso. Both MCP and SCAD are parametrized by both a MIO, coping now with problems of size n = 1000 and shrinkage parameter λ and a shape parameter γ which are p = 20,000 within seconds of the glmnet implementa- adaptively tuned to the data via cross-validation. Just as tion of the lasso. Apparently, they can even scale further for the HTT comparisons, the statistical performance dif- to feasibly handle problems of sizes n, p of 100,000 or ferences between all these methods can be intuitively un- n = 10,000 and p = 1,000,000 within minutes, a major derstood from the bias-variance tradeoffs faced by their step forward. induced coefficient estimators as they are predictively Beyond this increased feasibility for larger problems, tuned by cross-validation. which is obviously a huge boon for practicality, the sta- BPV carry out simulated comparisons of CIS, SS, tistical value of CIS and SS ultimately rests on the infer- ENet, MCP and SCAD across various versions of the ba- ential consequences of adding an 2 -penalization to the sic underlying Gaussian structure used by HTT. For their 0 -constrained least squares criterion of MIO. With this terrain of comparisons, they consider six noise/correlation addition, an 2 -penalized subset least squares estimator, settings with three noise levels: low (SNR = 6, k = 100, adaptively tuned by cross-validation, is now induced to p = 20,000), medium (SNR = 1, k = 50, p = 10,000) reduce prediction error. Such 2 -penalization was implicit and high (SNR = 0.05, k = 10, p = 2000), and two pre- in the early shrinkage proposals of ridge regression and dictor correlation levels: ρ = 0.2, 0.7. Although there are Stein estimation for improved prediction as well as sta- lots of moving parts across these setups, one can say bilization of least squares estimation (Hoerl and Kennard, roughly that the low noise settings, where the signal is 1970, Stein, 1960, James and Stein, 1961). But in contrast easiest to measure, are the least challenging for variable to 1 induced shrinkage, 2 -penalization alone does not selection, whereas the high noise settings, where the sig- lead to thresholding and so is not directly useful for selec- nal is most hidden, are the most challenging and perhaps tion. However, when combined with the 0 -constrained most like what is most often encountered in practice. As criterion as BPV have done, the improved accuracy of elaborated by HTT, both the low and medium noise SNR the induced coefficient estimator can be expected to lead values correspond to PVE values that may be less fre- to both improved variable selection and prediction, espe- quently in encountered in practice. It should also be noted cially in settings where the variance of least squares is that unlike HTT, BPV vary the sample size n within each large relative to the sizes of the nonzero regression coef- of their settings, thereby also illustrating performance dif- ficients. It may be of interest to note that such threshold- ferences as n increases. Revealing the limiting behavior ing of an 2 shrinkage estimator is akin to the automatic of the methods, increasing n has the effect of making thresholding by the positive-part James–Stein estimator, the problems statistically easier in terms of measuring the which also has the attractive theoretical property of being signal. By decreasing the implicit variance of the subset classically minimax under squared predictive loss. Fur- least squares estimates on which shrinkage is applied, in- ther, more powerful minimax versions of such positive- creasing n thus diminishes the importance of shrinkage part shrinkage estimators for selection were developed by for each the methods. In this way, the effect of increasing Zhou and Hwang (2005). n parallels the effect of increasing SNR. Illustrating the statistical potential of CIS and SS, BPV Beyond these settings, BPV also provide valuable com- go on to compares their performance to three prominent parisons on simulated data where the MI conditions for shrinkage-selection methods that use enhanced lasso re- lasso consistency fail, as well as comparisons using data lated penalties, namely ENet (the elastic net), MCP (max- generated with a real design matrix X with p = 14,858 imum convex penalty) and SCAD (smoothly clipped ab- gene expressions for n = 1145 lung cancer patients. With solute deviation). ENet enhances the lasso criterion with this X, they simulate hypothetical continuous patient out- the addition of an 2 -penalization, in the same way that comes from various Gaussian linear models with SNR CIS and SS enhance the best subsets criterion. More pre- levels varying from 0.05 to 6, as well as discrete 0–1 pa- cisely, ENet is driven by a mixture of 1 and 2 penal- tient outcomes by thresholding these linear models. Ex- ties, parametrized by a shrinkage parameter λ and a mix- tending the methods for the classification data, they apply ing parameter α, both of which are then tuned by cross- CIS and SS driven by hinge loss, and ENet, MCP and validation. The combination of the 1 and 2 penalization SCAD driven by logistic loss. serves to stabilize the shrinkage of subset least squares, For performance evaluations of the competing methods, especially in the context of highly correctly predictors. On BPV highlight their potential for variable selection with the other hand, both SCAD and MCP are driven by care- accuracy A (the fraction of true nonzero found), FDR, (the fully tailored nonconvex relaxations of the 1 -criterion false discovery rate), and ROC curves of true positives that automatically allow for lighter shrinkage of the es- versus false positives. Taken together these reveal the se- timates of the larger coefficients, in that way mitigating lection tradeoffs obtained by each of the methods. For pre- the tension between selection and prediction faced by the dictive performance, they report MSE, casting prediction
612 E. I. GEORGE primarily as a “practical implication of accuracy.” And to ENet persistently worst in all settings. SS and MCP are convey the practicality of their CIS and SS implementa- also sometimes best, improving along with SCAD as n in- tions, they also report speed comparisons across various creases. As BPV suggest, this overall observed tendency settings, gauging the speed of every method against the of ENet, MCP and SCAD to select many more variables gold standard fastest speed of the lasso. than are selected by CIS and SS, can be understood to Across all these simulation comparisons, many differ- be a consequence of using soft penalization rather than a ent patterns of relative accuracy, false discovery rates and hard cardinality constraint to drive variable selection and prediction quality emerge. Beginning with the accuracy prediction. and MSE comparisons when the actual underlying spar- For the settings where the MI conditions for lasso con- sity level ktrue is treated as fixed and known, so that sistency fail, Enet is seen, not surprisingly, to be among A = 1 − FDR. This constraint forces all the methods to the least accurate across all three noise settings with an explicitly trade correct selection for false detections. It is accuracy converging to less than 1 for large n. In contrast, hence not surprising that ENet is the least accurate as it the accuracies of the other four methods, which do not also suffers from the lasso’s tension between selection and rely on the MI conditions, converge to 1 in the low and prediction. In contrast it is impressive that both CIS and medium noise settings. CIS and SS are best in all settings, MCP are the most accurate throughout, MCP apparently along with MCP which is also one of the best in the low avoiding this tradeoff tension with its relaxed penalty. noise setting. CIS and SS are also seen to dominate in pre- More pronounced in the low noise and high correlation diction in the low and medium noise settings. However, in settings, these differences diminish rapidly as we would the high noise setting, the predictive performances of all expect when n and/or the noise level is increased, virtu- the methods become indistinguishable. ally disappearing in the high noise setting. For prediction Lastly, we consider the simulated comparisons based in the low and medium noise settings, CIS is best, slightly on the real data design matrix where the predictors cor- beating SS and MCP and SCAD, and with ENet clearly relations are apt to be less symmetrically structured and worst, though again all these differences seem to disap- perhaps more realistic. For the linear model setup, as SNR pear in the more statistically difficult high noise setting. increases, ENet is increasingly the least accurate with the The speed comparisons here demonstrate that CIS and SS most false detections, while MCP is increasingly the most are dramatically faster than MIO, and even computation- accurate with the fewest false detections. However, these ally comparable with MCP and SCAD for larger n. In- differences all disappear at the lowest SNR levels where deed, in the high noise setting, SS appears to be nearly as the variable selection problem is most challenging. In fast as ENet, which is as fast as the lasso in every setting. terms of the ROC curve tradeoffs, ENet goes from best to For the more realistic settings where the sparsity level worst and MCP goes from worst to best as the SNR level ktrue is unknown and estimated via cross validation, we is increased from 0.05 to 6. For the classification model first see from the ROC curves, that CIS, SS and MCP setup, we see ENet, MCP and SCAD generally provid- essentially provide the most desirable tradeoffs between ing better accuracy compared to CIS and SS across all the true and false selection in the low and medium noise set- noise settings, with the differences diminishing as n, the tings, with ENet performing worst. However, these differ- noise level and the predictor correlations are increased. ences become much less pronounced and even reversed in But in terms of false detection rates, this ordering is re- the high noise setting, especially when the predictor cor- versed, except for MCP which also obtains lower false relations are high. For prediction in the low and medium detection in the low noise settings. For the classifications noise settings, CIS continues to be best up to around ktrue obtained by their respective cross-validated support selec- with MCP becoming best at the higher levels of estimated tions, SS appears to provide the most desirable tradeoffs sparsity, with ENet worst throughout. It is interesting that between accuracy, false detection and sparsity. the predictions of MCP, SCAD and ENet continue to im- In concluding, let me thank the HTT and BPV for prove with increasing k, suggesting that they add more their illuminating comparisons of some of the leading ap- variables to improve their predictive cross-validation tun- proaches for variable selection in high dimensional re- ing. In the high noise setting, it appears that MCP and gression, and for their many insightful summary conclu- SCAD predict best, with ENet second under high correla- sions with which I agree. Together, they have provided tion, although the wide error bars for these comparisons convincing evidence that the relaxed lasso, CIS and SS suggest such differences may not be real. In terms of ac- will be very valuable additions to the arsenal of mod- curacy, ENet, MCP and SCAD are best, followed by CIS ern variable selection approaches. Of course, as so aptly and then SS in the low and medium noise settings, all dif- demonstrated throughout the simulations, no single ap- ferences disappearing as n increases. In the high noise proach will dominate all others in every setting, especially setting, these differences are more pronounced and per- as pertains to the ease of detecting the underlying sig- sistent. However, in terms of FDR, CIS is clearly best and nal which is typically unknown. So in practice it makes
MODERN VARIABLE SELECTION IN ACTION 613 good sense to consider a collection of methods together, B ERTSIMAS , D., PAUPHILET, J. and VAN PARYS , B. (2017). Sparse for example, beginning an application with lasso predic- Classification: a scalable discrete optimization perspective. Under tive screening and then proceeding from there with other revision. B ERTSIMAS , D. and VAN PARYS , B. (2020). Sparse high-dimensional methods, as BPV have suggested. Finally, it should be regression: Exact scalable algorithms and phase transitions. emphasized that the arsenal of valuable modern variable Ann. Statist. 48 300–323. MR4065163 https://doi.org/10.1214/ selection approaches goes well beyond the methods con- 18-AOS1804 sidered here, as HTT have indicated. In particular, in line H OERL , A. E. and K ENNARD , R. W. (1970). Ridge regression: Biased with the underlying Bayesian nature of penalized likeli- estimation for nonorthogonal problems. Technometrics 12 55–67. hood methods, a growing number of promising scalable JAMES , W. and S TEIN , C. (1961). Estimation with quadratic loss. In Bayesian variable selection approaches, for example, the Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I 361– spike-and-slab lasso described earlier, are valuable addi- 379. Univ. California Press, Berkeley, CA. MR0133191 RO ČKOVÁ , V. and G EORGE , E. I. (2018). The spike-and-slab tions to the arsenal and should not be overlooked. LASSO. J. Amer. Statist. Assoc. 113 431–444. MR3803476 https://doi.org/10.1080/01621459.2016.1260469 ACKNOWLEDGMENT S TEIN , C. M. (1960). Multiple Regression. Essays in Honor of Harold This work was supported by NSF Grant DMS-1916245. Hotelling. Stanford Univ. Press, Palo Alto, CA. Z HOU , H. H. and H WANG , J. T. G. (2005). Minimax estima- tion with thresholding and its application to wavelet analysis. REFERENCES Ann. Statist. 33 101–125. MR2157797 https://doi.org/10.1214/ B ERTSIMAS , D., K ING , A. and M AZUMDER , R. (2016). Best subset 009053604000000977 selection via a modern optimization lens. Ann. Statist. 44 813–852. MR3476618 https://doi.org/10.1214/15-AOS1388
You can also read