A Look at Robustness and Stability of 1- versus 0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al - ETH Zürich
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Statistical Science 2020, Vol. 35, No. 4, 614–622 https://doi.org/10.1214/20-STS809 Main article: https://doi.org/10.1214/19-STS701, https://doi.org/10.1214/19-STS733 © Institute of Mathematical Statistics, 2020 A Look at Robustness and Stability of 1- versus 0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Yuansi Chen, Armeen Taeb and Peter Bühlmann Abstract. We congratulate the authors Bertsimas, Pauphilet and van Parys (hereafter BPvP) and Hastie, Tibshirani and Tibshirani (hereafter HTT) for providing fresh and insightful views on the problem of variable selection and prediction in linear models. Their contributions at the fundamental level provide guidance for more complex models and procedures. Key words and phrases: Distributional robustness, high-dimensional esti- mation, latent variables, low-rank estimation, variable selection. 1. A BIT OF HISTORY ON SUBSET SELECTION The scientific debate on whether to use 0 or 1 regular- ization to induce sparsity has always been active. One way 0 regularization in linear and other models has taken to tackle the computational shortcoming of the exact 0 a prominent role, perhaps because of Occam’s razor prin- approach is through greedy algorithms. Tropp (2004) jus- ciple of simplicity. Information criteria like AIC (Akaike, tified the good old greedy forward selection from a theo- 1973) and BIC (Schwarz, 1978) have been a breakthrough retical view point with no noise and Cai and Wang (2011) for complexity regularization with 0 regularization, see generalize this result to noisy situations, both assuming also Kotz and Johnson (1992). Miller’s book on subset fairly restrictive coherence conditions on the design; the selection (Miller, 1990) provided a comprehensive view conditions in the latter work for noisy problems are weak- of the state-of-the-art 30 years ago. But the landscape has ened in Zhang (2011). Also matching pursuit (Mallat and changed since then. Zhang, 1993) has been a contribution in favor of an al- Breiman introduced the nonnegative garrote for better gorithmic forward approach, perhaps mostly for compu- subset selection (Breiman, 1995), mentioned instability tational reasons at that time. The similarity of matching of forward selection (Breiman, 1996b) and promoted bag- pursuit to 1 norm penalization was made more clear by ging (Breiman, 1996a) as a way to address it. A bit ear- Least Angle regression (Efron et al., 2004) and L2 boost- lier, Basis pursuit has been invented by Chen and Donoho ing (Bühlmann, 2006). It is worth mentioning the recent (1994), just before Tibshirani (1996) proposed the famous work by Bertsimas, King and Mazumder (2016) illus- Lasso with 1 norm regularization that has seen mas- trates that solving moderate-size exact 0 problem is no sive use in statistics and machine learning; see also Chen, longer as slow as people used to think. Apart from their Donoho and Saunders (2001) and Fuchs (2004). The view smart implementations, the speed-up they are able to at- and fact that 1 norm regularization can be seen as a pow- tain benefits from tremendous progress in modern opti- erful convex relaxation (Donoho, 2006) for the 0 prob- lem has perhaps overshadowed that there are other statis- mization theory and in physical computing powers. tical aspects in favor of 1 norm regularization, as pointed On choosing between the 0 and the 1 approach, both out now again by HTT in their current paper. BPvP and HTT have provided valuable statistical and op- timization comparisons and discussions. To consolidate the contributions in both papers in a principled manner, Yuansi Chen is Postdoctoral Fellow, Seminar for Statistics, here we outline what we believe are the three aspects that ETH Zürich, Rämistrasse 101, CH-8092 Zürich, Switzerland a data analyst must consider when deciding between 0 (e-mail: chen@stat.math.ethz.ch). Armeen Taeb is Postdoctoral Fellow, Seminar for Statistics, ETH Zürich, and 1 regularization: namely the application for which Rämistrasse 101, CH-8092 Zürich, Switzerland (e-mail: they will be employed, the optimization guarantees (com- armeen.taeb@stat.math.ethz.ch). Peter Bühlmann is Professor, putation speed and certificates of optimality) that are de- Seminar for Statistics, ETH Zürich, Rämistrasse 101, CH-8092 sired, and the statistical properties that the model might Zürich, Switzerland (e-mail: buhlmann@stat.math.ethz.ch). possess. 614
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL. 615 2. THREE ASPECTS OF THE PROBLEM other hand, 1 regularization is convex. When 1 reg- ularization is combined with a convex loss, the opti- 2.1 Application Aspect mization program can be solved efficiently using stan- Without a concrete data problem, the application aspect dard subgradient-based methods. Gradient methods are is perhaps not clearly distinguishable from a statistical ubiquitous and ready-to-use in today’s deep learning re- data analysis viewpoint. Whether one would prefer 0 or search and popular computing packages such as Tensor- 1 norm regularization is problem dependent. For exam- flow (Abadi et al., 2015) or PyTorch (Paszke et al., 2019). ple, for problems where the cardinality of model param- These publicly available computing packages make 1 eters is naturally constrained, a truly sparse approxima- regularization easily deployable in applications beyond tion is reasonable and one may prefer 0 regularization. linear models. While 0 regularization is more natural from a modeling Of course, finding the global optimum may not be nec- perspective, data analysts have often resorted to 1 regu- essary for good statistical performance. The search for larization as a convex surrogate to speed up computations. greedy algorithms for 0 regularization and the study of However, the 1 surrogate can become problematic if the the corresponding statistical performance has been and solution of the 1 problem were interpreted as that of the still is an active research field (see Tropp, 2004, Cai and exact 0 problem. In particular, it is well known that 1 Wang, 2011 and Zhang, 2011 mentioned in Section 1). tends to overshrink the fitted estimates (Zou, 2006). That However, in order to provide both computational and is, the 1 estimates are overly biased toward zero. This statistical guarantees for these greedy algorithms, often phenomenon can lead to poorly fitted models in prac- fairly restrictive coherence conditions on the design ma- tice. For example, 1 procedures for deconvolving cal- trix have to be assumed. cium imaging data to determine the location of neuron The computational difficulty of exact 0 regularization is not completely insurmountable. As pointed by the re- activity lead to incorrect estimates (Friedrich, Zhou and cent work of Bertsimas, King and Mazumder (2016), with Paninski, 2017), which has motivated exact solvers for the advance in physical computing powers and in the the 0 -penalized problem in this context (Jewell and Wit- modern optimization theory on mixed-integer program- ten, 2018). In directed acyclic graph (DAG) estimation, ming (MIO), data analysts now have a rigorous approach 0 regularization is clearly preferred over 1 regulariza- to employ 0 regularization efficiently in regression prob- tion. In particular, 0 regularization preserves the impor- lems that would have taken years to compute decades ago. tant property that Markov-equivalent DAGs have the same However, as pointed out by BPvP and HTT, the mixed- penalized likelihood score, while this is not the case for 1 integer programming formulation of the 0 problem is the regularization (van de Geer and Bühlmann, 2013). not yet as fast as Lasso, especially for low-SNR regimes The two examples above illustrate that many additional and large problem sizes. In such scenarios, one often has desiderata—other than computational complexity—may to trade off between getting closer to the global minimum arise when deciding between 0 and 1 regularization. De- and terminating the program early under a time limit. As pending on the application, a practitioner may consider: an example, HTT points out for problem size n = 500, certificates of optimality (e.g., in mission critical appli- p = 100, the method originally introduced in Bertsimas, cations), statistical prediction guarantees, feature selec- King and Mazumder (2016) still requires an hour to cer- tion guarantees, stability of the estimated model, distribu- tify optimality. The addition of 2 regularization by BPvP tional robustness and generalization of the theoretical un- does improve computations in the low-SNR regime, in derstandings in linear models to more complex models. particular with their new convex relaxation SS for the Our main objective in this discussion is to raise aware- 0 + 2 framework. As a convex relaxation, SS is a heuris- ness of the above considerations when choosing between tic solution and is able to solve sparse linear regression 0 and 1 regularization. of problem size n = 10,000, p = 100,000 in less than 20 2.2 Optimization Aspect seconds, putting it on par with Lasso. Overall, the contri- bution of Berstimas and coauthors in the series of related Exact 0 regularization has long been abandoned be- papers Bertsimas, King and Mazumder (2016), Bertsimas cause of the computational hardness. One could opti- and King (2017), Bertsimas and Van Parys (2020) enables mize using branch-and-bound techniques for moderate the possibility of 0 regularization in high-dimensional dimensions (Gatu and Kontoghiorghes, 2006, Hofmann, data analysis, and inspires future research to further in- Gatu and Kontoghiorghes, 2007, Bertsimas and Shioda, tegrate 0 -based procedures. 2009) or, as most statisticians did, resort to some forward– 2.3 Statistical Aspect backward heuristics. These approaches either provide no guarantee for finding the optimal solution or are not com- The statistical aspect is rooted in the goal to do accu- putationally tractable for large-scale problems. On the rate or “optimal” information extraction in the context of
616 Y. CHEN, A. TAEB AND P. BÜHLMANN uncertainty and aiming for stability and replicability. This provided for prediction performance or feature selection inferential aspect may have very different aims, and de- performance of a procedure. pending on them and on the data generating processes, Regarding the 0 and 1 regularization in high-dimen- different methods might be preferred for different scenar- sional sparse linear models, some statistical guarantees ios. Even more so, the statistical aspect might even em- are known and some are still unknown (at least to us). brace the idea that there are several “competing” methods For prediction, that is, estimating the regression surface and one would typically gain information by inspecting Xβ – 0 regularization is very powerful as it leads to min- consensus or using aggregation. imax optimality under no assumption on the design ma- Both BPvP and HTT papers demonstrate a careful and trix X (Barron, Birgé and Massart, 1999). For the Lasso, responsible comparison of subset selection methods, and this is not true and the fast convergence rate requires we complement with a few additional empirical results in conditions on the design such as the restricted eigen- Section 3. With respect to statistical performance, HTT value condition (Bickel, Ritov and Tsybakov, 2009) or point out some key takeaways from their extensive em- the compatibility condition (van de Geer and Bühlmann, pirical study: there is no clear winner, and depending on 2009, Bühlmann and van de Geer, 2011, Bellec, 2018). the size of the SNR, different methods perform favorably. On the other hand, for feature selection accuracy, that is, In particular, they suggest that 0 -based best subset suf- estimation of the parameter vector β ∗ —one necessarily fers both computationally and statistically in the low-SNR needs a condition on the fixed design matrix X, since regime. To address some of the instability concerns raised β ∗ is not identifiable in general if p > n. Specifically, by HTT and inspired by the Elastic Net Zou and Hastie since the least squares loss function is the quadratic form (2005), BPvP include an 2 penalty to their MIO formu- Y − Xb22 /n ≈ (β ∗ − b)T X T X/n(β ∗ − b) + σε2 with β ∗ lation. With this new estimator, they demonstrate good denoting the true regression parameter and σε2 the error performance across the SNR spectrum. We applaud both variance, we might necessarily need a restrictive eigen- BPvP and HTT put the No Cherry Picking (NCP) guide- value/compatibility type condition to estimate β ∗ (these lines (Bühlmann and van de Geer, 2018) in action. are in fact the weakest assumptions known for accurate parameter estimation or feature selection consistency with 2.3.1 Relaxed and adaptive Lasso: A compromise be- the Lasso). Whether 0 minimization has an advantage tween 0 - and 1 regularization. HTT find as an over- over the Lasso for parameter estimation (e.g., requiring all recommendation that the relaxed Lasso (Meinshausen, weaker assumptions or yielding more accurate estimators) 2007), or a version of it, performs “overall best.” In fact, is unclear to us, and we believe investigating such rela- when Nicolai Meinshausen came up independently with tionships is an interesting direction for future research. On the idea around 2005, we learned that Hui Zou has in- the fine scale, methods which improve upon the bias of vented the adaptive Lasso (Zou, 2006). Both proposals the Lasso as the adaptive Lasso (Zou, 2006), the relaxed aim to reduce the bias of Lasso and push the solution more Lasso (Meinshausen, 2007) or thresholding after Lasso, towards 0 penalization. So we wonder why HTT did not are indeed a bit better than the plain Lasso, under some include the adaptive Lasso into their study as well (by us- assumptions (van de Geer, Bühlmann and Zhou, 2011). ing the plain Lasso as initial estimator, see also Bühlmann We wonder whether these Lasso variants also have simi- and van de Geer, 2011, Chapter 2.8). We would expect a lar advantages when introduced with 0 regularization. similar behavior as for the relaxed Lasso. The relaxed Lasso and adaptive Lasso above are “push- 3. A FEW ADDITIONAL THOUGHTS ing” the convex 1 regularization towards the nonconvex 0 to benefit from the both regularization. In the same BPvP and HTT provide many empirical illustrations spirit, the 0 + 2 approach from BPvP combines the on prediction, parameter estimation, cross-validation or 0 regularization with the convex 2 regularization to in- degrees of freedom. We complement this with empiri- crease stability in low SNR regime. If one is willing to cal studies on the following points: (i) distributional ro- accept the conceptual connection between the 0 + 2 ap- bustness and (ii) feature selection stability. The first point proach from BPvP and the relaxed Lasso, it is no longer (i) is very briefly mentioned by BPvP in Sections 2.1.1 surprising to see that the 0 + 2 approach from BPvP and 2.3.1 but not further considered in their empirical performs better than best subset and Lasso from HTT in results. Further, we highlight a few problem settings— with more complicated decision spaces than linear sub- many simulation settings. set selection—where developing optimization tools for 0 2.3.2 Statistical theory. The beauty of statistical the- regularization may be fruitful. ory is to characterize mathematically under which as- Implementation details. In all our experiments, Lasso sumptions a certain method exhibits performance guar- is solved via the R package glmnet (Friedman, Hastie and antees such as minimax optimality, rate of statistical con- Tibshirani, 2010) and best subset is solved via the R pack- vergence, and consistency. Such guarantees are typically age bestsubset, with maximum computing time limit of
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL. 617 30 minutes following HTT. The convex relaxation of the We evaluate the distributional robustness with the follow- 0 + 2 approach SS (introduced by BPvP) is solved via ing DR metric which takes a regression parameter or its the Julia package SubsetSelection, with maximum com- estimate as input and outputs its distributional robustness: puting iteration limit of 200 iterations following BPvP. (3.2) DR : β → max Y − (X + )β 2 , While CIO is also an important method in BPvP, we did ∈U (η) not include it here because we had a hard time executing with (X, Y ) generated independently identically dis- the CIO code by BPvP. The difficulty is due to the version tributed with respect to the training data that yielded the incompatibilities of installing both SS and CIO under the estimate for β. DR computes the “worst-case” predic- same Julia version in the current SparseRegression pack- tion performance of β when the covariates X have been age provided by BPvP. We would like to encourage the perturbed inside the set U (η). Here, we take the per- authors to update the SparseRegression package to make turbation set U (η) = { = (δ1 , . . . , δp ), δj 2 ≤ η ∀j }, their software contributions more accessible. parametrized by the perturbation magnitude η > 0. To ob- The code to generate all the figures here are written in tain a better understanding of DR, it helps to consider the R with the integration of Julia code via R package Julia- distributional robustness difference (DRD) Call. Our code is released publicly in the Github reposi- tory STSDiscussion_SparseRegression.1 (3.3) DRD : β → max Y − (X + )β 2 − Y − Xβ2 . 3.1 Distributional Robustness ∈U (η) It is well known that the Lasso has a robustness prop- The metrics DR and DRD are related trivially by the rela- erty with respect to “measurement error” as the robust op- tion DR(β) = DRD(β) + Y − Xβ2 , for a fixed β. Fur- timization problem (Xu, Caramanis and Mannor, 2009): ther, due to the 1 norm and robustness duality obtained in Xu, Caramanis and Mannor (2009), DRD(β) = ηβ1 . (3.1) argmin max Y − (X + )b2 As a consequence, the DRD definition is independent of b ∈U (λ) how the data (X, Y ) is generated. We also deduce that with the perturbation set U (λ) = { = (δ1 , . . . , δp ), DR(β) = ηβ1 + Y − Xβ2 . In other words, the dis- δj 2 ≤ λ ∀j } is equivalent to the square root Lasso tributional robustness of an estimator β (as measured by DR(β)) trade-offs between the 1 norm of its solution and argmin Y − Xb2 + λb1 . the prediction performance on unperturbed data. b We note that the square root Lasso and the Lasso have the 3.1.1 Empirical illustration: DRD vs regularization pa- same solution path when varying λ from 0 to ∞. Thus, rameter. We empirically explore distributional robustness the Lasso is robust under small covariate perturbations; difference, DRD, as a function of the regularization pa- “small” since the optimal choice of λ to guarantee the rameter for both the Lasso and best subset (i.e., the 0 optimal statistical risk in the Lasso is typically of order regularization solver originally developed in Bertsimas, √ King and Mazumder, 2016). log(p)/n. More generally, Bertsimas and Copenhaver (2018) In this simulation, we consider the stylized setting demonstrate that the robust optimization problem (3.1) where n = 100, p = 30, s = 5, SNR = 2.0, and the de- sign matrix is sampled from a Gaussian distribution with with a norm h(·) and a perturbation set U (λ) = { : identity covariance (i.e., correlation ρ = 0). We further maxb∈Rp bh(b) ≤ λ} is equivalent to argmin Y − Xb2 + 2 fix the perturbation magnitude η = 1.0. Figure 1 shows λh(b), matching the previous result of Xu, Caramanis and DRD of the Lasso and best subset estimates as a function Mannor (2009) in the 1 norm regularized regression set- of the regularization parameter λ for the Lasso and as a ting. These results suggest a duality between norm-based function of the number of features selected k for best sub- regularization and robustness. However, as 0 is not a set. Due to the equality DRD(β) = ηβ1 , the behavior norm, the connection with robustness is not immediately of DRD mimics the 1 norm of the estimates in the reg- transparent. We believe that theoretically characterizing ularization solution path. As expected, choosing larger λ the type of perturbations that the 0 regularization is ro- leads to smaller DRD for the Lasso; we observe a similar bust to would be an interesting and important future re- behavior with best subset for smaller k. search direction. For both problems, we label two choices for the regular- Due to the lack of theoretical comparisons between the ization parameters: 1. the choice that leads to the best pre- distributional robustness of the 0 and 1 regularization, diction performance on a separate validation set of size n; in the following, we investigate the distributional robust- 2. the choice that leads to the best prediction performance ness of these two regularization techniques empirically. on the same validation set among all regularization values that yield support size less than or equal to s = 5. Choice 1 https://github.com/yuachen/STSDiscussion_SparseRegression 1 prioritizes the test prediction performance, while Choice
618 Y. CHEN, A. TAEB AND P. BÜHLMANN F IG . 1. Left—Distributional robustness difference (DRD) of the Lasso as a function of the regularization parameter λ, right—DRD of best subset as a function of the regularization parameter k. 2 prioritizes selecting the correct number of features. The For each methods, after appropriately selecting the reg- left plot of Figure 1 shows that for the Lasso, the two ularization parameters and computing the corresponding choices lead to very different values for λ, and as a re- estimates, we evaluate the metric DR in equation (3.2) as sult, different DRD. Theoretically, the Lasso can be used a function of the perturbation magnitude η. Figure 2 com- to achieve good prediction performance and good feature pares the performance of the four methods. Given the re- selection performance, but the choice of λ for each goal is lation DR(β) = ηβ + Y − Xβ2 , the Y-intercept of the different (Wainwright, 2009, Zhao and Yu, 2006). In par- linear curves in Figure 2 represent the prediction perfor- ticular, optimizing with respect to prediction performance mance with unperturbed data, and the slopes are precisely requires a smaller value of λ (Choice 1) and optimizing the 1 norm of the β. for model selection accuracy requires a larger values of λ Focusing on the high-SNR regimes in the left column of (Choice 2). The right plot of Figure 1 shows that the two Figure 2, the Lasso, best subset and SS1 have comparable choices for best subset are similar. Comparing the left and distributional robustness for small amounts of perturba- right plots in Figure 1, we observe that Choice 1 yields ap- tion in both low and high dimensions. As a comparison, proximately the same DRD for the Lasso and best subset. SS2 has larger prediction error with unperturbed data but Choice 2, however, chooses a large λ for the Lasso, lead- has better distributional robustness (as measured by DR) ing to a slightly better DRD for the Lasso than for best if the perturbations are large enough; this behavior can be subset. attributed to the large 2 penalty in SS2 . 3.1.2 Empirical illustration: Robustness versus signal- Focusing on the low-SNR regimes in the right column to-noise ratio and ambient dimension. In this experiment, of Figure 2, in low dimensional setting 2, the Lasso is we explore the distributional robustness of the Lasso, best more robust than best subset. The robustness of the Lasso subset, as well as the new method SS developed by BPvP. is largely due to the 1 shrinkage property provided by Following the simulation settings in HTT, we consider 1 regularization. In the low-SNR and high dimensional the following four data generation settings: setting 4, best subset turns out to be more robust than the the Lasso. We observe that in the low-SNR and high di- 1. n = 100, p = 30, s = 5, SNR = 2.0, ρ = 0 mensional setting, best subset selects very small number 2. n = 100, p = 30, s = 5, SNR = 0.1, ρ = 0 of features, enhancing its robustness. Once again, we ob- 3. n = 50, p = 1000, s = 5, SNR = 20.0, ρ = 0 serve that SS2 yields substantially more robust solutions 4. n = 50, p = 1000, s = 5, SNR = 1.0, ρ = 0, as compared to the other three methods. and the design matrix is sampled from a Gaussian distri- Our experiments, as well as the empirical studies by bution with identity covariance (i.e., correlation ρ = 0). BPvP, suggest that 0 + 2 minimization can substantially For the Lasso and best subset, we select their regular- improve robustness. It is worth noting that this comes ization parameters based on prediction performance on a at the cost of choosing an additional regularization pa- separate validation set of size n. Since SS has two hy- rameter. In practice, searching over the two-dimensional perparamter choices, we fix the 2 penalty regularization grid of regularization parameters could make SS less pre- (denoted as 1/γ by BPvP) to take one of the two values ferred choice for computational reasons. Nonetheless, the γ = {1000, 0.01} and tune the 0 regularization parameter promising results of SS raise interesting questions about based on validation performance. The two levels of the 2 its statistical properties, as well its optimization guaran- penalty lead to two versions of SS: SS1 (γ = 1000) with tees. On the statistical side, it is interesting for future re- small 2 penalty and SS2 with large 2 penalty (γ = 0.01). search to understand theoretically when the nonconvex
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL. 619 F IG . 2. Distributional robustness (DR) of the Lasso, best subset, SS1 and SS2 for four ambient dimensions and SNR settings. For all problems, the data matrix is uncorrelated, that is, ρ = 0. 0 + 2 regularization has similar prediction, model se- the feature selection stability of the 0 approach is better lection or robustness guarantees as the Lasso. On the op- than that of the Lasso. This is perhaps not very surprising timization side, since SS only solves a relaxation of the as it is known that using validation error to select the reg- 0 + 2 regularization optimization problem, it is inter- ularization parameter for the Lasso yields overly complex esting to understand when the SS solution is close to the models (Wainwright, 2009) (we consider in supplemen- global optimum. tary material Section A.2 of Chen, Taeb and Bühlmann, 2020 the setting where λ for the Lasso is chosen so that 3.2 Sampling Stability for Feature Selection the solution has s nonzeros). In the low-SNR regime (al- Next, we empirically study the stability of best subset beit still a bit larger than the extreme SNR = 0.05 consid- and Lasso as feature selection methods. We measure the ered by HTT), we see both methods struggle to tease away feature selection stability by the probability of true and true variable from null although we would argue that the null variables being selected across multiple independent Lasso performs favorably here as the true variables appear identically generated datasets. Ideally, we would like the with much larger probability. probability of selecting any true variable to be close to one 3.3 More Complicated Decision Spaces and any null variable close to zero. As with distributional robustness, the feature selection stability of the Lasso and Much of the focus of BPvP and HTT has been on subset best subset are strongly dependent on the choice of regu- selection for linear sparse regression settings. Although larization parameter. We illustrate this phenomena in the BPvP consider logistic regression and hinge loss in their supplementary material Section A.1 (see Chen, Taeb and mixed integer framework as well, the theoretical and com- Bühlmann, 2020). putational aspects of these approaches appear less under- We consider the following stylized setting: stood as compared to the linear setting. In practice, 1 norm regularization has been widely used beyond the lin- 1. n = 100, p = 30, s = 5, SNR = 2.0, ρ = 0.35 ear setting for generalized linear models, survival models 2. n = 100, p = 30, s = 5, SNR = 0.1, ρ = 0.35 but also for scenarios with nonconvex loss functions such Figure 3 shows the feature stability of Lasso and best sub- as mixture regression or mixed effect models (cf. Hastie, set where the regularization parameters are chosen based Tibshirani and Friedman, 2009, Bühlmann and van de on prediction on a validation set. In the high-SNR regime, Geer, 2011, Hastie, Tibshirani and Wainwright, 2015).
620 Y. CHEN, A. TAEB AND P. BÜHLMANN F IG . 3. Feature stability of the Lasso and best subset across 100 independent datasets with problem parameters n = 100, p = 30, s = 5, ρ = 0.35 and SNR = {2, 0.1}. The regularization parameters are chosen based on prediction performance on a validation set. As such, we believe that deepening our understanding with a DAG constraint is very difficult, since the DAG of 0 -optimization for problems involving nonquadratic constraint induces a high degree of non-convexity. The loss functions (beyond generalized linear models) is an famous proposal of greedy equivalent search (GES) is important direction for future research. Furthermore, we the most used heuristics, but with proven crude consis- next outline a few problem instances where 0 -based ap- tency guarantees in low dimensions (Chickering, 2003). It proaches may provide a fresh and interesting perspective. would be wonderful to have more rigorous optimization 3.3.1 Plug-in to squared error loss. There are exten- tools, which could address this highly nonconvex prob- sions of linear models in a plug-in sense, as indicated lem. below, which can deal with important issues around hid- 3.3.3 Extensions to low-rank matrix estimation. A den confounding, causality and distributional robustness common task in data analysis is to obtain a low-rank in the presence of large perturbations. model from observed data. These problems often aim at The trick with such plug-in methodology is as follows. solving the optimization problem: We linearly transform the data to X̃ = F X and Ỹ = F Y for a carefully chosen n × n matrix F . Subsequently, we (3.4) argmin f (X) s.t. rank(X) ≤ k, fit a linear model of Ỹ versus X̃, typically with a regular- X∈Rn×p ization term such as 1 norm or now, due to recent con- for some differentiable function f : Rn×p → R and k tributions by BPvP, we can also use 0 regularization. We min{n, p}. The constraint rank(X) ≤ k can be viewed list here two choices of F : as singular-values(X)0 ≤ k, making equation 1. F is a spectral transform which trims the singu- (3.4) a matrix analog of the subset selection problem ana- lar values of X. If there is unobserved hidden linear lyzed in BPvP and HTT. Due to computational intractabil- confounding which is dense affecting many components ity of rank minimization, a large body of work has re- of X, then the regularized regression of Ỹ versus X̃ sorted to convex relaxation in order to obtain computa- yields the deconfounded regression coefficient. (Cevid, tionally feasible solution by replacing the rank constraint Bühlmann and Meinshausen, 2018). rank(X) with the nuclear norm penalty λX in the ob- 2. F is a linear transformation involving projection jective (Fazel, 2002), yielding a semidefinite program for matrices and a robustness tuning parameter. Then Anchor certain function classes f . Such relaxations, while hav- regression is simply regression of Ỹ versus X̃, and we ing the advantage of convexity, are not scalable to large may want to use 0 or 1 regularization (Rothenhäusler problem instances. As such, practitioners often resort to et al., 2018). The obtained regression coefficient has a solving other nonconvex formulations of equation (3.4) causal interpretation and leads to distributional robustness such as the Burer–Monteiro approach (Burer and Mon- under a class of large perturbations. teiro, 2003) and projected gradient descent on the low- 3.3.2 Fitting directed acyclic graphs. Fitting Gaussian rank manifold (Jain, Tewari and Kar, 2014). For a range structural equation models with acyclic directed graph of problem settings, these nonconvex techniques achieve (DAG) structure from observational data is a well-studied appealing estimation accuracy (see Chen and Chen, 2018 topic in structure learning. As mentioned in Section 2.1, for a summary), and are able to solve larger-dimensional one should always prefer the 0 regularization principle problems than approaches based on convex relaxation. as it respects the Markov equivalence property. The op- While these nonconvex approaches are commonly em- timization of the 0 -regularized log-likelihood function ployed, they do not certify optimality. As such, nonlinear
DISCUSSION ON BERTSIMAS ET AL. AND HASTIE ET AL. 621 semi-definite optimization techniques have been devel- A KAIKE , H. (1973). Information theory and an extension of the max- oped to produce more accurate solutions for some special- imum likelihood principle. In Second International Symposium on izations of equation (3.4); see, for example, (Bertsimas, Information Theory (Tsahkadsor, 1971) 267–281. MR0483125 BARRON , A., B IRGÉ , L. and M ASSART, P. (1999). Risk bounds for Copenhaver and Mazumder, 2017) in the context of rank- model selection via penalization. Probab. Theory Related Fields constrained factor analysis. Beyond the setting analyzed 113 301–413. MR1679028 https://doi.org/10.1007/s004400050210 in (Bertsimas, Copenhaver and Mazumder, 2017), we B ELLEC , P. (2018). The noise barrier and the large signal bias wonder whether there are conceptual advancements to of the Lasso and other convex estimators. Preprint. Available at the mixed integer framework proposed in BPvP to handle arXiv:1804.01230. low-rank optimization problems (3.4) involving continu- B ERTSIMAS , D. and C OPENHAVER , M. S. (2018). Characteriza- ous nonconvex optimization. Developing this connection tion of the equivalence of robustification and regularization in lin- ear and matrix regression. European J. Oper. Res. 270 931–942. may enable a fresh and powerful approach to provide—in MR3814540 https://doi.org/10.1016/j.ejor.2017.03.051 a computationally efficiently manner—certifiably optimal B ERTSIMAS , D., C OPENHAVER , M. S. and M AZUMDER , R. (2017). solutions to equation (3.4). Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18 Paper No. 29, 53. MR3634896 4. CONCLUSIONS B ERTSIMAS , D. and K ING , A. (2017). Logistic regression: From art to science. Statist. Sci. 32 367–384. MR3696001 https://doi.org/10. It is great that practitioners can begin to integrate fast 1214/16-STS602 and exact 0 -based solvers in their data analysis pipelines. B ERTSIMAS , D., K ING , A. and M AZUMDER , R. (2016). Best subset The methods developed by BPvP can benefit many data selection via a modern optimization lens. Ann. Statist. 44 813–852. analysts who would prefer to focus on the application MR3476618 https://doi.org/10.1214/15-AOS1388 aspect of the problem rather than the optimization as- B ERTSIMAS , D. and S HIODA , R. (2009). Algorithm for cardinality- pect. Aiming for the most parsimonious model fit is a constrained quadratic optimizaiton. Comput. Optim. Appl. 43 1–22. MR2501042 https://doi.org/10.1007/s10589-007-9126-9 very plausible principle which is easy to communicate in B ERTSIMAS , D. and VAN PARYS , B. (2020). Sparse high-dimensional many applications. But this alone does not rule out the at- regression: Exact scalable algorithms and phase transitions. tractiveness of 1 norm regularization and its versions: it Ann. Statist. 48 300–323. MR4065163 https://doi.org/10.1214/ builds in additional estimation shrinkage, which may be 18-AOS1804 desirable in high noise settings, and sticking to convex- B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009). Simultane- ity is yet another plausible principle. In this discussion, ous analysis of Lasso and Dantzig selector. Ann. Statist. 37 1705– 1732. MR2533469 https://doi.org/10.1214/08-AOS620 we tried to add a few additional thoughts to the excellent B REIMAN , L. (1995). Better subset regression using the non- and insightful papers by BPvP and HTT, in the hope of negative garrote. Technometrics 37 373–384. MR1365720 encouraging new research in this direction. https://doi.org/10.2307/1269730 B REIMAN , L. (1996a). Bagging predictors. Mach. Learn. 24 123–140. ACKNOWLEDGMENTS B REIMAN , L. (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383. MR1425957 Y. Chen, A. Taeb and P. Bühlmann have received fund- https://doi.org/10.1214/aos/1032181158 ing from the European Research Council under the Grant B ÜHLMANN , P. (2006). Boosting for high-dimensional linear mod- Agreement No. 786461 (CausalStats—ERC-2017-ADG). els. Ann. Statist. 34 559–583. MR2281878 https://doi.org/10.1214/ They all acknowledge scientific interaction and exchange 009053606000000092 at “ETH Foundations of Data Science.” B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg. MR2807761 SUPPLEMENTARY MATERIAL https://doi.org/10.1007/978-3-642-20192-9 B ÜHLMANN , P. and VAN DE G EER , S. (2018). Statistics for big Supplement to “A Look at Robustness and Stability data: A perspective. Statist. Probab. Lett. 136 37–41. MR3806834 of 1 - versus 0 -Regularization: Discussion of Papers https://doi.org/10.1016/j.spl.2018.02.016 by Bertsimas et al. and Hastie et al.” (DOI: 10.1214/20- B URER , S. and M ONTEIRO , R. D. C. (2003). A nonlinear pro- STS809SUPP; .pdf). The supplement contains an empiri- gramming algorithm for solving semidefinite programs via low- cal analysis of the dependence of feature selection stabil- rank factorization. Math. Program. 95 329–357. MR1976484 ity on the choice of regularization parameter for the Lasso https://doi.org/10.1007/s10107-002-0352-8 C AI , T. T. and WANG , L. (2011). Orthogonal matching pursuit and best subset. Further, in the context of feature selection for sparse signal recovery with noise. IEEE Trans. Inf. The- stability, the effect of choosing λ for the Lasso to obtain a ory 57 4680–4688. MR2840484 https://doi.org/10.1109/TIT.2011. fixed cardinality is explored. 2146090 C EVID , D., B ÜHLMANN , P. and M EINSHAUSEN , N. (2018). Spectral deconfounding and perturbed sparse linear models. Preprint. Avail- REFERENCES able at arXiv:1811.05352. A BADI , M., AGARWAL , A., BARHAM , P., B REVDO , E., C HEN , Z., C HEN , Y. and C HEN , Y. (2018). Harnessing structures in big data via C ITRO , C., C ORRADO , G. S., DAVIS , A., D EAN , J. et al. (2015). guaranteed low-rank matrix estimation: Recent theory and fast al- TensorFlow: Large-scale machine learning on heterogeneous sys- gorithms via convex and nonconvex optimization. IEEE Signal Pro- tems. Software available from tensorflow.org. cess. Mag. 35 14–31.
622 Y. CHEN, A. TAEB AND P. BÜHLMANN C HEN , S. and D ONOHO , D. (1994). Basis pursuit. In Proceedings of M ALLAT, S. and Z HANG , Z. (1993). Matching pursuits with time- 1994 28th Asilomar Conference on Signals, Systems and Comput- frequency dictionaries. IEEE Trans. Signal Process. 41 3397–3415. ers 1 41–44. IEEE, New York. M EINSHAUSEN , N. (2007). Relaxed Lasso. Comput. Statist. Data C HEN , S. S., D ONOHO , D. L. and S AUNDERS , M. A. (2001). Anal. 52 374–393. MR2409990 https://doi.org/10.1016/j.csda. Atomic decomposition by basis pursuit. SIAM Rev. 43 129–159. 2006.12.019 MR1854649 https://doi.org/10.1137/S003614450037906X M ILLER , A. J. (1990). Subset Selection in Regression. Monographs C HEN , Y., TAEB , A. and B ÜHLMANN , P. (2020). Supplement to “A on Statistics and Applied Probability 40. CRC Press, London. look at robustness and stability of 1 - versus 0 -regularization: MR1072361 https://doi.org/10.1007/978-1-4899-2939-6 discussion of papers by Bertsimas et al. and Hastie et al.” PASZKE , A., G ROSS , S., M ASSA , F., L ERER , A., B RADBURY, J., https://doi.org/10.1214/20-STS809SUPP C HANAN , G., K ILLEEN , T., L IN , Z., G IMELSHEIN , N. et al. C HICKERING , D. M. (2003). Optimal structure identification with greedy search. J. Mach. Learn. Res. 3 507–554. MR1991085 (2019). Pytorch: An imperative style, high-performance deep learn- https://doi.org/10.1162/153244303321897717 ing library. In Advances in Neural Information Processing Systems D ONOHO , D. L. (2006). For most large underdetermined systems of 8026–8037. linear equations the minimal l1 -norm solution is also the spars- ROTHENHÄUSLER , D., M EINSHAUSEN , N., B ÜHLMANN , P. and P E - est solution. Comm. Pure Appl. Math. 59 797–829. MR2217606 TERS , J. (2018). Anchor regression: Heterogeneous data meets https://doi.org/10.1002/cpa.20132 causality. Preprint. Available at arXiv:1801.06229. To appear in E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHIRANI , R. (2004). J. Roy. Statist. Soc. Ser. B. Least angle regression. Ann. Statist. 32 407–499. MR2060166 S CHWARZ , G. (1978). Estimating the dimension of a model. Ann. https://doi.org/10.1214/009053604000000067 Statist. 6 461–464. MR0468014 FAZEL , M. (2002). Matrix rank minimization with applications. Ph.D. T IBSHIRANI , R. (1996). Regression shrinkage and selection via the thesis, Stanford, CA. Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242 F RIEDMAN , J., H ASTIE , T. and T IBSHIRANI , R. (2010). Regular- T ROPP, J. A. (2004). Greed is good: Algorithmic results for sparse ap- ization paths for generalized linear models via coordinate descent. proximation. IEEE Trans. Inf. Theory 50 2231–2242. MR2097044 J. Stat. Softw. 33 1–22. https://doi.org/10.1109/TIT.2004.834793 F RIEDRICH , J., Z HOU , P. and PANINSKI , L. (2017). Fast online de- VAN DE G EER , S. A. and B ÜHLMANN , P. (2009). On the conditions convolution of calcium imaging data. PLoS Comput. Biol. 13 1–26. used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360– F UCHS , J.-J. (2004). On sparse representations in arbitrary redun- 1392. MR2576316 https://doi.org/10.1214/09-EJS506 dant bases. IEEE Trans. Inf. Theory 50 1341–1344. MR2094894 VAN DE G EER , S. and B ÜHLMANN , P. (2013). 0 -penalized maxi- https://doi.org/10.1109/TIT.2004.828141 G ATU , C. and KONTOGHIORGHES , E. J. (2006). Branch-and- mum likelihood for sparse directed acyclic graphs. Ann. Statist. 41 bound algorithms for computing the best-subset regression 536–567. MR3099113 https://doi.org/10.1214/13-AOS1085 models. J. Comput. Graph. Statist. 15 139–156. MR2269366 VAN DE G EER , S., B ÜHLMANN , P. and Z HOU , S. (2011). The adap- https://doi.org/10.1198/106186006X100290 tive and the thresholded Lasso for potentially misspecified models H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Ele- (and a lower bound for the Lasso). Electron. J. Stat. 5 688–749. ments of Statistical Learning: Data Mining, Inference, and Pre- MR2820636 https://doi.org/10.1214/11-EJS624 diction, 2nd ed. Springer Series in Statistics. Springer, New York. WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional MR2722294 https://doi.org/10.1007/978-0-387-84858-7 and noisy sparsity recovery using 1 -constrained quadratic pro- H ASTIE , T., T IBSHIRANI , R. and WAINWRIGHT, M. (2015). Statisti- gramming (Lasso). IEEE Trans. Inf. Theory 55 2183–2202. cal Learning with Sparsity: The Lasso and Generalizations. Mono- MR2729873 https://doi.org/10.1109/TIT.2009.2016018 graphs on Statistics and Applied Probability 143. CRC Press, Boca X U , H., C ARAMANIS , C. and M ANNOR , S. (2009). Robust regression Raton, FL. MR3616141 and Lasso. In Advances in Neural Information Processing Systems H OFMANN , M., G ATU , C. and KONTOGHIORGHES , E. J. (2007). Ef- 1801–1808. ficient algorithms for computing the best subset regression models Z HANG , T. (2011). Sparse recovery with orthogonal matching pursuit for large-scale problems. Comput. Statist. Data Anal. 52 16–29. under RIP. IEEE Trans. Inf. Theory 57 6215–6221. MR2857968 MR2409961 https://doi.org/10.1016/j.csda.2007.03.017 https://doi.org/10.1109/TIT.2011.2162263 JAIN , P., T EWARI , A. and K AR , P. (2014). On iterative hard thresh- Z HAO , P. and Y U , B. (2006). On model selection consistency of olding methods for high-dimensional M-estimation. In Advances in Neural Information Processing Systems 685–693. Lasso. J. Mach. Learn. Res. 7 2541–2563. MR2274449 J EWELL , S. and W ITTEN , D. (2018). Exact spike train inference Z OU , H. (2006). The adaptive Lasso and its oracle properties. J. Amer. via 0 optimization. Ann. Appl. Stat. 12 2457–2482. MR3875708 Statist. Assoc. 101 1418–1429. MR2279469 https://doi.org/10. https://doi.org/10.1214/18-AOAS1162 1198/016214506000000735 KOTZ , S. and J OHNSON , N. L., eds. (1992). Breakthroughs in Z OU , H. and H ASTIE , T. (2005). Regularization and variable selec- Statistics. Vol. I: Foundations and Basic Theory. Springer Se- tion via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 ries in Statistics. Perspectives in Statistics. Springer, New York. 301–320. MR2137327 https://doi.org/10.1111/j.1467-9868.2005. MR1182794 00503.x
You can also read