Varying Coefficient Model via Adaptive Spline Fitting - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Varying Coefficient Model via Adaptive Spline Fitting arXiv:2201.10063v1 [stat.ME] 25 Jan 2022 Xufei Wang1 Bo Jiang1 Jun S. Liu2 1 Two Sigma Investments Limited Partnership 2 Department of Statistics, Harvard University January 26, 2022 Abstract The varying coefficient model has received wide attention from researchers as it is a powerful dimension reduction tool for non-parametric modeling. Most existing varying coefficient models fitted with polynomial spline assume equidis- tant knots and take the number of knots as the hyperparameter. However, impos- ing equidistant knots appears to be too rigid, and determining the optimal num- ber of knots systematically is also a challenge. In this article, we deal with this challenge by utilizing polynomial splines with adaptively selected and predictor- specific knots to fit the coefficients in varying coefficient models. An efficient dy- namic programming algorithm is proposed to find the optimal solution. Numerical results show that the new method can achieve significantly smaller mean squared errors for coefficients compared with the commonly used kernel-based method. 1 Introduction The problem of accurately estimating the relationship between a response variable and multiple predictor variables is fundamental to statistics and machine learning, and many other scientific applications. Among parametric models, the linear regression model is a simple and powerful approach, yet it is somewhat limited by its linearity assumption, which is often violated in real applications. In contrast, non-parametric models do not assume any specific type of relationship between the response variable and the predictors, which offer some flexibility in modeling nonlinear relationships, and are powerful alternatives to linear models. However, non-parametric models often need to impose certain local smoothness assumptions which are mostly achieved by employing a certain class of kernels or spline basis functions, to overcome the over- fitting issue. This is known to be plagued by the curse of dimensionality in that these methods are both ineffective in capturing the true relationship and computationally very expensive for high dimensional data sets. Serving as a bridge between linear and non-parametric models, the varying coef- ficient model (Hastie & Tibshirani, 1993) provides an attractive compromise between simplicity and flexibility. In this class of models, the regression coefficients are not 1
set constants and are allowed to depend on other conditioners. As a result, the vary- ing coefficient model is more flexible because of the infinite dimensionality of the corresponding parameter spaces. Compared to standard non-parametric approaches, this method rises as a powerful strategy to cope with the curse of dimensionality. This method also inherits all advantages from linear models of being simple and in- terpretable. A typical setup for the varying coefficient model is as follows. Given response y and predictors X = (x1 , . . . , xp )T , the model assumes that p X y= βj (u)xj + , j=1 where u is the conditional random variable, which is usually a scalar. The varying coefficient model has a broad range of applications, including longitudinal data (Huang et al., 2004; Tang & Cheng, 2012), functional data (Zhang & Wang, 2014; Hu et al., 2019), and spatial data (Wang & Sun, 2019; Finley & Banerjee, 2020). Moreover, varying coefficient models could naturally be extended to time series contexts (Huang & Shen, 2004; Lin et al., 2019). There are three major approaches to estimate the coefficients βj (u) (j = 1, . . . , p). One widely acknowledged approach is the smoothing spline method proposed by Hastie & Tibshirani (1993), with a recent follow-up work using P-spline (Astrid et al., 2009). Fan & Zhang (1999) and Fan & Zhang (2000) proposed a two-step kernel ap- proach for estimating the coefficients. The first step is to use local cubic smoothing to model function βj (u), and the second step re-applies local cubic smoothing on the residuals. A recent follow-up of their work is an adaptive estimator by Chen et al. (2015). Approximation of the coefficient functions using a basis expansion, e.g. poly- nomial B-splines, is always popular as it leads to a simple solution to estimation and inference with good theoretical properties. Compared with smoothing spline and kernel methods, the polynomial spline method with a finite number of knots strikes a balance between model flexibility and interpretability of the estimated coefficients. Huang et al. (2002), Huang & Shen (2004), and Huang et al. (2004) utilized a set of polynomial esti- mators. The assumptions made in these works are that the knots are equidistant and the number of knots is chosen such that the bias terms become asymptotically negligible to guarantee the local asymptotic normality. Most polynomial spline approaches run optimization based on a set of finite- dimensional classes of functions, such as a space of polynomial B-splines with L equally spaced knots. If the real turning points of the coefficients are not equidis- tant yet we still use the method assuming equally spaced knots, then L should be large enough to make sure the distances between knots in the model are small enough to cap- ture the resolution for the coefficients. In theory, it is not clear how to systematically determine L yet. In practice, L is always chosen by a parameter searching process accompanied with other parameters, as people need to compare a set of parameters before determining the optimal fixed number of knots. Too few knots might ignore the high-frequency information of βj (u), while too many knots might over-fit the area where the coefficients barely change. Variable selection for varying coefficient mod- els, especially when the number of predictors is larger than the sample size, is also an interesting direction to study. 2
In this paper, we propose two adaptive algorithms that are enabled by first fitting piece-wise linear functions with automatically selected turning points for univariate conditioner variable u. Compared with traditional methods above, these algorithms can automatically determine optimal positions of knots modeled as the turning points of the true coefficients. We prove that, if the coefficients are piece-wise linear in u, the selected knots by our methods are almost surely the true change points. We fur- ther combine the knots selection algorithms with the adaptive group lasso method for variable selection in high-dimensional problems, following a similar idea as Wei et al. (2011) who applies the adaptive group lasso to basis expansions of predictors. Our simulation studies demonstrate that the new adaptive method achieves smaller mean squared errors for estimating the coefficients compared to available methods, and im- proves the variable selection performance. We finally apply the method to study the association between environmental factors and COVID-19 infected cases and observe time-varying effects of the environmental factors. 2 Adaptive spline fitting methods 2.1 Knots selection for polynomial spline In varying coefficient models, each coefficient βj (u) is a function of the conditional variable u, which can be estimated by fitting a polynomial spline on u. In this paper we assume that u is a univariate variable. Let Xi = (xi,1 , . . . , xi,p )T ∈ Rp , ui , and yi denote the ith observations of the predictor vector, the conditional variable, and the response variable, respectively, for i = 1, . . . , n. Suppose the knots are common to all coefficients located at d1 < . . . < dL , and the corresponding B-splines of degree D are Bk (u) (k = 1, . . . , D + L + 1). Each varying coefficient can be represented as PD+L+1 βj (u) = k=1 cj,k Bk (u), where the coefficients cj,k are estimated by minimizing the following sum of squared errors: 2 n X p X D+L+1 X yi − xi,j cj,k Bk (ui ) . (1) i=1 j=1 k=1 In previous work, the knots for polynomial spline were always chosen as equidistant quantiles of u and are the same for all predictors. The approach is computationally straightforward, but the knots chosen cannot properly reflect the varying smoothness between and within the coefficients. We propose an adaptive knot selection approach in which the knots can be interpreted as turning points of the coefficients. For knots d1 < . . . < dL , we define the segmentation scheme S = {s1 , . . . , sL+1 } for the observed samples ordered by u, where sk = {i | dk < ui ≤ dk+1 }, with d0 = −∞ and dL+1 = ∞. If the true coefficients β(u) = (β1 (u), . . . , βp (u))T ∈ Rp is a linear function of u within each segment, i.e., β(u) = as + bs u for as , bs ∈ Rp , then the observed response satisfies y = aTs X + bTs (uX) + , ∼ N (0, σs2 ). (2) 3
Thus, the coefficients can be estimated by maximizing the log-likelihood function, which is equivalent to minimizing the loss function: X L(S) = |s| log σ̂s2 , (3) s∈S where σ̂s2 is the residual variance by regressing yi over (Xi , ui Xi ) for i ∈ s. Because any almost-everywhere continuous function can be approximated by piece-wise linear functions, we can employ the estimating framework in (2) and de- rive (3). To avoid over-fitting when maximizing the log-likelihood over the param- eters and segmentation schemes, we constrain that |s| is greater than a lower bound ms = n−α (α > 0), and penalize the loss function with the number of segments X L(S, λ0 ) = |s| log σ̂s2 + λ0 |S| log(n), (4) s∈S where λ0 > 0 is the penalty strength. The resulting knots correspond to the segmenta- tion scheme S that minimizes the penalized loss function (4). When λ0 is very large, this strategy tends to select no knots, whereas when λ0 gets close to 0 it can select as many knots as n1−α . We find the optimal λ0 by minimizing the Bayesian information criterion (Schwarz et al., 1978) of the fitted model. For a given λ0 , suppose L(λ0 ) knots are finally proposed, and the fitted model is fˆ(X, u). Then, we have " n # 1X 2 BIC (λ0 ) = n log {yi − f (Xi , ui )} + p{L(λ0 ) + D + 1} log(n). (5) n i=1 The optimal λ0 is selected by searching over a grid to minimize BIC(λ0 ). We call this procedure the global adaptive knots selection strategy as it assumes that all the predictors have the same set of knots. We will discuss in Section 2.4 how to allow each predictor to have its own set of knots. Here we only use the piece-wise linear model (2) and loss function (4) for knots selection, but will fit the varying coefficients with B-splines derived from the resulting knots via minimizing (1). In this way, the fitted varying coefficients are smooth func- tions and the smoothness is determined by the degree of the splines. This method is referred to as the one-step adaptive spline fitting throughout the paper. 2.2 Theoretical property of the selected knots The proposed method is invariant to marginal distribution of u. Without loss of gen- erality, we assume that u follows the uniform distribution in [0, 1]. According to Def- inition 1 below, a turning point of u, denoted as ck , is defined as a local maximum or minimum of any coefficient βi (u) (j = 1, . . . , p). In Theorem 1, we show that the adaptive knots selection approach can almost surely detect all the turning points of β(u). Definition 1 We call 0 < c1 < . . . < cK < 1, K < ∞ the turning points of β(u) if for any ck−1 < u1 < u2 < ck < u3 < u4 < ck+1 (k = 1, . . . , K), {βj (u1 ) − βj (u2 )}{βj (u3 ) − βj (u4 )} < 0, 4
for some index j with u0 = 0, uK+1 = 1. Theorem 1 Suppose y = β(u)T X + , u ∼ Unif(0, 1), ∼ N (0, σ 2 ) where X is bounded and β(u) is a bounded continuous function. Moreover, assume X, u, are independent. Let 0 < c1 < . . . < cK < 1 be the turning points in Definition 1 and d1 < . . . < dL be the selected knots with ms = n−α (α > 0), then for 0 < γ < 1 − α and λ0 large enough, K L pr L ≥ K, max min |dl − ck | < n−γ → 1, n → ∞. k=1 l=1 Details of the proof are in Section 3.1 of the supplementary material. Although the adaptive spline fitting method is motivated by a piece-wise linear model, Theorem 1 shows that with probability approaching 1 we can detect all the turning points accu- rately under general varying coefficient functions. The selected knots could be a super- set of the real turning points especially when λ0 is small, so we tune λ0 with BIC (5) which penalizes the number of selected knots, to find the optimal set. When the varying coefficient functions are piece-wise linear, each coefficient βj (u) (j = 1, . . . , p) can be almost surely defined as βj (u) = aj,k + bj,k u, cj,k−1 < u ≤ cj,k , k = 1, . . . , Kj + 1, where cj,0 = 0, cj,Kj +1 = 1 and (aj,k − aj,k+1 )2 + (bj,k − bj,k+1 )2 > 0 (k = 1, . . . , Kj ) if Kj ≥ 1. When Kj = 0 the varying coefficient βj (u) is a simple linear function of u, otherwise we call cj,k (k = 1, . . . , Kj ) the change points for coeffi- cients S βj (u), because the linear relationship varies before and after these points. Then Kj ≥1 {cj,1 , . . . , cj,Kj } are all the change points of the entire varying coefficient func- tion. In Theorem 2, we show that the adaptive knots selection method can almost surely discover the change points of β(u) without false positive selection. Theorem 2 Suppose (X, y, u) follow the same assumption as in Theorem 1, and β(u) is a piece-wise linear function of u. Let 0 < c1 < . . . < cK < 1 be the change points defined as above and d1 < . . . < dL be the selected knots with ms = n−α (α > 0), then for 0 < γ < 1 − α and λ0 large enough, K pr L = K, max |dk − ck | < n−γ → 1, n → ∞. k=1 Details of the proof are in Section 3.2 of the supplementary material. The theorem shows that if the varying coefficient function is piece-wise linear, the method can dis- cover all the change points with almost 100% accuracy. 2.3 Dynamic programming algorithm for adaptive knots selection The brute force algorithm to compute the optimal knots has a computation complexity of O(2n ) and is impractical. As summarized by the following algorithm, we propose a dynamic programming approach whose computation complexity is of order O(n2 ). If we further assume that the knots can only be chosen from a predetermined set M, 5
√ e.g. √mn (m = 1, . . . , b nc − 1) quantiles of ui ’s, the computation complexity can be further reduced to O(|M|2 ). Note that the algorithm in Section 2.4 of Wang et al. (2017) is a special case with a constant x. The algorithm is summarized in the following three steps: Step 1 Data preparation: Arrange the data (Xi , ui , yi ) (i = 1, . . . , n) according to order of ui from small to large. Without lose of generality, we assume that u1 < · · · < un . Step 2 Obtain the minimum loss by forward recursion: Define ms = dn−α e (α > 0) as the smallest segment size, and set λ = λ0 log(n). Initialize two sequence: (Lossi , Previ ) (i = 1, . . . n) with Loss0 = 0. For i = ms , . . . , n, recursively fill in entries of the tables with Lossi = min 0 (Lossi0 −1 + li0 :i + λ), Previ = arg min(Lossi0 −1 + li0 :i + λ). i ∈Ii i0 ∈Ii Here Ii = {1} ∪ {ms + 1, . . . , i − ms + 1} and li0 :i = (i − i0 + 1) log σ̂i20 :i where σ̂i20 :i is the residual variance of regressing yk on (Xk , uk Xk ) (k = i0 , . . . , i). Step 3 Determine knots by backward tracing: Let p = Prevn . If p = 1 no knot is needed; otherwise, we recursively add 0.5(up−1 + up ) as a new knot with p = Prevp until p = 1. When the algorithm is run with a grid of λ0 , we repeat Step 2 and 3 for all the λ0 ’s, and return the final model with the minimum BIC. 2.4 Predictor specific adaptive knots selection The global adaptive knots selection method introduced in Section 2.1 assumes that the set of knot locations are the same for all predictors, similar to most of the literature for polynomial spline fitting. However, different coefficients may have a different res- olution of smoothness relative to u, and a different set of knots for each predictor is preferable. Here we propose a two-step adaptive spline fitting algorithm on top of the global knot selection. Pp Suppose the fitted model for the one-step adaptive spline fitting is fˆ(X, u) = j=1 β̂j (u)xj . For predictor j, the knots can be updated by performing the same knots selection procedure between it and the residual without it, that is X ri = yi − β̂j 0 (ui )xi,j 0 , i = 1, . . . , n. (6) j 0 6=j If BIC in (5) is smaller for the corresponding new polynomial spline model with the new knots, the knots and fitted model are update. The steps are repeated until BIC cannot be further minimized. The following three steps summarizes the algorithm: Step 1 One-step adaptive spline: Run the one-step adaptive spline fitting algorithm for (Xi , ui , yi ) (i = 1, . . . , n). Denote fˆ(X, u) as the fitted model and compute model BIC with (5). 6
Step 2 Update knots with BIC: For j = 1, . . . , p, compute residual variable ri by (6). Run the one-step adaptive spline fitting algorithm for the residual and predic- tor j, that is (xi,j , ui , ri ) (i = 1, . . . , n). Denote the new fitted model as fˆj (X, u) and corresponding BIC as BICj . Note that when we assume different number of knots nPeach predictor, the termop{L(λ0 ) + D + 1} log(n) in (5) should be replaced by for p j=1 Lj (λ0 ) + p(D + 1) log(n), where Lk (λ0 ) is the number of knots for predic- tor j. Step 3 Recursively update model: If min BICj < BIC and j ∗ = arg min BICj , update ∗ the current model by fˆ(X, u) = fˆj (X, u), and repeat Step 2. Otherwise return the fitted model fˆ(X, u). An alternative approach to model the heterogeneous relationships between coefficients is to replace the initial model in Step 1 with fˆ(X, u) = 0, and repeat the following two steps. However, starting from the one-step model is preferred because fitting to residual instead of the original response minimizes the mean squared error more efficiently. Simulation studies in Section 3 show that the predictor-specific knots can further reduce mean squared error for the fitted coefficient, compared with the global knot selection approach. 2.5 Knots selection in sparse high-dimensional problems When the number of predictors is large and the number of predictors with non-zero varying coefficients is small, we perform a variable selection for all the predictors. For predictor j, we run the knots selection procedure between it and the response, and construct B-spline functions as {Bj,k (u)} (L = 1, . . . , Lj + D + 1) where Lj is the number of knots and D is the degree of the B-splines. Then, we perform the variable selection method for varying coefficient model proposed by Wei et al. (2011), which is a generalization of group lasso (Yuan & Lin, 2006) and adaptive lasso (Zou, 2006). Wei et al. (2011) shows that group lasso tends to over select variables and suggests adaptive group lasso for variable selection. In their original algorithm, the knots for each predictor are chosen as equidistant quantiles and not predictor-specific. The following three steps summarize our proposed algorithm: Step 1 Select knots for each predictor: Run the one-step adaptive spline fitting algo- rithm between each predictor xi,j and response yi . Denote the corresponding B-splines as Bj,k (u) (k = 1, . . . , Lj + D + 1) where Lj is the number of knots for predictor j and D is degree of splines. Step 2 Group lasso for first step variable selection: Run group lasso algorithm for the following loss function n p Lj +D+1 p 1 X X X X yi − xi,j cj,k Bj,k (ui ) + λ1 (c0j Rj cj )1/2 , λ1 > 0 n i=1 j=1 j=1 k=1 where cj = (cj,1 , . . . , cj,Lj +D+1 ) and Rj is the kernel matrix whose (k1 , k2 ) element is E{Bj,k1 (u) ∗ Bj,k2 (u)}. Denote the fitted coefficients as c̃j,k . 7
Step 3 Adaptive group lasso for second step variable selection: Run adaptive group lasso algorithm for the updated loss function n p Lj +D+1 p 1 X X X X yi − xi,j cj,k Bj,k (ui ) + λ2 wj (c0j Rj cj )1/2 , λ2 > 0 n i=1 j=1 j=1 k=1 with weight as wk = ∞ if c̃0j Rj c̃j = 0, otherwise wk = (c̃0j Rj c̃j )−1/2 . Denote the fitted coefficients as ĉj,k and the selected variables are those with ĉ0j Rj ĉj 6= 0. For Steps 2 and 3, parameter λ1 and λ2 are chosen by minimizing BIC (5) for the fit- ted model, where the degree of freedom is computed with only the selected predictors. Simulation studies in Section 3.2 show that, with the knots selection Step 1, the algo- rithm can always choose the correct predictors with a reasonable number of samples. 3 Empirical studies 3.1 Simulation study for adaptive spline fitting In this section, we compare the one-step and two-step adaptive spline fitting approaches we introduced with a commonly used kernel method package tvReg by Casas & Fernandez-Casal (2019), using the simulation examples from Tang & Cheng (2012). The model simulates a longitudinal data, which are frequently encountered in biomed- ical applications. In this example, the conditional random variable represents time and for convenience, we use notation t instead of u. We simulated n individuals, each having a scheduled time set {0, 1, . . . , 19} to generate observations. A scheduled time can be skipped with probability 0.6, so no observation will be generated at the skipped time. For each non-skipped scheduled time, the real observed time is the scheduled time plus a random disturbance from Unif(0, 1). Consider the time-dependent predic- tors X(t) = (x1 (t), x2 (t), x3 (t), x4 (t))T with x1 (t) = 1, x2 (t) ∼ Bern(0.6), 1 + x3 (t) x3 (t) ∼ Unif(0.1t, 2 + 0.1t), x4 (t) | x3 (t) ∼ N 0, . 2 + x3 (t) The response for individual i at observed time ti,k is yi (ti,k ) = P4 j=1 βj (ti,k )xi,j (ti,k ) + i (ti,k ), with β1 (t) = 1 + 3.5 sin(t − 3), β2 (t) = 2 − 5 cos(0.75t − 0.25), 2 3 β3 (t) = 4 − 0.04(t − 12) , β4 (t) = 1 + 0.125t + 4.6 (1 − 0.1t) . The random error i (ti,k ) are independent from the predictors, and generated with two components i (ti,k ) = vi (ti,k ) + ei (ti,k ), i.i.d. where ei (ti,k ) ∼ N (0, 4), and vi (ti,k ) ∼ N (0, 4) with correlation structure cor {vi1 (ti1 ,k1 ), vi2 (ti2 ,k2 )} = I(i1 = i2 ) exp(−|ti1 ,k1 − ti2 ,k2 |). 8
Figure 1: The true coefficients and fitted coefficients with n = 200. Dots: true βj (t); lines: adaptive spline; dashes: kernel. Table 1: MSEj for the one-step and two-step adaptive spline fitting methods, compared with kernel method. MSE 1 ∗ 1e2 MSE 2 ∗ 1e2 MSE 3 ∗ 1e2 MSE 4 ∗ 1e2 one-step 2.12 (1.12) 0.32 (0.12) 0.55 (0.27) 0.08 (0.04) two-step 1.70 (1.02) 0.32 (0.12) 0.39 (0.24) 0.05 (0.03) kernel 2.80 (0.92) 0.38 (0.11) 0.81 (0.26) 0.14 (0.04) Random errors are positively correlated within the same individual and are indepen- dent between different individuals. Figure 1 shows the true coefficients and the fitted coefficients by the two-step adaptive spline and kernel methods for an example with n = 200. We observe that the fitted coefficients by the two-step method are smoother than those by the kernel method. We compare the three methods by the mean square errors of their estimated coeffi- cients, i.e., n o 2 n ni 1 XX β̂j (ti,k ) − βj (ti,k ) MSE j = , (7) N i=1 range(βj )2 k=1 where βj (t) and β̂j (t) are the true and estimated coefficients, Pn respectively, ni is the total number of observations for individual i and N = i=1 ni . We run the simulation with n = 200 for 1000 repetitions. √ For adaptive spline methods we only consider knots from √mN (m = 1, . . . , b N c) quantiles of ti,k . Table 1 shows the average MSEj for the proposed one-step and two-step methods, in comparison with the kernel method. We find that the two-step method has the small- est MSEj for all four coefficients, significantly outperforming the other two methods. Interestingly, the one-step adaptive method selects on average 6.1 global knots for all predictors, whereas the two-step method selects fewer on average: 5.8 knots for x1 (t), 4.2 for x2 (t), 4.0 for x3 (t) and 2.0 for x4 (t). The result is expected since β3 (t) and β4 (t) are less volatile than β1 (t) and β2 (t). It appears that the two-step procedure can use a smaller number of knots to achieve a better fitting than the other two methods. 9
3.2 Simulation study for variable selection We use the simulation example from Wei et al. (2011) to compare the performance of our method with one using adaptive group lasso and equidistant knots. Similar to the previous subsection, there are n individuals, each having a scheduled time set {0, 1, . . . , 29} to generate observations and a skipping probability of 0.6. For each non-skipped scheduled time, the real observed time is the scheduled time adding a random disturbance generated from Unif(0, 1). We construct p = 500 time-dependent predictors as follows: x1 (t) ∼ Unif (0.05 + 0.1t, 2.05 + 0.1t) , 1 + x1 (t) xj (t) | x1 (t) ∼ N 0, , j = 2, . . . , 5 2 + x1 (t) x6 (t) ∼ N (3 exp{(t + 0.5)/30}, 1) , i.i.d. xj (t) ∼ N (0, 4), j = 7, . . . , 500. The same individual’s predictors xj (t) (j = 7, . . . , 500) are correlated with cor{xj (t), xj (s)} = exp(−|t − s|). The response for individual i at observed time P6 ti,k is yi (ti,k ) = j=1 βj (ti,k )xi,j (ti,k ) + i (ti,k ). The time-varying coefficients βj (t) (j = 1, . . . , 6) are β1 (t) = 15 + 20 sin{π(t + 0.5)/15}, β2 (t) = 15 + 20 cos{π(t + 0.5)/15}, β3 (t) = 2 − 3 sin{π(t − 24.5)/15}, β4 (t) = 2 − 3 cos{π(t − 24.5)/15}, β5 (t) = 6 − 0.2(t + 0.5) , 2 β6 (t) = −4 + 5 ∗ 10−4 (19.5 − t)3 . The random error i (ti,k ) is independent from the predictors and follows the same distribution as that in Section 3. We simulate cases with n = 50, 100, 200 and replicate each set for 200 times. Three metrics are considered: average number of selected variables, percentage of cases when there is no false negative, and percentage of cases when there is no false positive or negative. A comparison of our method with the variable selection method using equidistant knots (Wei et al., 2011) is summarized in Table 2. We find that our method clearly outperforms the method with no predictor- specific knots selection, and can always select the correct predictors when n is larger than 100. 4 Application: a study of environmental factors and COVID-19 The dataset we investigate contains daily measurements of meteorological data and air quality data in 7 counties of the state of New York between March 1, 2020, and Septem- ber 30, 2021. The meteorological data were obtained from the National Oceanic and Atmospheric Administration Regional Climate Centers, Northeast Regional Climate Center at Cornell University: http://www.nrcc.cornell.edu. The daily data are based on the average of the hourly measurements of several stations in each county 10
Table 2: Variable selection performance for adaptive group lasso with and without predictor-dependent knots selection. equidistant knots # selected variables % no false negative % no false positive or negative n = 50 7.04 72 68 n = 100 6.21 87 84 n = 200 6.13 99 93 adaptive selected knots # selected variables % no false negative % no false positive or negative n = 50 5.85 88 88 n = 100 6.00 100 100 n = 200 6.00 100 100 and composed of records of five meteorological components: temperature in Fahren- heit, dew point in Fahrenheit, wind speed in miles per hour, precipitation in inches, and humidity in percentage. The air quality data were from the Environmental Protection Agency: https://www.epa.gov. The data contain daily records of two major air quality components, the fine particles with an aerodynamic diameter of 2.5µm or less, i.e., PM2.5 , in µg/m3 and ozone in µg/m3 . The objective of the study is to under- stand the association between the meteorological measurements, in conjunction with pollutant levels, and the number of infected cases for COVID-19, a contagious disease caused by severe acute respiratory syndrome coronavirus 2, and examine whether the association varies over time. The daily infected records were retrieved from the official website of the Department of Health, New York state: https://data.ny.gov. Figure 2 shows scatter plots of daily infected cases in New York County and the 7 en- vironmental components over time. In order to remove the variation of recorded cases between weekdays and weekends, in the following study, we consider the weekly aver- aged infected cases which are the average between each day and the following 6 days. We also find that the temperature factor and dew point factor are highly correlated, and we remove the dew point factor when fitting the model. We first take the logarithmic transformation of the weekly averaged infected cases, denoted as y, and then fit a varying coefficient model with the following predictors: x1 = 1 as intercept, x2 as temperature, x3 as wind speed, x4 as precipitation, x5 as humidity, x6 as PM2.5 and x7 as ozone. Time t is the conditioner for our model. All predictors except the constant are normalized to make varying coefficients βj (t) comparable. The varying coefficient model has the form: 7 X yi (ti,k ) = βj (ti,k )xi,j (ti,k ) + (ti,k ), (8) j=1 where ti,k is the kth record time for the ith county, with yi (ti,k ) and xi,j (ti,k ) being iid the corresponding records for county i at time ti,k , and (ti,k ) ∼ N (0, σ 2 ) denotes the 11
Figure 2: Environmental measurements and COVID-19 infected cases in New York County, NY. error term. We apply the proposed two-step adaptive spline fitting method to fit the data. Fig- ure 3 shows the fitted βj (t) for each predictor. Figures show that there is a strong time effect on each coefficient function. For example, there are several peaks for the inter- cept, which correspond to the initial outbreak and delta variant outbreak. Moreover, there are rapid changes of coefficients around March 2020. This could be explained by the fact that at the beginning of the outbreak, the test cases were fewer and the number of infected cases was underestimated. Moreover, the coefficient curves show that the most important predictor is temperature. For most of the period, the coefficient is neg- ative, indicating that high temperature is negatively associated with transmission of the virus, which is also observed in the study by ? such that COVID-19 spread is slower at high temperatures. Besides the negatively correlated relationship, our analysis also demonstrates the time-varying property of the coefficient. We note that environmental factors may not have an immediate effect on the number of recorded COVID-19 cases since there is 2-14 days of incubation period before the onset of symptoms and it takes another 2-7 days before a test result is available. To study whether there are lagging effects between the predictor and response variables, we fit a varying coefficient model for each time lag τ , i.e., using data yi (ti,k + τ ) and xi,j (ti,k ) (j = 1, . . . , 7), to fit model (8) similarly as in Fan & Zhang (1999). Figure 4 shows the residual root mean squared error for each time lag. We find that the root mean squared error is the smallest with 4 days lag. Note that yi (t) represents the logarithm of the weekly average between day t and t + 6. This means the predictors at day t are most predictive for infected numbers between day t + 4 and t + 10, which is consistent with the incubation period and test duration. Figure 3 also show the lagged coefficients with τ = 4 in dashes, revealing nearly identical shapes with the non-lagged coefficients except the coefficient for PM2.5 , which may indicate a large day-to-day variation of the 12
Figure 3: Lines: fitted coefficients without lag. Dashes: fitted coefficients with 4-day lag. PM2.5 measurements. 5 Discussion In this paper, we introduce two algorithms for fitting varying coefficient models with adaptive polynomial splines. The first algorithm selects knots globally using a recursive method, assuming the same set of knots for all the predictors, whereas the second algorithm allows each predictor to have its own set of knots and also uses a recursive method for individualized knots selection. Coefficients modeled by polynomial splines with a finite number of non-regularly positioned knots are more flexible as well as more interpretable than the standard ones with equidistant knot placements. Simulation studies show that both one-step and two-step algorithms outperform the commonly used kernel method in terms of mean squared errors, while the two-step algorithm achieves the best performance. A fast dynamic programming algorithm is introduced, with a computation complex- ity no more than √ O(n2 ) and can be of order O(n) if the knots are chosen from m √ (m = 1, . . . , b nc − 1) quantiles of the conditional variable u. n In this paper, we assume that the variable u is univariate. The proposed two-step approach can be easily generalized to the case when each coefficient βj (u) has its own univariate conditioner variable u. However, it is still a challenging problem to gener- alize the proposed method to multi-dimensional conditioners and to model correlated errors. 13
Figure 4: Root mean squared error for logarithm of weekly averaged infected cases, with lag τ = 0, . . . , 21 days. Acknowledgement We are grateful to the National Oceanic and Atmospheric Administration Regional Cli- mate Centers, Northeast Regional Climate Center at Cornell University for kindly shar- ing the meteorological data. We are thankful to the Environmental Protection Agency and the Department of Health, New York State for publishing the air quality data and daily infected records online. The views expressed herein are the authors alone and are not necessarily the views of Two Sigma Investments, Limited Partnership, or any of its affiliates. Implementation We provide R implementation of varying coefficient model via adaptive spline fitting discussed in the paper. The R package is available at https://github.com/wangxf0106/vcmasf, which also provides a detailed instruction on how to use the software. References Astrid, J., Philippe, L., Benoit, B., & F, V. (2009). Pharmacokinetic parameters esti- mation using adaptive bayesian p-splines models. Pharm Stat, 8. Casas, I. & Fernandez-Casal, R. (2019). tvreg: Time-varying coefficient linear regres- sion for single and multi-equations in r. SSRN. 14
Chen, Y., Wang, Q., & Yao, W. (2015). Adaptive estimation for varying coefficient models. Journal of Multivariate Analysis, 137, 17–31. Fan, J. & Zhang, W. (1999). Statistical estimation in varying coefficient models. The Annals of Statistics, 27(5), 1491 – 1518. Fan, J. & Zhang, W. (2000). Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scandinavian Journal of Statistics, 27(4), 715–731. Finley, A. O. & Banerjee, S. (2020). Bayesian spatially varying coefficient models in the spbayes r package. Environmental Modelling & Software, 125, 104608. Hastie, T. & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), 55(4), 757–796. Hu, L., Huang, T., & You, J. (2019). Estimation and identification of a varying- coefficient additive model for locally stationary processes. Journal of the American Statistical Association, 114(527), 1191–1204. Huang, J. Z. & Shen, H. (2004). Functional coefficient regression models for non- linear time series: A polynomial spline approach. Scandinavian Journal of Statistics, 31(4), 515–534. Huang, J. Z., Wu, C. O., & Zhou, L. (2002). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Bimoetrika, 89(1), 111–128. Huang, J. Z., Wu, C. O., & Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, (pp. 763– 788). Lin, H., Yang, B., Zhou, L., Yip, P. S., Chen, Y.-Y., & Liang, H. (2019). Global kernel estimator and test of varying-coefficient autoregressive model. Canadian Journal of Statistics, 47(3), 487–519. Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics, 6(2), 461–464. Tang, Q. & Cheng, L. (2012). Componentwise b-spline estimation for varying coeffi- cient models with longitudinal data. Statistical Papers, 53, 629–652. Wang, W. & Sun, Y. (2019). Penalized local polynomial regression for spatial data. Biometrics, 75(4), 1179–1190. Wang, X., Jiang, B., & Liu, J. (2017). Generalized r-squared for detecting dependence. Biometrika, 104(1), 129–139. Wei, F., Huang, J., & Li, H. (2011). Variable selection and estimation in high- dimensional varying-coefficient models. Statistica Sinica, 21(4), 1515. 15
Yuan, M. & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodol- ogy), 68(1), 49–67. Zhang, X. & Wang, J.-L. (2014). Varying-coefficient additive models for functional data. Biometrika, 102(1), 15–32. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429. 16
You can also read