GMM, Weak Instruments, and Weak Identification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
GMM, Weak Instruments, and Weak Identification James H. Stock Kennedy School of Government, Harvard University and the National Bureau of Economic Research Jonathan Wright Federal Reserve Board, Washington, D.C. and Motohiro Yogo* Department of Economics, Harvard University June 2002 ABSTRACT Weak instruments arise when the instruments in linear IV regression are weakly correlated with the included endogenous variables. In nonlinear GMM, weak instruments correspond to weak identification of some or all of the unknown parameters. Weak identification leads to non-normal distributions, even in large samples, so that conventional IV or GMM inferences are misleading. Fortunately, various procedures are now available for detecting and handling weak instruments in the linear IV model and, to a lesser degree, in nonlinear GMM. *Prepared for the Journal of Business and Economic Statistics symposium on GMM. This research was supported in part by NSF grant SBR-0214131.
1. Introduction A subtle but important contribution of Hansen and Singleton’s (1982) and Hansen’s (1982) original work on generalized method of moments (GMM) estimation was to recast the requirements for instrument exogeneity. In the linear simultaneous equations framework then prevalent, instruments are exogenous if they are excluded from the equation of interest; in GMM, instruments are exogenous if they satisfy a conditional mean restriction that, in Hansen and Singleton’s (1982) application, was implied directly by a tightly specified economic model. Of course, at a mathematical level, these two requirements are the same, but conceptually the starting point is different. The shift from debatable (in Sims’ (1980) words, “incredible”) exclusion restrictions to first order conditions derived from economic theory has proven to be a highly productive way to think about candidate instrumental variables in a wide variety of applications. Accordingly, careful consideration of instrument exogeneity now is a standard part of a well-done empirical analysis using GMM. Instrument exogeneity, however, is only one of the criteria necessary for an instrument to be valid, and recently the other criterion – instrument relevance – has received increased attention by theoretical and applied researchers. It now appears that some, perhaps many, applications of GMM and instrumental variables (IV) regression have what is known as “weak instruments,” that is, instruments that are only weakly correlated with the included endogenous variables. Unfortunately, weak instruments pose considerable challenges to inference using GMM and IV methods. 1
This paper surveys weak instruments and its counterpart in nonlinear GMM, weak identification. We have five main themes: 1. If instruments are weak, then the sampling distributions of GMM and IV statistics are in general non-normal and standard GMM and IV point estimates, hypothesis tests, and confidence intervals are unreliable. 2. Weak instruments are commonplace in empirical research. This should not be a surprise. Finding exogenous instruments is hard work, and the features that make an instrument plausibly exogenous – for example, occurring sufficiently far in the past to satisfy a first order condition or the as-if random coincidence that lies behind a quasi-experiment – can also work to make the instrument weak. 3. It is not useful to think of weak instruments as a “small sample” problem. Bound, Jaeger and Baker (1995) provide an empirical example, based on an important article by Angrist and Krueger (1991), in which weak instrument problems arise despite having 329,000 observations. In a formal mathematical sense, the strength of the instruments, as measured by the so- called concentration parameter, plays the role of the sample size in determining the quality of the usual normal approximation. 4. If you have weak instruments, you do not need to abandon your empirical research, but neither should you use conventional GMM or IV methods. Various tools are available for handling weak instruments in the linear IV 2
regression model, and although research in this area is far from complete, a judicious use of these tools can result in reliable inferences. 5. What to do about weak instruments – even how to recognize them – is a much more difficult problem in general nonlinear GMM than in linear IV regression, and much theoretical work remains. Still, it is possible to make some suggestions for empirical practice. This survey emphasizes the linear IV regression model, simply because much more is known about weak instruments in this case. We begin in Section 2 with a primer on weak instruments in linear IV regression. With this as background, Section 3 discusses three important empirical applications that confront the challenge of weak instruments. Sections 4 – 6 discuss recent econometric approaches to weak instruments: their detection (Section 4); methods that are fully robust to weak instruments, at least in large samples (Section 5); and methods that are somewhat simpler to use and are partially robust to weak instruments (Section 6). Section 7 turns to weak identification in nonlinear GMM, its consequences, and methods for detecting and handling weak identification. Section 8 concludes. As we see in the next section, many of the key ideas of weak instruments have been understood for decades and can be explained in the context of the simplest IV regression model. This said, most of the literature on solutions to the problem of weak instruments is quite recent, and this literature is expanding rapidly; we both fear and hope that much of the practical advice in this survey will soon be out of date. 3
2. A Primer on Weak Instruments in the Linear Regression Model Many of the problems posed by weak instruments in the linear IV regression model are best explained in the context of the classical version of that model with fixed exogenous variables and i.i.d., normally distributed errors. This section therefore begins by using this model to show how weak instruments lead to TSLS having a non-normal sampling distribution, regardless of the sample size. The fixed-instrument, normal-error model, however, has strong assumptions that are empirically unrealistic. This section therefore concludes with a synopsis of asymptotic methods that weaken these strong assumptions yet attempt to retain the insights gained from the finite sample distribution theory. 2.1 The Linear Gaussian IV Regression Model with a Single Regressor The linear IV regression model with a single endogenous regressor and no included exogenous variables is, y = Yβ + u (1) Y = ZΠ + v, (2) where y and Y are T1 vectors of observations on endogenous variables, Z is a TK matrix of instruments, and u and v are T1 vectors of disturbance terms. The instruments are assumed to be nonrandom (fixed). The errors [ut vt] are assumed to be i.i.d. and normally distributed N(0,Σ), t = 1,…, T, where the elements of Σ are σ u2 , σuv, and σ v2 . 4
Equation (1) is the structural equation, and β is the scalar parameter of interest. Equation (2) relates the included endogenous variable to the instruments. The concentration parameter. The concentration parameter, µ2, is a unitless measure of the strength of the instruments, and is defined as µ2 = ΠZZΠ/ σ v2 . (3) A useful interpretation of µ2 is in terms of F, the F-statistic testing the hypothesis that Π = 0 in (2); because F is the F-statistic testing for nonzero coefficients on the instruments in the first stage of TSLS, it is commonly called the “first-stage F-statistic,” terminology that is adopted here. Let F be the infeasible counterpart of F, computed using the true value of σ v2 ; then F has a noncentral χ K2 / K distribution with noncentrality parameter µ2/K, and E( F ) = µ2/K + 1. If the sample size is large, F and F are close, so E(F) # µ2/K + 1; that is, the expected value of the first-stage F-statistic is approximately 1 + µ2/K. Thus, larger values of µ2/K shift out the distribution of the first- stage F-statistic. Said differently, F – 1 can be thought of as the sample analog of µ2/K. An expression for the TSLS estimator. The TSLS estimator of β is, Y PZ y β̂ TSLS = , (4) Y PZ Y 5
where PZ = Z(ZZ)–1Z. Rothenberg (1984) presents a useful expression for the TSLS estimator, which obtains after a few lines of algebra. First substitute the expression for Y in (2) into (4) to obtain, Π Z u + v PZ u β̂ TSLS – β = . (5) Π Z Z Π + 2 Π Z v + v PZ v Now define, Π Z u Π Z v zu = , zv = , σ u Π Z Z Π σ v Π Z Z Π v′PZ u v′PZ v Suv = , Svv = . σ vσ u σ v2 Under the assumptions of fixed instruments and normal errors, the distributions of zu, zv, Suv, and Svv do not depend on the sample size T: zu and zv are standard normal random variables with a correlation equal to ρ, the correlation between u and v, and Suv and Svv are quadratic forms of normal random variables with respect to the idempotent matrix PZ. Substituting these definitions into (5), multiplying both sides by µ, and collecting terms yields, zu + Suv / µ µ( β̂ TSLS – β) = (σu/σv) . (6) 1 + 2 zv / µ + Svv / µ 2 6
As this expression makes clear, the terms zv, Suv, and Svv result in the TSLS estimator having a non–normal distribution. If, however, the concentration parameter µ2 is large, then the terms Suv/µ, zv/µ, and Svv/µ2 will all be small in a probabilistic sense, so that the dominant term in (6) will be zu, which in turn yields the usual normal approximation to the distribution of β̂ TSLS . Formally, µ2 plays the role in (6) that is usually played by the number of observations: as µ2 gets large, the distribution of µ( β̂ TSLS – β) is increasingly well approximated by the N(0, σ u2 / σ v2 ) distribution. For the normal approximation to the distribution of the TSLS estimator to be a good one, it is not enough to have many observations: the concentration parameter must be large. Bias of the TSLS estimator in the unidentified case. When µ2 = 0 (equivalently, when Π = 0), the instrument is not just weak but irrelevant, and the TSLS estimator is centered around the probability limit of the OLS estimator. To see this, use (5) and Π = 0 to obtain, β̂ TSLS – β = vPZu/ vPZv. It is useful to write u = E(u|v) + η where, because u and v are jointly normal, E(u|v) = (σuv/ σ v2 )v and η is normally distributed. Moreover, σuv = σuY and σ v2 = σ Y2 because Π = 0. Thus u = (σuY/ σ Y2 )v + η, so v PZ [(σ uY / σ Y2 )v + η ] σ uY v Pzη β̂ TSLS – β = = 2 + . (7) v Pz v σ Y v Pz v Because η and v are independently distributed and Z is fixed, E[(vPZη/ vPZv)|v] = 0. Suppose that K 3, so that the first moment of the final ratio in (7) exists; then by the law of iterated expectations, 7
σ uY E( β̂ TSLS – β) = . (8) σ Y2 The right hand side of (8), is the inconsistency of the OLS estimator when Y is p correlated with the error term; that is, β̂ OLS → β + σuY/ σ Y2 . Thus, when µ2 = 0, the expectation of the TSLS estimator is the probability limit of the OLS estimator. The result in (8) applies to the limiting case of irrelevant instruments; with weak instruments, the TSLS estimator is biased towards the probability limit of the OLS estimator. Specifically, define the “relative bias” of TSLS to be the bias of TSLS relative to the inconsistency of OLS, that is, E( β̂ TSLS – β)/plim( β̂ OLS – β). Then the TSLS relative bias is approximately inversely proportional to 1 + µ2/K (this result holds whether or not the errors are not normally distributed (Buse (1992)). Hence, the relative bias decreases monotonically in µ2/K. Numerical examples. Figures 1a and 1b respectively present the densities of the TSLS estimator and its t-statistic for various values of the concentration parameter, when the true value of β is zero. The other parameter values mirror those in Nelson and Startz (1990a, 1990b), with σu = σv = 1, ρ = .99, and K = 1 (one instrument). For small values of µ2, such as the value of .25 considered by Nelson and Startz, the distributions are strikingly non–normal, even bimodal. As µ2 increases, the distributions eventually approach the usual normal limit. Under the assumptions of fixed instruments and normal errors, the distributions in Figure 1 depend on the sample size only through the concentration parameter; for a 8
given value of the concentration parameter, the distributions do not depend on the sample size. Thus, these figures illustrate the point made more generally by (6) that the quality of the usual normal approximation depends on the size of the concentration parameter. The Nelson–Startz results build on a large literature on the exact distribution of IV estimators under the assumptions of fixed exogenous variables and i.i.d. normal errors (e.g. the exact distribution of the TSLS estimator was obtained by Richardson (1968) and Sawa (1969) for the case of a single right hand side endogenous regressor). From a practical perspective, this literature has two drawbacks. First, the expressions for distributions in this literature, comprehensively reviewed by Phillips (1984), are among the most offputting in econometrics and pose substantial computational challenges. Second, because the assumptions of fixed instruments and normal errors are inappropriate in applications, it is unclear how to apply these results to the sorts of problems encountered in practice. To obtain more generally useful results, researchers have focused on asymptotic approximations, which we now briefly review. 2.2 Asymptotic Approximations Asymptotic distributions can provide good approximations to exact sampling distributions. Conventionally, asymptotic limits are taken for a fixed model as the sample size gets large, but this is not the only approach and for some problems this is not the best approach, in the sense that it does not necessarily provide the most useful approximating distribution. This is the case for the weak instruments problem: as is evident in Figure 1, the usual fixed-model large-sample normal approximations can be quite poor when the concentration parameter is small, even if the number of observations 9
is large. For this reason, there have been various alternative asymptotic methods used to analyze IV statistics in the presence of weak instruments; three such methods are Edgeworth expansions, large–model asymptotics, and weak-instrument asymptotics. Because the concentration parameter plays a role in these distributions akin to the sample size, all these methods aim to improve the quality of the approximations when µ2/K is small, but the number of observations is large. Edgeworth expansions. An Edgeworth expansion is a representation of the distribution of the statistic of interest in powers of 1/ T ; Edgeworth and related expansions of IV estimators are reviewed by Rothenberg (1984). As Rothenberg (1984) points out, in the fixed-instrument, normal-error model, an Edgeworth expansion in 1/ T with a fixed model is formally equivalent to an Edgeworth expansion in 1/µ. In this sense, Edgeworth expansions improve upon the conventional normal approximation when µ is small enough for the term in 1/µ2 to matter but not so small that the terms in 1/µ3 and higher matter. Rothenberg (1984) suggests the Edgeworth approximation is “excellent” for µ2 > 50 and “adequate” for µ2 as small as 10, as long as the number of instruments is small (less than µ). Many-instrument asymptotics. Although the problem of many instruments and weak instruments might at first seem different, they are in fact related: if there were many strong instruments, then the adjusted R2 of the first-stage regression would be nearly one, so a small first-stage adjusted R2 suggests that many of the instruments are weak. Bekker (1994) formalized this notion by developing asymptotic approximations for a sequence of models with fixed instruments and Gaussian errors, in which the number of instruments, K, is proportional to the sample size and µ2/K converges to a 10
constant, finite limit; similar approaches were taken by Anderson (1976), Kunitomo (1980), and Morimune (1983). The limiting distributions obtained from this approach generally are normal and simulation evidence suggests that these approximations are good for moderate as well as large values of K, although they cannot capture the non– normality evident in the Nelson–Startz example in Figure 1. Distributions obtained using this approach generally depend on error distribution (see Bekker and van der Ploeg (1999)), so some procedures justified using many-instrument asymptotics require adjustments if the errors are non-normal. Rate and consistency results are, however, more robust to non-normality (see Chao and Swanson (2002)). Weak-instrument asymptotics. Like many-instrument asymptotics, weak- instrument asymptotics (Staiger and Stock (1997)) considers a sequence of models chosen to keep µ2/K constant; unlike many-instrument asymptotics, K is held fixed. Technically, the sequence of models considered is the same as is used to derive the local asymptotic power of the first-stage F test (a “Pitman drift” parameterization in which Π is in a 1/ T neighborhood of zero). Staiger and Stock (1997) show that, under general conditions on the errors and with random instruments, the representation in (6) can be reinterpreted as holding asymptotically in the sense that the sample size T tends to infinity while µ2/K is fixed. More generally, many of the results from the fixed- instrument, normal-error model apply to models with random instrument and nonnormal errors, with certain simplifications arising from the consistency of the estimator of σ v2 . 2.3 Weak Instruments with Multiple Regressors The linear IV model with multiple regressors is, 11
y = Yβ + Xγ + u (9) Y = ZΠ + XΦ + V, (10) where Y is now a Tn vector of observations on n endogenous variables, X is a TJ matrix of included exogenous variables, and V is a Tn vector of disturbances with covariance matrix ΣVV; as before, Z is a TK matrix of instruments and u is T1. The concentration parameter is now a KK matrix. Expressed in terms population moments, the concentration matrix is, Σ VV 1/ 2 Π Σ Z | X ΠΣ VV 1/ 2 , where ΣZ|X = ΣZZ – ΣZX Σ XX −1 ΣXZ. To avoid introducing new notation, we refer to the concentration parameter as µ2 in both the scalar and matrix case. Exact distribution theory for IV statistics with multiple regressors is quite complicated, even with fixed exogenous variables and normal errors. Somewhat simpler expressions obtain using weak instrument asymptotics (under which reduced form moments, such as ΣVV and ΣZ|X are consistently estimable). The quality of the usual normal approximation is governed by the matrix µ2/K. Because the predicted values of Y from the first-stage regression can be highly correlated, for the usual normal approximations to be good it is not enough for a few elements of µ2/K to be large. Rather, it is necessary for the matrix µ2/K to be large in the sense that its smallest eigenvalue is large. The notation for IV estimators with included exogenous regressors X (equations (9) and (10)) is cumbersome. Sections 4 – 6 therefore focus on the case of no included 12
exogenous variables. The discussion, however, extends directly to the case of included exogenous regressors, and the formulas generally extend by replacing y, Y, and Z by the residuals from their projection onto X and by modifying the degrees of freedom as needed. 3. Three Empirical Examples There are many examples of weak instruments in empirical work. Here, we mention three from labor economics, finance, and macroeconomics. 3.1 Estimating the Returns to Education In an influential paper, Angrist and Krueger (1991) proposed using the quarter of birth as an instrumental variable to circumvent ability bias in estimating the returns to education. The date of birth, they argued, should be uncorrelated with ability, so that quarter of birth is exogenous; because of mandatory schooling laws, quarter of birth should also be relevant. Using large samples from the U.S. census, they therefore estimated the returns to education by TSLS, using quarter of birth and its interactions with state and year of birth binary variables. Depending on the specification, they had as many as 178 instruments. Surprisingly, despite the large number of observations (329,000 observations or more), in some of the Angrist–Krueger regressions the instruments are weak. This point was first made by Bound, Jaeger, and Baker (1995), who (among other things) provide Monte Carlo results showing that similar point estimates and standard errors obtain in 13
some specifications if each individual’s true quarter of birth is replaced by a randomly generated quarter of birth. Because the results with the randomly generated quarter of birth must be spurious, this suggests that the results with the true quarter of birth are misleading. The source of these misleading inferences is weak instruments: in some specifications, the first-stage F-statistic is less than 2, suggesting that µ2/K might be one or less (recall that E(F) – 1 # µ2/K). An important conclusion is that it is not helpful to think of weak instruments as a “finite sample” problem that can be ignored if you have sufficiently many observations. 3.2 The Log-Linearized Euler Equation in the CCAPM The first empirical application of GMM was Hansen and Singleton’s (1982) investigation of the consumption-based capital asset pricing model (CCAPM). In its log- linearized form, the first order condition of the CCAPM with constant relative risk aversion can be written, E[(rt+1 + α – γ –1∆ct+1)|Zt] = 0, (11) where γ is the coefficient of risk aversion (here, also the inverse of the intertemporal elasticity of substitution), ∆ct+1 is the growth rate of consumption, rt is the log gross return on some asset, α is a constant, and Zt is a vector of variables in the information set at time t (Hansen and Singleton (1983); see Campbell (2001) for a textbook treatment). The coefficients of (11) can be estimated by GMM using Zt as an instrument. One way to proceed is to use TSLS with ri,t+1 as the dependent variable; another is to 14
apply TSLS with ∆ct+1 as the dependent variable; a third is to use a method, such as LIML, that is invariant to the normalization. Under standard fixed-model asymptotics, these estimators are asymptotically equivalent, so it should not matter which method is used. But, as discussed in detail in Neely, Roy, and Whiteman (2001) and Yogo (2002), it matters greatly in practice, with point estimates of γ ranging from small (Hansen and Singleton (1982, 1983) to very large (Hall (1988), Campbell and Mankiw (1989)). Although one possible explanation for these disparate empirical findings is that the instruments are not exogenous – the restrictions in (11) fail to hold – another possibility is that the instruments are weak. Indeed, the analysis of Stock and Wright (2000), Neely, Roy, and Whiteman (2001), and Yogo (2002) suggests that weak instruments are part of the explanation for these seemingly contradictory results. It should not be a surprise that instruments are weak here: for an instrument to be strong, it must be a good predictor of either consumption growth or an asset return, depending on the normalization, but both are notoriously difficult to predict. In fact, the first-stage F- statistics in regressions based on (11) are frequently less than 5 (Yogo (2002)). 3.3 A Hybrid Phillips curve Forward–looking price setting behavior is a prominent feature of modern macroeconomics. The hybrid Phillips curve (e.g. Fuhrer and Moore (1995)) blends forward–looking and backward–looking behavior, so that prices are based in part on expected future prices and in part on past prices. According to a typical hybrid Phillips curve, inflation (πt) follows, 15
πt = β1xt + β2Etπt+1 + β3πt–1 + ωt, (12) where xt is a measure of demand pressures, such as the output gap, Etπt+1 is the expected inflation rate in time t+1 based on information available at time t, and ωt is a disturbance. Because Etπt+1 is unobserved, the coefficients of (12) cannot be estimated by OLS but they can be estimated by GMM based on the moment condition, E[(πt – β1xt – β2πt+1 – β3πt–1)|xt, πt, πt –1, Zt] = 0, where Zt is a vector of variables known at date t. Efforts to estimate (12) using the output gap have met with mixed success (e.g. Roberts (1997), Fuhrer (1997)), although Gali and Gertler (1999) report plausible parameter estimates when marginal cost, as measured by labor’s share, is used as xt. Although the papers in this literature do not discuss the possibility of weak instruments, recent work by Ma (2001) and Mavroeidis (2001) suggests that weak instruments could be a problem here. Here, too, this should not be a surprise: to be strong, the instrument Zt must have substantial marginal predictive content for πt+1, given xt, πt, and πt–1. However, the regression relating πt+1 to xt, πt, and πt–1 is the “old” backwards-looking Phillips curve which, despite its theoretical limitations, is one of the most reliable tools for forecasting inflation (e.g. Stock and Watson (1999)); for Zt to be a strong instrument, it must improve substantially upon a backward–looking Phillips curve. 4. Detection of Weak Instruments 16
This section discusses two methods for detecting weak instruments, one based on the first-stage F-statistic, the other based on a statistic recently proposed by Hahn and Hausman (2002). 4.1 The First-stage F-statistic Before discussing how to use the first-stage F-statistic to detect weak instruments, we need to say what, precisely, weak instruments are. A definition of weak instruments. A practical approach to defining weak instruments is that instruments are weak if µ2/K is so small that inferences based on conventional normal approximating distributions are misleading. In this approach, the definition of weak instruments depends on the purpose to which the instruments are put, combined with the researcher’s tolerance for departures from the usual standards of inference (bias, size of tests). For example, suppose you are using TSLS and you want its bias to be small. One measure of the bias of TSLS is its bias relative to the inconsistency of OLS, as defined in Section 2. Accordingly, one measure of whether a set of instruments is strong is whether µ2/K is sufficiently large that the TSLS relative bias is, say, no more than 10%; if not, then the instruments are deemed weak. Alternatively, if you are interested in hypothesis testing, then you could define instruments to be strong if µ2/K is large enough that a 5% hypothesis test to reject no more than (say) 15% of the time; otherwise, the instruments are weak. These two definitions – one based on relative bias, one based on size – in general yield different sets of µ2/K; thus instruments might be weak if used for one application, but not if used for another. 17
Here, we consider the two definitions of weak instruments in the previous paragraph, that is, that the TSLS bias could exceed 10%, or that the nominal 5% test of β = β0 based on the usual TSLS t-statistic has size that could exceed 15%. As shown by Stock and Yogo (2001), under weak-instrument asymptotics, each of these definitions implies a threshold value of µ2/K: if the actual value of µ2/K exceeds this threshold, then the instruments are strong (e.g. TSLS relative bias is less than 10%), otherwise the instruments are weak. Ascertaining whether instruments are weak using the first-stage F-statistic. In the fixed-instrument/normal-error model, and also under weak-instrument asymptotics, the distribution of the first-stage F-statistic depends only on µ2/K and K, so that inference about µ2/K can be based on F. As Hall, Rudebusch, and Wilcox (1996) show in Monte Carlo simulations, however, simply using F to test the hypothesis of non–identification (Π = 0) is inadequate as a screen for problems caused by weak instruments. Instead, we follow Stock and Yogo (2001) and use F to test the null hypothesis that µ2/K is less than or equal to the weak-instrument threshold, against the alternative that it exceeds the threshold. Table 1 summarizes weak-instrument threshold values of µ2/K and critical values for the first-stage F-statistic testing the null hypothesis that instruments are weak, for selected values of K. For example, under the TSLS relative bias definition of weak instruments, if K = 5 then the threshold value of µ2/K is 5.82 and the test that µ2/K ≤ 5.82 rejects in favor of the alternative that µ2/K > 5.82 if F ≥ 10.83. It is evident from Table 1 that one needs large values of the first-stage F-statistic, typically exceeding 10, for TSLS inference to be reliable. 18
4.2 Extension of the First-stage F-statistic to n > 1 As discussed in Section 2.3, when there are multiple included endogenous regressors (n > 1) the concentration parameter µ2 is a matrix. Because of possible multicollinearity among the predicted values of the endogenous regressors, in general the distribution of TSLS statistics depends on all the elements of µ2. From a statistical perspective, when n > 1, the n first-stage F-statistics are not sufficient statistics for the concentration matrix even with fixed regressors and normal errors (Shea (1997) discusses the pitfall of using the n first-stage F-statistics when n > 1 ). Instead, inference about µ2 can be based on the nn matrix analog of the first-stage F- statistic, GT = ΣˆVV −1/ 2 YPZY ΣˆVV −1/ 2 /K, (13) where ΣˆVV −1/ 2 = YMZY/(T–K). Under weak-instrument asymptotics, E(GT) → µ2/K + In, where In is the nn identity matrix. Cragg and Donald (1993) proposed using GT to test for partial identification, specifically, testing the hypothesis that the matrix Π has rank L against the alternative that it has rank greater than L, where L < n. If the rank of Π is L, then L linear combinations of β are identified, and n – L are not; Choi and Phillips (1992) discuss this situation, which they refer to as partial identification. The Cragg-Donald test statistic is ∑ n− L i =1 λi , where λi is the ith–smallest eigenvalue of GT. Under the null, this statistic has a 19
χ (2n− L )( K − L ) /[( n − L)( K − L)] limiting distribution. For L = 0 (so the null is complete non- identification), the statistic reduces to trace(GT), the sum of the n individual first-stage F- statistics. If L = n – 1 (so the null is the smallest possible degree of underidentification), then the Cragg–Donald statistic is the smallest eigenvalue of GT . Although the Cragg–Donald statistic can be used to test for underidentification, from the perspective of IV inference, mere instrument relevance is insufficient, rather, the instruments must be strong in the sense that µ2/K must be large. Rejecting the null hypothesis of partial identification does not ensure reliable IV inference. Accordingly, Stock and Yogo (2001) consider the problem of testing the null hypothesis that a set of instruments is weak, against the alternative that they are strong, where instruments are defined to be strong if conventional TSLS inference is reliable for any linear combination of the coefficients. By focusing on the worst–behaved linear combination, this approach is conservative but tractable, and they provide tables of critical values, similar to those in Table 1, based on the minimum eigenvalue of GT. 4.3 A Specification Test of a Null of Strong Instruments The methods discussed so far have been tests of the null of weak instruments. Hahn and Hausman (2002) reverse the null and alternative and propose a test of the null that the instruments are strong, against the alternative that they are weak. Their procedure is conceptually straightforward. Suppose that there is a single included endogenous regressor (n = 1) and that the instruments are strong, so that conventional normal approximations are valid. Then the normalization of the regression (the choice of dependent variable) should not matter asymptotically. Thus the TSLS estimator in the 20
forward regression of y on Y and the inverse of the TSLS estimator in the reverse regression of Y on y are asymptotically equivalent (to order op(T–1/2)) with strong instruments, but if instruments are weak, they are not. Accordingly, Hahn and Hausman (2002) developed a statistic comparing the estimators from the forward and reverse regressions (and the extension of this idea when n =2). They suggest that if this statistic rejects the null hypothesis, then a researcher should conclude that his or her instruments are weak, otherwise he or she can treat the instruments as strong. 5. Fully Robust Inference with Weak Instruments in the Linear Model This section discusses hypothesis tests and confidence sets for β that are fully robust to weak instruments, in the sense that these procedures have the correct size or coverage rates regardless of the value of µ2 (including µ2 = 0) when the sample size is large, specifically, under weak-instrument asymptotics. If n = 1, these methods also produce median-unbiased estimators of β as the limiting case of a 0% confidence interval. This discussion starts with testing, then concludes with the complementary problem of confidence sets. We focus on the case n = 1, but the methods can be generalized to joint inference about β when n > 1. Several fully robust methods have been proposed, and Monte Carlo studies suggest that none appears to dominate the others. Moreira (2001) provides a theoretical explanation of this in the context of the fixed instrument, normal error model. In that model, there is no uniformly most powerful test of the hypothesis β = β0, a result that also holds asymptotically under weak instrument asymptotics. In this light, the various fully 21
robust procedures represent tradeoffs, with some working better than others, depending on the true parameter values. 5.1 Fully Robust Gaussian Tests Moreira (2001) considers the system (1) and (2) with fixed instruments and normally distributed errors. The reduced-form equation for y is, y = ZΠβ + w. (14) Let Ω denote the covariance matrix of the reduced-form errors, [wt vt], and for now suppose that Ω is known. We are interested in testing the hypothesis β = β0. Moreira (2001) shows that, under these assumptions, the statistics ( 8 , , ) are sufficient for β and Π, where ( Z ′Z ) −1/ 2 Z ′Yb0 ( Z ′Z )−1/ 2 Z ′Y Ω −1a0 8 = and , = , (15) b0′Ω b0 a0′Ω −1a0 where Y = [y Y], b0 = [1 –β0] and a0 = [β0 1]. Thus, for the purpose of testing β = β0, it suffices to consider test statistics that are functions of only 8 and , , say g( 8 , , ). Moreover, under the null hypothesis β = β0, the distribution of , depends on Π but the distribution of 8 does not; thus, under the null hypothesis, , is sufficient for Π. It follows that a test of β = β0 based on g( 8 , , ) is similar if its critical value is computed from the conditional distribution of g( 8 , , ) given , . Moreira (2001) also derives an 22
infeasible power envelope for similar tests in the fixed-instrument, normal-error model, under the further assumption that Π is known. In practice, Π is not known so feasible tests cannot achieve this power envelope and, when K > 1, there is no uniformly most powerful test of β = β0. In practice, Ω is unknown so the statistics in (15) cannot be computed. However, under weak-instrument asymptotics, Ω can be estimated consistently. Accordingly, let 8̂ and ,ˆ denote 8 and , evaluated with Ω̂ = YMZY/(T – K) replacing Ω. Because Moreira’s (2001) family of feasible similar tests, based on statistics of the form g( 8̂ , ,ˆ ), are derived under the assumption of normality, we refer to them as Gaussian similar tests. 5.2 Three Gaussian Similar Tests We now turn to three Gaussian similar tests: the Anderson-Rubin statistic, Kleibergen’s (2001) test statistic, and Moreira’s (2002) conditional likelihood ratio (LR) statistic. The Anderson-Rubin Statistic. More than fifty years ago, Anderson and Rubin (1949) proposed testing the null hypothesis β = β0 in (9) using the statistic, ( y − Y β 0 )' PZ ( y − Y β 0 ) / K 8ˆ′8ˆ AR( β 0 ) = = . (16) ( y − Y β 0 )' M Z ( y − Y β 0 ) /(T − K ) K One definition of the LIML estimator is that it minimizes AR(β). 23
Anderson and Rubin (1949) showed that, if the errors are Gaussian and the instruments are fixed, then this test statistic has an exact FK ,T − K distribution under the null, regardless of the value of µ2/K. Under the more general conditions of weak d instrument asymptotics, AR(β0) → χ K2 /K under the null hypothesis, regardless of the value of µ2/K. Thus the AR statistic provides a fully robust test of the hypothesis β = β0. The AR statistic can reject either because β β0 or because the instrument orthogonality conditions fail. In this sense, inference based on the AR statistic is different from inference based on conventional GMM standard errors, for which the maintained hypothesis is that the instruments are valid. For this reason, a variety of tests have been proposed that maintain the hypothesis that the instruments are valid. Kleibergen’s Statistic. Kleibergen (2001) proposed testing β = β0 using the statistic, ( 8ˆ′,ˆ )2 K(β0) = , (17) ,ˆ,′ ˆ which, following Moreira (2001), we have written in terms of 8̂ and ,ˆ . If K = 1, then K ( β 0 ) = AR( β 0 ) . Kleibergen shows that under either conventional or weak-instrument asymptotics, K ( β 0 ) has a χ12 null limiting distribution. Moreira’s Statistic. Moreira (2002) proposed to test β = β0 using the conditional likelihood ratio test statistic, 24
M(β0) = 2 ( 1 ˆ ˆ ˆ ˆ ) 8 ′8 − , ,′ + ( 8ˆ′8ˆ − ,ˆ,′ ˆ )2 − 4[( 8ˆ′8ˆ )(,ˆ,′ ˆ ) − ( 8ˆ′,ˆ ) 2 ] . (18) The (weak-instrument) asymptotic conditional distribution of M(β0) under the null, given ,ˆ , is nonstandard and depends on β0, and Moreira (2002) suggests computing the distribution by Monte Carlo simulation. 5.3 Power Comparisons We now turn to a comparison of the weak-instrument asymptotic power of the Anderson-Rubin, Kleibergen, and Moreira tests. The asymptotic power functions of these tests depend on µ2/K, ρ (the correlation between u and v in Equations (1) and (2)), and K, as well as the true value of β. We consider two values of µ2/K: µ2/K = 1 corresponds to very weak instruments (nearly unidentified), and µ2/K = 5 corresponds to moderately weak instruments. The two values of ρ considered correspond to moderate endogeneity (ρ = .5) and very strong endogeneity (ρ = .99). Figure 2 presents weak-instrument asymptotic power functions for K = 5 instruments, so that the degree of overidentification is 4; the power depends on β – β0 but not on β0, so Figure 2 applies to general β0. Moreira’s (2001) infeasible asymptotic Gaussian power envelope is also reported as a basis for comparison. When µ2/K = 1 and ρ = .5, all tests have poor power for all values of the parameter space, a result that is reassuring given how weak the instruments are; moreover, all tests have power functions that are far from the infeasible power envelope. Another unusual feature of these tests is that the power function does not increase monotonically as β departs from β0. The 25
relative performance of the three tests differs, depending on the true parameter values, and the power functions occasionally cross. Typically (but not always), the K(β0) and M(β0) tests outperform the AR(β0) test. Figure 3 presents the corresponding power functions for many instruments (K = 50). For µ2/K = 5, both the K(β0) and M(β0) tests are close to the power envelope for most values of β – β0 (except, oddly, when β
union of the TSLS (or LIML) 97.5% confidence intervals, conditional on the non-rejected values of µ2/K. The Wang-Zivot (1998) Score Test. Wang and Zivot (1998) and Zivot, Startz and Nelson (1998) consider testing β = β 0 using modifications of conventional GMM test statistics, in which σ u2 is estimated under the null hypothesis. Using weak- instrument asymptotics, they show that although these statistic are not asymptotically pivotal, their null distributions are bounded by a FK,∞ distribution (the bound is tight if K = 1), which permits a valid, but conservative, test. 5.5 Robust Confidence Sets By the duality between hypothesis tests and confidence sets, these tests can be used to construct fully robust confidence sets. For example, a fully robust 95% confidence set can be constructed as the set of β0 for which the Anderson-Rubin (1949) statistic, AR(β0), fails to reject at the 5% significance level. In general, this approach requires evaluating the test statistic for all points in the parameter space, although for some statistics the confidence interval can be obtained by solving a polynomial equation (a quadratic, in the case of the AR statistic). Because the tests in this section are fully robust to weak instruments, the confidence sets constructed by inverting the tests are fully robust. As a general matter, when the instruments are weak, these sets can have infinite volume. For example, because the AR statistic is a ratio of quadratic forms, it can have a finite maximum, and when µ2 = 0 any point in the parameter space will be contained in the AR confidence set with probability 95%. This does not imply that these methods 27
waste information, or are unnecessarily imprecise; on the contrary, they reflect the fact that, if instruments are weak, there simply is limited information to use to make inferences about β. This point is made formally by Dufour (1997), who shows that under weak instrument asymptotics a confidence set for β must have infinite expected volume, if it is to have nonzero coverage uniformly in the parameter space, as long as µ2 is fixed and finite. This infinite expected volume condition is shared by confidence sets constructed using any of the fully robust methods of this section; for additional discussion see Zivot, Startz and Nelson (1998). 6. Partially Robust Inference with Weak Instruments Although the fully robust tests discussed in the previous section always control size, they have some practical disadvantages, for example some are difficult to compute. Moreover, for n > 1 they do not readily provide point estimates, and confidence intervals for individual elements of β must be obtained by conservative projection methods. For this reason, some researchers have investigated inference that is partially robust to weak instruments. Recall from Section 4 that we stated that whether instruments are weak depends on the task to which they are put. Accordingly, one way to frame the investigation of partially robust methods is to push the ‘weak instrument threshold’ as far as possible below that needed for TSLS to be reliable, yet not require that the method produce valid inference in the completely unidentified case. 6.1 k-class Estimators 28
The k-class estimator of β is βˆ ( k ) = [Y(I – kMZ)Y]–1[Y(I – kMZ)y]. This class includes TSLS (for which k = 1), LIML, and some alternatives that improve upon TSLS when instruments are weak. LIML. LIML is a k-class estimator where k = kLIML is the smallest root of the determinantal equation |Y Y – k Y MZY| = 0. The mean of distribution of the LIML estimator does not exist, and its median is typically much closer to β than is the mean or median of TSLS. In the fixed-instrument, normal-error case the bias of TSLS increases with K, but the bias of LIML does not (Rothenberg (1984)). When the errors are symmetrically distributed and instruments are fixed, LIML is the best median-unbiased k- class estimator to second order (Rothenberg (1983)). Moreover, LIML is consistent under many-instrument asymptotics (Bekker (1994)), while TSLS is not. This bias reduction comes at a price, as the LIML estimator has fat tails. For example, LIML generally has a larger interquartile range than TSLS when instruments are weak (e.g. Hahn, Hausman and Kuersteiner (2001a)). Fuller-k estimators. Fuller (1977) proposed an alternative k–class estimator which sets k = kLIML – b/(T – K), where b is a positive constant. With fixed instruments and normal errors, the Fuller-k estimator with b = 1 is best unbiased to second order (Rothenberg (1984)). In Monte Carlo simulations, Hahn, Hausman, and Kuersteiner (2001a) report substantial reductions in bias and mean squared error, relative to TSLS and LIML, using Fuller- k estimators when instruments are weak. Bias-adjusted TSLS. Donald and Newey (2001) consider a bias–adjusted TSLS estimator (BTSLS), which is a k–class estimator with k = T/(T – K + 2), modifying an estimator previously proposed by Nagar (1959). Rothenberg (1984) shows that BTSLS is 29
unbiased to second order in the fixed-instrument, normal-error model. Donald and Newey provide expressions for the second–order asymptotic mean square error (MSE) of BTSLS, TSLS and LIML, as a function of the number of instruments K. They propose choosing the number of instruments to minimize the second-order MSE. In Monte–Carlo simulations, they find that selecting the number of instruments in this way generally improves performance. Chao and Swanson (2001) develop analogous expressions for bias and mean squared error of TSLS under weak instrument asymptotics, modified to allow the number of instruments to increase with the sample size; they report improvements in Monte Carlo simulations by incorporating bias adjustments. 6.3 The SSIV and JIVE Estimators Angrist and Krueger (1992, 1995) proposed eliminating the bias of TSLS by splitting the sample into two independent subsamples, then running the first-stage regression on one subsample and the second stage regression on the other. The SSIV estimator is given by an OLS regression of y[1] on Π̂ [2] Z[1], where the superscript denotes the subsample. Angrist and Krueger (1995) show that SSIV is biased towards zero, rather than towards the OLS probability limit. Because SSIV does not use the sample symmetrically and appears to waste information, Angrist, Imbens and Krueger (1999) proposed a jackknife instrumental variables (JIVE) estimator. The JIVE estimator is βˆ JIVE = (Y ′Y )−1Y ′y , where the ith row of Y is Z i Πˆ − i and Πˆ −i is the estimator of Π computed using all but the ith observation. Angrist, Imbens and Krueger show that JIVE and TSLS are asymptotically equivalent under conventional fixed-model asymptotics. Calculations reveal that, under weak- 30
instrument asymptotics, JIVE is asymptotically equivalent to a k-class estimator with k = 1 + K/(T – K). Theoretical calculations (Chao and Swanson (2002)) and Monte Carlo simulations (Angrist, Imbens and Krueger (1999), and Blomquist and Dahlberg (1999)) indicate that JIVE improves upon TSLS when there are many instruments 6.4 Comparisons One way to assess the extent to which a proposed estimator or test is robust to weak instruments is to characterize the size of the weak instrument region. When n = 1, this can be done by computing the value of µ2/K needed to ensure that inferences attain a desired degree of reliability. This was the approach taken in Table 1, and here we extend this approach to some of the estimators discussed in this section. Figure 4 reports boundaries of asymptotic weak-instrument sets, as a function of K, for various estimators (for computational details, see Stock and Yogo (2001)). In Figure 4a, the weak-instrument set is defined to be the set of µ2/K such that the relative bias of the estimator (relative to the inconsistency of OLS) exceeds 10%; the values of µ2/K plotted are the largest values that meet this criterion. In Figure 4b, the weak- instrument set is defined so that a nominal 5% test of β = β0, based on the relevant t- statistic, rejects more than 15% of the time (that is, has size exceeding 15%). Two features of Figure 4 are noteworthy. First, LIML, BTSLS, JIVE, and the Fuller-k estimator have much smaller weak-instrument thresholds than TSLS; in this sense, these four estimators are more robust to weak instruments than TSLS. Second, these thresholds do not increase as a function of K, whereas the TSLS threshold increases in K, a reflection of its greater bias as the number of instruments increases. 31
7. GMM Inference in General Nonlinear Models It has been recognized for some time that the usual large-sample normal approximations to GMM statistics in general nonlinear models can provide poor approximations to exact sampling distributions in problems of applied interest. For example, Hansen, Heaton and Yaron (1996) examine GMM estimators of various intertemporal asset pricing models using a Monte Carlo design calibrated to match U.S. data. They find that, in many cases, inferences based on the usual normal distributions are misleading (also see Tauchen (1986), Kocherlakota (1990), Ferson and Foerester (1994) and Smith (1999)). The foregoing discussion of weak instruments in the linear model suggests that weak instruments could be one possible reason for the failure of the conventional normal approximations in nonlinear GMM. In the linearized CCAPM Euler equation (11), both the log gross asset return rt and the growth rate of consumption ∆ct are difficult to predict; thus, as argued by Stock and Wright (2000) and Neely, Roy, and Whiteman (2001), it stands to reason that estimation of the original nonlinear Euler equation by GMM also suffers from weak instruments. But making this intuition precise in the general nonlinear setting is difficult: the machinery discussed in the previous sections relies heavily on the linear regression model. In this section, we begin by briefly discussing the problems posed by weak instruments in nonlinear GMM, and suggest that a better term in this context is weak identification. The general approaches to handling the problem of weak identification are the same in the nonlinear setting as the linear 32
setting: detection of weak identification, procedures that are fully robust to weak identification, and procedures that are partially robust. As shall be seen, the literature on this topic is quite incomplete. 7.1 Consequences of Weak Identification in Nonlinear GMM In GMM estimation, the n1 parameter vector θ is identified by the G conditional mean conditions E[h(Yt,θ0)|Zt] = 0, where θ0 is the true value of θ and Zt is a K-vector of instruments. The GMM estimator is computed by minimizing 1 T ′ 1 T ST(θ) = ∑ φt (θ ) W (θ ) −1 ∑ φt (θ ) , (19) T t =1 T t =1 where φt(θ) = h(Yt,θ)¦Zt is r1, W(θ) is a rr positive definite matrix, and r = GK. In the two-step GMM estimator, W(θ) initially is set to the identity matrix, yielding the estimator θˆ(1) , and the second step uses Wˆ (θˆ(1) ) , where 1 T Wˆ (θ ) = ∑ [φt (θ ) − φ (θ )][φt (θ ) − φ (θ )]′ (20) T t =1 and φ (θ ) = T −1 ∑ t =1φt (θ ) . (Here, we assume that φt(θ) is serially uncorrelated, T otherwise Wˆ (θ ) is replaced by an estimator of the spectral density of φt(θ) at frequency zero.) The iterated GMM estimator continues this process, evaluating Wˆ (θ ) at the 33
previous estimate of θ until convergence (see Ferson and Foerester (1994)). For additional details about GMM estimation, see Hayashi (2000). To understand weak identification in nonlinear GMM, it is useful to return for a moment to the linear model, in which case we can write the moment condition as E[(Y1t – θ0Y2t)Zt] = 0 (21) where, in a slight shift in notation, Y1t and Y2t correspond to yt and Yt in equations (1) and (2) and the parameter vector is θ rather than β. The reason that the instruments serve to identify θ is that the orthogonality condition (21) holds at the true value θ0, but it does not hold at other values of θ0. If the instrument is irrelevant, so that Y2t is uncorrelated with Zt, then E[(Y1t – θY2t)Zt] = 0 for all values of θ and (21) no longer identifies θ0 uniquely. If the instruments are weak, then E[(Y1t – θY2t)Zt] is nearly zero for all values of θ, and in this sense θ can be said to be weakly identified. Said differently, weak instruments imply that the correlation between the model error term Y1t – θY2t and the instruments is nearly zero, even at false values of θ. This intuition of weak instruments implying weak identification carries over to the nonlinear GMM setting: if the correlation between the model error term, h(Yt,θ), and Zt is low even for false values of θ, then θ is weakly identified. Because there is no exact sampling theory for GMM estimators, a formal treatment of the implications of weak identification for GMM must entail asymptotics. As in the linear case, there are two approaches. One is to use stochastic expansions in orders of T–1/2, an approach that has been pursued by Newey and Smith (2001). This 34
approach, however, seems likely to produce poor approximations when identification is very weak, as it does in the linear case. A second approach is to use an asymptotic nesting lets T ∞ but, loosely speaking, keeps the GMM version of the concentration parameter constant. This approach is developed in Stock and Wright (2000), who develop a stochastic process representation of the limiting objective function (the limit of ST, where ST is treated as a stochastic process indexed by θ) that holds formally in weakly identified, partially identified, and non-identified cases. Stock and Wright’s numerical work suggests that weak identification can explain many of the Monte Carlo results in Tauchen (1986), Kocherlakota (1990) and Hansen, Heaton, and Yaron (1996). 7.2. Detecting Weak Identification An implication of weak identification is that GMM estimators can exhibit a variety of pathologies. For example, two-step GMM estimators and iterated GMM point estimators can be quite different and can have yield quite different confidence sets. If identification is weak, GMM estimates can be sensitive to the addition of instruments or changes in the sample. To the extent that these features are present in an empirical application, they could suggest the presence of weak identification. The only formal test for weak identification that we are aware of in nonlinear GMM is that proposed by Wright (2001). In the conventional asymptotic theory of GMM, the identification condition requires that the gradient of φt(θ0) has full column rank. Wright proposes a test of the hypothesis of a complete failure of this rank 35
You can also read