GMM, Weak Instruments, and Weak Identification

Page created by Derek Garcia

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

GMM, Weak Instruments, and Weak Identification

                                        James H. Stock

                     Kennedy School of Government, Harvard University
                       and the National Bureau of Economic Research

                                       Jonathan Wright

                           Federal Reserve Board, Washington, D.C.

                                             and

                                       Motohiro Yogo*

                        Department of Economics, Harvard University

                                          June 2002

ABSTRACT
        Weak instruments arise when the instruments in linear IV regression are weakly
correlated with the included endogenous variables. In nonlinear GMM, weak instruments
correspond to weak identification of some or all of the unknown parameters. Weak
identification leads to non-normal distributions, even in large samples, so that
conventional IV or GMM inferences are misleading. Fortunately, various procedures are
now available for detecting and handling weak instruments in the linear IV model and, to
a lesser degree, in nonlinear GMM.

*Prepared for the Journal of Business and Economic Statistics symposium on GMM.
This research was supported in part by NSF grant SBR-0214131.

1. Introduction

A subtle but important contribution of Hansen and Singleton’s (1982) and

Hansen’s (1982) original work on generalized method of moments (GMM) estimation

was to recast the requirements for instrument exogeneity. In the linear simultaneous

equations framework then prevalent, instruments are exogenous if they are excluded from

the equation of interest; in GMM, instruments are exogenous if they satisfy a conditional

mean restriction that, in Hansen and Singleton’s (1982) application, was implied directly

by a tightly specified economic model. Of course, at a mathematical level, these two

requirements are the same, but conceptually the starting point is different. The shift from

debatable (in Sims’ (1980) words, “incredible”) exclusion restrictions to first order

conditions derived from economic theory has proven to be a highly productive way to

think about candidate instrumental variables in a wide variety of applications.

Accordingly, careful consideration of instrument exogeneity now is a standard part of a

well-done empirical analysis using GMM.

Instrument exogeneity, however, is only one of the criteria necessary for an

instrument to be valid, and recently the other criterion – instrument relevance – has

received increased attention by theoretical and applied researchers. It now appears that

some, perhaps many, applications of GMM and instrumental variables (IV) regression

have what is known as “weak instruments,” that is, instruments that are only weakly

correlated with the included endogenous variables. Unfortunately, weak instruments

pose considerable challenges to inference using GMM and IV methods.

This paper surveys weak instruments and its counterpart in nonlinear GMM, weak

identification. We have five main themes:

1. If instruments are weak, then the sampling distributions of GMM and IV

statistics are in general non-normal and standard GMM and IV point

estimates, hypothesis tests, and confidence intervals are unreliable.

2. Weak instruments are commonplace in empirical research. This should not be

a surprise. Finding exogenous instruments is hard work, and the features that

make an instrument plausibly exogenous – for example, occurring sufficiently

far in the past to satisfy a first order condition or the as-if random coincidence

that lies behind a quasi-experiment – can also work to make the instrument

weak.

3. It is not useful to think of weak instruments as a “small sample” problem.

Bound, Jaeger and Baker (1995) provide an empirical example, based on an

important article by Angrist and Krueger (1991), in which weak instrument

problems arise despite having 329,000 observations. In a formal

mathematical sense, the strength of the instruments, as measured by the so-

called concentration parameter, plays the role of the sample size in

determining the quality of the usual normal approximation.

4. If you have weak instruments, you do not need to abandon your empirical

research, but neither should you use conventional GMM or IV methods.

Various tools are available for handling weak instruments in the linear IV

regression model, and although research in this area is far from complete, a

judicious use of these tools can result in reliable inferences.

5. What to do about weak instruments – even how to recognize them – is a much

more difficult problem in general nonlinear GMM than in linear IV

regression, and much theoretical work remains. Still, it is possible to make

some suggestions for empirical practice.

This survey emphasizes the linear IV regression model, simply because much

more is known about weak instruments in this case. We begin in Section 2 with a primer

on weak instruments in linear IV regression. With this as background, Section 3

discusses three important empirical applications that confront the challenge of weak

instruments. Sections 4 – 6 discuss recent econometric approaches to weak instruments:

their detection (Section 4); methods that are fully robust to weak instruments, at least in

large samples (Section 5); and methods that are somewhat simpler to use and are partially

robust to weak instruments (Section 6). Section 7 turns to weak identification in

nonlinear GMM, its consequences, and methods for detecting and handling weak

identification. Section 8 concludes.

As we see in the next section, many of the key ideas of weak instruments have

been understood for decades and can be explained in the context of the simplest IV

regression model. This said, most of the literature on solutions to the problem of weak

instruments is quite recent, and this literature is expanding rapidly; we both fear and hope

that much of the practical advice in this survey will soon be out of date.

2. A Primer on Weak Instruments in the Linear Regression Model

Many of the problems posed by weak instruments in the linear IV regression

model are best explained in the context of the classical version of that model with fixed

exogenous variables and i.i.d., normally distributed errors. This section therefore begins

by using this model to show how weak instruments lead to TSLS having a non-normal

sampling distribution, regardless of the sample size. The fixed-instrument, normal-error

model, however, has strong assumptions that are empirically unrealistic. This section

therefore concludes with a synopsis of asymptotic methods that weaken these strong

assumptions yet attempt to retain the insights gained from the finite sample distribution

theory.

2.1 The Linear Gaussian IV Regression Model with a Single Regressor

The linear IV regression model with a single endogenous regressor and no

included exogenous variables is,

y = Yβ + u (1)

Y = ZΠ + v, (2)

where y and Y are T1 vectors of observations on endogenous variables, Z is a TK

matrix of instruments, and u and v are T1 vectors of disturbance terms. The instruments

are assumed to be nonrandom (fixed). The errors [ut vt] are assumed to be i.i.d. and

normally distributed N(0,Σ), t = 1,…, T, where the elements of Σ are σ u2 , σuv, and σ v2 .

Equation (1) is the structural equation, and β is the scalar parameter of interest. Equation

(2) relates the included endogenous variable to the instruments.

       The concentration parameter. The concentration parameter, µ2, is a unitless

measure of the strength of the instruments, and is defined as

                            µ2 = ΠZZΠ/ σ v2 .                                                (3)

       A useful interpretation of µ2 is in terms of F, the F-statistic testing the hypothesis

that Π = 0 in (2); because F is the F-statistic testing for nonzero coefficients on the

instruments in the first stage of TSLS, it is commonly called the “first-stage F-statistic,”

terminology that is adopted here. Let F be the infeasible counterpart of F, computed

using the true value of σ v2 ; then F has a noncentral χ K2 / K distribution with

noncentrality parameter µ2/K, and E( F ) = µ2/K + 1. If the sample size is large, F and F

are close, so E(F) # µ2/K + 1; that is, the expected value of the first-stage F-statistic is

approximately 1 + µ2/K. Thus, larger values of µ2/K shift out the distribution of the first-

stage F-statistic. Said differently, F – 1 can be thought of as the sample analog of µ2/K.

       An expression for the TSLS estimator. The TSLS estimator of β is,

                                         Y  PZ y
                             β̂ TSLS =            ,                                            (4)
                                         Y  PZ Y

                                                      5

where PZ = Z(ZZ)–1Z. Rothenberg (1984) presents a useful expression for the TSLS

estimator, which obtains after a few lines of algebra. First substitute the expression for Y

in (2) into (4) to obtain,

                                                Π  Z u + v  PZ u
                        β̂ TSLS – β =                                      .                (5)
                                        Π  Z Z Π + 2 Π  Z v + v  PZ v

Now define,

                           Π  Z u                              Π  Z v
                zu =                  ,               zv =                  ,
                       σ u Π  Z Z Π                        σ v Π  Z Z Π

                        v′PZ u                                v′PZ v
                Suv =          ,                      Svv =          .
                        σ vσ u                                 σ v2

Under the assumptions of fixed instruments and normal errors, the distributions of zu, zv,

Suv, and Svv do not depend on the sample size T: zu and zv are standard normal random

variables with a correlation equal to ρ, the correlation between u and v, and Suv and Svv

are quadratic forms of normal random variables with respect to the idempotent matrix PZ.

        Substituting these definitions into (5), multiplying both sides by µ, and collecting

terms yields,

                                                        zu + Suv / µ
                       µ( β̂ TSLS – β) = (σu/σv)                            .               (6)
                                                   1 + 2 zv / µ + Svv / µ 2

                                                      6

As this expression makes clear, the terms zv, Suv, and Svv result in the TSLS

estimator having a non–normal distribution. If, however, the concentration parameter µ2

is large, then the terms Suv/µ, zv/µ, and Svv/µ2 will all be small in a probabilistic sense, so

that the dominant term in (6) will be zu, which in turn yields the usual normal

approximation to the distribution of β̂ TSLS . Formally, µ2 plays the role in (6) that is

usually played by the number of observations: as µ2 gets large, the distribution of

µ( β̂ TSLS – β) is increasingly well approximated by the N(0, σ u2 / σ v2 ) distribution. For the

normal approximation to the distribution of the TSLS estimator to be a good one, it is not

enough to have many observations: the concentration parameter must be large.

        Bias of the TSLS estimator in the unidentified case. When µ2 = 0 (equivalently,

when Π = 0), the instrument is not just weak but irrelevant, and the TSLS estimator is

centered around the probability limit of the OLS estimator. To see this, use (5) and Π = 0

to obtain, β̂ TSLS – β = vPZu/ vPZv. It is useful to write u = E(u|v) + η where, because u

and v are jointly normal, E(u|v) = (σuv/ σ v2 )v and η is normally distributed. Moreover, σuv

= σuY and σ v2 = σ Y2 because Π = 0. Thus u = (σuY/ σ Y2 )v + η, so

                                     v  PZ [(σ uY / σ Y2 )v + η ] σ uY v  Pzη
                     β̂ TSLS – β =                                = 2 +         .              (7)
                                                v  Pz v           σ Y v  Pz v

Because η and v are independently distributed and Z is fixed, E[(vPZη/ vPZv)|v] = 0.

Suppose that K  3, so that the first moment of the final ratio in (7) exists; then by the

law of iterated expectations,

                                                   7

σ uY
                             E( β̂ TSLS – β) =         .                                         (8)
                                                  σ Y2

        The right hand side of (8), is the inconsistency of the OLS estimator when Y is
                                                    p
correlated with the error term; that is, β̂ OLS    →    β + σuY/ σ Y2 . Thus, when µ2 = 0, the

expectation of the TSLS estimator is the probability limit of the OLS estimator.

        The result in (8) applies to the limiting case of irrelevant instruments; with weak

instruments, the TSLS estimator is biased towards the probability limit of the OLS

estimator. Specifically, define the “relative bias” of TSLS to be the bias of TSLS relative

to the inconsistency of OLS, that is, E( β̂ TSLS – β)/plim( β̂ OLS – β). Then the TSLS

relative bias is approximately inversely proportional to 1 + µ2/K (this result holds whether

or not the errors are not normally distributed (Buse (1992)). Hence, the relative bias

decreases monotonically in µ2/K.

        Numerical examples. Figures 1a and 1b respectively present the densities of the

TSLS estimator and its t-statistic for various values of the concentration parameter, when

the true value of β is zero. The other parameter values mirror those in Nelson and Startz

(1990a, 1990b), with σu = σv = 1, ρ = .99, and K = 1 (one instrument). For small values

of µ2, such as the value of .25 considered by Nelson and Startz, the distributions are

strikingly non–normal, even bimodal. As µ2 increases, the distributions eventually

approach the usual normal limit.

        Under the assumptions of fixed instruments and normal errors, the distributions in

Figure 1 depend on the sample size only through the concentration parameter; for a

                                                   8

given value of the concentration parameter, the distributions do not depend on the sample

size. Thus, these figures illustrate the point made more generally by (6) that the quality

of the usual normal approximation depends on the size of the concentration parameter.

The Nelson–Startz results build on a large literature on the exact distribution of

IV estimators under the assumptions of fixed exogenous variables and i.i.d. normal errors

(e.g. the exact distribution of the TSLS estimator was obtained by Richardson (1968) and

Sawa (1969) for the case of a single right hand side endogenous regressor). From a

practical perspective, this literature has two drawbacks. First, the expressions for

distributions in this literature, comprehensively reviewed by Phillips (1984), are among

the most offputting in econometrics and pose substantial computational challenges.

Second, because the assumptions of fixed instruments and normal errors are inappropriate

in applications, it is unclear how to apply these results to the sorts of problems

encountered in practice. To obtain more generally useful results, researchers have

focused on asymptotic approximations, which we now briefly review.

2.2 Asymptotic Approximations

Asymptotic distributions can provide good approximations to exact sampling

distributions. Conventionally, asymptotic limits are taken for a fixed model as the

sample size gets large, but this is not the only approach and for some problems this is not

the best approach, in the sense that it does not necessarily provide the most useful

approximating distribution. This is the case for the weak instruments problem: as is

evident in Figure 1, the usual fixed-model large-sample normal approximations can be

quite poor when the concentration parameter is small, even if the number of observations

is large. For this reason, there have been various alternative asymptotic methods used to

analyze IV statistics in the presence of weak instruments; three such methods are

Edgeworth expansions, large–model asymptotics, and weak-instrument asymptotics.

Because the concentration parameter plays a role in these distributions akin to the sample

size, all these methods aim to improve the quality of the approximations when µ2/K is

small, but the number of observations is large.

Edgeworth expansions. An Edgeworth expansion is a representation of the

distribution of the statistic of interest in powers of 1/ T ; Edgeworth and related

expansions of IV estimators are reviewed by Rothenberg (1984). As Rothenberg (1984)

points out, in the fixed-instrument, normal-error model, an Edgeworth expansion in

1/ T with a fixed model is formally equivalent to an Edgeworth expansion in 1/µ. In

this sense, Edgeworth expansions improve upon the conventional normal approximation

when µ is small enough for the term in 1/µ2 to matter but not so small that the terms in

1/µ3 and higher matter. Rothenberg (1984) suggests the Edgeworth approximation is

“excellent” for µ2 > 50 and “adequate” for µ2 as small as 10, as long as the number of

instruments is small (less than µ).

Many-instrument asymptotics. Although the problem of many instruments and

weak instruments might at first seem different, they are in fact related: if there were

many strong instruments, then the adjusted R2 of the first-stage regression would be

nearly one, so a small first-stage adjusted R2 suggests that many of the instruments are

weak. Bekker (1994) formalized this notion by developing asymptotic approximations

for a sequence of models with fixed instruments and Gaussian errors, in which the

number of instruments, K, is proportional to the sample size and µ2/K converges to a

constant, finite limit; similar approaches were taken by Anderson (1976), Kunitomo

(1980), and Morimune (1983). The limiting distributions obtained from this approach

generally are normal and simulation evidence suggests that these approximations are

good for moderate as well as large values of K, although they cannot capture the non–

normality evident in the Nelson–Startz example in Figure 1. Distributions obtained using

this approach generally depend on error distribution (see Bekker and van der Ploeg

(1999)), so some procedures justified using many-instrument asymptotics require

adjustments if the errors are non-normal. Rate and consistency results are, however,

more robust to non-normality (see Chao and Swanson (2002)).

       Weak-instrument asymptotics. Like many-instrument asymptotics, weak-

instrument asymptotics (Staiger and Stock (1997)) considers a sequence of models

chosen to keep µ2/K constant; unlike many-instrument asymptotics, K is held fixed.

Technically, the sequence of models considered is the same as is used to derive the local

asymptotic power of the first-stage F test (a “Pitman drift” parameterization in which Π

is in a 1/ T neighborhood of zero). Staiger and Stock (1997) show that, under general

conditions on the errors and with random instruments, the representation in (6) can be

reinterpreted as holding asymptotically in the sense that the sample size T tends to

infinity while µ2/K is fixed. More generally, many of the results from the fixed-

instrument, normal-error model apply to models with random instrument and nonnormal

errors, with certain simplifications arising from the consistency of the estimator of σ v2 .

2.3 Weak Instruments with Multiple Regressors

       The linear IV model with multiple regressors is,

                                              11

y = Yβ + Xγ + u                                                         (9)

                    Y = ZΠ + XΦ + V,                                                          (10)

where Y is now a Tn vector of observations on n endogenous variables, X is a TJ

matrix of included exogenous variables, and V is a Tn vector of disturbances with

covariance matrix ΣVV; as before, Z is a TK matrix of instruments and u is T1. The

concentration parameter is now a KK matrix. Expressed in terms population moments,

the concentration matrix is, Σ VV
                               1/ 2
                                    Π Σ Z | X ΠΣ VV
                                                  1/ 2
                                                       , where ΣZ|X = ΣZZ – ΣZX Σ XX
                                                                                  −1
                                                                                     ΣXZ. To avoid

introducing new notation, we refer to the concentration parameter as µ2 in both the scalar

and matrix case.

        Exact distribution theory for IV statistics with multiple regressors is quite

complicated, even with fixed exogenous variables and normal errors. Somewhat simpler

expressions obtain using weak instrument asymptotics (under which reduced form

moments, such as ΣVV and ΣZ|X are consistently estimable). The quality of the usual

normal approximation is governed by the matrix µ2/K. Because the predicted values of Y

from the first-stage regression can be highly correlated, for the usual normal

approximations to be good it is not enough for a few elements of µ2/K to be large.

Rather, it is necessary for the matrix µ2/K to be large in the sense that its smallest

eigenvalue is large.

        The notation for IV estimators with included exogenous regressors X (equations

(9) and (10)) is cumbersome. Sections 4 – 6 therefore focus on the case of no included

                                               12

exogenous variables. The discussion, however, extends directly to the case of included

exogenous regressors, and the formulas generally extend by replacing y, Y, and Z by the

residuals from their projection onto X and by modifying the degrees of freedom as

needed.

3. Three Empirical Examples

There are many examples of weak instruments in empirical work. Here, we

mention three from labor economics, finance, and macroeconomics.

3.1 Estimating the Returns to Education

In an influential paper, Angrist and Krueger (1991) proposed using the quarter of

birth as an instrumental variable to circumvent ability bias in estimating the returns to

education. The date of birth, they argued, should be uncorrelated with ability, so that

quarter of birth is exogenous; because of mandatory schooling laws, quarter of birth

should also be relevant. Using large samples from the U.S. census, they therefore

estimated the returns to education by TSLS, using quarter of birth and its interactions

with state and year of birth binary variables. Depending on the specification, they had as

many as 178 instruments.

Surprisingly, despite the large number of observations (329,000 observations or

more), in some of the Angrist–Krueger regressions the instruments are weak. This point

was first made by Bound, Jaeger, and Baker (1995), who (among other things) provide

Monte Carlo results showing that similar point estimates and standard errors obtain in

some specifications if each individual’s true quarter of birth is replaced by a randomly

generated quarter of birth. Because the results with the randomly generated quarter of

birth must be spurious, this suggests that the results with the true quarter of birth are

misleading. The source of these misleading inferences is weak instruments: in some

specifications, the first-stage F-statistic is less than 2, suggesting that µ2/K might be one

or less (recall that E(F) – 1 # µ2/K). An important conclusion is that it is not helpful to

think of weak instruments as a “finite sample” problem that can be ignored if you have

sufficiently many observations.

3.2 The Log-Linearized Euler Equation in the CCAPM

       The first empirical application of GMM was Hansen and Singleton’s (1982)

investigation of the consumption-based capital asset pricing model (CCAPM). In its log-

linearized form, the first order condition of the CCAPM with constant relative risk

aversion can be written,

                           E[(rt+1 + α – γ –1∆ct+1)|Zt] = 0,                                (11)

where γ is the coefficient of risk aversion (here, also the inverse of the intertemporal

elasticity of substitution), ∆ct+1 is the growth rate of consumption, rt is the log gross

return on some asset, α is a constant, and Zt is a vector of variables in the information set

at time t (Hansen and Singleton (1983); see Campbell (2001) for a textbook treatment).

       The coefficients of (11) can be estimated by GMM using Zt as an instrument.

One way to proceed is to use TSLS with ri,t+1 as the dependent variable; another is to

                                              14

apply TSLS with ∆ct+1 as the dependent variable; a third is to use a method, such as

LIML, that is invariant to the normalization. Under standard fixed-model asymptotics,

these estimators are asymptotically equivalent, so it should not matter which method is

used. But, as discussed in detail in Neely, Roy, and Whiteman (2001) and Yogo (2002),

it matters greatly in practice, with point estimates of γ ranging from small (Hansen and

Singleton (1982, 1983) to very large (Hall (1988), Campbell and Mankiw (1989)).

        Although one possible explanation for these disparate empirical findings is that

the instruments are not exogenous – the restrictions in (11) fail to hold – another

possibility is that the instruments are weak. Indeed, the analysis of Stock and Wright

(2000), Neely, Roy, and Whiteman (2001), and Yogo (2002) suggests that weak

instruments are part of the explanation for these seemingly contradictory results. It

should not be a surprise that instruments are weak here: for an instrument to be strong, it

must be a good predictor of either consumption growth or an asset return, depending on

the normalization, but both are notoriously difficult to predict. In fact, the first-stage F-

statistics in regressions based on (11) are frequently less than 5 (Yogo (2002)).

3.3 A Hybrid Phillips curve

        Forward–looking price setting behavior is a prominent feature of modern

macroeconomics. The hybrid Phillips curve (e.g. Fuhrer and Moore (1995)) blends

forward–looking and backward–looking behavior, so that prices are based in part on

expected future prices and in part on past prices. According to a typical hybrid Phillips

curve, inflation (πt) follows,

                                              15

πt = β1xt + β2Etπt+1 + β3πt–1 + ωt,                                         (12)

where xt is a measure of demand pressures, such as the output gap, Etπt+1 is the expected

inflation rate in time t+1 based on information available at time t, and ωt is a disturbance.

Because Etπt+1 is unobserved, the coefficients of (12) cannot be estimated by OLS but

they can be estimated by GMM based on the moment condition, E[(πt – β1xt – β2πt+1 –

β3πt–1)|xt, πt, πt –1, Zt] = 0, where Zt is a vector of variables known at date t. Efforts to

estimate (12) using the output gap have met with mixed success (e.g. Roberts (1997),

Fuhrer (1997)), although Gali and Gertler (1999) report plausible parameter estimates

when marginal cost, as measured by labor’s share, is used as xt.

        Although the papers in this literature do not discuss the possibility of weak

instruments, recent work by Ma (2001) and Mavroeidis (2001) suggests that weak

instruments could be a problem here. Here, too, this should not be a surprise: to be

strong, the instrument Zt must have substantial marginal predictive content for πt+1, given

xt, πt, and πt–1. However, the regression relating πt+1 to xt, πt, and πt–1 is the “old”

backwards-looking Phillips curve which, despite its theoretical limitations, is one of the

most reliable tools for forecasting inflation (e.g. Stock and Watson (1999)); for Zt to be a

strong instrument, it must improve substantially upon a backward–looking Phillips curve.

                              4. Detection of Weak Instruments

                                               16

This section discusses two methods for detecting weak instruments, one based on

the first-stage F-statistic, the other based on a statistic recently proposed by Hahn and

Hausman (2002).

4.1 The First-stage F-statistic

Before discussing how to use the first-stage F-statistic to detect weak instruments,

we need to say what, precisely, weak instruments are.

A definition of weak instruments. A practical approach to defining weak

instruments is that instruments are weak if µ2/K is so small that inferences based on

conventional normal approximating distributions are misleading. In this approach, the

definition of weak instruments depends on the purpose to which the instruments are put,

combined with the researcher’s tolerance for departures from the usual standards of

inference (bias, size of tests). For example, suppose you are using TSLS and you want its

bias to be small. One measure of the bias of TSLS is its bias relative to the inconsistency

of OLS, as defined in Section 2. Accordingly, one measure of whether a set of

instruments is strong is whether µ2/K is sufficiently large that the TSLS relative bias is,

say, no more than 10%; if not, then the instruments are deemed weak. Alternatively, if

you are interested in hypothesis testing, then you could define instruments to be strong if

µ2/K is large enough that a 5% hypothesis test to reject no more than (say) 15% of the

time; otherwise, the instruments are weak. These two definitions – one based on relative

bias, one based on size – in general yield different sets of µ2/K; thus instruments might be

weak if used for one application, but not if used for another.

Here, we consider the two definitions of weak instruments in the previous

paragraph, that is, that the TSLS bias could exceed 10%, or that the nominal 5% test of β

= β0 based on the usual TSLS t-statistic has size that could exceed 15%. As shown by

Stock and Yogo (2001), under weak-instrument asymptotics, each of these definitions

implies a threshold value of µ2/K: if the actual value of µ2/K exceeds this threshold, then

the instruments are strong (e.g. TSLS relative bias is less than 10%), otherwise the

instruments are weak.

Ascertaining whether instruments are weak using the first-stage F-statistic. In

the fixed-instrument/normal-error model, and also under weak-instrument asymptotics,

the distribution of the first-stage F-statistic depends only on µ2/K and K, so that inference

about µ2/K can be based on F. As Hall, Rudebusch, and Wilcox (1996) show in Monte

Carlo simulations, however, simply using F to test the hypothesis of non–identification

(Π = 0) is inadequate as a screen for problems caused by weak instruments. Instead, we

follow Stock and Yogo (2001) and use F to test the null hypothesis that µ2/K is less than

or equal to the weak-instrument threshold, against the alternative that it exceeds the

threshold.

Table 1 summarizes weak-instrument threshold values of µ2/K and critical values

for the first-stage F-statistic testing the null hypothesis that instruments are weak, for

selected values of K. For example, under the TSLS relative bias definition of weak

instruments, if K = 5 then the threshold value of µ2/K is 5.82 and the test that µ2/K ≤ 5.82

rejects in favor of the alternative that µ2/K > 5.82 if F ≥ 10.83. It is evident from Table 1

that one needs large values of the first-stage F-statistic, typically exceeding 10, for TSLS

inference to be reliable.

4.2 Extension of the First-stage F-statistic to n > 1

            As discussed in Section 2.3, when there are multiple included endogenous

regressors (n > 1) the concentration parameter µ2 is a matrix. Because of possible

multicollinearity among the predicted values of the endogenous regressors, in general the

distribution of TSLS statistics depends on all the elements of µ2.

            From a statistical perspective, when n > 1, the n first-stage F-statistics are not

sufficient statistics for the concentration matrix even with fixed regressors and normal

errors (Shea (1997) discusses the pitfall of using the n first-stage F-statistics when n > 1 ).

Instead, inference about µ2 can be based on the nn matrix analog of the first-stage F-

statistic,

                        GT = ΣˆVV
                               −1/ 2
                                     YPZY ΣˆVV
                                              −1/ 2
                                                    /K,                                          (13)

where ΣˆVV
        −1/ 2
              = YMZY/(T–K). Under weak-instrument asymptotics, E(GT) → µ2/K + In,

where In is the nn identity matrix.

            Cragg and Donald (1993) proposed using GT to test for partial identification,

specifically, testing the hypothesis that the matrix Π has rank L against the alternative

that it has rank greater than L, where L < n. If the rank of Π is L, then L linear

combinations of β are identified, and n – L are not; Choi and Phillips (1992) discuss this

situation, which they refer to as partial identification. The Cragg-Donald test statistic is

∑
    n− L
    i =1
           λi , where λi is the ith–smallest eigenvalue of GT. Under the null, this statistic has a

                                                     19

χ (2n− L )( K − L ) /[( n − L)( K − L)] limiting distribution. For L = 0 (so the null is complete non-

identification), the statistic reduces to trace(GT), the sum of the n individual first-stage F-

statistics. If L = n – 1 (so the null is the smallest possible degree of underidentification),

then the Cragg–Donald statistic is the smallest eigenvalue of GT .

Although the Cragg–Donald statistic can be used to test for underidentification,

from the perspective of IV inference, mere instrument relevance is insufficient, rather, the

instruments must be strong in the sense that µ2/K must be large. Rejecting the null

hypothesis of partial identification does not ensure reliable IV inference. Accordingly,

Stock and Yogo (2001) consider the problem of testing the null hypothesis that a set of

instruments is weak, against the alternative that they are strong, where instruments are

defined to be strong if conventional TSLS inference is reliable for any linear combination

of the coefficients. By focusing on the worst–behaved linear combination, this approach

is conservative but tractable, and they provide tables of critical values, similar to those in

Table 1, based on the minimum eigenvalue of GT.

4.3 A Specification Test of a Null of Strong Instruments

The methods discussed so far have been tests of the null of weak instruments.

Hahn and Hausman (2002) reverse the null and alternative and propose a test of the null

that the instruments are strong, against the alternative that they are weak. Their

procedure is conceptually straightforward. Suppose that there is a single included

endogenous regressor (n = 1) and that the instruments are strong, so that conventional

normal approximations are valid. Then the normalization of the regression (the choice of

dependent variable) should not matter asymptotically. Thus the TSLS estimator in the

forward regression of y on Y and the inverse of the TSLS estimator in the reverse

regression of Y on y are asymptotically equivalent (to order op(T–1/2)) with strong

instruments, but if instruments are weak, they are not. Accordingly, Hahn and Hausman

(2002) developed a statistic comparing the estimators from the forward and reverse

regressions (and the extension of this idea when n =2). They suggest that if this statistic

rejects the null hypothesis, then a researcher should conclude that his or her instruments

are weak, otherwise he or she can treat the instruments as strong.

5. Fully Robust Inference with Weak Instruments in the Linear Model

This section discusses hypothesis tests and confidence sets for β that are fully

robust to weak instruments, in the sense that these procedures have the correct size or

coverage rates regardless of the value of µ2 (including µ2 = 0) when the sample size is

large, specifically, under weak-instrument asymptotics. If n = 1, these methods also

produce median-unbiased estimators of β as the limiting case of a 0% confidence

interval. This discussion starts with testing, then concludes with the complementary

problem of confidence sets. We focus on the case n = 1, but the methods can be

generalized to joint inference about β when n > 1.

Several fully robust methods have been proposed, and Monte Carlo studies

suggest that none appears to dominate the others. Moreira (2001) provides a theoretical

explanation of this in the context of the fixed instrument, normal error model. In that

model, there is no uniformly most powerful test of the hypothesis β = β0, a result that also

holds asymptotically under weak instrument asymptotics. In this light, the various fully

robust procedures represent tradeoffs, with some working better than others, depending

on the true parameter values.

5.1 Fully Robust Gaussian Tests

        Moreira (2001) considers the system (1) and (2) with fixed instruments and

normally distributed errors. The reduced-form equation for y is,

                                y = ZΠβ + w.                                                (14)

Let Ω denote the covariance matrix of the reduced-form errors, [wt vt], and for now

suppose that Ω is known. We are interested in testing the hypothesis β = β0.

        Moreira (2001) shows that, under these assumptions, the statistics ( 8 , , ) are

sufficient for β and Π, where

                  ( Z ′Z ) −1/ 2 Z ′Yb0             ( Z ′Z )−1/ 2 Z ′Y Ω −1a0
            8 =                           and , =                               ,           (15)
                        b0′Ω b0                            a0′Ω −1a0

where Y = [y Y], b0 = [1 –β0] and a0 = [β0 1]. Thus, for the purpose of testing β = β0, it

suffices to consider test statistics that are functions of only 8 and , , say g( 8 , , ).

Moreover, under the null hypothesis β = β0, the distribution of , depends on Π but the

distribution of 8 does not; thus, under the null hypothesis, , is sufficient for Π. It

follows that a test of β = β0 based on g( 8 , , ) is similar if its critical value is computed

from the conditional distribution of g( 8 , , ) given , . Moreira (2001) also derives an

                                                    22

infeasible power envelope for similar tests in the fixed-instrument, normal-error model,

under the further assumption that Π is known. In practice, Π is not known so feasible

tests cannot achieve this power envelope and, when K > 1, there is no uniformly most

powerful test of β = β0.

         In practice, Ω is unknown so the statistics in (15) cannot be computed. However,

under weak-instrument asymptotics, Ω can be estimated consistently. Accordingly, let 8̂

and ,ˆ denote 8 and , evaluated with Ω̂ = YMZY/(T – K) replacing Ω. Because

Moreira’s (2001) family of feasible similar tests, based on statistics of the form g( 8̂ , ,ˆ ),

are derived under the assumption of normality, we refer to them as Gaussian similar tests.

5.2 Three Gaussian Similar Tests

         We now turn to three Gaussian similar tests: the Anderson-Rubin statistic,

Kleibergen’s (2001) test statistic, and Moreira’s (2002) conditional likelihood ratio (LR)

statistic.

         The Anderson-Rubin Statistic. More than fifty years ago, Anderson and Rubin

(1949) proposed testing the null hypothesis β = β0 in (9) using the statistic,

                               ( y − Y β 0 )' PZ ( y − Y β 0 ) / K      8ˆ′8ˆ
             AR( β 0 ) =                                              =       .             (16)
                           ( y − Y β 0 )' M Z ( y − Y β 0 ) /(T − K )    K

One definition of the LIML estimator is that it minimizes AR(β).

                                                      23

Anderson and Rubin (1949) showed that, if the errors are Gaussian and the

instruments are fixed, then this test statistic has an exact FK ,T − K distribution under the

null, regardless of the value of µ2/K. Under the more general conditions of weak

                                   d
instrument asymptotics, AR(β0)     →    χ K2 /K under the null hypothesis, regardless of the

value of µ2/K. Thus the AR statistic provides a fully robust test of the hypothesis β = β0.

         The AR statistic can reject either because β  β0 or because the instrument

orthogonality conditions fail. In this sense, inference based on the AR statistic is

different from inference based on conventional GMM standard errors, for which the

maintained hypothesis is that the instruments are valid. For this reason, a variety of tests

have been proposed that maintain the hypothesis that the instruments are valid.

         Kleibergen’s Statistic. Kleibergen (2001) proposed testing β = β0 using the

statistic,

                                       ( 8ˆ′,ˆ )2
                             K(β0) =              ,                                             (17)
                                          ,ˆ,′ ˆ

which, following Moreira (2001), we have written in terms of 8̂ and ,ˆ . If K = 1, then

K ( β 0 ) = AR( β 0 ) . Kleibergen shows that under either conventional or weak-instrument

asymptotics, K ( β 0 ) has a χ12 null limiting distribution.

         Moreira’s Statistic. Moreira (2002) proposed to test β = β0 using the conditional

likelihood ratio test statistic,

                                                      24

M(β0) =
                     2 (
                     1 ˆ ˆ ˆ ˆ
                                                                                              )
                       8 ′8 − , ,′ + ( 8ˆ′8ˆ − ,ˆ,′ ˆ )2 − 4[( 8ˆ′8ˆ )(,ˆ,′ ˆ ) − ( 8ˆ′,ˆ ) 2 ] .   (18)

The (weak-instrument) asymptotic conditional distribution of M(β0) under the null, given

,ˆ , is nonstandard and depends on β0, and Moreira (2002) suggests computing the

distribution by Monte Carlo simulation.

5.3 Power Comparisons

       We now turn to a comparison of the weak-instrument asymptotic power of the

Anderson-Rubin, Kleibergen, and Moreira tests. The asymptotic power functions of

these tests depend on µ2/K, ρ (the correlation between u and v in Equations (1) and (2)),

and K, as well as the true value of β. We consider two values of µ2/K: µ2/K = 1

corresponds to very weak instruments (nearly unidentified), and µ2/K = 5 corresponds to

moderately weak instruments. The two values of ρ considered correspond to moderate

endogeneity (ρ = .5) and very strong endogeneity (ρ = .99).

       Figure 2 presents weak-instrument asymptotic power functions for K = 5

instruments, so that the degree of overidentification is 4; the power depends on β – β0 but

not on β0, so Figure 2 applies to general β0. Moreira’s (2001) infeasible asymptotic

Gaussian power envelope is also reported as a basis for comparison. When µ2/K = 1 and

ρ = .5, all tests have poor power for all values of the parameter space, a result that is

reassuring given how weak the instruments are; moreover, all tests have power functions

that are far from the infeasible power envelope. Another unusual feature of these tests is

that the power function does not increase monotonically as β departs from β0. The

                                                    25

relative performance of the three tests differs, depending on the true parameter values,

and the power functions occasionally cross. Typically (but not always), the K(β0) and

M(β0) tests outperform the AR(β0) test.

        Figure 3 presents the corresponding power functions for many instruments (K =

50). For µ2/K = 5, both the K(β0) and M(β0) tests are close to the power envelope for

most values of β – β0 (except, oddly, when β

union of the TSLS (or LIML) 97.5% confidence intervals, conditional on the non-rejected

values of µ2/K.

The Wang-Zivot (1998) Score Test. Wang and Zivot (1998) and Zivot, Startz

and Nelson (1998) consider testing β = β 0 using modifications of conventional GMM

test statistics, in which σ u2 is estimated under the null hypothesis. Using weak-

instrument asymptotics, they show that although these statistic are not asymptotically

pivotal, their null distributions are bounded by a FK,∞ distribution (the bound is tight if K

= 1), which permits a valid, but conservative, test.

5.5 Robust Confidence Sets

By the duality between hypothesis tests and confidence sets, these tests can be

used to construct fully robust confidence sets. For example, a fully robust 95%

confidence set can be constructed as the set of β0 for which the Anderson-Rubin (1949)

statistic, AR(β0), fails to reject at the 5% significance level. In general, this approach

requires evaluating the test statistic for all points in the parameter space, although for

some statistics the confidence interval can be obtained by solving a polynomial equation

(a quadratic, in the case of the AR statistic). Because the tests in this section are fully

robust to weak instruments, the confidence sets constructed by inverting the tests are fully

robust.

As a general matter, when the instruments are weak, these sets can have infinite

volume. For example, because the AR statistic is a ratio of quadratic forms, it can have a

finite maximum, and when µ2 = 0 any point in the parameter space will be contained in

the AR confidence set with probability 95%. This does not imply that these methods

waste information, or are unnecessarily imprecise; on the contrary, they reflect the fact

that, if instruments are weak, there simply is limited information to use to make

inferences about β. This point is made formally by Dufour (1997), who shows that

under weak instrument asymptotics a confidence set for β must have infinite expected

volume, if it is to have nonzero coverage uniformly in the parameter space, as long as µ2

is fixed and finite. This infinite expected volume condition is shared by confidence sets

constructed using any of the fully robust methods of this section; for additional

discussion see Zivot, Startz and Nelson (1998).

6. Partially Robust Inference with Weak Instruments

Although the fully robust tests discussed in the previous section always control

size, they have some practical disadvantages, for example some are difficult to compute.

Moreover, for n > 1 they do not readily provide point estimates, and confidence intervals

for individual elements of β must be obtained by conservative projection methods. For

this reason, some researchers have investigated inference that is partially robust to weak

instruments. Recall from Section 4 that we stated that whether instruments are weak

depends on the task to which they are put. Accordingly, one way to frame the

investigation of partially robust methods is to push the ‘weak instrument threshold’ as far

as possible below that needed for TSLS to be reliable, yet not require that the method

produce valid inference in the completely unidentified case.

6.1 k-class Estimators

The k-class estimator of β is βˆ ( k ) = [Y(I – kMZ)Y]–1[Y(I – kMZ)y]. This class

includes TSLS (for which k = 1), LIML, and some alternatives that improve upon TSLS

when instruments are weak.

       LIML. LIML is a k-class estimator where k = kLIML is the smallest root of the

determinantal equation |Y Y – k Y MZY| = 0. The mean of distribution of the LIML

estimator does not exist, and its median is typically much closer to β than is the mean or

median of TSLS. In the fixed-instrument, normal-error case the bias of TSLS increases

with K, but the bias of LIML does not (Rothenberg (1984)). When the errors are

symmetrically distributed and instruments are fixed, LIML is the best median-unbiased k-

class estimator to second order (Rothenberg (1983)). Moreover, LIML is consistent

under many-instrument asymptotics (Bekker (1994)), while TSLS is not. This bias

reduction comes at a price, as the LIML estimator has fat tails. For example, LIML

generally has a larger interquartile range than TSLS when instruments are weak (e.g.

Hahn, Hausman and Kuersteiner (2001a)).

       Fuller-k estimators. Fuller (1977) proposed an alternative k–class estimator

which sets k = kLIML – b/(T – K), where b is a positive constant. With fixed instruments

and normal errors, the Fuller-k estimator with b = 1 is best unbiased to second order

(Rothenberg (1984)). In Monte Carlo simulations, Hahn, Hausman, and Kuersteiner

(2001a) report substantial reductions in bias and mean squared error, relative to TSLS

and LIML, using Fuller- k estimators when instruments are weak.

       Bias-adjusted TSLS. Donald and Newey (2001) consider a bias–adjusted TSLS

estimator (BTSLS), which is a k–class estimator with k = T/(T – K + 2), modifying an

estimator previously proposed by Nagar (1959). Rothenberg (1984) shows that BTSLS is

                                             29

unbiased to second order in the fixed-instrument, normal-error model. Donald and

Newey provide expressions for the second–order asymptotic mean square error (MSE) of

BTSLS, TSLS and LIML, as a function of the number of instruments K. They propose

choosing the number of instruments to minimize the second-order MSE. In Monte–Carlo

simulations, they find that selecting the number of instruments in this way generally

improves performance. Chao and Swanson (2001) develop analogous expressions for

bias and mean squared error of TSLS under weak instrument asymptotics, modified to

allow the number of instruments to increase with the sample size; they report

improvements in Monte Carlo simulations by incorporating bias adjustments.

6.3 The SSIV and JIVE Estimators

Angrist and Krueger (1992, 1995) proposed eliminating the bias of TSLS by

splitting the sample into two independent subsamples, then running the first-stage

regression on one subsample and the second stage regression on the other. The SSIV

estimator is given by an OLS regression of y[1] on Π̂ [2] Z[1], where the superscript denotes

the subsample. Angrist and Krueger (1995) show that SSIV is biased towards zero,

rather than towards the OLS probability limit.

Because SSIV does not use the sample symmetrically and appears to waste

information, Angrist, Imbens and Krueger (1999) proposed a jackknife instrumental

variables (JIVE) estimator. The JIVE estimator is βˆ JIVE = (Y ′Y )−1Y ′y , where the ith row

of Y is Z i Πˆ − i and Πˆ −i is the estimator of Π computed using all but the ith observation.

Angrist, Imbens and Krueger show that JIVE and TSLS are asymptotically equivalent

under conventional fixed-model asymptotics. Calculations reveal that, under weak-

instrument asymptotics, JIVE is asymptotically equivalent to a k-class estimator with k =

1 + K/(T – K). Theoretical calculations (Chao and Swanson (2002)) and Monte Carlo

simulations (Angrist, Imbens and Krueger (1999), and Blomquist and Dahlberg (1999))

indicate that JIVE improves upon TSLS when there are many instruments

6.4 Comparisons

       One way to assess the extent to which a proposed estimator or test is robust to

weak instruments is to characterize the size of the weak instrument region. When n = 1,

this can be done by computing the value of µ2/K needed to ensure that inferences attain a

desired degree of reliability. This was the approach taken in Table 1, and here we extend

this approach to some of the estimators discussed in this section.

       Figure 4 reports boundaries of asymptotic weak-instrument sets, as a function of

K, for various estimators (for computational details, see Stock and Yogo (2001)). In

Figure 4a, the weak-instrument set is defined to be the set of µ2/K such that the relative

bias of the estimator (relative to the inconsistency of OLS) exceeds 10%; the values of

µ2/K plotted are the largest values that meet this criterion. In Figure 4b, the weak-

instrument set is defined so that a nominal 5% test of β = β0, based on the relevant t-

statistic, rejects more than 15% of the time (that is, has size exceeding 15%).

       Two features of Figure 4 are noteworthy. First, LIML, BTSLS, JIVE, and the

Fuller-k estimator have much smaller weak-instrument thresholds than TSLS; in this

sense, these four estimators are more robust to weak instruments than TSLS. Second,

these thresholds do not increase as a function of K, whereas the TSLS threshold increases

in K, a reflection of its greater bias as the number of instruments increases.

                                             31

7. GMM Inference in General Nonlinear Models

       It has been recognized for some time that the usual large-sample normal

approximations to GMM statistics in general nonlinear models can provide poor

approximations to exact sampling distributions in problems of applied interest. For

example, Hansen, Heaton and Yaron (1996) examine GMM estimators of various

intertemporal asset pricing models using a Monte Carlo design calibrated to match U.S.

data. They find that, in many cases, inferences based on the usual normal distributions

are misleading (also see Tauchen (1986), Kocherlakota (1990), Ferson and Foerester

(1994) and Smith (1999)).

       The foregoing discussion of weak instruments in the linear model suggests that

weak instruments could be one possible reason for the failure of the conventional normal

approximations in nonlinear GMM. In the linearized CCAPM Euler equation (11), both

the log gross asset return rt and the growth rate of consumption ∆ct are difficult to

predict; thus, as argued by Stock and Wright (2000) and Neely, Roy, and Whiteman

(2001), it stands to reason that estimation of the original nonlinear Euler equation by

GMM also suffers from weak instruments. But making this intuition precise in the

general nonlinear setting is difficult: the machinery discussed in the previous sections

relies heavily on the linear regression model. In this section, we begin by briefly

discussing the problems posed by weak instruments in nonlinear GMM, and suggest that

a better term in this context is weak identification. The general approaches to handling

the problem of weak identification are the same in the nonlinear setting as the linear

                                             32

setting: detection of weak identification, procedures that are fully robust to weak

identification, and procedures that are partially robust. As shall be seen, the literature on

this topic is quite incomplete.

7.1 Consequences of Weak Identification in Nonlinear GMM

        In GMM estimation, the n1 parameter vector θ is identified by the G conditional

mean conditions E[h(Yt,θ0)|Zt] = 0, where θ0 is the true value of θ and Zt is a K-vector of

instruments. The GMM estimator is computed by minimizing

                     1 T            ′            1 T            
            ST(θ) =    ∑    φt (θ )   W (θ ) −1
                                                     ∑    φt (θ )  ,                    (19)
                     T t =1                      T t =1         

where φt(θ) = h(Yt,θ)¦Zt is r1, W(θ) is a rr positive definite matrix, and r = GK. In the

two-step GMM estimator, W(θ) initially is set to the identity matrix, yielding the

estimator θˆ(1) , and the second step uses Wˆ (θˆ(1) ) , where

                              1 T
                     Wˆ (θ ) = ∑ [φt (θ ) − φ (θ )][φt (θ ) − φ (θ )]′                    (20)
                              T t =1

and φ (θ ) = T −1 ∑ t =1φt (θ ) . (Here, we assume that φt(θ) is serially uncorrelated,
                      T

otherwise Wˆ (θ ) is replaced by an estimator of the spectral density of φt(θ) at frequency

zero.) The iterated GMM estimator continues this process, evaluating Wˆ (θ ) at the

                                                     33

previous estimate of θ until convergence (see Ferson and Foerester (1994)). For

additional details about GMM estimation, see Hayashi (2000).

       To understand weak identification in nonlinear GMM, it is useful to return for a

moment to the linear model, in which case we can write the moment condition as

                           E[(Y1t – θ0Y2t)Zt] = 0                                         (21)

where, in a slight shift in notation, Y1t and Y2t correspond to yt and Yt in equations (1) and

(2) and the parameter vector is θ rather than β. The reason that the instruments serve to

identify θ is that the orthogonality condition (21) holds at the true value θ0, but it does not

hold at other values of θ0. If the instrument is irrelevant, so that Y2t is uncorrelated with

Zt, then E[(Y1t – θY2t)Zt] = 0 for all values of θ and (21) no longer identifies θ0 uniquely.

If the instruments are weak, then E[(Y1t – θY2t)Zt] is nearly zero for all values of θ, and in

this sense θ can be said to be weakly identified. Said differently, weak instruments imply

that the correlation between the model error term Y1t – θY2t and the instruments is nearly

zero, even at false values of θ.

       This intuition of weak instruments implying weak identification carries over to

the nonlinear GMM setting: if the correlation between the model error term, h(Yt,θ), and

Zt is low even for false values of θ, then θ is weakly identified.

       Because there is no exact sampling theory for GMM estimators, a formal

treatment of the implications of weak identification for GMM must entail asymptotics.

As in the linear case, there are two approaches. One is to use stochastic expansions in

orders of T–1/2, an approach that has been pursued by Newey and Smith (2001). This

                                              34

approach, however, seems likely to produce poor approximations when identification is

very weak, as it does in the linear case.

A second approach is to use an asymptotic nesting lets T ∞ but, loosely

speaking, keeps the GMM version of the concentration parameter constant. This

approach is developed in Stock and Wright (2000), who develop a stochastic process

representation of the limiting objective function (the limit of ST, where ST is treated as a

stochastic process indexed by θ) that holds formally in weakly identified, partially

identified, and non-identified cases. Stock and Wright’s numerical work suggests that

weak identification can explain many of the Monte Carlo results in Tauchen (1986),

Kocherlakota (1990) and Hansen, Heaton, and Yaron (1996).

7.2. Detecting Weak Identification

An implication of weak identification is that GMM estimators can exhibit a