Making the Most of Statistical Analyses: Improving Interpretation and Presentation - Stanford University
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Making the Most of Statistical Analyses: Improving Interpretation and Presentation Gary King Harvard University Michael Tomz Harvard University Jason Wittenberg Harvard University W Social scientists rarely take full advan- e show that social scientists often do not take full advantage of tage of the information available in the information available in their statistical results and thus their statistical results. As a conse- miss opportunities to present quantities that could shed the quence, they miss opportunities to greatest light on their research questions. In this article we suggest an ap- present quantities that are of greatest proach, built on the technique of statistical simulation, to extract the cur- substantive interest for their research rently overlooked information and present it in a reader-friendly manner. and express the appropriate degree More specifically, we show how to convert the raw results of any statistical of certainty about these quantities. In procedure into expressions that (1) convey numerically precise estimates of this article, we offer an approach, built the quantities of greatest substantive interest, (2) include reasonable mea- on the technique of statistical simula- sures of uncertainty about those estimates, and (3) require little specialized tion, to extract the currently over- knowledge to understand. looked information from any statistical The following simple statement satisfies our criteria: “Other things be- method and to interpret and present it ing equal, an additional year of education would increase your annual in- in a reader-friendly manner. Using this come by $1,500 on average, plus or minus about $500.” Any smart high technique requires some expertise, school student would understand that sentence, no matter how sophisti- which we try to provide herein, but its cated the statistical model and powerful the computers used to produce it. application should make the results of The sentence is substantively informative because it conveys a key quantity quantitative articles more informative of interest in terms the reader wants to know. At the same time, the sen- and transparent. To illustrate our rec- tence indicates how uncertain the researcher is about the estimated quan- ommendations, we replicate the re- tity of interest. Inferences are never certain, so any honest presentation of sults of several published works, statistical results must include some qualifier, such as “plus or minus $500” showing in each case how the au- in the present example. thors’ own conclusions can be ex- pressed more sharply and informa- Gary King is Professor of Government, Harvard University, Littauer Center North Yard, tively, and, without changing any data Cambridge, MA 02138 (king@harvard.edu). Michael Tomz is a Ph.D. candidate, De- or statistical assumptions, how our partment of Government, Harvard University, Littauer Center North Yard, Cambridge, approach reveals important new infor- MA 02138 (tomz@fas.harvard. edu). Jason Wittenberg is a fellow of the Weatherhead Center for International Affairs, Cambridge Street, Cambridge, MA 02138 (jwittenb mation about the research questions @latte.harvard.edu). at hand. We also offer very easy-to- Our computer program, “CLARIFY: Software for Interpreting and Presenting Statistical use software that implements our Results,” designed to implement the methods described in this article, is available at suggestions. http://GKing.Harvard.Edu, and won the Okidata Best Research Software Award for 1999. We thank Bruce Bueno de Mesquita, Jorge Domínguez, Geoff Garrett, Jay McCann, and Randy Siverson for their data; Jim Alt, Alison Alter, Steve Ansolabehere, Marc Busch, Nick Cox, Jorge Domínguez, Jeff Gill, Michael Meffert, Jonathan Nagler, Ken Scheve, Brian Silver, Richard Tucker, and Steve Voss for their very helpful com- ments; and U.S./National Science Foundation (SBR-9729884), the Centers for Disease Control and Prevention (Division of Diabetes Translation), the National Institutes of Aging, the World Health Organization, and the Global Forum for Health Research for research support. American Journal of Political Science, Vol. 44, No. 2, April 2000, Pp. 341–355 ©2000 by the Midwest Political Science Association
, , In contrast, bad intepretations are substantively am- second equation gives the systematic component of the biguous and filled with methodological jargon: “the coef- model; it indicates how θi changes across observations, ficient on education was statistically significant at the depending on values of explanatory variables (typically 0.05 level.” Descriptions like this are very common in so- including a constant) in the 1 × k vector Xi and effect pa- cial science, but students, public officials, and scholars rameters in the k × 1 vector β. The functional form g(⋅,⋅), should not need to understand phrases like “coefficient,” sometimes called the link function, specifies how the ex- “statistically significant,” and “the 0.05 level” to learn planatory variables and effect parameters get translated from the research. Moreover, even statistically savvy into θi. readers should complain that the sentence does not con- One member of this general class is a linear-normal vey the key quantity of interest: how much higher the regression model, otherwise known as least-squares re- starting salary would be if the student attended college gression. To see this, let f (⋅,⋅) be the normal distribution for an extra year. N(⋅,⋅); set the main parameter vector to the scalar mean Our suggested approach can help researchers do bet- θi = E(Yi) = µi ; and assume that the ancilliary parameter ter in three ways. First, and most importantly, it can ex- matrix is the scalar homoskedastic variance α = V(Yi) = tract new quantities of interest from standard statistical σ2. Finally, set the systematic component to the linear models, thereby enriching the substance of social science form g(Xi ,β) = Xiβ = β0 + Xi1β1 + Xi2β2 + ⋅⋅⋅. The result is research. Second, our approach allows scholars to assess familiar: the uncertainty surrounding any quantity of interest, so it should improve the candor and realism of statistical dis- Yi ~ N(µi ,σ2), µi = Xi β (2) course about politics. Finally, our method can convert raw statistical results into results that everyone, regardless of Similarly, one could write a logit model by expressing the statistical training, can comprehend. The examples in this stochastic component as a Bernoulli distribution with article should make all three benefits apparent. main parameter πi = Pr(Yi = 1)—no ancillary parameter Most of this article describes our simulation-based is necessary—and setting the systematic component to approach to interpreting and presenting statistical re- the logistic form: sults. In many situations, what we do via simulation can 1 also be done by direct mathematical analysis or other Yi ~ Bernoulli(πi), πi = (3) computationally-intensive techniques, and we discuss 1 + e − X iβ these approaches as well. To assist researchers in imple- Equation 1 also includes as special cases nearly every menting our suggestions, we have developed an easy-to- other statistical model in the social sciences, including use, public domain software package called CLARIFY, multiple-equation models in which Yi is a vector, as well which we describe in the appendix. as specifications for which the probability distribution, functional form, or matrix of explanatory variables is es- timated rather than assumed to be known. Having estimated the statistical model, many re- The Problem of Statistical Interpretation searchers stop after a cursory look at the signs and “sta- tistical significance” of the effect parameters. This We aim to interpret the raw results from any member of approach obviously fails to meet our criteria for mean- a very general class of statistical models, which we sum- ingful statistical communication since, for many nonlin- marize with two equations: ear models, β̂ and α̂ are difficult to interpret and only indirectly related to the substantive issues that motivated Yi ~ f (θi , α), θi = g(Xi , β). (1) the research (Cain and Watts 1970; Blalock 1967). In- stead of publishing the effect coefficients and ancillary The first equation describes the stochastic component of parameters, researchers should calculate and present the statistical model: the probability density (or mass) quantities of direct substantive interest. function that generates the dependent variable Yi (i = Some researchers go a step farther by computing de- 1, . . . , n) as a random draw from the probability density rivatives, fitted values, and first differences (Long 1997; f (θi , α). Some characteristics of this function vary from King 1989, subsection 5.2) which do convey numerically one observation to the next, while others remain con- precise estimates of interesting quantities and require stant across all the i ’s. We represent the varying charac- little specialized knowledge to understand. Even these teristics with the parameter vector θi and relegate non- approaches are inadequate, however, because they ignore varying features to the ancillary parameter matrix α. The two forms of uncertainty. Estimation uncertainty arises
∞ from not knowing β and α perfectly, an unavoidable ∫−∞ yP (y )dy , which is not always the most pleasant of ex- consequence of having fewer than an infinite number of periences! Alternatively, we could approximate the mean observations. Researchers often acknowledge this un- through simulation by drawing many random numbers certainty by reporting standard errors or t-statistics, from P(y) and computing their average. If we were inter- but they overlook it when computing quantities of inter- ested in the theoretical variance of Y, we could calculate est. Since β̂ and α̂ are uncertain, any calculations— the sample variance of a large number of random draws, including derivatives, fitted values, and first differences— and if we wanted the probability that Y > 0.8, we could based on those parameter estimates must also be count the fraction of draws that exceeded 0.8. Likewise, uncertain, a fact that almost no scholars take into ac- we could find a 95-percent confidence interval for a func- count. A second form of variability, the fundamental un- tion of Y by drawing 1000 values of Y, computing the certainty represented by the stochastic component (the function for each draw, sorting the transformed draws distribution f ) in Equation 1, results from innumerable from lowest to highest and taking the 25th and 976th val- chance events such as weather or illness that may influ- ues. We could even approximate the entire distribution ence Y but are not included in X. Even if we knew the ex- of, say, Y , by plotting a histogram of the square roots of act values of the parameters (thereby eliminating esti- a large number of simulations of Y. mation uncertainty), fundamental uncertainty would Approximations can be computed to any desired de- prevent us from predicting Y without error. Our methods gree of precision by increasing the number of simula- for computing quantities of interest must account for tions (M), which is analagous to boosting the number of both types of uncertainty. observations in survey sampling. Assessing the precision of the approximation is simple: run the same procedure, with the same number of simulations, repeatedly. If the answer remains the same to within four decimal points Simulation-Based Approaches across the repetitions, that is how accurate the approxi- to Interpretation mation is. If more accuracy is needed, raise the number of simulations and try again. Nothing is lost by simu- We recommend statistical simulation as an easy method lation—except a bit of computer time—and much is of computing quantities of interest and their uncertain- gained in ease of use. ties. Simulation can also help researchers understand the entire statistical model, take full advantage of the param- eter estimates, and convey findings in a reader-friendly Simulating the Parameters manner (see Fair 1980; Tanner 1996; Stern 1997). We now explain how researchers can use simulation to compute quantities of interest and account for uncer- What Is Statistical Simulation? tainty. The first step involves simulating the main and ancillary parameters. Recall that the parameter estimates Statistical simulation uses the logic of survey sampling to β̂ and α̂ are never certain because our samples are finite. approximate complicated mathematical calculations. In To capture this estimation uncertainty, we draw many survey research, we learn about a population by taking a plausible sets of parameters from their posterior or sam- random sample from it. We use the sample to estimate a pling distribution. Some draws will be smaller or larger feature of the population, such as its mean or its vari- than β̂ and α̂ , reflecting our uncertainty about the exact ance, and our estimates become more precise as we in- value of the parameters, but all will be consistent with crease the sample size, n. Simulation follows a similar the data and statistical model. logic but teaches us about probability distributions, To simulate the parameters, we need the point esti- rather than populations. We learn about a distribution by mates and the variance-covariance matrix of the esti- simulating (drawing random numbers) from it and us- mates, which most statistical packages will report on re- ing the draws to approximate some feature of the distri- quest. We denote γ̂ as the vector produced by stacking β̂ bution. The approximation becomes more accurate as we on top of α̂ . More formally, γˆ = vec (βˆ , αˆ ) , where “vec” increase the number of draws, M. Thus, simulation en- stacks the unique elements of β̂ and α̂ in a column vec- ables us to approximate any feature of a probability dis- tor. Let Vˆ (γˆ ) designate the variance matrix associated tribution without resorting to advanced mathematics. with these estimates. The central limit theorem tells us For instance, we could compute the mean of a prob- that with a large enough sample and bounded variance, ability distribution P(y) by taking the integral E(Y) = we can randomly draw (simulate) the parameters from a
, , multivariate normal distribution with mean equal to γ̂ these is equivalent to the expected value ( Yˆ ) in a linear ˆ (γˆ ) .1 Using our notation, and variance equal to V regression, which we discuss in the following subsection. To simulate one predicted value, follow these steps: ( γ˜ ~ N γˆ , V ) ˆ (γˆ ) . (4) 1. Using the algorithm in the previous subsection, draw one value of the vector γ˜ = vec(β˜ , α ˜ ). Thus, we can obtain one simulation of γ by following 2. Decide which kind of predicted value you wish to these steps: compute, and on that basis choose one value for each 1. Estimate the model by running the usual software explanatory variable. Denote the vector of such val- program (which usually maximizes a likelihood func- ues Xc . tion), and record the point estimates γ̂ and variance 3. Taking the simulated effect coefficients from the top matrix Vˆ (γˆ ) . portion of γ̃ , compute θ˜ c = g (X c , β˜ ) , where g(⋅,⋅) is 2. Draw one value of the vector γ from the multivariate the systematic component of the statistical model. normal distribution in Equation 4. Denote the γ̃ = 4. Simulate the outcome variable Y˜c by taking a random vec(β˜ , α˜ ). draw from f (θ˜ c , α˜ ), the stochastic component of the statistical model. Repeat the second step, say, M = 1000 times to obtain 1000 draws of the main and ancillary parameters. Repeat this algorithm, say, M = 1000 times, to produce If we knew the elements of γ perfectly, the sets of 1000 predicted values, thereby approximating the entire draws would all be identical; the less information we have probability distribution of Yc . From these simulations about γ (due to larger elements in the variance matrix), the researcher can compute not only the average pre- the more the draws will differ from each other. The spe- dicted value but also measures of uncertainty around the cific pattern of variation summarizes all knowledge average. The predicted value will be expressed in the about the parameters that we can obtain from the statis- same metric as the dependent variable, so it should re- tical procedure. We still need to translate γ into substan- quire little specialized knowledge to understand. tively interesting quantities, but now that we have sum- marized all knowledge about γ we are well positioned to make the translation. In the next three subsections, we Expected Values describe algorithms for converting the simulated param- Depending on the issue being studied, the expected or eters into predicted values, expected values, and first mean value of the dependent variable may be more inter- differences. esting than a predicted value. The difference is subtle but important. A predicted value contains both fundamental Predicted Values and estimation uncertainty, whereas an expected value averages over the fundamental variability arising from Our task is to draw one value of Y conditional on one sheer randomness in the world, leaving only the estima- chosen value of each explanatory variable, which we tion uncertainty caused by not having an infinite num- represent with the vector Xc. Denote the simulated θ as ber of observations. Thus, predicted values have a larger θ̃c and the corresponding Y as Y˜c , a simulated predicted variance than expected values, even though the average value. Predicted values come in many varieties, depend- should be nearly the same in both cases.2 ing on the kind of X-values used. For instance, Xc may When choosing between these two quantities of in- correspond to the future (in which case Y˜c is a simulated terest, researchers should reflect on the importance of forecast), a real situation described by observed data fundamental uncertainty for the conclusions they are (such that Y˜c is a simulated predicted value), or a hypo- drawing. In certain applications, such as forecasting the thetical situation not necessarily in the future (making actual result of an election or predicting next month’s Y˜c a simulated counterfactual predicted value). None of foreign exchange rate, scholars and politicians—as well as investors—want to know not only the expected out- come, but also how far the outcome could deviate from 1This distributional statement is a shorthand summary of the expectation due to unmodeled random factors. Here, a Bayesian, likelihood, and Neyman-Pearson theories of statistical inference. The interpretive differences among these theories (such as whether θ or θ̂ is the random variable) are important but need not concern us here, as our approach can usually be employed 2In linear models, the average predicted value is identical to the ex- with any of these and most other theories of inference (see Barnett pected value. For nonlinear cases, the two can differ but are often 1982). close if the nonlinearity is not severe.
predicted value seems most appropriate. For other appli- For instance, to simulate a first difference for the first cations, the researcher may want to highlight the average explanatory variable, set the values for all explanatory effect of a particular explanatory variable, so an expected variables except the first at their means and fix the first value would be the best choice. one at its starting point. Denote this vector of starting We now offer an algorithm for creating one simula- values for the explanatory variables as Xs and run the ex- tion of an expected value: pected value algorithm once to generate E˜(Y s ) , the aver- 1. Following the procedure for simulating the param- age value of Y conditional on Xs. Next change the value of eters, draw one value of the vector γ˜ = vec (β˜ , α˜ ). the first explanatory variable to its ending point, leaving 2. Choose one value for each explanatory variable and the others at their means as before. Denote the new vec- denote the vector of values as Xc . tor as Xe and rerun the algorithm to get E˜(Ye ) , the mean 3. Taking the simulated effect coefficients from the top of Y conditional on Xe. The first difference is simply portion of γ̃ , compute θ˜ c = g (X c , β˜ ) , where g(⋅,⋅) is E˜(Ye ) − E˜(Y s ) . Repeat the first difference algorithm, say, the systematic component of the statistical model. M = 1000 times to approximate the distribution of first 4. Draw m values of the outcome variable Y˜c(k ) (k = differences. Average the simulated values to obtain a 1, . . . , m) from the stochastic component f (θ˜ c , α˜ ). point estimate, compute the standard deviation to obtain This step simulates fundamental uncertainty. a standard error, or sort the values to approximate a con- 5. Average over the fundamental uncertainty by calcu- fidence interval. lating the the mean of the m simulations to yield one We previously discussed expected values of Y, and until now this section has considered first differences simulated expected value E˜(Yc ) = ∑k =1Y˜c(k ) m . m based on only this type of expected value. Different ex- When m = 1, this algorithm reduces to the one for pre- pectations, such as Pr(Y = 3) in an ordered-probit model, dicted values. If m is a larger number, Step 4 accurately may also be of interest. For these cases, the expected portrays the fundamental variability, which Step 5 aver- value algorithm would need to be modified slightly. We ages away to produce an expected value. The larger the have made the necessary modifications in CLARIFY, the value of m, the more successful the algorithm will be in software package described in the appendix, which al- purging E˜(Yc ) of any fundamental uncertainty. lows researchers to calculate a wide variety of expected To generate 1000 simulations of the expected value, values and first differences, as well as predicted values repeat the entire algorithm M = 1000 times for some and other quantities of interest. fixed value of m. The resulting expected values will differ The algorithms in this article do not require new as- from each other due to estimation uncertainty, since each sumptions; rather, they rest on foundations that have be- expected value will correspond to a different γ̃ . These M come standard in the social sciences. In particular, we as- simulations will approximate the full probability distri- sume that the statistical model is identified and correctly bution of E(Yc), enabling the researcher to compute aver- specified (with the appropriate explanatory variables and ages, standard errors, confidence intervals, and almost functional form), which allows us to focus on interpret- anything else desired. ing and presenting the final results. We also assume that The algorithm works in all cases but involves some the central limit theorem holds sufficiently for the avail- approximation error, which we can reduce by setting able sample size, such that the sampling distribution of both m and M sufficiently high. For some statistical parameters (not the stochastic component) can be de- models, there is a shortcut that curtails both computa- scribed by a normal distribution.3 Although we focus on tion time and approximation error. Whenever E(Yc) = θc , asymptotic results, as do the vast majority of the applied the researcher can skip steps 4–5 of the expected value al- researchers using nonlinear models, simulation works gorithm, since steps 1–3 suffice to simulate one expected with finite sample distributions, which are preferable value. This shortcut is appropriate for the linear-normal when feasible. In short, our algorithms work whenever and logit models in Equations 2 and 3. the usual assumptions work. First Differences Alternative Approaches A first difference is the difference between two expected, In this section, we discuss several other techniques for rather than predicted, values. To simulate a first differ- generating quantities of interest and measuring the un- ence, researchers need only run steps 2–5 of the expected certainty around them. These approaches can be valuable value algorithm twice, using different settings for the ex- 3From a Bayesian perspective, we exclude unusual cases where a planatory variables. flat prior generates an improper posterior.
, , complements to simulation, because they provide impor- rameters are not of intrinsic interest, researchers must tant mathematical intuition or, in some cases, enable fi- convert them into quantities such as predicted values, ex- nite sample approximations. We briefly summarize some pected values, and first differences. The algorithms above of the leading computer-intensive and analytical alterna- show how to make the conversion and are therefore es- tives to simulation. sential supplements. Indeed, our software, CLARIFY, could easily be modified to interpret the parameters gen- Computer-intensive alternatives. Our version of simula- erated by these alternative approaches. tion is not the only computer-intensive technique for ob- taining quantities of interest and measures of uncer- Analytical approaches. The main analytical (mathemati- tainty. Fully Bayesian methods, using Markov-Chain cal) alternative to simulation is the delta method, which Monte Carlo techniques, are more powerful than our al- uses the tools of calculus to approximate nonlinear func- gorithms because they allow researchers to draw from tions of random variables (van der Vaart 1998). Suppose the exact finite-sample distribution, instead of relying on that we are interested in the mean and variance of θ = the central limit theorem to justify an asymptotic normal g(Xc , β), where g is a nonlinear function. Assuming that g approximation (Carlin and Louis 1996). Unfortunately is approximately linear in a range where β has high prob- these methods remain difficult to use, particularly since ability, then a Taylor-series expansion of g about β̂ is of- statisticians still disagree about appropriate criteria for ten reasonable. To the first order, θ ≈ g (βˆ ) + g ′(βˆ )(β − βˆ ) , determining when a Markov chain has converged in dis- where g ′(a ) = ∂g (a )/ ∂a . As a result, the maximum- tribution to the true posterior (Cowles and Carlin 1996; likelihood estimate of θ is approximately g (βˆ ) , and its Kass et al. 1998). Nonetheless, this field has shown re- variance is approximately g ′(βˆ )V (βˆ )g ′(βˆ )′. For example, markable progress over the last decade and is well worth in the exponential Poisson regression model (King 1989, monitoring by political scientists. Chapter 5), where Y is Poisson with mean λ = E (Y | X ) Another useful alternative is bootstrapping, a non- = e X iβ, suppose we wish to compute the expected num- parametric approach that relies on the logic of re- ber of events given X = X0. In this case, the maximum sampling to approximate the distribution of parameters likelihood estimate of the expected number of events is ˆ (Mooney and Duval 1993; Mooney 1996). In theory, the g (βˆ ) = e X 0β and its variance is (X 0e X 0β )Vˆ (βˆ )(X 0e X 0β )′ . sampling distribution of γ̂ can be viewed as a histogram Note that this maximum-likelihood estimate still does of an infinite number of γ̂’s, each estimated from a dif- not reflect the uncertainty in β̂ , as done automatically by ferent sample of size n from the same population. Boot- simulation and the other computationally intensive strapping mimicks this process by drawing many sub- methods. To incorporate this additional uncertainty re- samples (with replacement) from the original sample, quires another level of mathematical complexity in that ˆ estimating γ̂ for each subsample, and then constructing we must now approximate the integral ∫ e X 0βP (βˆ )dβˆ and a histogram of the various γ̂ ’s. Bootstrapping has many its variance. A detailed example is given by King and advantages. It does not require strong distributional as- Zeng (1999) in the case of logistic regression. sumptions, and Monte Carlo studies have demonstrated Despite its utility for increasing computing speed that it has superior small sample properties for some and revealing statistical intuition through mathematical problems. It also has the advantage of not requiring analysis, the delta method suffers from two shortcomings strong parametric distributional assumptions. Program- that simulation can help overcome. First, the method is ming a bootstrapped estimator is not difficult, although technically demanding, since it requires researchers to commercial software packages have not been quick to calculate derivatives and compute the moments of lin- adopt this method. The weakness of bootstrapping is earized functions.4 Thus, it is not surprising that most that it gives biased estimates for certain quantities of in- scholars do not use the delta method, even when they ap- terest, such as max(Y ). preciate the importance of reporting uncertainty. Sec- For both Bayesian methods and bootstrapping, all of ond, the Taylor series used in the delta method only ap- the methods of interpretation we discuss in this article proximates a nonlinear form. Although researchers can can be used directly. The only change is that instead of sometimes improve the approximation with additional drawing the parameters from the multivariate normal in terms in the Taylor series, this can be difficult, and find- Equation 4, we would use MCMC-based simulation or 4 When g is linear there is obviously no need for a linearizing ap- bootstrapping. Even our software, CLARIFY, could be used without additional programming. proximation; an exact analytical solution exists for the mean and variance of many quantities of interest that we described earlier. Like our method, MCMC and bootstrapping gener- Simulation produces the same answer, however, and requires less ate simulations of the parameters. In cases where the pa- mathematical proficiency.
ing estimates of the additional terms is often impossible. tion. Parameters have different logical statuses, such as In practice most researchers stop after expanding the se- the effect parameters β versus the ancillary parameters α, ries to the first or second order, which can compromise but our algorithms do not need to distinguish between the accuracy of the approximation. With simulation one the two: both are uncertain and should be simulated, can achieve an arbitrarily high degree of precision simply even if only one proves useful in later calculations. It may by increasing M and letting the computer run longer. be possible to accelerate our algorithms by excluding cer- Several general arguments weigh in favor of simula- tain parameters from the simulation stage, but for the tion. First, there is a simulation-based alternative to vast majority of applications these tricks are unnecessary nearly every analytical method of computing quantities and could lead to errors. Researchers usually will risk of interest and conducting statistical tests, but the reverse fewer mistakes by following, without deviation, our algo- is not true (Noreen 1989). Thus, simulation can provide rithm for simulating the parameters. accurate answers even when no analytical solutions exist. In some statistical models, the elements of γ are or- Second, simulation enjoys an important pedagogical ad- thogonal, so software packages provide separate variance vantage. Studies have shown that, no matter how well matrices for each set. When implementing the algorithm analytical methods are taught, students get the right an- for simulating the parameters, researchers may want to swer far more often via simulation (Simon, Atkinson, create a bloc diagonal matrix by placing the separately es- and Shevokas 1976). One scholar has even offered a timated variance matrices on the diagonal and inserting $5,000 reward for anyone who can demonstrate the su- zeros everywhere else. Obviously, if the subsets of γ truly periority of teaching analytical methods, but so far no are orthogonal, equivalent draws from the two sets can one has managed to earn the prize (Simon 1992). Of be made from independent multivariate normal distri- course, there are advantages to the insight the mathemat- butions, but it may be easier to work with a single sam- ics underlying the delta method can reveal and so, when pling distribution. feasible, we encourage researchers to learn both simula- Researchers should reparameterize the elements of γ tion and analytical methods. to increase the likelihood that the asymptotic multivari- ate normal approximation will hold in finite samples. In general, all parameters should be reparameterized unless they are already unbounded and logically symmetric, as a Tricks of the Trade Normal must be. For instance, a variance parameter like σ2 must be greater than zero, so it will pay to reparam- The algorithms in the previous section apply to all statis- eterize by using an expression like σ2 = eη. This allows re- tical models, but they can be made to work better by fol- searchers to estimate η, which is on the scale from –∞ to lowing a few tricks and avoiding some common mis- ∞, as one element of γ , which is assumed to be multi- understandings. variate normal. When making reparameterizations, of course, we add an extra step to the algorithm for simulat- Tricks for Simulating Parameters ing the parameters: after drawing γ from the multivariate normal, we reparameterize back to the original scale by Statistical programs usually report standard errors for computing σ˜ 2 = e η˜ .5 parameter estimates, but accurate simulation requires Several other reparameterizations may come in the full variance matrix Vˆ (γˆ ) . The diagonal elements of handy. A correlation parameter ρ, ranging from –1 to 1, Vˆ (γˆ ) contain the squared standard errors, while the off- can be reparameterized to η (reusing the same symbol) diagonal elements express the covariances between one on an unbounded scale with the inverse of Fisher’s Z parameter estimate and another in repeated draws from transformation: ρ = (e2η – 1)/(e2η + 1). Likewise, a the same probability distribution. Simulating each pa- parameter representing a probability π can be made rameter independently would be incorrect, because this procedure would miss the covariances among param- 5Reparameterization also makes likelihood-maximization algo- eters. Nearly all good statistical packages can report the rithms easier to use by avoiding problems caused by the optimiza- tion procedure choosing inadmissable parameter values (which of- full variance matrix, but most require researchers to re- ten result in the program terminating abnormally because of quest it explicitly by setting an option or a global. The attempts to divide by zero or logging negative numbers). Since software described in the appendix fetches the variance maximum likelihood estimates are invariant to reparameteri- matrix automatically. zation, the reparameterization has no effect except on the finite sample distribution around the point estimate. For example, esti- One common mistake is to exclude some parameters mating σ̂ 2 directly gives the same estimate as estimating η̂ and when drawing from the multivariate normal distribu- transforming to σ̂ 2 by using σˆ 2 = e ηˆ .
, , unbounded using the logistic transformation, π = speed the approximation or make it more accurate for a [1 + e–η]–1. These and other tricks should enhance the ef- fixed number of simulations, they should be incorpo- fectiveness of simulating the parameters. rated into computer programs for general use. In some cases, analytical computations are also possible and can Tricks for Simulating Quantities of Interest get speedier results. But our algorithms provide social scientists with all they need to understand the funda- When converting the simulated parameters into quanti- mental concepts: which quantities are being computed ties of interest, it is safest to simulate Y and use this as a and how the computation is, or could be, done. More- basis for obtaining other quantities. This rule is equiva- over, as long as M and m are large enough, these and all lent to incorporating all simulated parameters—and thus other correct algorithms will give identical answers. all information from the statistical model—into the cal- culations. Of course, some shortcuts do exist. We have al- ready mentioned that, in a logit model, one can obtain E˜(Y ) by stopping with π̃ , since drawing dichotomous Empirical Examples Y ’s and averaging would yield exactly π̃ . If one is not sure, though, it is helpful to continue until one has simu- To illustrate how our algorithms work in practice, we in- lations of the outcome variable. clude replications or extensions of five empirical works. If some function of Y, such as ln(Y), is used as the de- Instead of choosing the most egregious, we choose a pendent variable during the estimation stage, the re- large number of the best works, from our most presti- searcher can simulate ln(Y) and then apply the inverse gious journals and presses, written by some of our most function exp(ln(Y)) to reveal Y. We adopt this procedure distinguished authors. Within this group, we eliminated in the first example, below, where we estimate a log-log the many publications we were unable to replicate and regression model. This sequence of simulation and trans- then picked five to illustrate a diverse range of models formation is crucial, since the usual procedure of calcu- and interpretative issues. The procedures for model in- lating E(ln(Y)) = µ̂ without simulation and then expo- terpretation in all five were exemplary. If we all followed nentiating gives the wrong answer: exp(µˆ ) ≠ Yˆ . With their examples, reporting practices in the discipline simulation, both Y and E(Y) can be computed easily, re- would be greatly improved. For each article, we describe gardless of the scale that the researcher used during the the substantive problem posed and statistical model cho- estimation stage. sen; we also accept rather than evaluate their statistical Researchers should assess the precision of any simu- procedures, even though in some cases the methods lated quantity by repeating the entire algorithm and see- could be improved. We then detail how the authors inter- ing if anything of substantive importance changes. If preted their results and demonstrate how our procedures something does change, increase the number of simula- advance this state of the art. tions (M and, in the case of expected values, m) and try again. In certain instances—particularly when the re- Linear Regression searcher has misspecified a nonlinear statistical model— the number of simulations required to approximate an Following Tufte (1974), we estimated a log-log regression expected value accurately may be larger than normal. model of the size of government in the U.S. states. Our Numerical estimates should be reported to the correct dependent variable, Yi , was the natural log of the number level of precision, so for instance if repeated runs with of people (measured in 1000s) that the state government the same number of simulations produce an estimate employed on a full-time basis in 1990. Tufte was inter- that changes only in the fourth decimal point, then— ested (for a pedagogical example) in whether Yi would assuming this is sufficient for substantive purposes—the increase with state population; but consider another hy- number reported should be rounded to two or three pothesis that may be of more interest to political scien- decimal points. tists: the number of employees might depend on the pro- The simulation procedures given in this article can portion of Democrats in the state legislature, since be used to compute virtually all quantities that might be Democrats are reputed to favor bigger government than of interest for nearly all parametric statistical models that Republicans, even after adjusting for state population. scholars might wish to interpret. As such, they can be Thus, our two main explanatory variables were the log of considered canonical methods of simulation. Numerous state population Pi in 1000s and the logged proportion of other simulation algorithms are available, however, in the lower-house legislators who identified themselves as context of specific models. When these alternatives could Democrats Di.
We applied the predicted value algorithm to predict FIGURE 1 Probability of Voting by Age the number of government employees in a state with six million people and an 80 percent Democratic house. 1 ree First, we used the statistical software described in the ap- e deg pendix to estimate the log-linear model and simulate one l leg Probability of Voting .8 co set of values for the effect coefficients (β˜ ) and the ancil- eg r ee d lary parameter (σ˜ ) . Next, we set the main explanatory ol ho variables at Pc = ln(6000) and Dc = ln(0.8), so we could .6 sc gh construct Xc and compute θ˜ c = X c β˜ . We then drew one hi value of Y˜c from the normal distribution N (θ˜ c , σ 2 ) . Fi- .4 nally, we calculated exp(Y˜c ) to transform our simulated value into the actual number of government employees, .2 a quantity that seemed more understandable than its natural logarithm. By repeating this process M = 1000 18 24 30 36 42 48 54 60 66 72 78 84 90 95 times, we generated 1000 predicted values, which we Age of Respondent sorted from lowest to highest. The numbers in the 25th Vertical bars indicate 99-percent confidence intervals and the 976th positions represented the upper and lower bounds of a 95-percent confidence interval. Thus, we predicted with 95-percent confidence that the state gov- ernment would employ between 73,000 and 149,000 = {1, Ai , Ai2 , Ei , Ii , Ri}, where 1 is a constant and Ai2 is people. Our best guess was 106,000 full-time employees, the quadratic term. the average of the predicted values. In our logit model, the probability of voting in a We also calculated some expected values and first presidential election is E(Yi ) = πi , an intuitive quantity of differences and found that increasing Democratic con- interest. We estimated this probability, and the uncer- trol from half to two-thirds of the lower house tended to tainty surrounding it, for two different levels of educa- raise state government employment by 7,000 people on tion and across the entire range of age, while holding average. The 95-percent confidence interval around this other variables at their means. In each case, we repeated first difference ranged from 3,000 to 12,000 full-time em- the expected value algorithm M = 1000 times to approxi- ployees. Our result may be worth following up, since, to mate a 99-percent confidence interval around the prob- the best of our knowledge, researchers have not ad- ability of voting. The results appear in Figure 1, which il- dressed this relationship in the state-politics literature. lustrates the conclusions of Rosenstone and Hansen quite sharply: the probability of voting rises steadily to a Logit Models plateau between the ages of 45 and 65, and then tapers downward through the retirement years. The figure also The algorithms in the third section can also help re- reveals that uncertainty associated with the expected searchers interpret the results of a logit model. Our ex- value is greatest at the two extremes of age: the vertical ample draws on the work of Rosenstone and Hansen bars, which represent 99-percent confidence intervals, (1993), who sought to explain why some individuals are are longest when the respondent is very young or old.6 more likely than others to vote in U.S. presidential elec- tions. Following Rosenstone and Hanson, we pooled data from every National Election Study that was conducted A Time-Series Cross-Sectional Model during a presidential election year. Our dependent vari- We also used our algorithms to interpret the results of a able, Yi , was coded 1 if the respondent reported voting in time-series cross-sectional model. Conventional wisdom the presidential election and 0 otherwise. holds that the globalization of markets has compelled For expository purposes we focus on a few demo- governments to slash public spending, but a new book graphic variables that Rosenstone and Hanson empha- by Garrett (1998) offers evidence to the contrary. Where sized: Age (Ai) and Education (Ei) in years, Income (Ii) strong leftist parties and encompassing trade unions in 10,000s of dollars, and Race (coded Ri = 1 for whites coincide, Garrett argues, globalization leads to greater and 0 otherwise). We also include a quadratic term to test the hypothesis that turnout rises with age until the 6 Theconfidence intervals are quite narrow, because the large respondent nears retirement, when the tendency re- number of observations (N = 15,837) eliminated most of the esti- verses itself. Thus, our set of explanatory variables is Xi mation uncertainty.
, , government spending as a percentage of GDP, whereas TABLE 1 Garrett’s Counterfactual Effects on the opposite occurs in countries where the left and labor Government Spending (% of GDP) are weak. To support his argument, Garrett constructed a Trade Capital Mobility panel of economic and political variables, measured an- Low High Low High nually, for fourteen industrial democracies during the period 1966–1990. He then estimated a linear-normal re- Left-labor Low 43.1 41.9 42.8 42.3 gression model where the dependent variable, Yi , is gov- power High 43.5 44.2 43.1 44.5 ernment spending as a percentage of GDP for each coun- try-year in the data set. The three key explanatory variables were Capital mobility, Ci (higher values indicate Each entry is the expected level of government spending for given con- figurations left-labor power and trade or capital mobility, holding all other fewer government restrictions on cross-border financial variables constant at their means. flows), Trade, Ti (larger values mean more foreign trade as a percentage of GDP), and Left-labor power, Li (higher scores denote a stronger combination of leftist parties and labor unions).7 ernment spending where countries were highly inte- To interpret his results, Garrett computed grated into the international economy than in more closed contexts.” Finally, “where left-labor power was low, a series of counterfactual estimates of government government spending decreased if one moved from low spending under different constellations of domestic to high levels of market integration, but the converse was political conditions and integration into global mar- true at high levels of left-labor power” (1998, 83). kets. This was done by set ting all the other variables Garrett’s counterfactuals go far beyond the custom- in the regression equations equal to their mean lev- ary list of coefficients and t-tests, but our tools can help els and multiplying these means by their corre- us extract even more information from his model and sponding coefficients, and then by examining the data. For instance, simulation can reveal whether the dif- counterfactual impact of various combinations of ferences in values across the cells might have arisen by left-labor power and globalization. . . . (1998, 82) chance alone. To make this assessment, we reestimated the parameters in Garrett’s regression equation9 and In particular, Garrett distinguished between low and high drew 1000 sets of simulated coefficients from their poste- levels of Li , Ti , and Ci. For these variables, the 14th per- rior distribution, using the algorithm for simulating the centile in the dataset represented a low value, whereas the parameters. Then we fixed Lc and Tc at their 14th percen- 86th percentile represented a high one.8 tiles, held other variables at their means, and calculated The counterfactual estimates appear in Table 1, 1000 (counterfactual) expected values, one for each set of which Garrett used to draw three conclusions. First, simulated coefficients. Following the same procedure, we “government spending was always greater when left- produced counterfactuals for the other combinations of labor power was high than when it was low, irrespective Lc, Tc, and Cc represented by the cells of Table 1. Finally, of the level of market integration” (entries in the second we plotted “density estimates” (which are smooth ver- row in each table exceeded values in the first row). Sec- sions of histograms) of the counterfactuals; these appear ond, “the gap between the low and high left-labor power in Figure 2. One can think of each density estimate as a cells was larger in the high trade and capital mobility pile of simulations distributed over the values govern- cases than in the cells with low market integration,” im- ment spending. The taller the pile at any given level of plying that “partisan politics had more impact on gov- government spending, the more simulations took place near that point. Figure 2 shows that when globalization of trade or 7Garrett also focused on two interactions among the variables, C L i i capital mobility is low, leftist governments spend only and TiLi , and he included a battery of business cycle and demo- slightly more than rightist ones. More importantly, the graphic controls, as well as the lagged level of government spend- ing and dummy variables for countries and time. 8“So as not to exaggerate the substantive effects” of the relation- 9Our coefficients differed from those in Garrett (1998, 80–81) by ships he was studying, Garrett “relied on combinations of the 20th only 0.3 percent, on average. Standard errors diverged by 6.8 per- and 80th percentile scores” (1998, 82). Unfortunately, due to a mi- cent, on average, apparently due to discrepancies in the method of nor arithmetic error, the values he reports (1998, 84) correspond calculating panel-corrected standard errors (Franzese 1996). only to the 14th and 86th percentiles. To facilitate comparison with None of the differences made any substantive difference in the Garrett, we use the 14th and 86th percentiles in our simulations. conclusions.
FIGURE 2 Simulated Levels of Government Spending Low Exposure to Trade High Exposure to Trade 2.0 2.0 Density Estimate Density Estimate 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 40 42 44 46 40 42 44 46 Government Spending (% of GDP) Government Spending (% of GDP) Low Capital Mobility High Capital Mobility 2.0 2.0 Density Estimate Density Estimate 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 40 42 44 46 40 42 44 46 Government Spending (% of GDP) Government Spending (% of GDP) These panels contain density estimates (smooth versions of histograms) of expected govern- ment spending for countries where left-labor power is high (the solid curve) and low (the dotted curve). The panels, which add uncertainty estimates to the concepts in Table 1, demonstrate that left-labor power has a distinguishable effect only when exposure to trade or capital mobil- ity is high. density estimates overlap so thoroughly that it is difficult presidential candidates: Carlos Salinas (from the ruling to distinguish the two spending patterns with much con- PRI), Manuel Clouthier (representing the PAN, a right- fidence. (Another way to express one aspect of this point wing party), and Cuauhtémoc Cárdenas (head of a leftist is that the means of the two distributions are not statisti- coalition). The election was historically significant, be- cally distinguishable at conventional levels of signifi- cause for the first time all three presidential candidates cance.) In the era of globalization, by contrast, domestic appeared to be highly competitive. Domínguez and politics exerts a powerful effect on fiscal policy: leftist McCann used a multinomial logit model to explain why governments outspend rightist ones by more than two some voters favored one candidate over the others. The percent of GDP on average, a difference we can affirm following equations summarize the model, in which Yi with great certainty, since the density estimates for the and πi are 3 × 1 vectors: two regime-types are far apart. In summary, our simula- tions cause us to question Garrett’s claim that left-labor Yi ~ Multinomial(πi) governments always outspend the right, regardless of the e X iβ j level of market integration: although the tendency may πi = where j = 1, 2, 3 candidates. (5) ∑k =1 e X β 3 be correct, the results could have arisen from chance i k alone. The simulations do support Garrett’s claim that globalization has intensified the relationship between The effect parameters can vary across the candidates, so partisan politics and government spending. β1, β2 , and β3 are distinct vectors, each with k × 1 elements.10 Multinomial Logit Models The book focuses on individual voting behavior, as is traditional in survey research, but we used simulation How do citizens in a traditional one-party state vote when to examine the quantity of interest that motivated they get an opportunity to remove that party from office? Domínguez and McCann (1996) addressed this question 10Domínguez and McCann included thirty-one explanatory vari- by analyzing survey data from the 1988 Mexican presi- ables in their model. For a complete listing of the variables and dential election. In that election, voters chose among three question wording, see Domínguez and McCann (1996, 213–216).
, , Domínguez and McCann in the first place: the electoral FIGURE 3 Simulated Electoral Outcomes outcome itself. In particular, if every voter thought the Cardenas PRI was weakening, which candidate would have won the presidency? To answer this question, we coded each voter as thinking that the PRI was weakening and let other characteristics of the voter take on their true values. Then we used the predicted value algorithm to simulate the vote for each person in the sample and used . ................... the votes to run a mock election. We repeated this exer- . .......... ..... cise 100 times to generate 100 simulated election out- comes. For comparison, we also coded each voter as thinking the PRI was strengthening and simulated 100 Salinas Clouthier election outcomes conditional on those beliefs. Figure 3 displays our results. The figure is called a Coordinates in this ternary diagram are predicted fractions of the vote “ternary plot” (see Miller 1977; Katz and King 1999), and received by each of the three candidates. Each point is an election out- coordinates in the figure represent predicted fractions of come drawn randomly from a world in which all voters believe Salinas’ PRI party is strengthing (for the “o”’s in the bottom left) or weakening (for the vote received by each candidate under a different the “·”’s in the middle), with other variables held constant at their means. simulated election outcome. Roughly speaking, the closer a point appears to one of the vertices, the larger the frac- tion of the vote going to the candidate whose name ap- pears on the vertex. A point near the middle indicates that question by estimating a censored Weibull regression (a the simulated election was a dead heat. We also added form of duration model) on a dataset in which the de- “win lines” to the figure that divide the ternary diagram pendent variable, Yi , measures the number of years that into areas that indicate which candidate receives a plural- leader i remains in office following the onset of war. For ity and thus wins the simulated election (e.g., points that fully observed cases (the leader had left office at the time appear in the top third of the triangle are simulated elec- of the study), the model is tion outcomes where Cárdenas receives a plurality). In this figure, the o’s (all near the bottom left) are Yi ~ Weibull(µi ,σ) simulated outcomes in which everyone thought the PRI was strengthening, while the dots (all near the center) µ i ≡ E (Yi X i ) = (e X iβ )− σ Γ (1 + σ) (6) correspond to beliefs that the PRI was weakening. The figure shows that when the country believes the PRI is where σ is an ancilliary shape parameter and Γ is the strengthening, Salinas wins hands down; in fact, he wins gamma function, an interpolated factorial that works for every one of the simulated elections. If voters believe the continuous values of its argument. The model includes PRI is weakening, however, the 1988 election is a toss-up, four explanatory variables: the leader’s pre-war tenure in with each candidate having an equal chance of victory. years, an interaction between pre-war tenure and democ- This must be a sobering thought for those seeking to racy, the number of battle deaths per 10,000 inhabitants, end PRI dominance in Mexico. Hope of defeating the and a dummy variable indicating whether the leader won PRI, even under these optimistic conditions, probably re- the war.12 The authors find that leaders who waged for- quires some kind of compromise between the two oppo- eign wars tended to lose their grip on power at home, but sition parties. The figure also supports the argument authoritarian leaders with a long pre-war tenure were that, despite much voter fraud, Salinas probably did win able to remain in office longer than others. the presidency in 1988. He may have won by a lesser mar- Bueno de Mesquita and Siverson discuss the mar- gin than reported, but the figure is strong evidence that ginal impact of their explanatory variables by computing he did indeed defeat a divided opposition.11 the “hazard rate” associated with each variable. Hazard rates are the traditional method of interpretation in the Censored Weibull Regression Models literature, but understanding them requires considerable statistical knowledge. Simulation can help us calculate How do wars affect the survival of political leaders? more intuitive quantities, such as the number of months Bueno de Mesquita and Siverson (1995) examine this that a leader could expect to remain in office following 11See Scheve and Tomz (1999) on simulation of counter factual predictions and Sterman and Wittenberg (1999) on simulation of predicted values, both in the context of binary logit models. 12 The first three variables are expressed in logs.
You can also read