Text as Data Matthew Gentzkow, Bryan Kelly, and Matt Taddy* - Stanford University
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of Economic Literature 2019, 57(3), 535–574 https://doi.org/10.1257/jel.20181020 Text as Data† Matthew Gentzkow, Bryan Kelly, and Matt Taddy* An ever-increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications. (JEL C38, C55, L82, Z13) 1. Introduction from advertisements and product reviews is used to study the drivers of consumer deci- N ew technologies have made available vast quantities of digital text, recording an ever-increasing share of human interac- sion making. In political economy, text from politicians’ speeches is used to study the dynamics of political agendas and debate. tion, communication, and culture. For social The most important way that text differs scientists, the information encoded in text is from the kinds of data often used in econom- a rich complement to the more structured ics is that text is inherently h igh dimensional. kinds of data traditionally used in research, Suppose that we have a sample of documents, and recent years have seen an explosion of each of which is w words long, and suppose empirical economics research using text as that each word is drawn from a vocabulary data. of ppossible words. Then the unique repre- To take just a few examples: In finance, sentation of these documents has dimension text from financial news, social media, and p w. A sample of thirty-word Twitter mes- company filings is used to predict asset price sages that use only the one thousand most movements and study the causal impact of common words in the English language, for new information. In macroeconomics, text example, has roughly as many dimensions as is used to forecast variation in inflation and there are atoms in the universe. unemployment, and estimate the effects of A consequence is that the statistical meth- policy uncertainty. In media economics, text ods used to analyze text are closely related to from news and social media is used to study those used to analyze h igh-dimensional data the drivers and effects of political slant. In in other domains, such as machine learning industrial organization and marketing, text and computational biology. Some methods, such as lasso and other penalized regres- * Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods, cago Booth School of Business. † Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse article page and view author disclosure statement(s). regression, are close cousins of more general 535
536 Journal of Economic Literature, Vol. LVII (September 2019) methods adapted to the specific structure of third task is p redicting the incidence of local text data. flu outbreaks from Google searches, where In all of the cases we consider, the analysis the outcome V is the true incidence of flu. can be summarized in three steps: In these examples, and in the vast major- ity of settings where text analysis has been 1. Represent raw text as a numerical applied, the ultimate goal is prediction rather array C ; than causal inference. The interpretation of the mapping from V to Vˆ is not usually an ˆ of 2. Map C to predicted values V object of interest. Why certain words appear unknown outcomes V; and more often in spam, or why certain searches are correlated with flu is not important so 3. Use Vˆ in subsequent descriptive or long as they generate highly accurate predic- causal analysis. tions. For example, Scott and Varian (2014, 2015) use data from Google searches to pro- In the first step, the researcher must duce high-frequency estimates of macro- impose some preliminary restrictions to economic variables such as unemployment reduce the dimensionality of the data claims, retail sales, and consumer sentiment to a manageable level. Even the most that are otherwise available only at lower fre- cutting-edge high-dimensional techniques quencies from survey data. Groseclose and can make nothing of 1,000 30-dimensional Milyo (2005) compare the text of news out- raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order discuss, the elements of C are counts of to estimate the outlets’ political slant. A large tokens: words, phrases, or other p redefined literature in finance following Antweiler and features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text ing out very common or uncommon words; from the internet or the news to predict dropping numbers, punctuation, or proper stock prices. names; and restricting attention to a set of In many social science studies, however, features such as words or phrases that are the goal is to go further and, in the third likely to be especially diagnostic. The map- step, use text to infer causal relationships ping from raw text to C leverages prior infor- or the parameters of structural economic mation about the structure of language to models. S tephens-Davidowitz (2014) uses reduce the dimensionality of the data prior Google search data to estimate local areas’ to any statistical analysis. racial animus, then studies the causal The second step is where high-dimensional effect of racial animus on votes for Barack statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and example, the data is the text of emails, and Shapiro (2010) use congressional and news the unknown variable of interest V is an indi- text to estimate each news outlet’s political cator for whether the email is spam. The slant, then study the supply and demand prediction Vˆ determines whether or not to forces that determine slant in equilibrium. send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements, Lee, and Vaithyanathan 2002), where the then use the relationship between coverage unknown variable Vis the true sentiment of and trading by local investors to separate a message (say positive or negative), and the the causal effect of news from other sources prediction Vˆ might be used to identify posi- of correlation between news and stock tive reviews or comments about a product. A prices.
Gentzkow, Kelly, and Taddy: Text as Data 537 In this paper, we provide an overview from the text as a whole. It might seem of methods for analyzing text and a survey obvious that any attempt to distill text into of current applications in economics and meaningful data must similarly take account related social sciences. The methods discus- of complex grammatical structures and rich sion is forward looking, providing an over- interactions among words. view of methods that are currently applied The field of computational linguistics in economics as well as those that we expect has made tremendous progress in this kind to have high value in the future. Our discus- of interpretation. Most of us have mobile sion of applications is selective and necessar- phones that are capable of complex speech ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently examples that illustrate particular methods parse grammatical structure, disambiguate and use text data to make important substan- different senses of words, distinguish key tive contributions even if they do not apply points from secondary asides, and so on. methods close to the frontier. Yet virtually all analysis of text in the social A number of other excellent surveys have sciences, like much of the text analysis in been written in related areas. See Evans and machine learning more generally, ignores Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text (2013) for related surveys focused on text consists of an ordered sequence of language analysis in sociology and political science, elements: words, punctuation, and white respectively. For methodological surveys, space. To reduce this to a simpler repre- Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications: contemporary statistics and machine learn- dividing the text into individual documents i, ing in general while Jurafsky and Martin reducing the number of language elements (2009) overview methods from computa- we consider, and limiting the extent to which tional linguistics and natural language pro- we encode dependence among elements cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping of Economic Perspectives contains a sympo- from raw text to a numerical array C . A row sium on “big data,” which surveys broader ci of Cis a numerical vector with each ele- applications of high-dimensional statistical ment indicating the presence or count of a methods to economics. particular language token in document i. In section 2 we discuss representing 2.1 What Is a Document? text data as a manageable (though still high-dimensional) numerical array C ; in sec- The first step in constructing C is to divide tion 3 we discuss methods from data mining raw text into individual documents { i}. and machine learning for predicting V from In many applications, this is governed by the C. Section 4 then provides a selective survey level at which the attributes of interest V are of text analysis applications in social science, defined. For spam detection, the outcome of and section 5 concludes. interest is defined at the level of individual emails, so we want to divide text that way too. If V is daily stock price movements that 2. Representing Text as Data we wish to predict from the prior day’s news When humans read text, they do not see a text, it might make sense to divide the news vector of dummy variables, nor a sequence text by day as well. of unrelated tokens. They interpret words In other cases, the natural way to define in light of other words, and extract meaning a document is not so clear. If we wish to
538 Journal of Economic Literature, Vol. LVII (September 2019) redict legislators’ partisanship from their p that occur fewer than ktimes for some arbi- floor speeches (Gentzkow, Shapiro, and trary small integer k. Taddy 2016), we could aggregate speech An approach that excludes both common so a document is a speaker–day, a speaker– and rare words and has proved very useful year, or all speech by a given speaker during in practice is filtering by “term frequency– the time she is in Congress. When we use inverse document frequency” ( tf–idf). methods that treat documents as indepen- For a word or other feature jin document i, dent (which is true most of the time), finer term frequency (t fij) is the count cij of occur- partitions will typically ease computation at rences of jin i. Inverse document frequency the cost of limiting the dependence we are (id fj ) is the log of one over the share of able to capture. Theoretical guidance for the documents containing j: log(n/dj) where dj right level of aggregation is often limited, so = ∑ i 1[cij >0] and nis the total number of this is an important dimension along which documents. The object of interest tf–idf to check the sensitivity of results. is the product t fij × id fj. Very rare words will have low tf–idf scores because t fij will 2.2 Feature Selection be low. Very common words that appear in To reduce the number of features to some- most or all documents will have low tf–idf thing manageable, a common first step is to scores because id fjwill be low. (Note that strip out elements of the raw text other than this improves on simply excluding words that words. This might include punctuation, num- occur frequently because it will keep words bers, HTML tags, proper names, and so on. that occur frequently in some documents but It is also common to remove a subset of do not appear in others; these often provide words that are either very common or very useful information.) A common practice is to rare. Very common words, often called “stop keep only the words within each document i words,” include articles (“the,” “a”), conjunc- with tf–idf scores above some rank or cutoff. tions (“and,” “or”), forms of the verb “to be,” A final step that is commonly used to and so on. These words are important to the reduce the feature space is stemming: replac- grammatical structure of sentences, but they ing words with their root such that, e.g., typically convey relatively little meaning on “economic,” “economics,” “economically” their own. The frequency of “the” is proba- are all replaced by the stem “economic.” The bly not very diagnostic of whether an email Porter stemmer (Porter 1980) is a standard is spam, for example. Common practice is stemming tool for English language text. to exclude stop words based on a predefined All of these cleaning steps reduce the list.1 Very rare words do convey meaning, but number of unique language elements we their added computational cost in expand- must consider and thus the dimensional- ing the set of features that must be consid- ity of the data. This can provide a massive ered often exceeds their diagnostic value. computational benefit, and it is also often A common approach is to exclude all words key to getting more interpretable model fits (e.g., in topic modeling). However, each of these steps requires careful decisions about 1 There is no single stop word list that has become a the elements likely to carry meaning in a standard. How aggressive one wants to be in filtering stop words depends on the application. The web page http:// particular application.2 One researcher’s www.ranks.nl/stopwords shows several common stop word lists, including the one built into the database software SQL and the list claimed to have been used in early ver- 2 Denny and Spirling (2018) discuss the sensitivity of sions of Google search. (Modern Google search does not unsupervised text analysis methods such as topic modeling appear to filter any stop words.) to preprocessing steps.
Gentzkow, Kelly, and Taddy: Text as Data 539 stop words are another’s subject of interest. representation then corresponds to counts of Dropping numerals from political text means 1-grams. missing references to “the first 100 days” or Counting n-grams of order n > 1 yields “September 11.” In online communication, data that describe a limited amount of the even punctuation can no longer be stripped dependence between words. Specifically, the without potentially significant information n-gram counts are sufficient for estimation loss :-(. of an n-order homogeneous Markov model across words (i.e., the model that arises if we 2.3 n-grams assume that word choice is only dependent Producing a tractable representation also upon the previous n words). This can lead requires that we limit dependence among to richer modeling. In analysis of partisan language elements. A fairly mild step in this speech, for example, single words are often direction, for example, might be to parse doc- insufficient to capture the patterns of inter- uments into distinct sentences and encode est: “death tax” and “tax break” are phrases features of these sentences while ignoring with strong partisan overtones that are not the order in which they occur. The most evident if we look at the single words “death,” common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and The simplest and most common way to Shapiro 2010). represent a document is called bag-of-words. Unfortunately, the dimension of ci in- The order of words is ignored altogether, creases exponentially quickly with the order and ciis a vector whose length is equal to nof the phrases tracked. The majority of text the number of words in the vocabulary and analyses consider n-grams up to two or three whose elements c ijare the number of times at most, and the ubiquity of these simple word joccurs in document i. Suppose that representations (in both machine learning the text of document i is and social science) reflects a belief that the return to richer n-gram modeling is usually Good night, good night! small relative to the cost. Best practice in Parting is such sweet sorrow. many cases is to begin analysis by focusing on single words. Given the accuracy obtained After stemming, removing stop words, and with words alone, one can then evaluate if it removing punctuation, we might be left with is worth the extra time to move on to 2-grams “good night good night part sweet sorrow.” or 3-grams. The bag-of-words representation would then 2.4 Richer Representations have cij = 2for j ∈ {good, night}, cij = 1 for j ∈ {part, sweet, sorrow}, and cij = 0for all While rarely used in the social science other words in the vocabulary. literature to date, there is a vast array of This scheme can be extended to encode methods from computational linguistics a limited amount of dependence by count- that capture richer features of text and may ing unique phrases rather than unique have high return in certain applications. words. A phrase of length n is referred to One basic step beyond the simple n -gram as an n-gram. For example, in our snippet counting above is to use sentence syntax to above, the count of 2-grams (or “bigrams”) inform the text tokens used to summarize would have c ij = 2for j = good.night, a document. For example, Goldberg and cij = 1for jincluding night.good, night.part, Orwant (2013) describe syntactic n-grams part.sweet, and sweet.sorrow, and cij = 0 for where words are grouped together when- all other possible 2-grams. The bag-of-words ever their meaning depends upon each
540 Journal of Economic Literature, Vol. LVII (September 2019) other, according to a model of language A more serious issue is that research- syntax. ers sometimes do not have direct access An alternative approach is to move beyond to the raw text and must access it through treating documents as counts of language some interface such as a search engine. For tokens, and to instead consider the ordered example, Gentzkow and Shapiro (2010) sequence of transitions between words. count the number of newspaper articles In this case, one would typically break the containing partisan phrases by entering the document into sentences, and treat each phrases into a search interface (e.g., for the as a separate unit for analysis. A single sen- database ProQuest) and counting the num- tence of length s(i.e., containing s words) ber of matches they return. Baker, Bloom, is then represented as a binary p × s matrix and Davis (2016) perform similar searches S, where the nonzero elements of S indi- to count the number of articles mentioning cate occurrence of the row-word in the terms related to policy uncertainty. Saiz and column-position within the sentence, and p Simonsohn (2013) count the number of web is the length of the vocabulary. Such repre- pages measuring combinations of city names sentations lead to a massive increase in the and terms related to corruption by enter- dimensions of the data to be modeled, and ing queries in a search engine. Even if one analysis of this data tends to proceed through can automate the searches in these cases, it word embedding: the mapping of words to is usually not feasible to produce counts for Kfor some K ≪ p, such that a location in 핉 very large feature sets (e.g., every two-word the sentences are then sequences of points phrase in the English language), and so the in this K dimensional space. This is discussed initial feature selection step must be rel- in detail in section 3.3. atively aggressive. Relatedly, interacting through a search interface means that there 2.5 Other Practical Considerations is no simple way to retrieve objects like the It is worth mentioning two details that can set of all words occurring at least twenty cause practical social science applications times in the corpus of documents, or the of these methods to diverge a bit from the inputs to computing tf–idf. ideal case considered in the statistics liter- ature. First, researchers sometimes receive 3. Statistical Methods data in a p re-aggregated form. In the analysis of Google searches, for example, one might This section considers methods for map- observe the number of searches contain- ping the document-token matrix Cto pre- ing each possible keyword on each day, but dictions Vˆ of an attribute V . In some cases, not the raw text of the individual searches. the observed data is partitioned into subma- This means documents must be similarly trices C train and C test, where the matrix C train aggregated (to days, rather than individual collects rows for which we have observations searches), and it also means that the natu- V train of Vand the matrix C test collects rows ral representation where cijis the number of for which V is unobserved. The dimension occurrences of word jon day iis not avail- of C train is n train × p, and the dimension of able. This is probably not a significant limita- V train is n train × k, where kis the number of tion, as the missing information (how many attributes we wish to predict. times per search a word occurs conditional Attributes in Vcan include observable on occurring at least once) is unlikely to be quantities such as the frequency of flu cases, essential, but it is useful to note when map- the positive or negative rating of movie ping practice to theory. reviews, or the unemployment rate, about
Gentzkow, Kelly, and Taddy: Text as Data 541 which the documents are informative. There The second and third groups of meth- can also be latent attributes of interest, such ods are distinguished by whether they as the topics being discussed in a congressio- begin from a model of p( vi | ci)or a model of nal debate or in news articles. p(ci | vi) . In the former case, which we will Methods to connect counts c i to attri- call text regression methods, we directly butes vican be roughly divided into four estimate the conditional outcome distribu- categories. The first, which we will call tion, usually via the conditional expectation dictionary-based methods, do not involve E[ vi | ci] of attributes vi. This is intuitive: if we statistical inference at all: they simply spec- want to predict vi from ci, we would naturally ify vˆ i = f ( ci) for some known function f ( ⋅ ). regress the observed values of the former This is by far the most common method in (V train) on the corresponding values of the lat- the social science literature using text to ter (C train). Any generic regression technique date. In some cases, researchers define f ( ⋅ ) can be applied, depending upon the nature based on a p respecified dictionary of of vi. However, the h igh dimensionality of terms capturing particular categories of ci, where pis often as large as or larger than text. In Tetlock (2007), for example, ci is a n train, requires use of regression techniques bag-of-words representation and the out- appropriate for such a setting, such as penal- come of interest viis the latent “sentiment” ized linear or logistic regression. of Wall Street Journal columns, defined along In the latter case, we begin from a genera- a number of dimensions such as “positive,” tive model of p(ci | vi) . To see why this is intu- “optimistic,” and so on. The author defines itive, note that in many cases the underlying the function f ( ⋅ )using a dictionary called causal relationship runs from outcomes to the General Inquirer, which provides lists of language rather than the other way around. words associated with each of these sentiment For example, Google searches about the flu categories.3 The elements of f( ci) are defined do not cause flu cases to occur; rather, peo- to be the sum of the counts of words in each ple with the flu are more likely to produce category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology ysis then focuses on the first principal com- is not determined by their use of partisan ponent of the resulting counts.) In Baker, language; rather, people who are more con- Bloom, and Davis (2016), c iis the count of servative or liberal to begin with are more articles in a given n ewspaper-month contain- likely to use such language. From an eco- ing a set of p respecified terms such as “pol- nomic point of view, the correct “structural” icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from and the outcome of interest v i is the degree vi to ci, and as in other cases familiar to of “policy uncertainty” in the economy. The economists, modeling the underlying causal authors define f ( ⋅ )to be the raw count of relationships can provide powerful guidance the prespecified terms divided by the total to inference and make the estimated model number of articles in the newspaper–month, more interpretable. averaged across newspapers. We do not pro- Generative models can be further divided vide additional discussion of d ictionary-based by whether the attributes are observed or methods in this section, but we return to them latent. In the first case of unsupervised in section 3.5 and in our discussion of applica- methods, we do not observe the true value of tions in section 4. vifor any documents. The function relating ci and viis unknown, but we are willing to impose sufficient structure on it to allow us to 3 http://www.wjh.harvard.edu/~inquirer/. infer vi from ci. This class includes methods
542 Journal of Economic Literature, Vol. LVII (September 2019) such as topic modeling and its variants (e.g., typically perform close to the frontier in latent Dirichlet allocation, or LDA). In the terms of out-of-sample prediction. second case of supervised methods, we Linear models in the sense we mean here observe training data V trainand we can fit are those in which v i depends on ci only our model, say fθ( ci; vi) for a vector of param- through a linear index ηi = α + ′i β, where eters θ, to this training set. The fitted model iis a known transformation of c i. In many fθˆ can then be inverted to predict v i for doc- cases, we simply have E [vi | i] = ηi. It is uments in the test set and can also be used to also possible that E [vi | i] = f ( ηi) for some interpret the structural relationship between known link function f ( ⋅ ), as in the case of attributes and text. Finally, in some cases, v i logistic regression. includes both observed and latent attributes Common transformations are the iden- for a semi-supervised analysis. tity i = ci, normalization by document Lastly, we discuss word embeddings, length i = ci/mi with mi = ∑ j cij , or which provide a richer representation of the the positive indicator x ij = 1[cij >0]. The best underlying text than the token counts that choice is application specific, and may be underlie other methods. They have seen driven by interpretability; does one wish to limited application in economics to date, but interpret βjas the added effect of an extra their dramatic successes in deep learning count for token j (if so, use x ij = cij ) or as the and other machine learning domains sug- effect of the presence of token j(if so, use gest they are likely to have high value in the xij = 1[cij>0])? The identity is a reasonable future. default in many settings. We close in section 3.5 with some broad Write l(α, β)for an unregularized objec- recommendations for practitioners. tive proportional to the negative log likeli- 3.1 Text Regression hood,− log p(vi | i) . For example, in Gaussian (linear) regression, l(α, β) = ∑ i (vi − ηi) 2 Predicting an attribute vi from counts ci is and in binomial (logistic) regression, l(α, β) a regression problem like any other, except = − ∑ i [ηi vi − log(1 + e ηi) ] for vi ∈ {0, 1}. that the high dimensionality of ci makes ordi- A penalized estimator is then the solution to nary least squares (OLS) and other standard { } p techniques infeasible. The methods in this section are mainly applications of standard (1) min l(α, β) + nλ ∑ κj(|βj|) , j=1 high-dimensional regression methods to text. where λ > 0controls overall penalty mag- 3.1.1 Penalized Linear Models nitude and κ j( ⋅ )are increasing “cost” func- The most popular strategy for very tions that penalize deviations of the βj from igh-dimensional regression in contempo- h zero. rary statistics and machine learning is the A few common cost functions are shown in estimation of penalized linear models, par- figure 1. Those that have a n on-differentiable ticularly with L 1 penalization. We recom- spike at zero (lasso, elastic net, and log) lead mend this strategy for most text regression to sparse estimators, with some coefficients applications: linear models are intuitive and set to exactly zero. The curvature of the interpretable; fast, h igh-quality software penalty away from zero dictates the weight is available for big sparse input matrices of shrinkage imposed on the nonzero coef- like our C . For simple text-regression tasks ficients: L2 costs increase with coefficient with input dimension on the same order as size; lasso’s L1penalty has zero curvature and the sample size, penalized linear models imposes constant shrinkage, and as c urvature
Gentzkow, Kelly, and Taddy: Text as Data 543 A. Ridge B. Lasso C. Elastic net D. log 400 60 | β | + 0.1 × β 2 2.5 log(1 + | β |) 15 40 |β| 200 1.5 β2 5 20 0.5 0 0 0 −20 0 20 −20 0 20 −20 0 20 −20 0 20 β β β β Figure 1 Note: From left to right, L 2costs (ridge, Hoerl and Kennard 1970), L 1(lasso, Tibshirani 1996), the “elastic net” mixture of L1 and L2 (Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008). goes toward − ∞one approaches the L0 pen- sample standard deviation of that covariate. alty of subset selection. The lasso’s L1 pen- In text analysis, where each covariate corre- alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific it yields sparse solutions with a number of text token, this type of weighting is referred desirable properties (e.g., Bickel, Ritov, and to as “rare feature u p-weighting” (e.g., Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare and van de Geer 2011), and the number of words are often most useful in differentiat- nonzero estimated coefficients is an unbi- ing between documents.5 ased estimator of the regression degrees of Large λleads to simple model estimates freedom (which is useful in model selection; in the sense that most coefficients will be see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as λ → 0 we Focusing on L1 regularization, rewrite the approach maximum likelihood estimation penalized linear model objective as (MLE). Since there is no way to define an optimal λa priori, standard practice is to { } p (2) min l(α, β) + nλ ∑ ωj |βj| . compute estimates for a large set of possible j=1 λand then use some criterion to select the one that yields the best fit. A common strategy sets ω jso that the pen- Several criteria are available to choose an alty cost for each coefficient is scaled by the optimal λ. One common approach is to leave out part of the training sample in estimation 4 Penalties with a bias that diminishes with coefficient and then choose the λ that yields the best size—such as the log penalty in figure 1 (Candès, Wakin, out-of-sample fit according to some criterion and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work (SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single leave-out sample, researchers (2006)—have been promoted in the statistics literature as improving upon the lasso by providing consistent variable most often use K -fold cross-validation (CV). selection and estimation in a wider range of settings. These diminishing-bias penalties lead to increased computation costs (due to a non-convex loss), but there exist efficient 5 This is the same principle that motivates approximation algorithms (see, e.g., Fan, Xue, and Zou “inverse-document frequency” weighting schemes, such 2014; Taddy 2017b). as tf–idf.
544 Journal of Economic Literature, Vol. LVII (September 2019) This splits the sample into Kdisjoint subsets, Penalized linear models use shrinkage and and then fits the full regularization path K variable selection to manage high dimen- times excluding each subset in turn. This sionality by forcing the coefficients on most yields Krealizations of the mean squared regressors to be close to (or, for lasso, exactly) error or other out-of-sample fit measure for zero. This can produce suboptimal forecasts each value of λ. Common rules are to select when predictors are highly correlated. A the value of λ that minimizes the average transparent illustration of this problem would error across these realizations, or (more be a case in which all of the predictors are conservatively) to choose the largest λ with equal to the forecast target plus an i.i.d. noise mean error no more than one standard error term. In this situation, choosing a subset of away from the minimum. predictors via lasso penalty is inferior to tak- Analytic alternatives to cross-validation ing a simple average of the predictors and are Akaike’s information criterion (AIC; using this as the sole predictor in a univar- Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence particular, Flynn, Hurvich, and Simonoff of dimension reduction. (2013) describe a b ias-corrected AIC PCR consists of a two-step procedure. In objective for high-dimensional problems the first step, principal components analysis that they call AICc. It is motivated as an (PCA) combines regressors into a small set approximate likelihood maximization sub- of Klinear combinations that best preserve ject to a degrees of freedom (d fλ) adjust- the covariance structure among the predic- ment: AICc(λ) = 2l(αλ , βλ ) + 2d fλ_ n . tors. This amounts to solving the problem n − d fλ − 1 Similarly, the BIC objective is BIC(λ) = l(αλ , βλ ) + d fλ log n , and is motivated (3) min trace[(C − ΓB′ )(C − ΓB′ )′ ], Γ,B as an approximation to the Bayesian pos- terior marginal likelihood in Kass and subject to Wasserman (1995). AICc and BIC selec- tion choose λ to minimize their respec- (Γ) = rank(B) = K. rank tive objectives. The BIC tends to choose simpler models than cross-validation or The count matrix C consists of n rows (one AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p columns (one for recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen- tion whenever variable selection, rather tation ΓB′ that best approximates the text than predictive performance, is the primary data C. This formulation has the character of goal. a factor model. The n × Kmatrix Γ captures 3.1.2 Dimension Reduction the prevalence of Kcommon components, or “factors,” in each document. The p × K Another common solution for taming high matrix Bdescribes the strength of associa- dimensional prediction problems is to form a tion between each word and the factors. As small number of linear combinations of pre- we will see, this reduced-rank decomposi- dictors and to use these derived indices as tion bears a close resemblance to other text variables in an otherwise standard predictive analytic methods such as topic modeling and regression. Two classic dimension reduction word embeddings. techniques are principal components regres- In the second step, the Kcomponents are sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an
Gentzkow, Kelly, and Taddy: Text as Data 545 example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index. use PCR to build a hedonic real estate pricing To use additional predictive indices, both model that takes textual content of property vi on cij are orthogonalized with respect listings as an input.6 With text data, where the to vˆ i, the above procedure is repeated on number of features tend to vastly exceed the the orthogonalized data set, and the result- observation count, regularized versions of ing forecast is added to the original vˆ i. This PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS and Ng 2008) and sparse PCA (Zou, Hastie, components K is reached. Like PCR, PLS and Tibshirani 2006) help exclude the least components describe the prevalence of K informative features to improve predictive common factors in each document. And also content of the d imension-reduced text. like PCR, PLS can be implemented with a A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its porate the ultimate statistical objective— performance in the u ltra-high-dimensional forecasting a particular set of attributes—in world of text. Section 4 discusses applica- the dimensionality reduction step. PCA con- tions using PLS in text regression. denses text data into indices based on the PCR and PLS share a number of com- covariation among the predictors. This hap- mon properties. In both cases, Kis a pens prior to the forecasting step and with- user-controlled parameter which, in many out consideration of how predictors associate social science applications, is selected ex ante with the forecast target. by the researcher. But, like any hyperparam- In contrast, PLS performs dimension eter, Kcan be tuned via c ross-validation. And reduction by directly exploiting covaria- neither method is scale invariant—the fore- tion of predictors with the forecast target.7 casting model is sensitive to the distribution Suppose we are interested in forecasting of predictor variances. It is therefore com- a scalar attribute v i. PLS regression pro- mon to variance-standardize features before ceeds as follows. For each element jof the applying PCR or PLS. feature vector c i, estimate the univariate 3.1.3 Nonlinear Text Regression covariance between vi on cij . This covari- ance, denoted φj, reflects the attribute’s Penalized linear models are the most “partial” sensitivity to each feature j. Next, widely applied text regression tools due to form a single predictor by averaging all their simplicity, and because they may be attributes into a single aggregate predictor viewed as a fi rst-order approximation to vˆ i = ∑j φj cij / ∑ j φj. This forecast places potentially nonlinear and complex data gen- the highest weight on the strongest uni- erating processes (DGPs). In cases where a variate predictors, and the least weight on linear specification is too restrictive, there the weakest. In this way, PLS performs its are several other machine learning tools that dimension reduction with the ultimate fore- are well suited to represent nonlinear asso- casting objective in mind. The description ciations between text ci and outcome attri- of vˆ i reflects the K = 1case, i.e., when text butes vi. Here we briefly describe four such nonlinear regression methods—generalized linear models, support vector machines, regression trees, and deep learning—and 6 See Stock and Watson (2002a, b) for development of the PCR estimator and an application to macroeconomic provide references for readers interested in forecasting with a large set of numerical predictors. thorough treatments of each. 7 See Kelly and Pruitt (2013, 2015) for the asymptotic theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture risk premia in financial markets. nonlinear associations between ci and vi is
546 Journal of Economic Literature, Vol. LVII (September 2019) with a generalized linear model (GLM). roblems. The logic of trees differs markedly p These expand the linear model to include from traditional regressions. A tree “grows” nonlinear functions of ci such as polynomials by sequentially sorting data observations or interactions, while otherwise treating the into bins based on values of the predictor problem with the penalized linear regression variables. This partitions the data set into methods discussed above. rectangular regions, and forms predictions A related method used in the social science as the average value of the outcome vari- literature is the support vector machine, or able within each partition (Breiman et al. SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to classification problems (when V is categor- accommodate rich interactions and nonlin- ical), the prototypical example being email ear dependencies. spam filtering. A detailed discussion of SVMs Two extensions of the simple regression is beyond the scope of this review, but from tree have been highly successful thanks to a high level, the SVM finds hyperplanes in a clever regularization approaches that min- basis expansion of Cthat partition the obser- imize the need for tuning and avoid over- vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001) that viare all equal in each region).8 average predictions from many trees that GLMs and SVMs both face the limita- have been randomly perturbed in a b ootstrap tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002) which basis transformations and interactions recursively combine predictions from many to include, they may overfit and require oversimplified trees.10 extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non- Friedman 2009; Murphy 2012). For exam- linearity and high-order interactions—are ple, multi-way interactions increase the sometimes lessened in the presence of parameterization combinatorially and can high-dimensional inputs. While we would quickly overwhelm the penalization rou- generally recommend tree models, and tine, and their performance suffers in the especially random forests, they are often not presence of many spurious “noise” inputs worth the effort for simple text regression. (Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is in a final prediction step after some dimen- Regression Trees.—Regression trees have sion reduction derived from the generative become a popular nonlinear approach for models in section 3.2. incorporating multi-way predictor inter- actions into regression and classification Deep Learning.—There is a host of other machine learning techniques that have been 8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com- and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the (2001) (among others) study text applications of SVMs. 9 Another drawback of SVMs is that they cannot be inputs to act on the response through one easily connected to the estimation of a probabilistic model and the resulting fitted model can sometimes be difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie, objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially of alternatives. optimal trees.
Gentzkow, Kelly, and Taddy: Text as Data 547 or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana- functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the attraction of neural networks is their status as log penalty on the right of figure 1. universal approximators, a theoretical result For those looking to do a full Bayesian describing their ability to mimic general, analysis for high-dimensional (e.g., text) smooth nonlinear associations. regression, an especially appealing model is In high-dimensional and very noisy set- the spike-and-slab introduced in George and tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu- ral nets tend to suffer from the same issues tion over regression coefficients as a mixture referenced above: they often overfit and between two densities centered at zero— are difficult to tune. However, the recently one with very small variance (the spike) and popular “deep” versions of neural networks another with large variance (the slab). This (with many layers, and fewer nodes per model allows one to compute posterior vari- layer) incorporate a number of innovations able inclusion probabilities as, for each coef- that allow them to work better, faster, and ficient, the posterior probability that it came with little tuning, even in difficult text analy- from the slab and not the spike component. sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior are now the state-of-the-art solution for many distribution, e.g., via Markov chain Monte machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for spike-and-slab Hinton 2015).11 DNNs are now employed in models is much more computationally inten- many complex natural language processing sive than fitting the penalized regressions of tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright, and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that s pike-and-slab parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples in exercises of relevance to social scientists— can be useful in application, while Scott for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi- ideology from text using a DNN. They are cient implementations of the s pike-and-slab frequently used in conjunction with richer model for big data applications. These pro- text representations such as word embed- cedures give a full accounting of parameter dings, described more below. uncertainty, which we miss in a quick penal- ized regression. 3.1.4 Bayesian Regression Methods 3.2 Generative Language Models The penalized methods above can all be interpreted as posterior maximization under Text regression treats the token counts as some prior. For example, ridge regression generic high-dimensional input variables, maximizes the posterior under independent without any attempt to model structure that Gaussian priors on each coefficient, while is specific to language data. In many set- Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen- Bayesian interpretations to the lasso. See also erative model for the text tokens to learn the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word (2010) and the double Pareto of Armagan, choice and account for various dependen- cies among words and among attributes. In this approach, the words in a document are 11 Goodfellow, Bengio, and Courville (2016) provide a viewed as the realization of a generative pro- thorough textbook overview of these “deep learning” tech- nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for their use in natural language processing. p(ci | vi) .
548 Journal of Economic Literature, Vol. LVII (September 2019) 3.2.1 Unsupervised Generative Models Many readers will recognize the model in (5) as a factor model for the vector of nor- In the unsupervised setting, we have no malized counts for each token in document direct observations of the true attributes i, ci / mi. Indeed, a topic model is simply a fac- vi. Our inference about these attributes must tor model for multinomial data. Each topic therefore depend entirely on strong assump- is a probability vector over possible tokens, tions that we are willing to impose on the denoted θl, l = 1, … , k (where θlj ≥ 0 and structure of the model p ( ci | vi) . Examples in ∑pj=1 θlj = 1). A topic can be thought of as the broader literature include cases where a cluster of tokens that tend to appear in the viare latent factors, clusters, or catego- documents. The latent attribute vector v i is ries. In text analysis, the leading application referred to as the set of topic weights (for- has been the case in which the vi are topics. mally, a distribution over topics, vil ≥ 0 and A typical generative model implies that ∑l=1 k vil = 1). Note that v il describes the pro- each observation c iis a conditionally inde- portion of language in document i devoted to pendent draw from the vocabulary of the lthtopic. We can allow each document possible tokens according to some d ocument- to have a mix of topics, or we can require specific token probability vector, say that one vil = 1while the rest are zero, so i = [qi1 ⋯ qip ]′ . Conditioning on doc- that each document has a single topic.13 ument length, m i = ∑ j cij , this implies a Since its introduction into text analysis, multinomial distribution for the counts topic modeling has become hugely popu- lar.14 (See Blei 2012 for a high-level over- (4) ci ∼ MN ( i, mi). view.) The model has been especially useful in political science (e.g., Grimmer 2010), This multinomial model underlies the vast where researchers have been successful in majority of contemporary generative models attaching political issues and beliefs to the for text. estimated latent topics. Under the basic model in (4), the function Since the v iare of course latent, estima- i = q(vi) links attributes to the distribution tion for topic models tends to make use of of text counts. A leading example of this link some alternating inference for V | Θand Θ | V. function is the topic model specification of One possibility is to employ a version of the Blei, Ng, and Jordan (2003),12 where expectation-maximization (EM) algorithm to either maximize the likelihood implied by (5) i = vi1 θ1 + vi2 θ2 + ⋯ + vik θk = Θ vi. 13 Topic modeling is alternatively labeled as “latent Dirichlet allocation,” (LDA) which refers to the Bayesian model in Blei, Ng, and Jordan (2003) that treats each v i and θlas generated from a Dirichlet-distributed prior. Another specification that is popular in political science (e.g., Quinn et al. 2010) keeps θ l as Dirichlet-distributed but requires each document to have a single topic. This may be most 12 Standard l east-squares factor models have long appropriate for short documents, such a press releases or been employed in “latent semantic analysis” (LSA; single speeches. Deerwester et al. 1990), which applies PCA (i.e., singu- 14 The same model was independently introduced in lar value decompositions) to token count transformations genetics by Pritchard, Stephens, and Donnelly (2000) for such as i = ci/mi or xij = cij log(dj) where dj = ∑i 1[cij >0]. factorizing gene expression as a function of latent popula- Topic modeling and its precursor, probabilistic LSA, are tions; it has been similarly successful in that field. Latent generally seen as improving on such approaches by replac- Dirichlet allocation is also an extension of a related mix- ing arbitrary transformations with a plausible generative ture modeling approach in the latent semantic analysis of model. Hofmann (1999).
Gentzkow, Kelly, and Taddy: Text as Data 549 (4) and (5) or, after incorporating the usual on the application. As we discuss below, in Dirichlet priors on v i and θl, to maximize the many applications of topic models to date, posterior; this is the approach taken in Taddy the goal is to provide an intuitive description (2012; see this paper also for a review of of text, rather than inference on some under- topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the one can target the full posterior distribution ad hoc selection of the number of topics may p(Θ, V ∣ ci) . Estimation, say for Θ, then pro- be reasonable. ceeds by maximization of the estimated mar- The basic topic model has been general- ginal posterior, say p (Θ ∣ ci) . ized and extended in variety of ways. A prom- Due to the size of the data sets and dimen- inent example is the dynamic topic model sion of the models, posterior approximation of Blei and Lafferty (2006), which considers for topic models usually uses some form documents that are indexed by date (e.g., of variational inference (Wainwright and publication date for academic articles) and Jordan 2008) that fits a tractable paramet- allows the topics, say Θ t, to evolve smoothly ric family to be as close as possible (e.g., in in time. Another example is the super- Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe posterior. This variational approach was (2007), which combines the standard topic used in the original Blei, Ng, and Jordan model with an extra equation relating the (2003) paper and in many applications since. weights vito some additional attribute yi in Hoffman et al. (2013) present a stochastic p(yi | vi). This pushes the latent topics to be variational inference algorithm that takes relevant to y ias well as the text c i. In these advantage of techniques for optimization on and many other extensions, the modifica- massive data; this algorithm is used in many tions are designed to incorporate available contemporary topic modeling applications. document metadata (in these examples, Another approach, which is more computa- time and y i respectively). tionally intensive but can yield more accu- 3.2.2 Supervised Generative Models rate posterior approximations, is the MCMC algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes vi are Alternatively, for quick estimation without observed in a training set and thus may be uncertainty quantification, the posterior directly harnessed to inform the model of maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common good option. supervised generative model is the so-called The choice of k , the number of topics, is naive Bayes classifier (e.g., Murphy 2012), often fairly arbitrary. Data-driven choices which treats counts for each token as inde- do exist: Taddy (2012) describes a model pendent with class-dependent means. For selection process for kthat is based upon example, the observed attribute might be Bayes factors, Airoldi et al. (2010) provide author identity for each document in the a cross-validation (CV) scheme, while Teh corpus with the model specifying different et al. (2006) use Bayesian nonparametric mean token counts for each author. techniques that view kas an unknown model In naive Bayes, viis a univariate categor- parameter. In practice, however, it is very ical variable and the token count distribu- common to simply start with a number of tion is factorized as p(ci | vi) = ∏ j pj(cij | vi), topics on the order of ten, and then adjust thus “naively” specifying conditional inde- the number of topics in whatever direction pendence between tokens j. This rules out seems to improve interpretability. Whether the possibility that by choosing to say one this ad hoc procedure is problematic depends token (say, “hello”) we reduce the probability
You can also read