Text as Data Matthew Gentzkow, Bryan Kelly, and Matt Taddy* - Stanford University

Page created by Earl Deleon

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Journal of Economic Literature 2019, 57(3), 535–574
https://doi.org/10.1257/jel.20181020

Text as Data†
Matthew Gentzkow, Bryan Kelly, and Matt Taddy*

An ever-increasing share of human interaction, communication, and culture is
recorded as digital text. We provide an introduction to the use of text as an input to
economic research. We discuss the features that make text different from other forms
of data, offer a practical overview of relevant statistical methods, and survey a variety
of applications. (JEL C38, C55, L82, Z13)

1. Introduction from advertisements and product reviews is
used to study the drivers of consumer deci-

N ew technologies have made available
vast quantities of digital text, recording
an ever-increasing share of human interac-
sion making. In political economy, text from
politicians’ speeches is used to study the
dynamics of political agendas and debate.
tion, communication, and culture. For social The most important way that text differs
scientists, the information encoded in text is from the kinds of data often used in econom-
a rich complement to the more structured ics is that text is inherently h igh dimensional.
kinds of data traditionally used in research, Suppose that we have a sample of documents,
and recent years have seen an explosion of each of which is w words long, and suppose
empirical economics research using text as that each word is drawn from a vocabulary
data. of ppossible words. Then the unique repre-
To take just a few examples: In finance, sentation of these documents has dimension
text from financial news, social media, and p w. A sample of thirty-word Twitter mes-
company filings is used to predict asset price sages that use only the one thousand most
movements and study the causal impact of common words in the English language, for
new information. In macroeconomics, text example, has roughly as many dimensions as
is used to forecast variation in inflation and there are atoms in the universe.
unemployment, and estimate the effects of A consequence is that the statistical meth-
policy uncertainty. In media economics, text ods used to analyze text are closely related to
from news and social media is used to study those used to analyze h igh-dimensional data
the drivers and effects of political slant. In in other domains, such as machine learning
industrial organization and marketing, text and computational biology. Some methods,
such as lasso and other penalized regres-
* Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly
and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods,
cago Booth School of Business.
†
Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse
article page and view author disclosure statement(s). regression, are close cousins of more general

535

536 Journal of Economic Literature, Vol. LVII (September 2019)

methods adapted to the specific structure of third task is p
redicting the incidence of local
text data. flu outbreaks from Google searches, where
In all of the cases we consider, the analysis the outcome V is the true incidence of flu.
can be summarized in three steps: In these examples, and in the vast major-
ity of settings where text analysis has been
1. Represent raw text as a numerical applied, the ultimate goal is prediction rather
array C
; than causal inference. The interpretation of
the mapping from V to Vˆ is not usually an
ˆ of
2. Map C to predicted values V
object of interest. Why certain words appear
unknown outcomes V; and more often in spam, or why certain searches
are correlated with flu is not important so
3. Use Vˆ in subsequent descriptive or long as they generate highly accurate predic-
causal analysis. tions. For example, Scott and Varian (2014,
2015) use data from Google searches to pro-
In the first step, the researcher must duce high-frequency estimates of macro-
impose some preliminary restrictions to economic variables such as unemployment
reduce the dimensionality of the data claims, retail sales, and consumer sentiment
to a manageable level. Even the most that are otherwise available only at lower fre-
cutting-edge high-dimensional techniques quencies from survey data. Groseclose and
can make nothing of 1,000 30-dimensional Milyo (2005) compare the text of news out-
raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order
discuss, the elements of C are counts of to estimate the outlets’ political slant. A large
tokens: words, phrases, or other p redefined literature in finance following Antweiler and
features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text
ing out very common or uncommon words; from the internet or the news to predict
dropping numbers, punctuation, or proper stock prices.
names; and restricting attention to a set of In many social science studies, however,
features such as words or phrases that are the goal is to go further and, in the third
likely to be especially diagnostic. The map- step, use text to infer causal relationships
ping from raw text to C leverages prior infor- or the parameters of structural economic
mation about the structure of language to models. S tephens-Davidowitz (2014) uses
reduce the dimensionality of the data prior Google search data to estimate local areas’
to any statistical analysis. racial animus, then studies the causal
The second step is where high-dimensional effect of racial animus on votes for Barack
statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and
example, the data is the text of emails, and Shapiro (2010) use congressional and news
the unknown variable of interest V is an indi- text to estimate each news outlet’s political
cator for whether the email is spam. The slant, then study the supply and demand
prediction Vˆ determines whether or not to forces that determine slant in equilibrium.
send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local
sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements,
Lee, and Vaithyanathan 2002), where the then use the relationship between coverage
unknown variable Vis the true sentiment of and trading by local investors to separate
a message (say positive or negative), and the the causal effect of news from other sources
prediction Vˆ might be used to identify posi- of correlation between news and stock
tive reviews or comments about a product. A prices.

Gentzkow, Kelly, and Taddy: Text as Data 537

In this paper, we provide an overview from the text as a whole. It might seem
of methods for analyzing text and a survey obvious that any attempt to distill text into
of current applications in economics and meaningful data must similarly take account
related social sciences. The methods discus- of complex grammatical structures and rich
sion is forward looking, providing an over- interactions among words.
view of methods that are currently applied The field of computational linguistics
in economics as well as those that we expect has made tremendous progress in this kind
to have high value in the future. Our discus- of interpretation. Most of us have mobile
sion of applications is selective and necessar- phones that are capable of complex speech
ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently
examples that illustrate particular methods parse grammatical structure, disambiguate
and use text data to make important substan- different senses of words, distinguish key
tive contributions even if they do not apply points from secondary asides, and so on.
methods close to the frontier. Yet virtually all analysis of text in the social
A number of other excellent surveys have sciences, like much of the text analysis in
been written in related areas. See Evans and machine learning more generally, ignores
Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text
(2013) for related surveys focused on text consists of an ordered sequence of language
analysis in sociology and political science, elements: words, punctuation, and white
respectively. For methodological surveys, space. To reduce this to a simpler repre-
Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we
Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications:
contemporary statistics and machine learn- dividing the text into individual documents i,
ing in general while Jurafsky and Martin reducing the number of language elements
(2009) overview methods from computa- we consider, and limiting the extent to which
tional linguistics and natural language pro- we encode dependence among elements
cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping
of Economic Perspectives contains a sympo- from raw text  to a numerical array C . A row
sium on “big data,” which surveys broader ci of Cis a numerical vector with each ele-
applications of high-dimensional statistical ment indicating the presence or count of a
methods to economics. particular language token in document i.
In section 2 we discuss representing
2.1 What Is a Document?
text data as a manageable (though still
high-dimensional) numerical array C ; in sec- The first step in constructing C is to divide
tion 3 we discuss methods from data mining raw text into individual documents { i}.
and machine learning for predicting V from In many applications, this is governed by the
C. Section 4 then provides a selective survey level at which the attributes of interest V are
of text analysis applications in social science, defined. For spam detection, the outcome of
and section 5 concludes. interest is defined at the level of individual
emails, so we want to divide text that way
too. If V is daily stock price movements that
2. Representing Text as Data
we wish to predict from the prior day’s news
When humans read text, they do not see a text, it might make sense to divide the news
vector of dummy variables, nor a sequence text by day as well.
of unrelated tokens. They interpret words In other cases, the natural way to define
in light of other words, and extract meaning a document is not so clear. If we wish to

538 Journal of Economic Literature, Vol. LVII (September 2019)

redict legislators’ partisanship from their
p that occur fewer than ktimes for some arbi-
floor speeches (Gentzkow, Shapiro, and trary small integer k.
Taddy 2016), we could aggregate speech An approach that excludes both common
so a document is a speaker–day, a speaker– and rare words and has proved very useful
year, or all speech by a given speaker during in practice is filtering by “term frequency–
the time she is in Congress. When we use inverse document frequency” ( tf–idf).
methods that treat documents as indepen- For a word or other feature jin document i,
dent (which is true most of the time), finer term frequency (t fij) is the count cij of occur-
partitions will typically ease computation at rences of jin i. Inverse document frequency
the cost of limiting the dependence we are (id fj
) is the log of one over the share of
able to capture. Theoretical guidance for the documents containing j: log(n/dj) where dj
right level of aggregation is often limited, so = ∑ i 1[cij >0] and nis the total number of
this is an important dimension along which documents. The object of interest tf–idf
to check the sensitivity of results. is the product t fij × id fj. Very rare words
will have low tf–idf scores because t fij will
2.2 Feature Selection
be low. Very common words that appear in
To reduce the number of features to some- most or all documents will have low tf–idf
thing manageable, a common first step is to scores because id fjwill be low. (Note that
strip out elements of the raw text other than this improves on simply excluding words that
words. This might include punctuation, num- occur frequently because it will keep words
bers, HTML tags, proper names, and so on. that occur frequently in some documents but
It is also common to remove a subset of do not appear in others; these often provide
words that are either very common or very useful information.) A common practice is to
rare. Very common words, often called “stop keep only the words within each document i
words,” include articles (“the,” “a”), conjunc- with tf–idf scores above some rank or cutoff.
tions (“and,” “or”), forms of the verb “to be,” A final step that is commonly used to
and so on. These words are important to the reduce the feature space is stemming: replac-
grammatical structure of sentences, but they ing words with their root such that, e.g.,
typically convey relatively little meaning on “economic,” “economics,” “economically”
their own. The frequency of “the” is proba- are all replaced by the stem “economic.” The
bly not very diagnostic of whether an email Porter stemmer (Porter 1980) is a standard
is spam, for example. Common practice is stemming tool for English language text.
to exclude stop words based on a predefined All of these cleaning steps reduce the
list.1 Very rare words do convey meaning, but number of unique language elements we
their added computational cost in expand- must consider and thus the dimensional-
ing the set of features that must be consid- ity of the data. This can provide a massive
ered often exceeds their diagnostic value. computational benefit, and it is also often
A common approach is to exclude all words key to getting more interpretable model fits
(e.g., in topic modeling). However, each of
these steps requires careful decisions about
1 There is no single stop word list that has become a
the elements likely to carry meaning in a
standard. How aggressive one wants to be in filtering stop
words depends on the application. The web page http:// particular application.2 One researcher’s
www.ranks.nl/stopwords shows several common stop word
lists, including the one built into the database software
SQL and the list claimed to have been used in early ver- 2 Denny and Spirling (2018) discuss the sensitivity of
sions of Google search. (Modern Google search does not unsupervised text analysis methods such as topic modeling
appear to filter any stop words.) to preprocessing steps.

Gentzkow, Kelly, and Taddy: Text as Data 539

stop words are another’s subject of interest. representation then corresponds to counts of
Dropping numerals from political text means 1-grams.
missing references to “the first 100 days” or Counting n-grams of order n > 1 yields
“September 11.” In online communication, data that describe a limited amount of the
even punctuation can no longer be stripped dependence between words. Specifically, the
without potentially significant information n-gram counts are sufficient for estimation
loss :-(. of an n-order homogeneous Markov model
across words (i.e., the model that arises if we
2.3 n-grams
assume that word choice is only dependent
Producing a tractable representation also upon the previous n words). This can lead
requires that we limit dependence among to richer modeling. In analysis of partisan
language elements. A fairly mild step in this speech, for example, single words are often
direction, for example, might be to parse doc- insufficient to capture the patterns of inter-
uments into distinct sentences and encode est: “death tax” and “tax break” are phrases
features of these sentences while ignoring with strong partisan overtones that are not
the order in which they occur. The most evident if we look at the single words “death,”
common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and
The simplest and most common way to Shapiro 2010).
represent a document is called bag-of-words. Unfortunately, the dimension of ci in-
The order of words is ignored altogether, creases exponentially quickly with the order
and ciis a vector whose length is equal to nof the phrases tracked. The majority of text
the number of words in the vocabulary and analyses consider n-grams up to two or three
whose elements c ijare the number of times at most, and the ubiquity of these simple
word joccurs in document i. Suppose that representations (in both machine learning
the text of document i is and social science) reflects a belief that the
return to richer n-gram modeling is usually
Good night, good night! small relative to the cost. Best practice in
Parting is such sweet sorrow. many cases is to begin analysis by focusing on
single words. Given the accuracy obtained
After stemming, removing stop words, and with words alone, one can then evaluate if it
removing punctuation, we might be left with is worth the extra time to move on to 2-grams
“good night good night part sweet sorrow.” or 3-grams.
The bag-of-words representation would then
2.4 Richer Representations
have cij = 2for j ∈ {good, night}, cij = 1 for
j ∈ {part, sweet, sorrow}, and cij = 0for all While rarely used in the social science
other words in the vocabulary. literature to date, there is a vast array of
This scheme can be extended to encode methods from computational linguistics
a limited amount of dependence by count- that capture richer features of text and may
ing unique phrases rather than unique have high return in certain applications.
words. A phrase of length n is referred to One basic step beyond the simple n -gram
as an n-gram. For example, in our snippet counting above is to use sentence syntax to
above, the count of 2-grams (or “bigrams”) inform the text tokens used to summarize
would have c ij = 2for j = good.night, a document. For example, Goldberg and
cij = 1for jincluding night.good, night.part, Orwant (2013) describe syntactic n-grams
part.sweet, and sweet.sorrow, and cij = 0 for where words are grouped together when-
all other possible 2-grams. The bag-of-words ever their meaning depends upon each

540 Journal of Economic Literature, Vol. LVII (September 2019)

other, according to a model of language A more serious issue is that research-
syntax. ers sometimes do not have direct access
An alternative approach is to move beyond to the raw text and must access it through
treating documents as counts of language some interface such as a search engine. For
tokens, and to instead consider the ordered example, Gentzkow and Shapiro (2010)
sequence of transitions between words. count the number of newspaper articles
In this case, one would typically break the containing partisan phrases by entering the
document into sentences, and treat each phrases into a search interface (e.g., for the
as a separate unit for analysis. A single sen- database ProQuest) and counting the num-
tence of length s(i.e., containing s words) ber of matches they return. Baker, Bloom,
is then represented as a binary p × s matrix and Davis (2016) perform similar searches
S, where the nonzero elements of S indi- to count the number of articles mentioning
cate occurrence of the row-word in the terms related to policy uncertainty. Saiz and
column-position within the sentence, and p Simonsohn (2013) count the number of web
is the length of the vocabulary. Such repre- pages measuring combinations of city names
sentations lead to a massive increase in the and terms related to corruption by enter-
dimensions of the data to be modeled, and ing queries in a search engine. Even if one
analysis of this data tends to proceed through can automate the searches in these cases, it
word embedding: the mapping of words to is usually not feasible to produce counts for
Kfor some K ≪ p, such that
a location in 핉 very large feature sets (e.g., every two-word
the sentences are then sequences of points phrase in the English language), and so the
in this K
dimensional space. This is discussed initial feature selection step must be rel-
in detail in section 3.3. atively aggressive. Relatedly, interacting
through a search interface means that there
2.5 Other Practical Considerations
is no simple way to retrieve objects like the
It is worth mentioning two details that can set of all words occurring at least twenty
cause practical social science applications times in the corpus of documents, or the
of these methods to diverge a bit from the inputs to computing tf–idf.

ideal case considered in the statistics liter-
ature. First, researchers sometimes receive
3. Statistical Methods
data in a p
re-aggregated form. In the analysis
of Google searches, for example, one might This section considers methods for map-
observe the number of searches contain- ping the document-token matrix Cto pre-
ing each possible keyword on each day, but dictions Vˆ of an attribute V . In some cases,
not the raw text of the individual searches. the observed data is partitioned into subma-
This means documents must be similarly trices C train and C test, where the matrix C train
aggregated (to days, rather than individual collects rows for which we have observations
searches), and it also means that the natu- V train of Vand the matrix C test collects rows
ral representation where cijis the number of for which V is unobserved. The dimension
occurrences of word jon day iis not avail- of C train is n train × p, and the dimension of
able. This is probably not a significant limita- V train is n train × k, where kis the number of
tion, as the missing information (how many attributes we wish to predict.
times per search a word occurs conditional Attributes in Vcan include observable
on occurring at least once) is unlikely to be quantities such as the frequency of flu cases,
essential, but it is useful to note when map- the positive or negative rating of movie
ping practice to theory. reviews, or the unemployment rate, about

Gentzkow, Kelly, and Taddy: Text as Data 541

which the documents are informative. There The second and third groups of meth-
can also be latent attributes of interest, such ods are distinguished by whether they
as the topics being discussed in a congressio- begin from a model of p( vi | ci)or a model of
nal debate or in news articles. p(ci | vi) . In the former case, which we will
 Methods to connect counts c i to attri- call text regression methods, we directly
butes vican be roughly divided into four estimate the conditional outcome distribu-
categories. The first, which we will call tion, usually via the conditional expectation
dictionary-based methods, do not involve E[ vi | ci]  of attributes vi. This is intuitive: if we
 statistical inference at all: they simply spec- want to predict vi from ci, we would naturally
 ify vˆ i = f ( ci) for some known function f ( ⋅ ). regress the observed values of the former
This is by far the most common method in (V train) on the corresponding values of the lat-
the social science literature using text to ter (C train). Any generic regression technique
date. In some cases, researchers define f ( ⋅ ) can be applied, depending upon the nature
based on a p respecified dictionary of of vi. However, the h  igh dimensionality of
terms capturing particular categories of ci, where pis often as large as or larger than
text. In Tetlock (2007), for example, ci is a n train, requires use of regression techniques
bag-of-words representation and the out-
 appropriate for such a setting, such as penal-
come of interest viis the latent “sentiment” ized linear or logistic regression.
of Wall Street Journal columns, defined along In the latter case, we begin from a genera-
a number of dimensions such as “positive,” tive model of p(ci | vi) . To see why this is intu-
“optimistic,” and so on. The author defines itive, note that in many cases the underlying
the function  f ( ⋅ )using a dictionary called causal relationship runs from outcomes to
the General Inquirer, which provides lists of language rather than the other way around.
words associated with each of these sentiment For example, Google searches about the flu
categories.3 The elements of f( ci)  are defined do not cause flu cases to occur; rather, peo-
 to be the sum of the counts of words in each ple with the flu are more likely to produce
 category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology
 ysis then focuses on the first principal com- is not determined by their use of partisan
 ponent of the resulting counts.) In Baker, language; rather, people who are more con-
 Bloom, and Davis (2016), c  iis the count of servative or liberal to begin with are more
articles in a given n  ewspaper-month contain- likely to use such language. From an eco-
ing a set of p  respecified terms such as “pol- nomic point of view, the correct “structural”
icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from
and the outcome of interest v  i is the degree vi to ci, and as in other cases familiar to
 of “policy uncertainty” in the economy. The economists, modeling the underlying causal
 authors define f ( ⋅ )to be the raw count of relationships can provide powerful guidance
the prespecified terms divided by the total to inference and make the estimated model
number of articles in the newspaper–month, more interpretable.
averaged across newspapers. We do not pro- Generative models can be further divided
vide additional discussion of d  ictionary-based by whether the attributes are observed or
methods in this section, but we return to them latent. In the first case of unsupervised
in section 3.5 and in our discussion of applica- methods, we do not observe the true value of
tions in section 4. vifor any documents. The function relating
 ci and viis unknown, but we are willing to
 impose sufficient structure on it to allow us to
 3 http://www.wjh.harvard.edu/~inquirer/. infer vi from ci. This class includes methods

542 Journal of Economic Literature, Vol. LVII (September 2019)

 such as topic modeling and its variants (e.g., typically perform close to the frontier in
 latent Dirichlet allocation, or LDA). In the terms of out-of-sample prediction.
 second case of supervised methods, we Linear models in the sense we mean here
 observe training data V  trainand we can fit are those in which v i depends on ci only
 our model, say fθ( ci; vi) for a vector of param- through a linear index ηi = α +   ′i   β, where
eters θ, to this training set. The fitted model iis a known transformation of c  i. In many
  fθˆ  can then be inverted to predict v  i for doc- cases, we simply have E  [vi |  i]  = ηi. It is
 uments in the test set and can also be used to also possible that E  [vi   |   i]  = f ( ηi)  for some
 interpret the structural relationship between known link function f ( ⋅ ), as in the case of
 attributes and text. Finally, in some cases, v  i logistic regression.
 includes both observed and latent attributes Common transformations are the iden-
 for a semi-supervised analysis. tity  i = ci, normalization by document
 Lastly, we discuss word embeddings, length  i = ci/mi with mi = ∑ j  cij , or
 which provide a richer representation of the the positive indicator x  ij = 1[cij >0]. The best
 underlying text than the token counts that choice is application specific, and may be
 underlie other methods. They have seen driven by interpretability; does one wish to
 limited application in economics to date, but interpret βjas the added effect of an extra
 their dramatic successes in deep learning count for token j (if so, use x  ij = cij ) or as the
 and other machine learning domains sug- effect of the presence of token j(if so, use
 gest they are likely to have high value in the xij = 1[cij>0])?
  The identity is a reasonable
 future. default in many settings.
 We close in section 3.5 with some broad Write l(α, β)for an unregularized objec-
 recommendations for practitioners. tive proportional to the negative log likeli-
3.1 Text Regression hood,− log  p(vi |  i) . For example, in Gaussian
 (linear) regression, l(α, β) = ∑ i  (vi − ηi)  2
 Predicting an attribute vi from counts ci is and in binomial (logistic) regression, l(α, β) 
 a regression problem like any other, except = − ∑ i [ηi vi − log(1 + e ηi) ]  for vi ∈  {0, 1}.
 that the high dimensionality of ci makes ordi- A penalized estimator is then the solution to
nary least squares (OLS) and other standard

 { }
 p
techniques infeasible. The methods in this
section are mainly applications of standard (1) min l(α, β) + nλ  ∑  κj(|βj|) ,
 j=1
high-dimensional regression methods to text.
 where λ > 0controls overall penalty mag-
3.1.1 Penalized Linear Models
 nitude and κ  j( ⋅ )are increasing “cost” func-
 The most popular strategy for very tions that penalize deviations of the βj from
igh-dimensional regression in contempo-
h zero.
rary statistics and machine learning is the A few common cost functions are shown in
estimation of penalized linear models, par- figure 1. Those that have a n  on-differentiable
ticularly with L 1 penalization. We recom- spike at zero (lasso, elastic net, and log) lead
mend this strategy for most text regression to sparse estimators, with some coefficients
applications: linear models are intuitive and set to exactly zero. The curvature of the
interpretable; fast, h igh-quality software penalty away from zero dictates the weight
is available for big sparse input matrices of shrinkage imposed on the nonzero coef-
like our C . For simple text-regression tasks ficients: L2 costs increase with coefficient
with input dimension on the same order as size; lasso’s L1penalty has zero curvature and
the sample size, penalized linear models imposes constant shrinkage, and as c urvature

Gentzkow, Kelly, and Taddy: Text as Data 543

A. Ridge B. Lasso C. Elastic net D. log

 400 60

 | β | + 0.1 × β 2
 2.5

 log(1 + | β |)
 15 40

 |β|
 200 1.5
β2

 5 20
 0.5
 0 0 0
 −20 0 20 −20 0 20 −20 0 20 −20 0 20
 β β β β

 Figure 1

Note: From left to right, L  2costs (ridge, Hoerl and Kennard 1970), L
  1(lasso, Tibshirani 1996), the “elastic net”
mixture of L1 and L2 (Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008).

goes toward − ∞one approaches the L0 pen- sample standard deviation of that covariate.
alty of subset selection. The lasso’s L1 pen- In text analysis, where each covariate corre-
alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific
it yields sparse solutions with a number of text token, this type of weighting is referred
desirable properties (e.g., Bickel, Ritov, and to as “rare feature u p-weighting” (e.g.,
Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and
Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare
and van de Geer 2011), and the number of words are often most useful in differentiat-
nonzero estimated coefficients is an unbi- ing between documents.5
ased estimator of the regression degrees of Large λleads to simple model estimates
freedom (which is useful in model selection; in the sense that most coefficients will be
see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as λ  → 0 we
 Focusing on L1 regularization, rewrite the approach maximum likelihood estimation
penalized linear model objective as (MLE). Since there is no way to define an
 optimal λa priori, standard practice is to

 { }
 p
(2) min l(α, β) + nλ  ∑  ωj  |βj| . compute estimates for a large set of possible
 j=1 λand then use some criterion to select the
 one that yields the best fit.
A common strategy sets ω  jso that the pen- Several criteria are available to choose an
alty cost for each coefficient is scaled by the optimal λ. One common approach is to leave
 out part of the training sample in estimation
 4 Penalties with a bias that diminishes with coefficient
 and then choose the λ  that yields the best
size—such as the log penalty in figure 1 (Candès, Wakin, out-of-sample fit according to some criterion
and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work
(SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single leave-out sample, researchers
(2006)—have been promoted in the statistics literature as
improving upon the lasso by providing consistent variable most often use K  -fold cross-validation (CV).
selection and estimation in a wider range of settings. These
diminishing-bias penalties lead to increased computation
costs (due to a non-convex loss), but there exist efficient 5 This is the same principle that motivates
approximation algorithms (see, e.g., Fan, Xue, and Zou “inverse-document frequency” weighting schemes, such
2014; Taddy 2017b). as tf–idf.

544 Journal of Economic Literature, Vol. LVII (September 2019)

This splits the sample into Kdisjoint subsets, Penalized linear models use shrinkage and
and then fits the full regularization path K  variable selection to manage high dimen-
times excluding each subset in turn. This sionality by forcing the coefficients on most
yields Krealizations of the mean squared regressors to be close to (or, for lasso, exactly)
error or other out-of-sample fit measure for zero. This can produce suboptimal forecasts
each value of λ. Common rules are to select when predictors are highly correlated. A
the value of λ  that minimizes the average transparent illustration of this problem would
error across these realizations, or (more be a case in which all of the predictors are
conservatively) to choose the largest λ with equal to the forecast target plus an i.i.d. noise
mean error no more than one standard error term. In this situation, choosing a subset of
away from the minimum. predictors via lasso penalty is inferior to tak-
 Analytic alternatives to cross-validation ing a simple average of the predictors and
are Akaike’s information criterion (AIC; using this as the sole predictor in a univar-
Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as
tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence
particular, Flynn, Hurvich, and Simonoff of dimension reduction.
(2013) describe a b ias-corrected AIC PCR consists of a two-step procedure. In
objective for  high-dimensional problems the first step, principal components analysis
that they call AICc. It is motivated as an (PCA) combines regressors into a small set
approximate likelihood maximization sub- of Klinear combinations that best preserve
ject to a degrees of freedom (d fλ) adjust- the covariance structure among the predic-
ment: AICc(λ) = 2l(αλ , βλ )  + 2d fλ_
  n . tors. This amounts to solving the problem
 n − d fλ − 1
Similarly, the BIC objective is  BIC(λ) 
= l(αλ , βλ ) + d fλ  log n
 , and is motivated (3) min   trace[(C − ΓB′ )(C − ΓB′ )′ ],
 Γ,B
as an approximation to the Bayesian pos-
terior marginal likelihood in Kass and subject to
Wasserman (1995). AICc and BIC selec-
tion choose λ to minimize their respec- (Γ) = rank(B) = K.
 rank
tive objectives. The BIC tends to choose
simpler models than  cross-validation or The count matrix C  consists of n
  rows (one
AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p  columns (one for
recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen-
tion whenever variable selection, rather tation ΓB′ that best approximates the text
than predictive performance, is the primary data C. This formulation has the character of
goal. a factor model. The n × Kmatrix Γ captures
3.1.2 Dimension Reduction the prevalence of Kcommon components,
 or “factors,” in each document. The p × K
 Another common solution for taming high matrix Bdescribes the strength of associa-
dimensional prediction problems is to form a tion between each word and the factors. As
small number of linear combinations of pre- we will see, this reduced-rank decomposi-
dictors and to use these derived indices as tion bears a close resemblance to other text
variables in an otherwise standard predictive analytic methods such as topic modeling and
regression. Two classic dimension reduction word embeddings.
techniques are principal components regres- In the second step, the Kcomponents are
sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an

Gentzkow, Kelly, and Taddy: Text as Data 545

example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index.
use PCR to build a hedonic real estate pricing To use additional predictive indices, both
model that takes textual content of property vi on cij are orthogonalized with respect
listings as an input.6 With text data, where the to vˆ i, the above procedure is repeated on
number of features tend to vastly exceed the the orthogonalized data set, and the result-
observation count, regularized versions of ing forecast is added to the original vˆ i. This
PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS
and Ng 2008) and sparse PCA (Zou, Hastie, components K is reached. Like PCR, PLS
and Tibshirani 2006) help exclude the least components describe the prevalence of K
informative features to improve predictive common factors in each document. And also
content of the d imension-reduced text. like PCR, PLS can be implemented with a
A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its
porate the ultimate statistical objective— performance in the u ltra-high-dimensional
forecasting a particular set of attributes—in world of text. Section 4 discusses applica-
the dimensionality reduction step. PCA con- tions using PLS in text regression.
denses text data into indices based on the PCR and PLS share a number of com-
covariation among the predictors. This hap- mon properties. In both cases, Kis a
pens prior to the forecasting step and with- user-controlled parameter which, in many

out consideration of how predictors associate social science applications, is selected ex ante
with the forecast target. by the researcher. But, like any hyperparam-
In contrast, PLS performs dimension eter, Kcan be tuned via c ross-validation. And
reduction by directly exploiting covaria- neither method is scale invariant—the fore-
tion of predictors with the forecast target.7 casting model is sensitive to the distribution
Suppose we are interested in forecasting of predictor variances. It is therefore com-
a scalar attribute v i. PLS regression pro- mon to variance-standardize features before
ceeds as follows. For each element jof the applying PCR or PLS.
feature vector c i, estimate the univariate 3.1.3 Nonlinear Text Regression
covariance between vi on cij . This covari-
ance, denoted φj, reflects the attribute’s Penalized linear models are the most
“partial” sensitivity to each feature j. Next, widely applied text regression tools due to
form a single predictor by averaging all their simplicity, and because they may be
attributes into a single aggregate predictor viewed as a fi rst-order approximation to
vˆ i = ∑j φj cij / ∑ j φj. This forecast places potentially nonlinear and complex data gen-
the highest weight on the strongest uni- erating processes (DGPs). In cases where a
variate predictors, and the least weight on linear specification is too restrictive, there
the weakest. In this way, PLS performs its are several other machine learning tools that
dimension reduction with the ultimate fore- are well suited to represent nonlinear asso-
casting objective in mind. The description ciations between text ci and outcome attri-
of vˆ i reflects the K = 1case, i.e., when text butes vi. Here we briefly describe four such
nonlinear regression methods—generalized
linear models, support vector machines,
regression trees, and deep learning—and
6 See Stock and Watson (2002a, b) for development of
the PCR estimator and an application to macroeconomic
provide references for readers interested in
forecasting with a large set of numerical predictors. thorough treatments of each.
7 See Kelly and Pruitt (2013, 2015) for the asymptotic
theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture
risk premia in financial markets. nonlinear associations between ci and vi is

546 Journal of Economic Literature, Vol. LVII (September 2019)

with a generalized linear model (GLM). roblems. The logic of trees differs markedly
p
These expand the linear model to include from traditional regressions. A tree “grows”
nonlinear functions of ci such as polynomials by sequentially sorting data observations
or interactions, while otherwise treating the into bins based on values of the predictor
problem with the penalized linear regression variables. This partitions the data set into
methods discussed above. rectangular regions, and forms predictions
A related method used in the social science as the average value of the outcome vari-
literature is the support vector machine, or able within each partition (Breiman et al.
SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to
classification problems (when V is categor- accommodate rich interactions and nonlin-
ical), the prototypical example being email ear dependencies.
spam filtering. A detailed discussion of SVMs Two extensions of the simple regression
is beyond the scope of this review, but from tree have been highly successful thanks to
a high level, the SVM finds hyperplanes in a clever regularization approaches that min-
basis expansion of Cthat partition the obser- imize the need for tuning and avoid over-
vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001)
that viare all equal in each region).8 average predictions from many trees that
GLMs and SVMs both face the limita- have been randomly perturbed in a b ootstrap
tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002)
which basis transformations and interactions recursively combine predictions from many
to include, they may overfit and require oversimplified trees.10
extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non-
Friedman 2009; Murphy 2012). For exam- linearity and high-order interactions—are
ple, multi-way interactions increase the sometimes lessened in the presence of
parameterization combinatorially and can high-dimensional inputs. While we would

quickly overwhelm the penalization rou- generally recommend tree models, and
tine, and their performance suffers in the especially random forests, they are often not
presence of many spurious “noise” inputs worth the effort for simple text regression.
(Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is
in a final prediction step after some dimen-
Regression Trees.—Regression trees have sion reduction derived from the generative
become a popular nonlinear approach for models in section 3.2.
incorporating
multi-way predictor inter-
actions into regression and classification Deep Learning.—There is a host of other
machine learning techniques that have been
8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com-
and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are
of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the
(2001) (among others) study text applications of SVMs.
9 Another drawback of SVMs is that they cannot be inputs to act on the response through one
easily connected to the estimation of a probabilistic
model and the resulting fitted model can sometimes be
difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an
pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie,
objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results
SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy
models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation
ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially
of alternatives. optimal trees.

Gentzkow, Kelly, and Taddy: Text as Data 547

or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana-
functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the
attraction of neural networks is their status as log penalty on the right of figure 1.
universal approximators, a theoretical result For those looking to do a full Bayesian
describing their ability to mimic general, analysis for high-dimensional (e.g., text)
smooth nonlinear associations. regression, an especially appealing model is
In high-dimensional and very noisy set- the spike-and-slab introduced in George and
tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu-
ral nets tend to suffer from the same issues tion over regression coefficients as a mixture
referenced above: they often overfit and between two densities centered at zero—
are difficult to tune. However, the recently one with very small variance (the spike) and
popular “deep” versions of neural networks another with large variance (the slab). This
(with many layers, and fewer nodes per model allows one to compute posterior vari-
layer) incorporate a number of innovations able inclusion probabilities as, for each coef-
that allow them to work better, faster, and ficient, the posterior probability that it came
with little tuning, even in difficult text analy- from the slab and not the spike component.
sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior
are now the state-of-the-art solution for many distribution, e.g., via Markov chain Monte
machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for spike-and-slab
Hinton 2015).11 DNNs are now employed in models is much more computationally inten-
many complex natural language processing sive than fitting the penalized regressions of
tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright,
and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that s pike-and-slab
parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples
in exercises of relevance to social scientists— can be useful in application, while Scott
for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi-
ideology from text using a DNN. They are cient implementations of the s pike-and-slab
frequently used in conjunction with richer model for big data applications. These pro-
text representations such as word embed- cedures give a full accounting of parameter
dings, described more below. uncertainty, which we miss in a quick penal-
ized regression.
3.1.4 Bayesian Regression Methods
3.2 Generative Language Models
The penalized methods above can all be
interpreted as posterior maximization under Text regression treats the token counts as
some prior. For example, ridge regression generic high-dimensional input variables,
maximizes the posterior under independent without any attempt to model structure that
Gaussian priors on each coefficient, while is specific to language data. In many set-
Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen-
Bayesian interpretations to the lasso. See also erative model for the text tokens to learn
the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word
(2010) and the double Pareto of Armagan, choice and account for various dependen-
cies among words and among attributes. In
this approach, the words in a document are
11 Goodfellow, Bengio, and Courville (2016) provide a
viewed as the realization of a generative pro-
thorough textbook overview of these “deep learning” tech-
nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for
their use in natural language processing. p(ci | vi) .

548 Journal of Economic Literature, Vol. LVII (September 2019)

3.2.1 Unsupervised Generative Models Many readers will recognize the model in
 (5) as a factor model for the vector of nor-
 In the unsupervised setting, we have no malized counts for each token in document
 direct observations of the true attributes i, ci / mi. Indeed, a topic model is simply a fac-
 vi. Our inference about these attributes must tor model for multinomial data. Each topic
 therefore depend entirely on strong assump- is a probability vector over possible tokens,
 tions that we are willing to impose on the denoted θl, l = 1, … , k (where θlj ≥ 0 and
 structure of the model p  ( ci | vi) . Examples in ∑pj=1 θlj  = 1). A topic can be thought of as
 the broader literature include cases where a cluster of tokens that tend to appear in
 the viare latent factors, clusters, or catego- documents. The latent attribute vector v  i is
 ries. In text analysis, the leading application referred to as the set of topic weights (for-
 has been the case in which the vi are topics. mally, a distribution over topics, vil ≥ 0 and
 A typical generative model implies that ∑l=1 k
  vil  = 1). Note that v il  describes the pro-
 each observation c  iis a conditionally inde- portion of language in document i devoted to
pendent draw from the vocabulary of the lthtopic. We can allow each document
possible tokens according to some d  ocument- to have a mix of topics, or we can require
specific token probability vector, say that one vil = 1while the rest are zero, so
 i = [qi1  ⋯ qip
  ]′ .  Conditioning on doc- that each document has a single topic.13
 ument length, m  i = ∑ j  cij , this implies a Since its introduction into text analysis,
 multinomial distribution for the counts topic modeling has become hugely popu-
 lar.14 (See Blei 2012 for a high-level over-
(4) ci ∼ MN
 ( i, mi). view.) The model has been especially useful
 in political science (e.g., Grimmer 2010),
This multinomial model underlies the vast where researchers have been successful in
majority of contemporary generative models attaching political issues and beliefs to the
for text. estimated latent topics.
 Under the basic model in (4), the function Since the v  iare of course latent, estima-
 i = q(vi) links attributes to the distribution tion for topic models tends to make use of
of text counts. A leading example of this link some alternating inference for V  | Θand Θ | V.
function is the topic model specification of One possibility is to employ a version of the
Blei, Ng, and Jordan (2003),12 where expectation-maximization (EM) algorithm
 
 to either maximize the likelihood implied by
(5)  i = vi1
   θ1  + vi2 θ2  + ⋯ + vik θk 

 = Θ vi. 13 Topic modeling is alternatively labeled as “latent
 Dirichlet allocation,” (LDA) which refers to the Bayesian
 model in Blei, Ng, and Jordan (2003) that treats each v  i and
 θlas generated from a Dirichlet-distributed prior. Another
 specification that is popular in political science (e.g., Quinn
 et al. 2010) keeps θ  l as Dirichlet-distributed but requires
 each document to have a single topic. This may be most
 12 Standard l east-squares factor models have long appropriate for short documents, such a press releases or
been employed in “latent semantic analysis” (LSA; single speeches.
Deerwester et al. 1990), which applies PCA (i.e., singu- 14 The same model was independently introduced in
lar value decompositions) to token count transformations genetics by Pritchard, Stephens, and Donnelly (2000) for
such as  i = ci/mi or xij = cij  log(dj) where dj = ∑i  1[cij >0]. factorizing gene expression as a function of latent popula-
Topic modeling and its precursor, probabilistic LSA, are tions; it has been similarly successful in that field. Latent
generally seen as improving on such approaches by replac- Dirichlet allocation is also an extension of a related mix-
ing arbitrary transformations with a plausible generative ture modeling approach in the latent semantic analysis of
model. Hofmann (1999).

Gentzkow, Kelly, and Taddy: Text as Data 549

(4) and (5) or, after incorporating the usual on the application. As we discuss below, in
Dirichlet priors on v i and θl, to maximize the many applications of topic models to date,
posterior; this is the approach taken in Taddy the goal is to provide an intuitive description
(2012; see this paper also for a review of of text, rather than inference on some under-
topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the
one can target the full posterior distribution ad hoc selection of the number of topics may
p(Θ, V ∣ ci) . Estimation, say for Θ, then pro- be reasonable.
ceeds by maximization of the estimated mar- The basic topic model has been general-
ginal posterior, say p (Θ ∣ ci) . ized and extended in variety of ways. A prom-
Due to the size of the data sets and dimen- inent example is the dynamic topic model
sion of the models, posterior approximation of Blei and Lafferty (2006), which considers
for topic models usually uses some form documents that are indexed by date (e.g.,
of variational inference (Wainwright and publication date for academic articles) and
Jordan 2008) that fits a tractable paramet- allows the topics, say Θ t, to evolve smoothly
ric family to be as close as possible (e.g., in in time. Another example is the super-
Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe
posterior. This variational approach was (2007), which combines the standard topic
used in the original Blei, Ng, and Jordan model with an extra equation relating the
(2003) paper and in many applications since. weights vito some additional attribute yi in
Hoffman et al. (2013) present a stochastic p(yi | vi). This pushes the latent topics to be
variational inference algorithm that takes relevant to y ias well as the text c i. In these
advantage of techniques for optimization on and many other extensions, the modifica-
massive data; this algorithm is used in many tions are designed to incorporate available
contemporary topic modeling applications. document metadata (in these examples,
Another approach, which is more computa- time and y i respectively).
tionally intensive but can yield more accu-
3.2.2 Supervised Generative Models
rate posterior approximations, is the MCMC
algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes vi are
Alternatively, for quick estimation without observed in a training set and thus may be
uncertainty quantification, the posterior directly harnessed to inform the model of
maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common
good option. supervised generative model is the so-called
The choice of k , the number of topics, is naive Bayes classifier (e.g., Murphy 2012),
often fairly arbitrary. Data-driven choices which treats counts for each token as inde-
do exist: Taddy (2012) describes a model pendent with class-dependent means. For
selection process for kthat is based upon example, the observed attribute might be
Bayes factors, Airoldi et al. (2010) provide author identity for each document in the
a cross-validation (CV) scheme, while Teh corpus with the model specifying different
et al. (2006) use Bayesian nonparametric mean token counts for each author.
techniques that view kas an unknown model In naive Bayes, viis a univariate categor-
parameter. In practice, however, it is very ical variable and the token count distribu-
common to simply start with a number of tion is factorized as p(ci | vi) = ∏ j pj(cij | vi),
topics on the order of ten, and then adjust thus “naively” specifying conditional inde-
the number of topics in whatever direction pendence between tokens j. This rules out
seems to improve interpretability. Whether the possibility that by choosing to say one
this ad hoc procedure is problematic depends token (say, “hello”) we reduce the probability

You can also read