Online Latent Dirichlet Allocation with Infinite Vocabulary
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Online Latent Dirichlet Allocation with Infinite Vocabulary Ke Zhai zhaike@cs.umd.edu Department of Computer Science, University of Maryland, College Park, MD USA Jordan Boyd-Graber jbg@umiacs.umd.edu iSchool and UMIACS, University of Maryland, College Park, MD USA Abstract et al., 2011; Mimno et al., 2012) make the same lim- Topic models based on latent Dirichlet alloca- iting assumption. The namesake topics, distributions tion (LDA) assume a predefined vocabulary. over words that evince thematic coherence, are always This is reasonable in batch settings but not modeled as a multinomial drawn from a finite Dirich- reasonable for streaming and online settings. let distribution. This assumption precludes additional To address this lacuna, we extend LDA by words being added over time. drawing topics from a Dirichlet process whose Particularly for streaming algorithms, this is neither base distribution is a distribution over all reasonable nor appealing. There are many reasons strings rather than from a finite Dirichlet. We immutable vocabularies do not make sense: words develop inference using online variational in- are invented (“crowdsourcing”), words cross languages ference and—to only consider a finite number (“Gangnam”), or words common in one context become of words for each topic—propose heuristics to prominent elsewhere (“vuvuzelas” moving from music dynamically order, expand, and contract the to sports in the 2010 World Cup). To be flexible, topic set of words we consider in our vocabulary. models must be able to capture the addition, invention, We show our model can successfully incorpo- and increased prominence of new terms. rate new words and that it performs better than topic models with finite vocabularies in Allowing models to expand topics to include additional evaluations of topic quality and classification words requires changing the underlying statistical for- performance. malism. Instead of assuming that topics come from a finite Dirichlet distribution, we assume that it comes from a Dirichlet process (Ferguson, 1973) with a base 1. Introduction distribution over all possible words, of which there are an infinite number. Bayesian nonparametric tools like Latent Dirichlet allocation (LDA) is a probabilistic the Dirichlet process allow us to reason about distri- approach for exploring topics in document collec- butions over infinite supports. We review both topic tions (Blei et al., 2003). Topic models offer a formalism models and Bayesian nonparametrics in Section 2. In for exposing a collection’s themes and have been used Section 3, we present the infinite vocabulary topic model, to aid information retrieval (Wei & Croft, 2006), un- which uses Bayesian nonparametrics to go beyond fixed derstand academic literature (Dietz et al., 2007), and vocabularies. discover political perspectives (Paul & Girju, 2010). In Section 4, we derive approximate inference for our As hackneyed as the term “big data” has become, re- model. Since emerging vocabulary are most important searchers and industry alike require algorithms that in non-batch settings, in Section 5, we extend inference are scalable and efficient. Topic modeling is no differ- to streaming settings. We compare the coherence and ent. A common scalability strategy is converting batch effectiveness of our infinite vocabulary topic model algorithms into streaming algorithms that only make against models with fixed vocabulary in Section 6. one pass over the data. In topic modeling, Hoffman Figure 1 shows a topic evolving during inference. The et al. (2010) extended LDA to online settings. algorithm processes documents in sets we call mini- However, this and later online topic models (Wang batches; after each minibatch, online variational infer- ence updates our model’s parameters. This shows that Proceedings of the 30 th International Conference on Ma- out of vocabulary words can enter topics and eventually chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). become high probability words.
Online Latent Dirichlet Allocation with Infinite Vocabulary minibatch-2 minibatch-3 minibatch-5 minibatch-8 minibatch-10 minibatch-16 minibatch-17 minibatch-39 minibatch-83 minibatch-120 ... ... ... ... ... ... ... ... 0-captain 0-appear 100-issu 90-club 102-club 118-club 132-rock 87-seri 82-seri 1-annual ... ... ... ... ... ... ... 1-appear 1-hulk 2-rock 146-cover 105-issu 115-issu 128-copi 194-issu 161-issu 162-issu ... ... ... ... ... ... ... 2-crohn 2-wolverin 3-wolverin 196-admir 161-cover 127-cover 137-cover 215-seri 283-copi 288-copi ... ... ... ... ... ... 3-hulk 3-annual 4-appear 138-issu ... 199-copi 214-copi 130-copi 217-copi 306-appear 294-appear ... 4-copi ... ... ... ... ... 5-comicstrip 5-rock 180-appear 307-cover 229-appear 244-appear 197-appear 226-cover 311-cover 5-rider ... ... 6-seri 6-wolverin ... ... ... ... ... 319-rock 502-annual 6-comicstrip 324-club 288-rock 289-rock 261-appear 512-annual 7-mutant 7-bloom ... ... ... ... ... ... ... 7-cover 493-annual 814-forc 8-cover 8-forc 360-annual 381-annual 450-annual 588-annual 830-forc ... ... ... ... ... ... ... ... 8-forc 9-comicstrip 639-seri 1194-rider 12-issu 643-rock 685-seri 584-seri 949-forc 4782-wolverin ... ... ... ... 9-captain ... ... ... ... ... 11-hiram 877-forc 8516-bloom 14-hulk 819-forc 791-forc 811-forc 1074-rider 9231-bloom ... ... ... 10-bloom ... ... ... ... ... 12-annal 1003-rider 8944-hulk 16-copi 1064-rider 1091-rider 1090-rider 6038-comicstrip 9659-hulk 11-issu ... ... ... 13-mutant ... ... ... ... ... 7075-captain 10819-comicstrip 53-forc ... 12-seri 1185-seri 4639-hiram 6267-crohn 6520-mutant 11527-comicstrip ... ... ... 15-seri ... ... ... ... ... ... 9420-crohn 11301-mutant 57-rider 16-mutant 7113-hiram 9569-captain 12009-mutant ... ... ... 16-cover ... Settings ... ... ... ... Number of Topics: K = 50 10266-hiram 11914-crohn 14335-captain 15040-captain 86-captain 37-crohn 19-copi Truncation Level: T = 20K ... ... ... ... ... ... ... Minibatch Size: S = 155 12760-hiram 16676-crohn 17381-crohn 3531-hiram 41-rock β 23-issu DP Scale Parameter: α = 5000 ... ... ... ... ... ... Reordering Delay: U = 20 17519-hiram 18224-hiram 3690-bloom 43-hiram Learning Inertia: τ0 = 256 ... ... ... 280-rider ... ... Learning Rate: κ = 0.6 3915-crohn ... New-Words 2 New-Words 3 New-Words 5 New-Words 8 New-Words 10 New-Words 16 New-Words-17 New-Words-39 New-Words-83 hiram crohn captain comicstrip bloom wolverin laci izzo gown ... ... ... moskowitz corpu seqitur mutant hulk albion ... ... ... ... patlafontain mazelyah ... ... words added at corresponding minibatch Figure 1. The evolution of a single “comic book” topic from the 20 newsgroups corpus. Each column is a ranked list of word probabilities after processing a minibatch (numbers preceding words are the exact rank). The box below the topics contains words introduced in a minibatch. For example, “hulk” first appeared in minibatch 10, was ranked at 9659 after minibatch 17, and became the second most important word by the final minibatch. Colors help show words’ trajectories. 2. Background 2.1. Bayesian Nonparametrics Latent Dirichlet allocation (Blei et al., 2003) assumes Bayesian nonparametrics is an appealing solution; it a simple generative process. The K topics, drawn from models arbitrary distributions with an unbounded and a symmetric Dirichlet distribution, βk ∼ Dir(η), k = possibly countably infinite support. While Bayesian {1, . . . , K} generate a corpus of observed words: nonparametrics is a broad field, we focus on the Dirich- 1: for each document d in a corpus D do let process (DP, Ferguson 1973). 2: Choose a distribution θd over topics from a The Dirichlet process is a two-parameter distribution Dirichlet distribution θd ∼ Dir(αθ ). with scale parameter αβ and base distribution G0 . A 3: for each of the n = 1, . . . , Nd word indexes do draw G from DP(αβ , G0 ) is modeled as 4: Choose a topic zn from the document’s distri- bution over topics zn ∼ Mult(θd ). b1 , . . . , bi , . . . ∼ Beta(1, αβ ), ρ1 , . . . , ρi , . . . ∼ G0 . 5: Choose a word wn from the appropriate topic’s Individual draws from a Beta distribution are the distribution over words p(wn |βzn ). foundation for the stick-breaking construction of the Implicit in this model is a finite number of words in DP (Sethuraman, 1994). Each break point bi mod- the vocabulary because the support of the Dirichlet els how much probability mass remains. These break distribution Dir(η) is fixed. Moreover, it fixes a priori points combine to form an infinite multinomial, which words we can observe, a patently false assump- i−1 Y X tion (Algeo, 1980). β i ≡ bi (1 − bj ), G≡ βi δρi , (1) j=1 i
Online Latent Dirichlet Allocation with Infinite Vocabulary Q where the weights βi give the probability of selecting 4: Set stick weights βkt = bkt s
Online Latent Dirichlet Allocation with Infinite Vocabulary the evidence lower bound (ELBO) L, distributions of interest. For q(z | η), we use local col- lapsed MCMC sampling (Mimno et al., 2012) and for log p(W ) ≥ Eq(Z) [log p(W , Z)] − Eq(Z) [q] = L. (2) q(b | ν) we use stochastic variational inference (Hoffman et al., 2010). We describe both in turn. Maximizing L is equivalent to minimizing the Kullback- Leibler (KL) divergence between the true distribution 4.2. Stochastic Inference and the variational distribution. Recall that the variational distribution q(zd | η) is a Unlike mean-field approaches (Blei et al., 2003), which single distribution over the Nd vectors of length K. assume q is a fully factorized distribution, we integrate While this removes the tight coupling between θ and z out the word-level topic distribution vector θ: q(zd | η) that often complicates mean-field variational inference, is a single distribution over K Nd possible topic con- it is no longer as simple to determine the variational figurations rather than a product of Nd multinomial distribution q(zd | η) that optimizes Eqn. (2). However, distributions over K topics. Combined with a beta 1 2 Mimno et al. (2012) showed that Gibbs sampling in- distribution q(bkt |νkt , νkt ) for stick break points, the ∗ stantiations of zdn from the distribution conditioned variational distribution q is on other topic assignments results in a sparse, effi- q(Z) ≡ q(β, z) = D q(zd | η) K q(bk | νk1 , νk2 ). (3) Q Q cient empirical estimate of the variation distribution. In our model, the conditional distribution of a topic However, we cannot explicitly represent a distribution assignment of a word with TOS index t = Tk (wdn ) is over all possible strings, so we truncate our variational q(zdn = k|z−dn , t = Tk (wdn )) (4) stick-breaking distribution q(b | ν) to a finite set. PNd θ ∝ m=1 Izdm =k + αk exp Eq(ν) [log βkt ] . 4.1. Truncation Ordered Set m6=n Variational methods typically cope with infinite dimen- We iteratively sample from this conditional distribution sionality of nonparametric models by truncating the to obtain the empirical distribution φdn ≡ q̂(zdn ) for distribution to a finite subset of all possible atoms that latent variable zdn , which is fundamentally different nonparametric distributions consider (Blei & Jordan, from mean-field approach (Blei et al., 2003). 2005; Kurihara et al., 2006; Boyd-Graber & Blei, 2009). There are two cases to consider for computing Eqn. (4)— This is done by selecting a relatively large truncation whether a word wdn is in the TOS for topic k or not. index Tk , and then stipulating that the variational dis- First, we look up the word’s index t = Tk (wdn ). If this tribution uses the rest of the available stick at that word is in the TOS, i.e., t ≤ Tk , the expectations are index, i.e., q(bTk = 1) ≡ 1. As a consequence, β is zero straightforward (Mimno et al., 2012) in expectation under q beyond that index. PNd θ 1 However, directly applying such a technique is not q(zdn = k) ∝ m=1 φdmk + αk · exp{Ψ(νkt ) (5) m6=n feasible here, as truncation is not just a search over Ps Tk ) is then a higher probability than words with higher indices. PNd θ q(zdn = k) ∝ m=1 φdmk + αk (6) Each topic has a unique TOS Tk of limited size that m6=n maps every word type w to an integer t; thus t = Tk (w) Ps≤t 2 1 2 · exp{ s=1 Ψ(νks ) − Ψ(νks + νks ) }. is the index of the atom ρkt that corresponds to w. We defer how we choose this mapping until Section 4.3. This is different from finite vocabulary topic models More pressing is how we compute the two variational that set vocabulary a priori and ignore OOV words.
Online Latent Dirichlet Allocation with Infinite Vocabulary 4.3. Refining the Truncation Ordered Set 5.1. Updating the Truncation Ordered Set In this section, we describe heuristics to update the A nonparametric streaming model should allow the TOS inspired by MCMC conditional equations, a com- vocabulary to dynamically expand as new words ap- mon practice for updating truncations. One component pear (e.g., introducing “vuvuzelas” for the 2010 World of a good TOS is that more frequent words should come Cup), and contract as needed to best model the data first in the ordering. This is reasonable because the (e.g., removing “vuvuzelas” after the craze passes). We stick-breaking prior induces a size-biased ordering of describe three components of this process, expanding the clusters. This has previously been used for trun- the truncation, refining the ordering of TOS, and con- cation optimization for Dirichlet process mixtures and tracting the vocabulary. admixtures (Kurihara et al., 2007). Another component of a good TOS is that words con- Determining the TOS Ordering This process de- sistent with the underlying base distribution should pends on the ranking score of a word in topic k at be ranked higher than those not consistent with the minibatch i, Ri,k (ρ). Ideally, we would compute R base distribution. This intuition is also consistent with from all data. However, only a single minibatch is the conditional sampling equations for MCMC infer- accessible. We have a per-minibatch rank estimate ence (Müller & Quintana, 2004); the probability of PNd D P creating a new table with dish ρ is proportional to ri,k (ρ) = p(ρ|G0 ) · |Si | d∈Si n=1 φdnk δωdn =ρ αβ G0 (ρ) in the Chinese restaurant process. which we interpolate with our previous ranking Thus, to update the TOS, we define the ranking score of word t in topic k as Rik (ρ) = (1 − ) · Ri−1,k (ρ) + · rik (ρ). (10) Nd D X X R(ρkt ) = p(ρkt |G0 ) φdnk δωdn =ρkt , (7) d=1 n=1 We introduce an additional algorithm parameter, the reordering delay U . We found that reordering after sort all words by the scores within that topic, and then every minibatch (U = 1) was not effective; we explore use those positions as the new TOS. In Section 5.1, we the role of reordering delay in Section 6. After U present online updates for the TOS. minibatches have been observed, we reorder the TOS for each topic according to the words’ ranking score 5. Online Inference R in Eqn. (10); Tk (w) becomes the rank position of w according to the latest Rik . Online variational inference seeks to optimize the ELBO L according to Eqn. (2) by stochastic gradi- ent optimization. Because gradients estimated from Expanding the Vocabulary Each minibatch con- a single observation are noisy, stochastic inference for tains words we have not seen before. When we see topic models typically uses “minibatches” of S docu- them, we must determine their relative rank position ments out of D total documents (Hoffman et al., 2010). in the TOS, their rank scores, and their associated variational parameters. The latter two issues are rele- An approximation of the natural gradient of L with vant for online inference because both are computed respect to ν is the product of the inverse Fisher infor- via interpolations from previous values in Eqn. (10) mation and its first derivative (Sato, 2001) and (9). For an unseen word ω, previous values are 1 D P PNd 1 undefined. Thus, we set Ri−1,k for unobserved words ∆νkt = 1 + |S| d∈S n=1 φdnk δωdn =ρkt − νkt (8) to be 0, ν to be 1, and Tk (ω) is Tk + 1 (i.e., increase 2 D PNd truncation and append to the TOS). = αβ + |S| 2 P ∆νkt d∈S n=1 φdnk δωdn >ρkt − νkt , which leads to an update of ν, Contracting the Vocabulary To ensure tractabil- 1 νkt = 1 νkt +· 1 ∆νkt , 2 νkt = 2 νkt +· 2 ∆νkt (9) ity we must periodically prune the words in the TOS. When we reorder the TOS (after every U minibatches), where i = (τ0 + i)−κ defines the step size of the algo- we only keep the top T terms, where T is a user-defined rithm in minibatch i. The learning rate κ controls integer. A word type ρ will be removed from Tk if its in- how quickly new parameter estimates replace the old; dex Tk (ρ) > T and its previous information (e.g., rank κ ∈ (0.5, 1] is required for convergence. The learn- and variational parameters) is discarded. In a later ing inertia τ0 prevents premature convergence. We minibatch, if a previously discarded word reappears, it recover the batch setting if S = D and κ = 0. is treated as a new word.
Online Latent Dirichlet Allocation with Infinite Vocabulary T : 4000 T : 5000 T : 4000 T : 5000 T : 20000 T : 30000 T : 40000 T : 20000 T : 30000 T : 40000 150 80 U : 10 U : 20 U : 10 U : 20 60 U : 10 U : 20 U : 10 U : 20 60 60 100 40 40 40 50 20 20 pmi 20 pmi 0 0 0 pmi pmi 0 150 80 100 60 60 60 50 40 40 40 20 20 20 0 0 0 00 00 0 0 40 80 40 80 40 80 0 12 12 12 20 40 60 20 40 60 20 40 60 0 0 0 20 40 60 20 40 60 10 20 30 40 10 20 30 40 0 0 0 0 minibatch minibatch minibatch minibatch infvoc: α β ab=3k infvoc: α β ab=5k infvoc: α β ab=3k infvoc: α β ab=5k β β β β settings settings infvoc: α ab=1k infvoc: α ab=2k infvoc: α ab=1k infvoc: α ab=2k infvoc: α β ab=4k infvoc: α β ab=4k settings settings (a) de-news, S = 140. (b) de-news, S = 245. (c) 20 newsgroups, S = 155. (d) 20 newsgroups, S = 310. Figure 2. PMI score on de-news (Figure 2(a) and 2(b), K = 10) and 20 newsgroups (Figure 2(c) and 2(d), K = 50) against different settings of DP scale parameter αβ , truncation level T and reordering delay U , under learning rate κ = 0.8 and learning inertia τ0 = 64. Our model is more sensitive to αβ and less sensitive to T . ! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 !:1 6. Experimental Evaluation 80 "0 : 256 "0 : 64 60 40 20 In this section, we evaluate the performance of our 0 pmi 80 infinite vocabulary topic model (infvoc) on two cor- 60 40 pora: de-news 1 and 20 newsgroups.2 Both corpora 20 0 were parsed by the same tokenizer and stemmer with 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 0 0 0 0 0 minibatch a common English stopword list (Bird et al., 2009). infvoc: α β ab=1k T=4k infvoc: α β ab=2k T=4k settings First, we examine its sensitivity to both model param- infvoc: α β ab=1k T=5k infvoc: α β ab=2k T=5k eters and online learning rates. Having chosen those (a) de-news, S = 245 and K = 10. parameters, we then compare our model with other topic models with fixed vocabularies. ! : 0.6 ! : 0.7 ! : 0.8 ! : 0.9 !:1 150 "0 : 256 "0 : 64 100 50 pmi 0 150 Evaluation Metric Typical evaluation of topic mod- 100 50 els is based on held-out likelihood or perplexity. How- 0 0 05 0 05 0 05 0 05 0 5 25 50 75 25 50 75 25 50 75 25 50 75 25 50 75 0 10 12 10 12 10 12 10 12 10 12 ever, creating a strictly fair comparison for our model minibatch against existing topic model algorithms is difficult, as infvoc: α β ab=3k T=20k infvoc: α β ab=4k T=20k infvoc: α β ab=5k T=20k settings traditional topic model algorithms must discard words (b) 20 newsgroups, S = 155 and K = 50. that have not previously been observed. Moreover, held-out likelihood is a flawed proxy for how topic Figure 3. PMI score on two datasets with reordering delay models are used in the real world (Chang et al., 2009). U = 20 against different settings of decay factor κ and τ0 . Instead, we use two evaluation metrics: topic coherence A suitable choice of DP scale parameter αβ increases the and classification accuracy. performance significantly. Learning parameters κ and τ0 jointly define the step decay. Larger step sizes promote Pointwise mutual information (PMI), which correlates better topic evolution. with human perceptions of topic coherence, measures how words fit together within a topic. Following New- man et al. (2009), we extract document co-occurence learned from the topic distribution of training docu- statistics from Wikipedia and score a topic’s coherence ments applied to test documents (the topic model sees by averaging the pairwise PMI score (w.r.t. Wikipedia both sets). A higher accuracy means the unsupervised co-occurence) of the topic’s ten highest ranked words. topic model better captures the underlying structure of Higher average PMI implies a more coherent topic. the corpus. To better simulate real-world situations, 20- Classification accuracy is the accuracy of a classifier newsgroup’s test/train split is by date (test documents 1 appeared after training documents). A collection of daily news items between 1996 to 2000 in English. It contains 9,756 documents, 1,175,526 word tokens, and 20,000 distinct word types. Avail- Comparisons We evaluate the performance of our able at homepages.inf.ed.ac.uk/pkoehn/publications/ model (infvoc) against three other models with fixed de-news. 2 vocabularies: online variational Bayes LDA (fixvoc-vb, A collection of discussions in 20 different newsgroups. It contains 18,846 documents and 100,000 distinct word types. Hoffman et al. 2010), online hybrid LDA (fixvoc-hybrid, It is sorted by date into roughly 60% training and 40% test- Mimno et al. 2012), and dynamic topic models (dtm, ing data. Available at qwone.com/~jason/20Newsgroups. Blei & Lafferty 2006). Including dynamic topic models
Online Latent Dirichlet Allocation with Infinite Vocabulary 80 is not a fair comparison, as its inferences requires access 60 to all of the documents in the dataset; unlike the other algorithms, it is not online. 40 pmi settings 20 dtm−dict: tcv=0.01 fixvoc−vb−dict fixvoc−hybrid−dict fixvoc−vb−null 0 fixvoc−hybrid−null infvoc: α β ab=2k T=4k U=10 Vocabulary For fixed vocabulary models, we must 0 10 20 30 40 decide on a vocabulary a priori. We consider two factor(minibatch) different vocabulary methods: use the first minibatch (a) de-news, S = 245, K = 10, κ = 0.6 and τ0 = 64 to define a vocabulary (null ) or use a comprehensive dictionary3 (dict). We use the same dictionary to train 150 infvoc’s base distribution. 100 pmi 50 Experiment Configuration For all models, we use 0 the same symmetric document Dirichlet prior with 0 10 20 30 40 50 60 70 80 90 100 110 120 αθ = 1/K, where K is the number of topics. Online factor(minibatch) dtm−dict: tcv=0.05 fixvoc−hybrid−null fixvoc−vb−null models see exactly the same minibatches. For dtm, settings β fixvoc−hybrid−dict fixvoc−vb−dict infvoc: α ab=5k T=20k U=20 which is not an online algorithm but instead partitions (b) 20 newsgroups, S = 155, K = 50, κ = 0.8 and τ0 = 64 its input into “epochs”, we combine documents in ten consecutive minibatches into an epoch (longer epochs Figure 4. PMI score on two datasets against different mod- tended to have worse performance; this was the shortest els. Our model infvoc yields a better PMI score against epoch that had reasonable runtime). fixvoc and dtm; gains are more marked in later minibatches For online hybrid approaches (infvoc and fixvoc-hybrid ), as more and more proper names have been added to the topics. Because dtm is not an online algorithm, we do not we collect 10 samples empirically from the variational have detailed per-minibatch coherence statistics and thus distribution in E-step with 5 burn-in sweeps. For fixvoc- show topic coherence as a box plot per epoch. vb, we run 50 iterations for local parameter updates. 6.1. Sensitivity to Parameters select parameters for each of the models4 and plotted Figure 2 shows how the PMI score is affected by the the topic coherence averaged over all topics in Figure 4. DP scale parameter αβ , the truncation level T , and the While infvoc initially holds its own against other mod- reordering delay U . The relatively high values of αβ els, it does better and better in later minibatches, since may be surprising to readers used to seeing a DP that it has managed to gain a good estimate of the vocabu- instantiates dozens of atoms, but when vocabularies lary and the topic distributions have stabilized. Most are in tens of thousands, such scale parameters are of the gains in topic coherence come from highly spe- necessary to support the long tail. Although we did cific proper nouns which are missing from vocabularies not investigate such approaches, this suggests that more of the fixed-vocabulary topic models. This advantage advanced nonparametric distributions (Teh, 2006) or holds even against dtm, which uses batch inference. explicitly optimizing αβ may be useful. Relatively large values of U suggest that accurate estimates of the rank 6.3. Comparing Algorithms: Classification order are important for maintaining coherent topics. For the classification comparison, we consider addi- While infvoc is sensitive to parameters related to the tional topic models. While we need the most probable vocabulary, once suitable values of those parameters topic strings for PMI calculations, classification exper- are chosen, it is no more sensitive to learning-specific iments only need a document’s topic vector. Thus, we parameters than other online LDA algorithms (Fig- consider hashed vocabulary schemes. The first, which ure 3), and values used for other online topic models we call dict-hashing, uses a dictionary for the known also work well here. words and hashes any other words into the same set 4 6.2. Comparing Algorithms: Coherence For the de-news dataset, we select (20 newsgroups pa- rameters in parentheses) minibatch size S ∈ {140, 245} Now that we have some idea of how we should set (S ∈ {155, 310}), DP scale parameter αβ ∈ {1k, 2k} parameters for infvoc, we compare it against other (αβ ∈ {3k, 4k, 5k}), truncation size T ∈ {3k, 4k} (T ∈ topic modeling techniques. We used grid search to {20k, 30k, 40k}), reordering delay U ∈ {10, 20} for infvoc; and topic chain variable tcv ∈ {0.001, 0.005, 0.01, 0.05} for 3 http://sil.org/linguistics/wordlists/english/ dtm.
Online Latent Dirichlet Allocation with Infinite Vocabulary model settings accuracy % 6.4. Qualitative Example infvoc αβ = 3k T = 40k U = 10 52.683 fixvoc vb-dict 45.514 Figure 1 shows the evolution of a topic in 20 news- τ0 = 64 κ = 0.6 fixvoc vb-null 49.390 groups about comics as new vocabulary words enter fixvoc hybrid-dict 46.720 S = 155 from new minibatches. While topics improve over time fixvoc hybrid-null 50.474 fixvoc vb dict-hash 52.525 (e.g., relevant words like “seri(es)”, “issu(e)”, “forc(e)” fixvoc vb full-hash T = 30k 51.653 are ranked higher), interesting words are being added fixvoc hybrid dict-hash 50.948 throughout training and become prominent after later fixvoc hybrid full-hash T = 30k 50.948 minibatches are processed (e.g., “captain”, “comic- dtm-dict tcv = 0.001 62.845 strip”, “mutant”). This is not the case for standard infvoc αβ = 3k T = 40k U = 20 52.317 online LDA—these words are ignored and the model fixvoc vb-dict 44.701 τ0 = 64 κ = 0.6 fixvoc vb-null 51.815 does not capture such information. In addition, only fixvoc hybrid-dict 46.368 about 60% of the word types appeared in the SIL En- S = 310 fixvoc hybrid-null 50.569 glish dictionary. Even with a comprehensive English fixvoc vb dict-hash 48.130 dictionary, online LDA could not capture all the word fixvoc vb full-hash T = 30k 47.276 types in the corpus, especially named entities. fixvoc hybrid dict-hash 51.558 fixvoc hybrid full-hash T = 30k 43.008 dtm-dict tcv = 0.001 64.186 7. Conclusion and Future Work Table 1. Classification accuracy based on 50 topic features We proposed an online topic model that, instead of extracted from 20 newsgroups data. Our model (infvoc) out- assuming vocabulary is known a priori, adds and sheds performs algorithms with a fixed or hashed vocabulary but words over time. While our model is better able to not dtm, a batch algorithm that has access to all documents. create coherent topics, it does not outperform dynamic topic models (Blei & Lafferty, 2006; Wang et al., 2008) that explicitly model how topics change. It would be interesting to allow such models to—in addition to modeling the change of topics—also change the of integers. The second, full-hash, used in Vowpal underlying dimensionality of the vocabulary. Wabbit,5 hashes all words into a set of T integers. In addition to explicitly modeling the change of topics We train 50 topics for all models on the entire dataset over time, it is also possible to model additional struc- and collect the document level topic distribution for ev- ture within topic. Rather than a fixed, immutable base ery article. We treat such statistics as features and train distribution, modeling each topic with a hierarchical a SVM classifier on all training data using Weka (Hall character n-gram model would capture regularities in et al., 2009) with default parameters. We then use the the corpus that would, for example, allow certain top- classifier to label testing documents with one of the 20 ics to favor different orthographies (e.g., a technology newsgroup labels. A higher accuracy means the model topic might prefer words that start with “i”). While is better capturing the underlying content. some topic models have attempted to capture orthogra- Our model infvoc captures better topic features than phy for multilingual applications (Boyd-Graber & Blei, online LDA fixvoc (Table 1) under all settings.6 This 2009), our approach is more robust and incorporating suggests that in a streaming setting, infvoc can better the our approach with models of transliteration (Knight categorize documents. However, the batch algorithm & Graehl, 1997) might allow concepts expressed in one dtm, which has access to the entire dataset performs language better capture concepts in another, further better because it can use later documents to retrospec- improving the ability of algorithms to capture the evolv- tively improve its understanding of earlier ones. Unlike ing themes and topics in large, streaming datasets. dtm, infvoc only sees early minibatches once and cannot revise its model when it is tested on later minibatches. Acknowledgments 5 hunch.net/~vw/ 6 The authors thank Chong Wang, Dave Blei, and Matt Parameters were chosen via cross-validation on a 30%/70% dev-test split from the following parameter set- Hoffman for answering questions and sharing code. We tings: DP scale parameter α ∈ {2k, 3k, 4k}, reordering thank Jimmy Lin and the anonymous reviewers for delay U ∈ {10, 20} (for infvoc only); truncation level T ∈ helpful suggestions. Research supported by NSF grant {20k, 30k, 40k} (for infvoc and fixvoc full-hash models); step #1018625. Any opinions, conclusions, or recommenda- decay factors τ0 ∈ {64, 256} and κ ∈ {0.6, 0.7, 0.8, 0.9, 1.0} tions are the authors’ and not those of the sponsors. (for all online models); and topic chain variable tcv ∈ {0.01, 0.05, 0.1, 0.5} (for dtm only).
Online Latent Dirichlet Allocation with Infinite Vocabulary References Kurihara, Kenichi, Welling, Max, and Teh, Yee Whye. Collapsed variational Dirichlet process mixture models. Algeo, John. Where do all the new words come from? In IJCAI. 2007. American Speech, 55(4):264–277, 1980. Mimno, David, Hoffman, Matthew, and Blei, David. Sparse Bird, Steven, Klein, Ewan, and Loper, Edward. Natural stochastic inference for latent Dirichlet allocation. In Language Processing with Python. O’Reilly Media, 2009. ICML, 2012. Blei, David M. and Jordan, Michael I. Variational infer- Müller, Peter and Quintana, Fernando A. Nonparametric ence for Dirichlet process mixtures. Journal of Bayesian Bayesian data analysis. Statistical Science, 19(1):95–110, Analysis, 1(1):121–144, 2005. 2004. Blei, David M. and Lafferty, John D. Dynamic topic models. Neal, Radford M. Probabilistic inference using Markov In ICML, 2006. chain Monte Carlo methods. Technical Report CRG-TR- 93-1, University of Toronto, 1993. Blei, David M., Ng, Andrew, and Jordan, Michael. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003. Newman, David, Karimi, Sarvnaz, and Cavedon, Lawrence. External evaluation of topic models. In ADCS, 2009. Blunsom, Phil and Cohn, Trevor. A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction. Paul, Michael and Girju, Roxana. A two-dimensional topic- In ACL, 2011. aspect model for discovering multi-faceted topics. 2010. Boyd-Graber, Jordan and Blei, David M. Multilingual topic Sato, Masa-Aki. Online model selection based on the varia- models for unaligned text. In UAI, 2009. tional Bayes. Neural Computation, 13(7):1649–1681, July 2001. Chang, Jonathan, Boyd-Graber, Jordan, and Blei, David M. Connections between the lines: Augmenting social net- Sethuraman, Jayaram. A constructive definition of Dirichlet works with text. In KDD, 2009. priors. Statistica Sinica, 4:639–650, 1994. Teh, Yee Whye. A hierarchical Bayesian language model Clark, Alexander. Combining distributional and morpho- based on Pitman-Yor processes. In ACL, 2006. logical information for part of speech induction. 2003. Wang, Chong and Blei, David. Variational inference for the Cohen, Shay B., Blei, David M., and Smith, Noah A. Vari- nested Chinese restaruant process. In NIPS, 2009. ational inference for adaptor grammars. In NAACL, 2010. Wang, Chong and Blei, David M. Truncation-free online variational inference for bayesian nonparametric models. Dietz, Laura, Bickel, Steffen, and Scheffer, Tobias. Un- In NIPS, 2012. supervised prediction of citation influences. In ICML, 2007. Wang, Chong, Blei, David M., and Heckerman, David. Continuous time dynamic topic models. In UAI, 2008. Ferguson, Thomas S. A Bayesian analysis of some nonpara- metric problems. The Annals of Statistics, 1(2):209–230, Wang, Chong, Paisley, John, and Blei, David. Online 1973. variational inference for the hierarchical Dirichlet process. In AISTATS, 2011. Goldwater, Sharon and Griffiths, Thomas L. A fully Bayesian approach to unsupervised part-of-speech tag- Wei, Xing and Croft, Bruce. LDA-based document models ging. In ACL, 2007. for ad-hoc retrieval. In SIGIR, 2006. Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., Bernhard, Reutemann, Peter, and Witten, Ian H. The and Attenberg, J. Feature hashing for large scale multi- WEKA data mining software: An update. SIGKDD task learning. In ICML, pp. 1113–1120. ACM, 2009. Explorations, 11, 2009. Zhai, Ke, Boyd-Graber, Jordan, Asadi, Nima, and Alkhouja, Hoffman, Matthew, Blei, David M., and Bach, Francis. Mohamad. Mr. LDA: A flexible large scale topic modeling Online learning for latent Dirichlet allocation. In NIPS, package using variational inference in mapreduce. In 2010. WWW, 2012. Jelinek, F. and Mercer, R. Probability distribution estima- tion from sparse data. IBM Technical Disclosure Bulletin, 28:2591–2594, 1985. Knight, Kevin and Graehl, Jonathan. Machine translitera- tion. In ACL, 1997. Kurihara, Kenichi, Welling, Max, and Vlassis, Nikos. Accel- erated variational Dirichlet process mixtures. In NIPS, 2006.
You can also read