Modeling the Unigram Distribution
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Modeling the Unigram Distribution Irene Nikkarinen∗,D,7 Tiago Pimentel∗,D Damián E. Blasi@,9,6 Ryan CotterellD,Q D University of Cambridge 7 Yle @ Harvard University 9 MPI for Evolutionary Anthropology 6 HSE University Q ETH Zürich irene.nikkarinen@gmail.com, tp472@cam.ac.uk dblasi@fas.harvard.edu, ryan.cotterell@inf.ethz.ch Abstract The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approxi- arXiv:2106.02289v1 [cs.CL] 4 Jun 2021 mated by each word’s sample frequency in the corpus. This approach, being highly depen- dent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabil- Figure 1: Word-level surprisal in Finnish under our ities for any oov word form, while positively two-stage model, two baseline LSTMs trained with biased probabilities to in-corpus words. In this either word type or token data, and another LSTM work, we argue in favor of properly modeling called the generator, trained on an interpolation of the unigram distribution—claiming it should both. Lines depict rolling averages. be a central task in natural language process- ing. With this in mind, we present a novel The unigram distribution is a central object in the model for estimating it in a language (a neural- ization of Goldwater et al.’s (2011) model) and science of language from historical linguistics to show it produces much better estimates across psycholinguistics and beyond (Baayen et al., 2016; a diverse set of 7 languages than the naı̈ve use Diessel, 2017; Divjak, 2019). However, the major- of neural character-level language models. ity of research on unigram distributions is based on identifying this distribution with sample fre- 1 Introduction quency. This approach results in poor estimates, Neural networks have yielded impressive gains as it assigns zero probability to out-of-vocabulary in sentence-level language modeling across a words.1 Further, it is highly dependent on sample typologically diverse set of languages (Mikolov size (Baayen, 2002) et al., 2010; Kalchbrenner et al., 2016; Merity The core contribution of our work is motivating et al., 2018; Melis et al., 2018; Cotterell et al., the unigram distribution as a worthwhile objective 2018). Similarly, neural networks constitute the for scientific inquiry—one which is currently un- state of the art in modeling the distribution over derstudied in the field. With that in mind, we also a language’s word types (Pimentel et al., 2020), present a neuralization of Goldwater et al.’s (2011) outperforming non-neural generative models such two-stage model.2 The gist of this approach is as Futrell et al.’s (2017) with character-level mod- using two components to model the Zipfian distri- els. This paper focuses on a less-researched task bution of word tokens in a language (Zipf, 1935) that is halfway between sentence-level language separately from its phono- or graphotactic distribu- modeling and word type distributions: Modeling tion. The first component, termed the adaptor, is the unigram distribution, the distribution over 1 word tokens in a language consisting of the prob- In the Turkish Wikipedia, for example, considering a train- ing set of 8 million and a test set of 1 million tokens, 27.4% of ability of a word’s form as well as its frequency test types and 5.3% of test tokens are out-of-vocabulary. Note in the language. In particular, as opposed to that, according to Heaps’ law, a similar behavior would be ex- sentence-level modeling, the unigram distribution pected from corpora of any size (Herdan, 1960; Heaps, 1978). 2 While Goldwater et al. (2011) acknowledge that their does not consider contextual information. model could be used in various tasks of learning linguistic ∗ Equal contribution structure, they only present results in modeling morphology.
based on the Pitman–Yor process (PYP; Pitman and encountered in a language’s utterances, thus dif- Yor, 1997), and has the ability to model the power- fering from word type distributions, such as in Pi- law behavior of word tokens. The second, termed mentel et al. (2020). It is also not conditioned on a the generator, leverages a character-level neural word’s context, as it considers each word token as a language model to capture structural patterns in stand-alone unit, as opposed to the task of language written words, e.g. graphotactics and morphology. modeling, e.g. Mikolov et al. (2010). Critically, naı̈vely training a character-level neu- ral model in either types (i.e. unique word forms) 2.1 Complex Vocabularies or tokens (i.e. word forms in their original frequen- The composition of spoken vocabularies is cies) should lead to degenerate results. Models structured according to a host of factors. Stemming trained on natural corpora (i.e. token data) should from articulatory biases, each language has a excel in modeling the most common words of a set of constraints on what sequences of speech language, but might poorly approximate the set of sounds can be valid words in it; this is termed the infrequent word forms which individuals dynam- phonotactics of a language. Languages also ex- ically produce (e.g. through compositional mor- hibit small but non-negligible biases in the regular phology). On the other hand, training models on match of forms and meanings (Dingemanse et al., the collection of unique word forms (i.e. type data) 2015; Pimentel et al., 2019, 2021b). Additionally, would give equal weight to typical and atypical pro- expectations about morphology can constrain ductions, potentially leading to poor performance the production or processing of a given word as on the most frequent forms, which any individual belonging to a particular word class (as shown would recognize as part of their language. In the for instance in Jabberwocky- and wug-type tasks, two-stage model, as we will show, our generator is Berko 1958, Hall Maudslay and Cotterell 2021). trained on a dataset interpolated between types and While individuals often have strong intuitions tokens—modeling the nuance between frequent about these patterns, their judgments are typically and infrequent word forms better. gradient rather than categorical (Hayes and Wil- By testing our model on a set of languages with son, 2008; Gorman, 2013). The effective set of diverse morphological and phonological character- words that naturally occur in linguistic productions istics, we find that it is capable of modeling both are known to be extremely diverse in their com- frequent and infrequent words, thus producing a position. Models deployed to explain and predict better estimate of the unigram distribution than a typical word forms in a given language might fail character-level LSTM. The empirical superiority at capturing these corners of the space of possible of our two-stage model is shown in Fig. 1, where forms. If the goal is to produce ecologically valid the surprisal (i.e. the negative log-probability, mea- models that could approximate actual cognitive pro- sured in nats here) of each token is plotted un- cesses, these atypical forms should be efficiently der four different models for Finnish. Our pro- learned in addition to the most typical productions. posed two-stage model achieves a lower or sim- 2.2 Imbalanced Frequencies ilar surprisal to the baselines on tokens with all frequencies—with similar patterns arising in all Zipf (1935) popularized the observation that the analyzed languages.3 frequency of a word in a corpus is inversely pro- portional to its rank, approximately following a 2 The Unigram Distribution power-law distribution. As such, a small subset of the most common word types dominate the cor- The unigram distribution is a probability distribu- pus. These extremely frequent words tend to be tion over the possible word forms in a language’s short in length and exceptionally archaic, in the lexicon. This probability takes the frequency of sense that they preserve traces of previous phono- a token into account, assigning larger probabili- tactic and phonological profiles that might have ties to word forms which are more likely to be ceased to be productive. This is particularly rele- 3 As a final contribution of our work, the code used vant when we consider scenarios where substantial in this paper is available at https://github.com/ portions of the vocabulary might have been bor- irenenikk/modelling-unigram. We hope this will rowed from different sources over time. English is encourage future work in psycholinguistics to use the model to accurately investigate the effects of unigram probabilities a textbook example: Williams (1986) reports that in rare words. French, Latin, Germanic and Greek account for
29%, 29%, 26% and 6% of all words’ origins in the generator, is a model used to produce a set of the vocabulary (plus a remaining 10% of diverse i.i.d. word forms {`k }K k=1 . The second component origin). The most frequent portion of the vocabu- is termed adaptor, and it assigns each instance in lary preserves the most the original West Germanic the training set to a cluster {zn }N n=1 . Under this forms, consisting largely of articles, prepositions, model, each token in a dataset has a correspond- pronouns, and auxiliaries. Further, irregular in- ing cluster zn which defines the token’s word form flections tend to be more common in these highly wn = `zn . We note that both word forms ` and frequent words (Ackerman and Malouf, 2013; Cot- clusters z are latent variables, and only tokens w terell et al., 2019). This observation might invite are observed during training. one to omit frequency information from training data, i.e. to use types, in order to balance out the Generator. The generator is a model which pro- role of the most frequent words. duces word forms; we use a character-level LSTM On the other side of the frequency scale, how- here (Hochreiter and Schmidhuber, 1997), as in:6 ever, any natural language data would have plenty {`k }K k=1 ∼ pφ (`) = LSTM(`) (1) of low-frequency words that reflect the open bound- aries of the vocabulary. These might include nonce These word forms `k are sampled i.i.d.—thus, the words (blick), expressive transformations of other same word may be sampled more than once. words (a loooooooooooong summer), specialized terms (onabotulinumtoxina), and names, among Adaptor. Each word form sampled from the gen- others. In addition, genuine orthographic mispro- erator corresponds to a cluster. The adaptor then ductions (langague) will be present to some degree. assigns a frequency to each of the clusters accord- Finally, acronyms (HTML) will be present in ing to a Pitman–Yor process: all frequency bands. These should be particularly problematic to model, since they do not necessarily p(zn | z
of distinct clusters to which it has been assigned: Evaluated Model N Language Type Token Two-stage Generator 1{w = `zn }, X cw = (4) English 12.50 9.04 8.34 11.86 n=1 Finnish 15.07 12.94 11.85 14.16 K Hebrew 12.62 10.71 10.20 11.76 1{w = `k } X nw = (5) Indonesian 13.28 10.33 9.55 11.89 k=1 Tamil 14.24 12.66 11.76 13.29 Turkish 14.00 11.77 10.86 12.95 In practice, the two-stage model acts as an inter- Yoruba 11.13 10.04 9.19 9.88 polation between a smoothed 1-gram model, i.e. corpus frequencies, and an LSTM character model. Table 1: Cross-entropy on the unigram distribution. Notably, this model learns per-word smoothing fac- tors and its interpolation weight in an unsupervised manner through the PYP parameters’ inference. where we assume instances wn are sampled from The adaptor is fit using Gibbs sampling, and the the true unigram distribution p(w). Specifically, generator is trained using a cross-entropy loss on these token samples {wn }N n=1 take the form of the the set of non-empty clusters produced by the adap- token dataset. The model with the lowest cross- tor. The generator is thus trained using a more entropy is the one that diverges the least from the balanced corpus where the proportion of the most true distribution. frequent words is reduced; this can be seen as an Baseline Models. As neural networks yield state- interpolation between a type and a token dataset.7 of-the-art performance in language modeling tasks, we expect them to also do well with the unigram 4 Experiments. distribution. In fact, pseudo-text generated by Dataset. We use Wikipedia data and evaluate LSTM-based language models reproduces Zipf’s our model on the following languages: English, law to some extent (Takahashi and Tanaka-Ishii, Finnish, Hebrew, Indonesian, Tamil, Turkish and 2017; Meister and Cotterell, 2021). Thus, we view Yoruba. These languages represent a typologi- state-of-the-art LSTM models as a strong baseline. cally diverse set—with different levels of morphol- We train a character-level LSTM language model ogy, ranging from rich (e.g. Finnish) to poor (e.g. (Pimentel et al., 2020) to directly approximate the Yoruba), as well as distinct scripts and graphotactic unigram distribution by training it on the token patterns. In preprocessing, we first split the data dataset—modeling these tokens at the character into sentences and then into tokens using spaCy level. As a second baseline, we train an LSTM on (Honnibal et al., 2020). We then sample 106 tokens the type dataset. However, we expect this model to as our training set for each language (except for be outperformed by the token one in the unigram Yoruba for which we had less data, see App. F for distribution task, as the information on word fre- more details). From these, we build two distinct quency is not available during its training. We do datasets: a token dataset, which corresponds to not use a word-level 1-gram model (i.e. the words’ the list of word forms with their corpus frequency, sample frequency) as a baseline here, since it re- and a type dataset containing the set of unique sults in an infinite cross-entropy for any test set con- word forms in the data. taining oov words. We empirically compare four models: two-stage, generator, token, and type. Evaluation. We measure the cross-entropy of our models on a held-out test set; this is the stan- Modeling Tokens. Cross-entropy on the token dard evaluation for language modeling. We ap- test sets can be found in Tab. 1. These results show proximate this cross-entropy using a sample mean our two-stage model indeed creates a more accurate estimate estimate of the unigram distribution, producing the smallest cross-entropy across all languages. H(p) ≤ H(p, pmodel ) (6) N Frequent vs Infrequent Words. The weak- 1 X ≈− log pmodel (wn ) nesses of the token and type models are evinced by N Fig. 1. In line with our hypothesis, the token model n=1 7 This model’s training is detailed in App. E. For a detailed achieves lower cross-entropy on the most common description of the adaptor see Goldwater et al. (2011). words, but fails to model the rare ones accurately.
Evaluated Model Evaluated Model Language Type Token Two-stage Generator % Language Type Token Two-stage Generator English 18.23 21.71 20.32 18.59 56% English 14.50 15.94 13.24 14.25 Finnish 19.76 21.42 19.79 19.89 71% Hebrew 15.81 17.76 17.10 16.30 56% Finnish 15.19 14.89 12.73 14.80 Indonesian 18.52 21.20 19.15 19.17 61% Hebrew 13.36 13.80 12.66 13.20 Tamil 18.77 20.37 19.26 19.19 71% Indonesian 14.72 15.50 12.80 14.64 Turkish 18.36 19.78 18.50 18.62 65% Tamil 14.65 14.77 12.95 14.33 Yoruba 15.44 17.69 15.34 16.28 67% Turkish 14.73 14.41 12.41 14.33 Yoruba 11.13 11.97 10.36 11.16 Table 2: Average surprisal for singleton types. Column % represents the ratio of singletons in the type test set. Table 3: The average surprisal for non-singleton types. The cross-entropy achieved by the type model does capable of modeling both frequent word forms as not change as much with word frequency, but is well as new productions. higher than the one achieved by the token model for most of the vocabulary. We also see that the Future Work. The results we present focus on two-stage model performs well across all word fre- analyzing the two-stage model. The generator, quencies. Indeed, this model appears to behave though, produces interesting results by itself, mod- similarly to the token model with frequent words, eling non-singleton word forms better than the type but obtains a lower cross-entropy on the rare ones, and token models in most languages. This suggests where the role of the generator in the estimated that it might be better at modeling the graphotactics probability is emphasized. We suspect this is the of a language than either of these baselines. Future reason behind the two-stage model’s success. work should explore if this indeed is the case. The Long Tail. Fig. 1 also demonstrates that the 5 Conclusion entropy estimate for the rare words grows quickly In this work, we motivate the unigram distribution and exhibits a large variance across models. This as an important task to both the psycholinguistics reflects the heterogeneous nature of the words that and natural language processing communities that only appear a few times in a corpus. This part of has received too little attention. We present a two- the vocabulary is where the type model achieves stage model for estimating this distribution—a neu- the best results for all languages except Yoruba ralization of Goldwater et al.’s (2011)—which is (see Tab. 2).8 The fact that singletons (also known motivated by the complex makeup of vocabularies: as hapax legomena), i.e. word forms which occur This model defines the probability of a token by only once in the test set, form a large portion of the combining the probability of its appearance in the type dataset boosts the type model’s performance training corpus with the probability of its form. We on rare words. However, in the case of words ap- have shown, through a cross-entropy evaluation, pearing more than once (see Tab. 3) the two-stage that our model outperforms naı̈ve solutions and is model achieves the best results across languages. capable of accurately modeling both frequent and Furthermore, in these non-singleton words, the gen- infrequent words. erator outperforms the type and token models in all languages except for Yoruba. This shows the Acknowledgements utility of training the generator on an interpolation between types and tokens. In addition, we note that Damián E. Blasi acknowledges funding from the one may justifiably question whether properly mod- Branco Weiss Fellowship, administered by the eling singletons is a desirable feature, since they ETH Zürich. Damián E. Blasi’s research was also are likely to contain unrepresentative word forms, executed within the framework of the HSE Uni- such as typos, as discussed previously. Indeed, it versity Basic Research Program and funded by the appears that the two-stage model not only leads to Russian Academic Excellence Project ‘5-100’. tighter estimates of the unigram distribution, but also allows us to train a better graphotactics model; Ethical Concerns 8 We note that we used considerably less training data for This paper highlights the importance of modeling Yoruba than for other languages. the unigram distribution, and presents a model for
the task. We do not foresee any reasons for ethical Dagmar Divjak. 2019. Frequency in Language: Mem- concerns, but we would like to note that the use ory, Attention and Learning. Cambridge University Press. of Wikipedia as a data source may introduce some bias into our experiments. Richard Futrell, Adam Albright, Peter Graff, and Timo- thy O’Donnell. 2017. A generative model of phono- tactics. Transactions of the Association for Compu- References tational Linguistics, 5(0):73–86. Farrell Ackerman and Robert Malouf. 2013. Morpho- Sharon Goldwater, Thomas L. Griffiths, and Mark logical organization: The low conditional entropy Johnson. 2011. Producing power-law distributions conjecture. Language, 89(3):429–464. and damping word frequencies with two-stage lan- guage models. Journal of Machine Learning Re- R. Harald Baayen. 2002. Word Frequency Distribu- search, 12(68):2335–2382. tions, volume 18. Springer Science & Business Me- Kyle Gorman. 2013. Generative Phonotactics. Univer- dia. sity of Pennsylvania. R. Harald Baayen, Petar Milin, and Michael Ramscar. Rowan Hall Maudslay and Ryan Cotterell. 2021. Do 2016. Frequency in lexical processing. Aphasiol- syntactic probes probe syntax? experiments with ogy, 30(11):1174–1220. jabberwocky probing. In Proceedings of the 2021 Conference of the North American Chapter of the James Bergstra and Yoshua Bengio. 2012. Random Association for Computational Linguistics: Human search for hyper-parameter optimization. Journal of Language Technologies, pages 124–131, Online. As- Machine Learning Research, 13:281–305. sociation for Computational Linguistics. Jean Berko. 1958. The child’s learning of English mor- Bruce Hayes and Colin Wilson. 2008. A maximum en- phology. Word, 14(2-3):150–177. tropy model of phonotactics and phonotactic learn- ing. Linguistic Inquiry, 39(3):379–440. Phil Blunsom, Trevor Cohn, Sharon Goldwater, and Mark Johnson. 2009. A note on the implementation Harold Stanley Heaps. 1978. Information Retrieval, of hierarchical Dirichlet processes. In Proceedings Computational and Theoretical Aspects. Academic of the ACL-IJCNLP 2009 Conference Short Papers, Press. pages 337–340, Suntec, Singapore. Association for Gustav Herdan. 1960. Type–Token Mathematics: A Computational Linguistics. Textbook of Mathematical Linguistics, volume 4. Mouton. Marc Brysbaert, Paweł Mandera, Samantha F. Mc- Cormick, and Emmanuel Keuleers. 2019. Word Sepp Hochreiter and Jürgen Schmidhuber. 1997. prevalence norms for 62,000 English lemmas. Be- Long short-term memory. Neural Computation, havior Research Methods, 51(2):467–479. 9(8):1735–1780. Ryan Cotterell, Christo Kirov, Mans Hulden, and Ja- Matthew Honnibal, Ines Montani, Sofie Van Lan- son Eisner. 2019. On the complexity and typol- deghem, and Adriane Boyd. 2020. spaCy: ogy of inflectional morphological systems. Transac- Industrial-strength natural language processing in tions of the Association for Computational Linguis- python. tics, 7:327–342. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Aäron van den Oord, Alexander Graves, and Koray Brian Roark. 2018. Are all languages equally hard Kavukcuoglu. 2016. Neural machine translation in to language-model? In Proceedings of the 2018 linear time. In arXiv preprint arXiv:1610.10099. Conference of the North American Chapter of the Clara Meister and Ryan Cotterell. 2021. Language Association for Computational Linguistics: Human model evaluation beyond perplexity. In Proceedings Language Technologies, Volume 2 (Short Papers), of the 59th Annual Meeting of the Association for pages 536–541, New Orleans, Louisiana. Associa- Computational Linguistics, Online. Association for tion for Computational Linguistics. Computational Linguistics. Holger Diessel. 2017. Usage-based linguistics. In Ox- Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On ford Research Encyclopedia of Linguistics. Oxford the state of the art of evaluation in neural language University Press. models. In International Conference on Learning Representations. Mark Dingemanse, Damián E. Blasi, Gary Lupyan, Morten H. Christiansen, and Padraic Monaghan. Stephen Merity, Nitish Shirish Keskar, and Richard 2015. Arbitrariness, iconicity, and systematic- Socher. 2018. Regularizing and optimizing LSTM ity in language. Trends in Cognitive Sciences, language models. In International Conference on 19(10):603–615. Learning Representations.
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan A Tuned hyperparameres for the Černocký, and Sanjeev Khudanpur. 2010. Recurrent two-stage model neural network based language model. In Eleventh annual conference of the International Speech Com- munication Association, pages 1045–1048. Language a b Radford M. Neal. 1993. Probabilistic inference using English 0.33 3,000 Markov chain Monte Carlo methods. Technical re- Finnish 0.36 90,000 port, University of Toronto. Hebrew 0.40 55,000 Tiago Pimentel, Arya D. McCarthy, Damian Blasi, Indonesian 0.48 180,000 Brian Roark, and Ryan Cotterell. 2019. Meaning Tamil 0.70 37,000 to form: Measuring systematicity as information. In Proceedings of the 57th Annual Meeting of the Asso- Turkish 0.33 95,000 ciation for Computational Linguistics, pages 1751– Yoruba 0.08 156,000 1764, Florence, Italy. Association for Computational Linguistics. Table 4: The optimized values of a and b for the ana- lyzed languages. Tiago Pimentel, Irene Nikkarinen, Kyle Mahowald, Ryan Cotterell, and Damián Blasi. 2021a. How (non-)optimal is the lexicon? In Proceedings of the B Hyperparameter Search 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- The same hyperparameters are used for both our man Language Technologies, pages 4426–4438, On- baseline LSTMs and the generator. We use 3 lay- line. Association for Computational Linguistics. ers, where embedding size is 128, hidden size is Tiago Pimentel, Brian Roark, and Ryan Cotterell. 2020. 512, and dropout probability is 0.33. Training the Phonotactic complexity and its trade-offs. Transac- two-stage model takes a considerable amount of tions of the Association for Computational Linguis- time (see Tab. 5). We are thus not capable of do- tics, 8:1–18. ing exhaustive hyperparameter tuning. Random Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan search (Bergstra and Bengio, 2012) is used in tun- Cotterell, and Damián Blasi. 2021b. Finding ing the values for a and b, where we run five train- concept-specific biases in form–meaning associa- ing procedures considering ranges a ∈ [0, 1), and tions. In Proceedings of the 2021 Conference of the North American Chapter of the Association for b ∈ [100, 200,000). We tune the hyperparameters Computational Linguistics: Human Language Tech- for each language by minimizing the model’s cross- nologies, pages 4416–4425, Online. Association for entropy on the development set, training them on Computational Linguistics. a subset of the training data with only 100,000 to- Jim Pitman and Marc Yor. 1997. The two-parameter kens. The found optimal values of a and b are Poisson–Dirichlet distribution derived from a stable rounded to two decimal places and the thousands subordinator. Annals of Probability, 25(2):855–900. respectively. Our two-stage model is trained for Shuntaro Takahashi and Kumiko Tanaka-Ishii. 2017. five iterations of expectation–maximization. Do neural nets learn statistical laws behind natural language? PLOS ONE, 12(12):1–17. C Training time with the two-stage model for each language Greg C.G. Wei and Martin A. Tanner. 1990. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Language Minutes Journal of the American Statistical Association, 85(411):699–704. English 164 Finnish 170 Joseph M. Williams. 1986. Origins of the English lan- guage. Simon and Schuster. Hebrew 175 Indonesian 173 George Kingsley Zipf. 1935. The Psycho-biology of Tamil 166 Language: An Introduction to Dynamic Philology. Houghton Mifflin. Turkish 174 Yoruba 56 Table 5: The training times for the two-stage model in each language. These times were obtained with a single NVIDIA Tesla P100 GPU.
D The development set cross-entropies fined by the two-stage model.9 We build our sam- on the unigram distribution pler after the morphological sampler presented by Goldwater et al. (2011). Gibbs sampling is a Markov Chain Monte Carlo Evaluated Model (MCMC) method which approximates the posterior Language Type Token Two-stage Generator of a multivariate distribution. It iteratively sam- English 12.50 9.04 8.34 9.18 ples from the conditional distribution of a variable, Finnish 15.08 12.94 11.88 13.05 given the values of the other dimensions (Neal, Hebrew 12.62 10.71 10.20 10.78 1993). We use the conditional distribution defined Indonesian 13.27 10.30 9.53 10.42 in eq. (7) (presented in Fig. 2) in the Gibbs sampler Tamil 14.22 12.65 11.76 12.75 where we know the word form wn of token n— Turkish 13.99 11.74 10.83 12.92 since it is observable in the corpus— and where the Yoruba 11.10 10.00 9.13 9.83 values for all other cluster assignments are fixed. Table 6: Development set cross-entropy for the base- Note that, according to eq. (7), we only assign word line models as well as our two-stage model evaluated tokens to clusters with the same form or create a on the unigram distribution. new cluster—and when a new one is created, its word form is assigned to wn . As such, each cluster contains a single shared word form. For each adap- E Inference tor training iteration, we run the Gibbs sampler for six epochs, and choose the cluster assignments that Unfortunately, there is no closed form solution for have the best performance on a development set. inferring the parameters of our two-stage model. Furthermore, we persist the adaptor state across In order to obtain a sample of cluster assignments iterations, warm starting the Gibbs sampler with and train the generator to match their labels, we the cluster assignments of the previous iteration. estimate the parameters of both the generator and the adaptor concurrently, freezing one’s parame- E.2 Training the generator ters while training the other. We use a regime In order to train the generator on word form data corresponding to the Monte Carlo Expectation- with more balanced frequency distributions, a new maximization (EM) algorithm to train the model training set is dynamically created. In this dataset, (Wei and Tanner, 1990), which can be found in each token appears as many times as it has been Algorithm 1. In the E-step, the function G IBB - assigned as a cluster label, noted with ` in Algo- S S AMPLER returns the cluster assignments z and rithm 1.10 A regime similar to using the inverse- the dampened word dataset ` obtained via Gibbs power transformed counts of the tokens in the cor- sampling from the PYP. We then use this dampened pus (Goldwater et al., 2011). dataset to train the generator in the M-step. This new training set allows us to train the gen- erator in an interpolation between a purely type- Algorithm 1 Training the two-stage model or token-based dataset; this interpolation can be 1: for i in RANGE(# Epochs) do controlled through its parameters a and b. Setting 2: // E-Step the values of a and b to zero will cause the model to 3: z, ` ∼ G IBBS S AMPLER(a, b, pφ , {wn }N 1 ) favor existing clusters to creating new ones, result- 4: // M-Step ing in assigning every token with the same form to 5: for t = 1 up to T do a single cluster. In this case, the generator parame- P|`| 6: φ ← ηt k=1 ∇φ log pφ (`k | φ) ters would be estimated using the equivalent of a 7: end for type corpus. Similarly, when a approaches one, or 8: end for in the limit of b → ∞, less tokens will be assigned per cluster and the number of single token clusters E.1 Gibbs Sampler For Cluster Assignments grows. This is effectively equivalent to training the generator using tokens. Consequently, non-extreme The Pitman–Yor process does not have a well- 9 defined posterior probability. Nonetheless, we can This is possible due to the exchangeability of the cluster assignments. use Gibbs sampling to obtain a sample from this 10 We hotstart the generator model by training it on a type- posterior distribution over cluster assignments de- level dataset before the first adaptor training iteration.
(c
You can also read