Three New Graphical Models for Statistical Language Modelling
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Three New Graphical Models for Statistical Language Modelling Andriy Mnih amnih@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu Department of Computer Science, University of Toronto, Canada Abstract in the discrete case. Representing discrete structures The supremacy of n-gram models in statis- such as words using continuous-valued distributed rep- tical language modelling has recently been resentations and then assigning probability to these challenged by parametric models that use structures based on their representations automati- distributed representations to counteract the cally introduces smoothing into the density estima- difficulties caused by data sparsity. We pro- tion problem making the data sparsity problem less pose three new probabilistic language models severe. Recently, significant progress in statistical lan- that define the distribution of the next word guage modelling has been made by using models that in a sequence given several preceding words rely on such distributed representations. Feed-forward by using distributed representations of those neural networks that operate on real-valued vector rep- words. We show how real-valued distributed resentations of words have been both the most popu- representations for words can be learned at lar and most successful models of this type (Bengio the same time as learning a large set of et al., 2003; Morin & Bengio, 2005; Emami et al., stochastic binary hidden features that are 2003). A number of techniques have been proposed used to predict the distributed representation to address the main drawback of these models – their of the next word from previous distributed long training times (Bengio & Senécal, 2003; Schwenk representations. Adding connections from & Gauvain, 2005; Morin & Bengio, 2005). Hierarchi- the previous states of the binary hidden fea- cal alternatives to feed-forward networks, which are tures improves performance as does adding faster to train and use, have been considered (Morin direct connections between the real-valued & Bengio, 2005; Blitzer et al., 2005b), but they do not distributed representations. One of our mod- perform quite as well. Recently, a stochastic model els significantly outperforms the very best n- with hidden variables has been proposed for language gram models. modelling (Blitzer et al., 2005a). Unlike the previous models, it uses distributed representations that consist of stochastic binary variables as opposed to real num- 1. Introduction bers. Unfortunately, this model does not scale well to large vocabulary sizes due to the difficulty of inference One of the main tasks of statistical language modelling in the model. is learning probability distributions of word sequences. In this paper, we describe three new probabilistic lan- This problem is usually reduced to learning the con- guage models that use distributed word representa- ditional distribution of the next word given a fixed tions to define the distribution of the next word in number of preceding words, a task at which n-gram a sequence given several preceding words. We start models have been very successful (Chen & Goodman, with an undirected graphical model that uses a large 1996). Density estimation for discrete distributions number of hidden binary variables to capture the de- is inherently difficult because there is no simple way sired conditional distribution. Then we augment it to do smoothing based on input similarity. Since all with temporal connections between hidden units to in- discrete values are equally similar (or dissimilar) as- crease the number of preceding words taken into ac- signing similar probabilities to similar inputs, which count without significantly increasing the number of is typically done for continuous inputs, does not work model parameters. Finally, we investigate a model Appearing in Proceedings of the 24 th International Confer- that predicts the distributed representation for the ence on Machine Learning, Corvallis, OR, 2007. Copyright next word using a linear function of the distributed 2007 by the author(s)/owner(s). representations of the preceding words, without using
Three New Graphical Models for Statistical Language Modelling any additional latent variables. requires nNw Nh free parameters which can be unac- ceptably large for vocabularies of even moderate size. 2. The Factored Restricted Boltzmann Another weakness of this model is that each word is associated with a different set of parameters for each Machine Language Model of the n positions it can occupy. Our goal is to design a probabilistic model for word se- Both of these drawbacks can be addressed by intro- quences that uses distributed representations for words ducing distributed representations (i.e. feature vec- and captures the dependencies between words in a se- tors) for words. Generalization is made easier by shar- quence using stochastic hidden variables. The main ing feature vectors across all sequence positions, and choice to be made here is between directed and undi- defining all of the interactions involving a word via rected interactions between the hidden variables and its feature vector. This type of parameterization has the visible variables representing words. Blitzer et al. been used in feed-forward neural networks for mod- (2005a) have proposed a model with directed interac- elling symbolic relations (Hinton, 1986) and for statis- tions for this task. Unfortunately, training their model tical language modelling (Bengio et al., 2003). required exact inference, which is exponential in the number of hidden variables. As a result, only a very We represent each word using a real-valued feature small number of hidden variables can be used, which vector of length Nf and make the energy depend on greatly limits the expressive power of the model. the word only through its feature vector. Let R be an Nw ×Nf matrix with row i being the feature vector for In order to be able to handle a large number of hid- the ith word in the dictionary. Then the feature vector den variables, we use a Restricted Boltzmann Ma- for word wi is given by viT R. Using this notation, we chine (RBM) that has undirected interactions between define the joint energy of a sequence of words w1 , ..., wn multinomial visible units and binary hidden units. along with the configuration of the hidden units h as While maximum likelihood learning in RBMs is in- tractable, RBMs can be trained efficiently using con- n X trastive divergence learning (Hinton, 2002) and the E(wn , h; w1:n−1 ) = − ( viT RWi )h learning rule is unaffected when binary units are re- i=1 placed by multinomial ones. − bTh h − bTr RT vn − bTv vn . (2) Assuming that the words we are dealing with come Here matrix Wi specifies the interaction between the from a finite dictionary of size Nw , we model the ob- vector of hidden variables and the feature vector for served words as multinomial random variables that the visible variable vi . The vector bh contains bi- take on Nw values. ases for the hidden units, while the vectors bv and To define an RBM model for the probability distribu- br contain biases for words and word features re- tion of the next word in a sentence given the word’s spectively.2 To simplify the notation we do not ex- context w1:n−1 , which is the previous n − 1 words plicitly show the dependence of the energy functions w1 , ..., wn−1 , we must first specify the energy function and probability distributions on model parameters. for a joint configuration of the visible and hidden units. In other words, we write P (wn |w1:n−1 ) instead of The simplest choice is probably P (wn |w1:n−1 , Wi , R, ...). n X Defining these interactions on the Nf -dimensional fea- E0 (wn , h; w1:n−1 ) = − viT Gi h, (1) ture vectors instead of directly on the Nw -dimensional i=1 visible variables leads to a much more compact param- where vi is a binary column vector of length Nw with 1 eterization of the model, since typically Nf is much in the with position and zeros everywhere else, and h is smaller than Nw . Using the same feature matrix R a column vector of length Nh containing the configura- for all visible variables forces it to capture position- tion of the hidden variables. Here matrix Gi specifies invariant information about words as well as further re- the interaction between the multinomial1 visible unit ducing the number of model parameters. With 18,000 vi and the binary hidden units. For simplicity, we have 2 The inclusion of per-word biases contradicts our philos- ignored the bias terms – the ones that depend only on ophy of using only the feature vectors to define the energy h or only on vn . Unfortunately, this parameterization function. However, since the number of these bias param- eters is small compared to the total number of parameters 1 Technically, we do not have to assume any distribution in the model, generalization is not negatively affected. In for w1 , ..., wn−1 since we always condition on these random fact the inclusion of per-word biases does not seem to af- variables. As a result, the energy function can depend on fect generalization but does speed up the early stages of them in an arbitrary, nonlinear, way. learning.
Three New Graphical Models for Statistical Language Modelling normalization is performed only over vn , resulting in a sum containing only Nw terms. Moreover, in some applications, such as speech recognition, we are inter- ested in ratios of probabilities of words from a short list and as a result we do not have to compute the normalizing constant at all (Bengio et al., 2003). The unnormalized probability of the next word can be efficiently computed using the formula P (wn |w1:n−1 ) ∝ exp(bTr RT vn + bTv vn ) Y (1 + exp(Ti )), (5) Figure 1. a) The diagram for the Factored RBM and the i Temporal Factored RBM. The dashed part is included only for the TFRBM. b) The diagram for the log-bilinear model. where Ti is the total input to the hidden unit i when w1 , ..., wn is the input to the model. words, 1000 hidden units and a context of size 2 Since the normalizing constant for this distribution can (n = 3), the use of 100-dimensional feature vectors be computed in time linear in the dictionary size, ex- reduces the number of parameters by a factor of 25, act inference in this model has time complexity linear from 54 million to a mere 2.1 million. As can be seen in the number of hidden variables and the dictionary from Eqs. 1 and 2, the feature-based parameteriza- size. This compares favourably with complexity of ex- tion constrains each of the visible-hidden interaction act inference in the latent variable model proposed in matrices Gi to be a product of two low-rank matrices (Blitzer et al., 2005a), which is exponential in the num- R and Wi , while the original parameterization does ber of hidden variables. not constrain Gi in any way. 2.2. Learning The joint conditional distribution of the next word and the hidden configuration h is defined in terms of the The model can be trained on a dataset D of word energy function in Eq. 2 as sequences using maximum likelihood learning. The log-likelihood of the dataset (assuming IID sequences) 1 P (wn , h|w1:n−1 ) = exp(−E(wn , h; w1:n−1 )), (3) simplifies to Zc X where Zc = P P L(D) = log P (wn |w1:n−1 ), (6) wn h exp(−E(wn , h; w1:n−1 )) is a context-dependent normalization term. The condi- tional distribution of the next word given its context, where P is defined by the model and the sum is over which is the distribution we are ultimately interested all word subsequences w1 , ..., wn of length n in the in, can be obtained from the joint by marginalizing dataset. L(D) can be maximized w.r.t. model param- over the hidden variables: eters using gradient ascent. The contributions made by a subsequence w1 , ..., wn from D to the derivatives 1 X of L(D) are given by P (wn |w1:n−1 ) = exp(−E(wn , h; w1:n−1 )). (4) Zc h Thus, we obtain a conditional model, which does not ∂ try to model the distribution of w1:n−1 since we always log P (wn |w1:n−1 ) = ∂R condition on those variables. * n + X T T T vi h Wi + vn br − 2.1. Making Predictions i=1 * n +D One attractive property of RBMs is that the probabil- X vi hT WiT + vn bTr , (7) ity of a configuration of visible units can be computed i=1 M up to a multiplicative constant in time linear in the number of hidden units (Hinton, 2002). The normaliz- ing constant, however, is usually infeasible to compute ∂ because it is a sum of the exponential number of terms log P (wn |w1:n−1 ) = RT vi hT D − ∂Wi (in the number of visible units). In the proposed lan- guage model, though, this computation is easy because R T v i hT M , (8)
Three New Graphical Models for Statistical Language Modelling where h·iD and h·iM denote expectations w.r.t. dis- updates for the hidden units, which is common prac- tributions P (h|w1:n ) and P (vn , h|w1:n−1 ) respectively. tice when training RBMs (Hinton, 2002). Using these The derivative of the log likelihood of the dataset w.r.t. mean-field updates instead of stochastic ones reduces each parameter is then simply the sum of the contri- the noise in the parameter derivatives allowing larger butions by all n-word subsequences in the dataset. learning rates to be used. Computing these derivatives exactly can be computa- tionally expensive because computing an expectation 3. The Temporal Factored RBM w.r.t. P (vn , h|w1:n−1 ) takes O(Nw Nh ) time for each The language model proposed above, like virtually all context. One alternative is to approximate the expec- statistical language models, is based on the assump- tation using a Monte Carlo method by generating sam- tion that given a word’s context, which is the n − 1 ples from P (vn , h|w1:n−1 ) and averaging the expression words that immediately precede it, the word is con- we are interested in over them. Unfortunately, gener- ditionally independent of all other preceding words. ating an exact sample from P (vn , h|w1:n−1 ) is just as This assumption, which is clearly false, is made in or- expensive as computing the original expectation. In- der keep the number of model parameters relatively stead of using exact sampling, we can generate sam- small. In n-grams, for example, the number of pa- ples from the distribution using a Markov Chain Monte rameters is exponential in context size, which makes Carlo method such as Gibbs sampling which involves n-grams unsuitable for handling large contexts. While starting vn and h in some random configuration and the dependence of the number of model parameters on alternating between sampling vn and h from their re- context size is usually linear for models that use dis- spective conditional distributions given by tributed representations for words, for larger context P (wn |h, w1:n−1 ) ∝ sizes this number might still be very large. exp((hT WnT + bTr )RT vn + bTv vn ), (9) Ideally, a language model should be able to take ad- vantage of indefinitely large contexts without needing a very large number of parameters. We propose a sim- n X ple extension to the factored RBM language model P (h|w1:n ) ∝ exp(( viT RWi + bTh )h). (10) to achieve that goal (at least in theory) following i=1 Sutskever and Hinton (2007). Suppose we want to However, a large number of alternating updates might predict word wt+n from w1 , ..., wt+n−1 for some large have to be performed to obtain a single sample from t. We can apply a separate instance of our model (with the joint distribution. the same parameters) to words wτ , ..., wτ +n−1 for each Fortunately, there is an approximate learning pro- τ in {1, ..., t}, obtaining a distributed representation of cedure called Contrastive Divergence (CD) learning the ith n-tuple of words in the hidden state hτ of the which is much more efficient. It is obtained by mak- τ th instance of the model. ing two changes to the MCMC-based learning method. In order to propagate context information forward First, instead of starting vn in a random configuration, through the sequence towards the word we want to we initialize vn with the state corresponding to wn . predict, we introduce directed connections from hτ to Second, instead of running the Markov chain to con- hτ +1 and compute the hidden state of model τ + 1 vergence, we perform three alternating updates (first using the inputs from the hidden state of model τ as h, then vn , and then h again). While the resulting well as its visible units. By introducing the dependen- configuration (vn , h) (called a “confabulation” or “re- cies between the hidden states of successive instances construction”) is not a sample from P (vn , h|w1:n−1 ), and specifying these dependencies using a shared pa- it has been shown empirically that learning still works rameter matrix A we make the distribution of wt+n well when confabulations are used instead of samples under the model depend on all previous words in the from P (vn , h|w1:n−1 ) in the learning rules given above sequence. (Hinton, 2002). In this paper, we train all our models that have hid- 3.1. Making Predictions den variables using CD learning. In some cases, we Exact inference in the resulting model is intractable use a version of CD that, instead of sampling vn from – it takes time exponential in the number of hidden P (wn |h, w1:n−1 ) to obtain a binary vector with a single variables (Nh ) in the model being instantiated. How- 1 in wn th position, sets vn to the vector of probabilities ever, since predicting the next word given its (near) given by P (wn |h, w1:n−1 ). This can be viewed as us- infinite context is an online problem we take the filter- ing mean-field updates for visible units and stochastic
Three New Graphical Models for Statistical Language Modelling ing approach to making this prediction, which requires on in the confabulation produced by model instance storing only the last n − 1 words in the sequence. In τ + 1. See (Sutskever & Hinton, 2007) for a more contrast, exact inference requires that the whole con- detailed description of learning in temporal RBMs. text be stored, which might be infeasible or undesir- able. 4. A Log-Bilinear Language Model Unfortunately, exact filtering is also intractable in this An alternative to using stochastic binary variables for model. To get around this problem, we treat the hid- modelling the conditional distribution of the next word den state hτ as fixed at pτ when inferring the distribu- given several previous words is to directly parameterize tion P (hτ +1 |w1:τ +n ), where pτj = P (hτj = 1|w1:τ +n−1 ) the distribution and thus avoid introducing stochastic Then, given a sequence of words w1 , ..., wt+n−1 we in- hidden variables altogether. As before, we start by fer the posterior over the hidden states of model in- specifying the energy function: stances using the following recursive procedure. The posterior for the first model instance is given by Eq. 10. Given the (factorial) posterior for model instance ! n−1 τ , the posterior for model instance τ + 1 is computed X E(wn ; w1:n−1 ) = − viT RCi R T vn as i=1 τ +1 T T P (h |w1:τ +n ) ∝ − br R v n − bTv vn . (13) X n exp(( viT RWi + (bh + Apτ )T )hτ +1 ). (11) Here Ci specifies the interaction between the feature i=1 vector of wi and the feature vector of wn , while br Thus, computing the posterior for model instance τ + and bv specify the word biases as in Eq. 2. Just like 1 amounts to adding Apτ to that model’s vector of the energy function for the factored RBM, this energy biases for hidden units and performing inference in the function defines a bilinear interaction. However, in the resulting model using Eq. 10. FRBM energy function the interaction is between the word feature vectors and the hidden variables, whereas Finally, the predictive distribution over wn is obtained in this model the interaction is between the feature by applying the procedure described in Section 2.1 to vectors for the context words and the feature vector model instance t after shifting its biases appropriately. for the predicted word. Intuitively, the model pre- dicts a feature vector for the next word by computing 3.2. Learning a linear function of the context word feature vectors. Maximum likelihood learning in the temporal FRBM Then it assigns probabilities to all words in the vo- model is intractable because it requires performing ex- cabulary based on the similarity of their feature vec- act inference. Since we would like to be able to train tors to the predicted feature vector as measured by models with large numbers of hidden variables we have the dot product. The resulting predictive distribution to resort to an approximate algorithm. is given by PP(wn |w1:n−1 ) = Z1c exp(−E(wn ; w1:n−1 )), where Zc = wn exp(−E(wn ; w1:n−1 )). Instead of performing exact inference we simply ap- ply the filtering algorithm from the previous section to This model is similar to the energy-based model pro- compute the approximate posterior for the model. For posed in (Bengio et al., 2003). However, our model each model instance the algorithm produces a vector uses a bilinear energy function while their energy func- which, when added to that model’s vector of hidden tion is a one-hidden-layer neural network. unit biases, makes the model posterior be the posterior produced by the filtering operation. Then we compute 4.1. Learning the parameter updates for each model instance sepa- Training the above model is considerably simpler and rately using the usual CD learning rule and average faster than training the FRBM models because no them over all instances. stochastic hidden variables are involved. The gradients The temporal parameters are learned using the follow- required for maximum likelihood learning are given by ing rule applied to each training sequence separately: t−1 X ∆A ∝ (pτ +1 − p̂τ +1 )T pτ . (12) ∂ log P (wn |w1:n−1 ) = RT vi vnT R D − τ =1 ∂Ci Here p̂τi +1 is the probability of the hidden unit i being RT vi vnT R M , (14)
Three New Graphical Models for Statistical Language Modelling and one temporal FRBM, as well as two log-bilinear Table 1. Perplexity scores for the models trained on the models. All of our models used 100-dimensional fea- 10M word training set. The mixture test score is the per- plexity obtained by averaging the model’s predictions with ture vectors and both FRBM models had 1000 stochas- those of the Kneser-Ney 6-gram model. The first four tic hidden units. models use 100-dimensional feature vectors. The FRBM The models were trained using mini-batches of 1000 models have 1000 stochastic hidden units. GTn and KNn examples each. For the non-temporal models with refer to back-off n-grams with Good-Turing and modified n = 3, each training case consisted of a two-word con- Kneser-Ney discounting respectively. text and a vector of probabilities specifying the dis- Model Context Model Mixture tribution of the next word for this context. These type size test score test score probabilities were precomputed on the training set and FRBM 2 169.4 110.6 stored in a sparse array. During training, the non- Temporal FRBM 2 127.3 95.6 temporal FRBM, vn was initialized with the precom- Log-bilinear 2 132.9 102.2 puted probability vector instead of the usual binary Log-bilinear 5 124.7 96.5 Back-off GT3 2 135.3 vector with a single 1 indicating the single “correct” Back-off KN3 2 124.3 next word for this instance of the context. The use of Back-off GT6 5 124.4 probability vectors as inputs along with mean field up- Back-off KN6 5 116.2 dates for the visible unit, as described in Section 2.2, allowed us to use relatively high learning rates. Since ∂ precomputing and storing the probability vectors for log P (wn |w1:n−1 ) = ∂R * all 5-word contexts turned out to be non-trivial, the model with a context size of 5 was trained directly on n−1 + X T T T T (vn vi RCi + vi vn RCi ) + vn br − 6-word sequences. i=1 *n−1 +D All the parameters of the non-temporal models except for biases were initialized to small random values. Per- X (vn viT RCi + vi vnT RCiT ) + vn bTr . (15) i=1 word bias parameters bv were initialized based on word M frequencies in the training set. All other bias parame- Note that averaging over the model distribution in this ters were initialized to zero. case is the same as averaging over the predictive distri- The temporal FRBM model was initialized by copy- bution over vn . As a result, these gradients are faster ing the parameters from the trained FRBM and ini- to compute than the gradients for a FRBM. tializing the temporal parameters (A) to zero. During training, stochastic updates were used for visible units 5. Experimental Results and vn was always initialized to a binary vector rep- resenting the actual next word in the sequences (as We evaluated our models using the Associated Press opposed to a distribution over words). News (APNews) dataset consisting of a text stream of about 16 million words. The dataset has been pre- For all models we used weight decay of 10−4 for word processed by replacing proper nouns and rare words representations and 10−5 for all other weights. No with the special “proper noun” and “unknown word” weight decay was applied to network biases. Weight symbols respectively, while keeping all the punctua- decay values as well as other learning parameters were tion marks, resulting in 17964 unique words. A more chosen using the validation set. Each model was detailed description of preprocessing can be found in trained until its performance on a subset of the val- (Bengio et al., 2003). idation set stopped improving. We did not observe overfitting in any of our models, which suggests that We performed our experiments in two stages. First, using models with more parameters might lead to im- we compared the performance of our models to that of proved performance. n-gram models on a smaller subset of the dataset to determine which type of our models showed the most We compared our models to n-gram models estimated promise. Then we performed a more thorough compar- using Good-Turing and modified Kneser-Ney discount- ison between the models of that type and the n-gram ing. Training and testing of the n-gram models was models using the full dataset. performed using programs from the SRI Language Modelling toolkit (Stolcke, 2002). To make the com- In the first experiment, we used a 10 million word parison fair, the n-gram models treated punctuation training set, a 0.5 million word validation set, and a marks (including full stops) as if they were ordinary 0.5 million word test set. We trained one non-temporal
Three New Graphical Models for Statistical Language Modelling words, since that is how they were treated by the net- Table 2. Perplexity scores for the models trained on the work models. 14M word training set. The mixture test score is the per- Models were compared based on their perplexity on plexity obtained by averaging the model’s predictions with the test set. Perplexity is a standard performance those of the Kneser-Ney 5-gram model. The log-bilinear measure for probabilistic language models, which is models use 100-dimensional feature vectors. computed as Model Context Model Mixture type size test score test score 1 X P = exp(− log P (wn |w1:n−1 )), (16) Log-bilinear 5 117.0 97.3 Nw Log-bilinear 10 107.8 92.1 1:n Back-off KN3 2 129.8 where the sum is over all subsequences of length n in Back-off KN5 4 123.2 Back-off KN6 5 123.5 the dataset, N is the number of such subsequences, Back-off KN9 8 124.6 and P (wn |w1:n−1 ) is the probability under the model of the nth word in the subsequence given the previous grams with Good-Turing discounting. For this experi- n − 1 words. To make the comparison between models ment we used the full APNews dataset which was split of different context size fair, given a test sequence of into a 14 million word training set, 1 million word val- length L, we ignored the first C words and tested the idation set, and 1 million word test set. models at predicting words C +1, ..., L in the sequence, where C was the largest context size among the models We trained two log-bilinear models: one with a con- compared. text of size 5, the other with a context of size 10. The training parameters were the same as in the first ex- For each network model we also computed the perplex- periment with the exception of the learning rate, which ity for a mixture of that model with the best n-gram was annealed to a lower value than before.3 The re- model (modified Kneser-Ney 6-gram). The predictive sults of the second experiments are summarized in Ta- distribution for the mixture was obtained simply by ble 2. The perplexity scores clearly show that the log- averaging the predictive distributions produced by the bilinear models outperform the n-gram models. Even network and the n-gram model (giving them equal the smaller log-bilinear model, with a context of size weight). The resulting model perplexities are given 5, outperforms all the n-gram models. The table also in Table 1. shows using n-grams with a larger context size does not The results show that three of the four network models necessarily lead to better results. In fact, the best per- we tested are competitive with n-gram models. Only formance on this dataset was achieved for the context the non-temporal FRBM is significantly outperformed size of 4. Log-bilinear models, on the other hand, do by all n-gram models. Adding temporal connections to benefit from larger context sizes: increasing the con- it, however, to obtain a temporal FRBM, improves the text size from 5 to 10 decreases the model perplexity model dramatically as indicated by a 33% drop in per- by 8%. plexity, suggesting that the temporal connections do In our second experiment, we used the same training, increase the effective context size of the model. The validation, and test sets as in (Bengio et al., 2003), log-bilinear models perform quite well: their scores are which means that our results are directly compara- on par with Good-Turing n-grams with the same con- ble to theirs4 . In (Bengio et al., 2003), a neural lan- text size. guage model with a context size of 5 is trained but its Averaging the predictions of any network model with individual score on the test set is not reported. In- the predictions of the best n-gram model produced stead the score of 109 is reported for the mixture of better predictions than any single model, which sug- a Kneser-Ney 5-gram model and the neural language gests that the network and n-gram models complement 3 This difference is one of the reasons for the log-bilinear each other in at least some cases. The best results model with a context of size 5 performing considerably bet- were obtained by averaging with the temporal network ter in the second experiment than in the first one. The model, resulting in 21% reduction in perplexity over increased training set size and the different test set are unlikely be the only reasons because the n-gram models the best n-gram model. actually performed slightly better in the first experiment. 4 Since the log-bilinear models performed well and, com- Due to limitations of the SRILM toolkit in dealing with pared to the FRBMs, were fast to train, we used only very long strings, instead of treating the dataset as a single string, we broke up each of the test, training, and valida- log-bilinear models in the second experiment. Simi- tions sets into 1000 strings of equal length. This might larly, we chose to use n-grams with Kneser-Ney dis- explain why the n-gram scores we obtained are slightly counting as they significantly outperformed the n- different from the scores reported in (Bengio et al., 2003).
Three New Graphical Models for Statistical Language Modelling model. That score is significantly worse than the score Blitzer, J., Globerson, A., & Pereira, F. (2005a). of 97.3 obtained by the corresponding mixture in our Distributed latent variable models of lexical co- experiments. Moreover, our best single model has a occurrences. Proceedings of the Tenth International score of 107.8 which means it performs at least as well Workshop on Artificial Intelligence and Statistics. as their mixture. Blitzer, J., Weinberger, K., Saul, L., & Pereira, F. (2005b). Hierarchical distributed representations for 6. Discussion and Future Work statistical language modeling. Advances in Neu- We have proposed three new probabilistic models for ral Information Processing Systems 18. Cambridge, language one of which achieves state-of-the-art perfor- MA: MIT Press. mance on the APNews dataset. In two of our models Chen, S. F., & Goodman, J. (1996). An empirical the interaction between the feature vectors for the con- study of smoothing techniques for language model- text words and the feature vector for the next word is ing. Proceedings of the Thirty-Fourth Annual Meet- controlled by binary hidden variables. In the third ing of the Association for Computational Linguistics model the interaction is modelled directly using a bi- (pp. 310–318). San Francisco: Morgan Kaufmann linear interaction between the feature vectors. This Publishers. third model appears to be preferable to the other two models because it is considerably faster to train and Emami, A., Xu, P., & Jelinek, F. (2003). Using a make predictions with. We plan to combine these two connectionist model in a syntactical based language interaction types in a single model, which, due to its model. Proceedings of ICASSP (pp. 372–375). greater representational power, should perform better than any of our current models. Hinton, G. E. (1986). Learning distributed representa- tions of concepts. Proceedings of the Eighth Annual Unlike other language models, with the exception of Conference of the Cognitive Science Society (pp. 1– the energy based model in (Bengio et al., 2003), our 12). Amherst, MA. models use distributed representations not only for context words, but for the word being predicted as Hinton, G. E. (2002). Training products of experts by well. This means that our models should be able to minimizing contrastive divergence. Neural Compu- generalize well even for datasets with very large vo- tation, 14, 1711–1800. cabularies. We intend to test this hypothesis by com- Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). paring our models to n-gram models on the APNews A fast learning algorithm for deep belief networks. dataset without removing rare words first. Neural Computation, 18, 1527–1554. In this paper we have used only a single layer of hid- den units, but now that we have shown how to use Morin, F., & Bengio, Y. (2005). Hierarchical prob- factored matrices to take care of the very large num- abilistic neural network language model. AIS- ber of free parameters, it would be easy to make use TATS’05 (pp. 246–252). of the greedy, multilayer learning algorithm for RBMs Schwenk, H., & Gauvain, J.-L. (2005). Training neu- that was described in (Hinton et al., 2006). ral network language models on very large corpora. Proceedings of Human Language Technology Confer- Acknowledgements ence and Conference on Empirical Methods in Nat- ural Language Processing (pp. 201–208). Vancouver, We thank Yoshua Bengio for providing us with the Canada. APNews dataset. This research was funded by grants from NSERC, CIAR and CFI. Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. Proceedings of the International References Conference on Spoken Language Processing (pp. 901–904). Denver. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Jour- Sutskever, I., & Hinton, G. E. (2007). Learn- nal of Machine Learning Research, 3, 1137–1155. ing multilevel distributed representations for high- dimensional sequences. AISTATS’07. Bengio, Y., & Senécal, J.-S. (2003). Quick training of probabilistic neural nets by importance sampling. AISTATS’03.
You can also read