Neural Language Generation: Formulation, Methods, and Evaluation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Language Generation: Formulation, Methods, and Evaluation Cristina Gârbacea1 , Qiaozhu Mei1,2 1 Department of EECS, University of Michigan, Ann Arbor, MI, USA 2 School of Information, University of Michigan, Ann Arbor, MI, USA {garbacea, qmei}@umich.edu Abstract of text generation is fundamental in natural lan- guage processing and aims to produce realistic and Recent advances in neural network-based gen- arXiv:2007.15780v1 [cs.CL] 31 Jul 2020 plausible textual content that is indistinguishable erative modeling have reignited the hopes in from human-written text (Turing, 1950). Broadly having computer systems capable of seam- lessly conversing with humans and able to un- speaking, the goal of predicting a syntactically derstand natural language. Neural architec- and semantically correct sequence of consecutive tures have been employed to generate text ex- words given some context is achieved in two steps cerpts to various degrees of success, in a mul- by first estimating a distribution over sentences titude of contexts and tasks that fulfil vari- from a given corpus, and then sampling novel and ous user needs. Notably, high capacity deep realistic-looking sentences from the learnt distri- learning models trained on large scale datasets bution. Ideally, the generated sentences preserve demonstrate unparalleled abilities to learn pat- terns in the data even in the lack of explicit su- the semantic and syntactic properties of real-world pervision signals, opening up a plethora of new sentences, and are different from the training ex- possibilities regarding producing realistic and amples used to estimate the model (Zhang et al., coherent texts. While the field of natural lan- 2017b). Language generation is an inherently guage generation is evolving rapidly, there are complex task, which requires considerable linguis- still many open challenges to address. In this tic and domain knowledge at multiple levels, in- survey we formally define and categorize the cluding syntax, semantics, morphology, phonol- problem of natural language generation. We ogy, pragmatics, etc. Moreover, texts are gener- review particular application tasks that are in- stantiations of these general formulations, in ated to fulfill a communicative goal (Reiter, 2019), which generating natural language is of prac- such as to provide support in decision making, tical importance. Next we include a compre- summarize content, translate between languages, hensive outline of methods and neural archi- converse with humans, make specific texts more tectures employed for generating diverse texts. accessible, as well as to entertain users or encour- Nevertheless, there is no standard way to as- age them to change their behaviour. Therefore sess the quality of text produced by these gen- generated texts should be tailored to their specific erative models, which constitutes a serious bot- tleneck towards the progress of the field. To audience in terms of appropriateness of content this end, we also review current approaches and terminology used (Paris, 2015), as well as for to evaluating natural language generation sys- fairness and transparency reasons (Mayfield et al., tems. We hope this survey will provide an in- 2019). For a long time natural language gener- formative overview of formulations, methods, ation models have been rule-based or relied on and assessments of neural natural language training shallow models on sparse high dimen- generation. sional features. With the recent resurgence of neu- ral networks, neural networks based models for 1 Introduction text generation trained with dense vector represen- Recent successes in deep generative modeling and tations have established unmatched prior perfor- representation learning have led to significant ad- mance and reignited the hopes in having machines vances in natural language generation (NLG), mo- able to understand language and seamlessly con- tivated by an increasing need to understand and verse with humans. Indeed, generating meaning- derive meaning from language. The research field ful and coherent texts is pivotal to many natural
language processing tasks. Nevertheless, design- ation presents rich practical opportunities. ing neural networks that can generate coherent text and model long-term dependencies has long been 2.1 Generic / Free-Text Generation a challenge for natural language generation due to The problem of generic text generation aims to the discrete nature of text data. Beyond that, the produce realistic text without placing any exter- ability of neural network models to understand lan- nal user-defined constraints on the model output. guage and ground textual concepts beyond picking Nevertheless, it does consider the intrinsic history up on shallow patterns in the data still remains lim- of past words generated by the model as context. ited. Finally, evaluation of generative models for We formally define the problem of free-text gener- natural language is an equally active and challeng- ation. ing research area of significant importance in driv- Given a discrete sequence of text tokens x = ing forward the progress of the field. (x1 , x2 , . . . , xn ) as input where each xi is drawn In this work we formally define the problem from a fixed set of symbols, the goal of language of neural text generation at particular contexts modeling is to learn the unconditional probability and present the diverse practical applications of distribution p(x) of the sequence x. This distri- text generation in Section 2. In Section 3 we in- bution can be factorized using the chain rule of clude a comprehensive overview of deep learning probability (Bengio et al., 2003) into a product of methodologies and neural model architectures em- conditional probabilities: ployed in the literature for neural network-based n Y natural language generation. We review methods p(x) = p(xi |x
may carry different semantics for different readers, 2.3 Constrained Text Generation therefore we want to clarify that in this survey the The problem of constrained text generation is fo- definition of conditional text generation considers cusing on generating coherent and logical texts as context only external attributes to the model and that cover a specific set of concepts (such as pre- not any model intrinsic attributes such as for exam- defined nouns, verbs, entities, phrases or sentence ple, the history of past generated words which is fragments) desired to be present in the output, and/ already included in the formulation of the generic or abide to user-defined rules which reflect the par- text generation problem in Section 2.1. ticular interests of the system user. Lexically con- Conditional language models are used to learn strained text generation (Hokamp and Liu, 2017) the distribution p(x|c) of the data x conditioned on places explicit constraints on independent attribute a specific attribute code c. Similar to the formula- controls and combines these with differentiable ap- tion of generic text generation, the distribution can proximation to produce discrete text samples. In still be decomposed using the chain rule of proba- the literature the distinction between conditional, bility as follows: controlled and constrained text generation is not clearly defined, and these terms are often used in- n terchangeably. In fact, the first work that proposed Y p(x|c) = p(xi |x
first constructed, followed by training a con- eration task is not as simple for machine learning ditional text generation model to capture their models (Lin et al., 2019). co-occurence and generate text which con- tains the constrained keywords. Nevertheless, 2.4 Natural Language Generation Tasks this approach does not guarantee that all de- In what follows we present natural language gener- sired keywords will be preserved during gen- ation tasks which are instances of generic, condi- eration; some of them may get lost and will tional and constrained text generation. All these not be found in the generated output, in par- applications demonstrate the practical value of ticular when there are constraints on simulta- generating coherent and meaningful texts, and that neously including multiple keywords. advances in natural language generation are of im- mediate applicability and practical importance in • Hard-constrained text generation: refers to many downstream tasks. the mandatory inclusion of certain keywords in the output sentences. The matching func- 2.4.1 Neural Machine Translation tion f is in this case a binary indicator, which The field of machine translation is focusing on rules out the possibility of generating infea- the automatic translation of textual content from sible sentences that do not meet the given one language into another language. The field constraints. Therefore, by placing hard con- has undergone major changes in recent years, straints on the generated output, all lexi- with end-to-end learning approaches for auto- cal constraints must be present in the gen- mated translation based on neural networks re- erated output. Unlike soft-constrained mod- placing conventional phrase-based statistical meth- els which are straightforward to design, the ods (Bahdanau et al., 2014), (Wu et al., 2016a). In problem of hard-constrained text generation contrast to statistical models which consist of sev- requires the design of complex dedicated neu- eral sub-components trained and tuned separately, ral network architectures. neural machine translation models build and train a single, large neural network end-to-end by feed- Constrained text generation is useful in many ing it as input textual content in the source lan- scenarios, such as incorporating in-domain ter- guage and retrieving its corresponding translation minology in machine translation (Post and Vilar, in the target language. Neural machine transla- 2018), avoiding generic and meaningless re- tion is a typical example of conditional text gen- sponses in dialogue systems (Mou et al., 2016), in- eration, where the condition encapsulated by the corporating ground-truth text fragments (such as conditional attribute code c is represented by the semantic attributes, object annotations) in image input sentence in the source language and the goal caption generation (Anderson et al., 2017). Typ- task is to generate its corresponding translation in ical attributes used to generate constrained nat- the target language. In addition, neural machine ural language are the tense and the length of translation is also an instance of constrained text the summaries in text summarization (Fan et al., generation given that it imposes the constraint to 2018a), the sentiment of the generated content in generate text in the target language. Additional review generation (Mueller et al., 2017), language constraints can be placed on the inclusion in the complexity in text simplification or the style in target sentence of named entities already present text style transfer applications. In addition, con- in the source sentence. In what follows we for- strained text generation is used to overcome lim- mally define the problem of neural machine trans- itations of neural text generation models for dia- lation. logue such as genericness and repetitiveness of re- We denote with Vs the vocabulary of the source sponses (See et al., 2019), (Serban et al., 2016). language and with Vt the vocabulary of the target Nevertheless, generating text under specific lex- language, with |Vt | ≈ |Vs | and Vt ∩ Vs = φ. Let us ical constraints is challenging (Zhang et al., 2020). also denote with with Vs∗ and Vt∗ all possible sen- While for humans it is straightforward to gener- tences under Vs , respectively Vt . Given a source ate sentences that cover a given set of concepts or sentence X = (x1 , x2 , . . . , xl ), X ∈ Vs∗ , xi ∈ Vs , abide to pre-defined rules by making use of their where xi is the ith word in X, ∀i = 1, . . . , l, the commonsense reasoning ability, generative com- goal is to generate the distribution over the possi- monsense reasoning with a constrained text gen- ble output sentences Y = (y1 , y2 , . . . , yl′ ), Y ∈
Vt∗ , yj ∈ Vt , where yj is the j th word in Y , of the most salient pieces of information from the ′ ∀j = 1, . . . , l by factoring Y into a chain of condi- input document(s). tional probabilities with left-to-right causal struc- Text summarization is a conditional text gener- ture using a neural network with parameters θ: ation task where the condition is represented by the given document(s) to be summarized. Addi- ′ +1 lY tional control codes are used in remainder summa- p(Y |X; θ) = p(yt |y0:t−1 , x1:l ; θ) (5) rization offering flexibility to define which parts t=1 of the document(s) are of interest, for eg., remain- ing paragraphs the user has not read yet, or in Special sentence delimiters y0 () and source-specific summarization to condition sum- yl′ +1 () are commonly added to the vocab- maries on the source type of input documents, ulary to mark the beginning and end of target for eg., newspapers, books or news articles. Be- sentence Y . Typically in machine translation the sides being a conditional text generation task, text source and target vocabularies consist of the most summarization is also a typical example of con- frequent words used in a language (for eg., top strained text generation where the condition is set 15,000 words), while the remaining words are such that the length of the resulting summary is replaced with a special token. Every strictly less than the length of the original docu- source sentence X is usually mapped to exactly ment. Unlike machine translation where output one target sentence Y , and there is no sharing of length varies depending on the source content, in words between the source sentence X and the text summarization the length of the output is fixed target sentence Y . and pre-determined. Controlling the length of the Although neural network based approaches to generated summary allows to digest information machine translation have resulted in superior per- at different levels of granularity and define the formance compared to statistical models, they are level of detail desired accounting for particular computationally expensive both in training and in user needs and time budgets; for eg., a document translation inference time. The output of machine can be summarized into a headline, a single sen- translation models is evaluated by asking human tence or a multi-sentence paragraph. In addition, annotators to rate the generated translations on var- explicit constraints can be placed on specific con- ious dimensions of textual quality, or by compar- cepts desired for inclusion in the summary. Most isons with human-written reference texts using au- frequently, named entities are used as constraints tomated evaluation metrics. in text summarization to ensure the generated sum- 2.4.2 Text Summarization mary is specifically focused on topics and events describing them. In addition, in the particular case Text summarization is designed to facilitate a of extractive summarization, there is the additional quick grasp of the essence of an input document constraint that sentences need to be picked explic- by producing a condensed summary of its content. itly from the original document. In what follows This can be achieved in two ways, either by means we formally define the task of text summarization. of extractive summarization or through abstrac- tive/generative summarization. While extractive We consider the input consisting of a sequence summarization (Nallapati et al., 2017) methods of M words x = (x1 , x2 , . . . , xM ), xi ∈ VX , i = produce summaries by copy-pasting the relevant 1, . . . , M , where VX is a fixed vocabulary of size portions from the input document, abstractive sum- |VX |. Each word xi is represented as an indicator marization (Rush et al., 2015), (Nallapati et al., vector xi ∈ {0, 1}VX , sentences are represented 2016), (See et al., 2017) algorithms can generate as sequences of indicators and X denotes the set novel content that is not present in the input doc- of all possible inputs. A summarization model ument. Hybrid approaches combining extractive takes x as input and yields a shorter version of it in summarization techniques with a a neural abstrac- the form of output sequence y = (y1 , y2 , . . . , yN ), tive summary generation serve to identify salient with N < M and yj ∈ {0, 1}VY , ∀j = 1, . . . , N . information in a document and generate distilled Abstractive / Generative Summarization We de- Wikipedia articles (Liu et al., 2018b). Character- fine Y ⊂ ({0, 1}VY , . . . , {0, 1}VY ) as the set of all istics of a good summary include brevity, fluency, possible generated summaries of length N , with non-redundancy, coverage and logical entailment y ∈ Y. The summarization system is abstractive
if it tries to find the optimal sequence y ∗ , y ∗ ⊂ Y, task. The condition is represented by the input under the scoring function s : X × Y → R, which document for which the text compression system can be expressed as: needs to output a condensed version. The task is also constrained text generation given the system y ∗ = arg max s(x, y) (6) needs to produce a compressed version of the in- y∈Y put strictly shorter lengthwise. In addition, there Extractive Summarization As opposed to ab- can be further constraints specified when the text stractive approaches which generate novel sen- compression output is desired to be entity-centric. tences, extractive approaches transfer parts from We denote with Ci = {ci1 , ci2 , . . . , cil } the set the input document x to the output y: of possible compression spans and with yy,c a bi- nary variable which equals 1 if the cth token of the y∗ = arg max s(x, x[m1 ,...,mN ] ) (7) ith sentence sˆi in document D is deleted, we are in- m∈{1,...,M }N terested in modeling the probability p(yi,c |D, sˆi ). Abstractive summarization is notably more chal- Following the same definitions from section 2.4.2, lenging than extractive summarization, and al- we can formally define the optimal compressed lows to incorporate real-world knowledge, para- text sequence under scoring function s as: phrasing and generalization, all crucial compo- nents of high-quality summaries (See et al., 2017). In addition, abstractive summarization does not y∗ = arg max s(x, x[m1 ,...,mN ] ) (8) impose any hard constraints on the system out- m∈{1,...,M }N ,mi−1
such as children, people with low education, peo- tion between the source sentence and the target ple who have reading disorders or dyslexia, and sentence can be one-to-many or many-to-one, as non-native speakers of the language. In the lit- simplification involves splitting and merging op- erature text simplification has been addressed at erations (Surya et al., 2018). Furthermore, infre- multiple levels: i) lexical simplification (Devlin, quent words in the vocabulary cannot be simply 1999) is concerned with replacing complex words dropped out and replaced with an unknown to- or phrases with simpler alternatives; ii) syntactic ken as it is typically done in machine translation, simplification (Siddharthan, 2006) alters the syn- but they need to be simplified appropriately corre- tactic structure of the sentence; iii) semantic sim- sponding to their level of complexity (Wang et al., plification (Kandula et al., 2010), sometimes also 2016a). Lexical simplification and content re- known as explanation generation, paraphrases por- duction is simultaneously approached with neu- tions of the text into simpler and clearer variants. ral machine translation models in (Nisioi et al., More recently, end-to-end models for text simpli- 2017), (Sulem et al., 2018c). Nevertheless, text fication attempt to address all these steps at once. simplification presents particular challenges com- Text simplification is an instance of conditional pared to machine translation. First, simplifica- text generation given we are conditioning on the tions need to be adapted to particular user needs, input text to produce a simpler and more readable and ideally personalized to the educational back- version of a complex document, as well as an in- ground of the target audience (Bingel, 2018), stance of constrained text generation since there (Mayfield et al., 2019). Second, text simplifica- are constraints on generating simplified text that is tion has the potential to bridge the communica- shorter in length compared to the source document tion gap between specialists and laypersons in and with higher readability level. To this end, it is many scenarios. For example, in the medical do- mandatory to use words of lower complexity from main it can help improve the understandability a much simpler target vocabulary than the source of clinical records (Shardlow and Nawaz, 2019), vocabulary. We formally introduce the text simpli- address disabilities and inequity in educational fication task below. environments (Mayfield et al., 2019), and assist with providing accessible and timely information Let us denote with Vs the vocabulary of the to the affected population in crisis management source language and with Vt the vocabulary of the (Temnikova, 2012). target language, with |Vt | ≪ |Vs | and Vt ⊆ Vs . Let us also denote with with Vs∗ and Vt∗ all possible 2.4.4 Text Style Transfer sentences under Vs , respectively Vt . Given source Style transfer is a newly emerging task designed sentence X = (x1 , x2 , . . . , xl ), X ∈ Vs∗ , xi ∈ Vs , to preserve the information content of a source where xi is the ith word in X, ∀i = 1, . . . , l, the sentence while delivering it to meet desired pre- goal is to produce the simplified sentence Y = sentation constraints. To this end, it is important (y1 , y2 , . . . , yl′ ), Y ∈ Vt∗ , yj ∈ Vt , where yj is to disentangle the content itself from the style in ′ the j th word in Y , ∀j = 1, . . . , l by modeling the which it is presented and be able to manipulate the conditional probability p(Y |X). In the context of style so as to easily change it from one attribute neural text simplification, a neural network with into another attribute of different or opposite po- parameters θ is used to maximize the probability larity. This is often achieved without the need for p(Y |X; θ). parallel data for source and target styles, but ac- Next we highlight differences between machine counting for the constraint that the transferred sen- translation and text simplification. Unlike ma- tences should match in style example sentences chine translation where the output sentence Y from the target style. To this end, text style trans- does not share any common terms with the in- fer is an instance of constrained text generation. put sentence X, in text simplification some or In addition, it is also a typical scenario of con- all of the words in Y might remain identical ditional text generation where we are condition- with the words in X in cases when the terms in ing on the given source text. Style transfer has X are already simple. In addition, unlike ma- been originally used in computer vision applica- chine translation where the mapping between the tions for image-to-image translation (Gatys et al., source sentence and the target sentence is usu- 2016), (Liu and Tuzel, 2016), (Zhu et al., 2017), ally one-to-one, in text simplification the rela- and more recently has been used in natural natu-
ral language processing applications for machine Latent VAE representations are manipulated to translation, sentiment modification to change the generate textual output with specific attributes, for sentiment of a sentence from positive to negative eg. contemporary text written in Shakespeare style and vice versa, word substitution decipherment or improving the positivity sentiment of a sentence and word order recovery (Hu et al., 2017). (Mueller et al., 2017). Style-independent con- The problem of style transfer in language gener- tent representations are learnt via disentangled la- ation can be formally defined as follows. Given tent representations for generating sentences with (1) (2) (n) controllable style attributes (Shen et al., 2017), two datasets X1 = {x1 , x1 , . . . , x1 } and (1) (2) (n) X2 = {x2 , x2 , . . . , x2 } with the same con- (Hu et al., 2017). Language models are employed tent distribution but different unknown styles y1 as style discriminators to learn disentangled rep- and y2 , where the samples in dataset X1 are drawn resentations for unsupervised text style transfer from the distribution p(x1 |y1 ) and the samples tasks such as sentiment modification (Yang et al., in dataset X2 are drawn from the distribution 2018d). p(x2 |y2 ), the goal is to estimate the style trans- 2.4.5 Dialogue Systems fer functions between them p(x1 |x2 ; y1 , y2 ) and p(x2 |x1 ; y1 , y2 ). According to the formulation of A dialogue system, also known as a conversational the problem we can only observe the marginal dis- agent, is a computer system designed to converse tributions p(x1 |y1 ) and p(x2 |y2 ), and the goal is with humans using natural language. To be able to recover the joint distribution p(x1 , x2 |y1 , y2 ), to carry a meaningful conversation with a human which can be expressed as follows assuming the user, the system needs to first understand the mes- existence of latent content variable z generated sage of the user, represent it internally, decide how from distribution p(z): to respond to it and issue the target response using natural language surface utterances (Chen et al., 2017a). Dialogue generation is an instance of con- Z ditional text generation where the system response p(x1 , x2 |y1 , y2 ) = p(z)p(x1 |y1 , z)p(x2 |y2 , z)dz z is conditioned on the previous user utterance and (9) frequently on the overall conversational context. Given that x1 and x2 are independent from each Dialogue generation can also be an instance of other given z, the conditional distribution corre- constrained text generation when the conversation sponding to the style transfer function is defined: is carried on a topic which explicitly involves en- tities such as locations, persons, institutions, etc. Z From an application point of view, dialogue sys- p(x1 |x2 ; y1 , y2 ) = p(x1 , z|x2 ; y1 , y2 )dz tems can be categorized into (Keselj, 2009): Zz = p(x1 |y1 , z)p(x2 |y2 , z)dz • task-oriented dialogue agents: are designed z to have short conversations with a human user = Ez∼p(z|x2,y2 ) [p(x1 |y1 , z)] to help him/ her complete a particular task. (10) For example, dialogue agents embedded into digital assistants and home controllers assist Models proposed in the literature for style transfer with finding products, booking accommoda- rely on encoder-decoder models. Given encoder tions, provide travel directions, make restau- E : X × Y → Z with paramters θE which infers rant reservations and phone calls on behalf of the content z and style y for a given sentence x, their users. Therefore, task-oriented dialogue and generator G : Y × Z → X with parameters generation is an instance of both conditional θG which given content z and style y generates and constrained text generation. sentence x, the reconstruction loss can be defined as follows: • non-task oriented dialogue agents or chat- bots: are designed for carrying extended conversations with their users on a wide Lrec =Ex1∼X1 [− log pG (x1 |y1 , E(x1 , y1 ))]+ range of open domains. They are set up to Ex2∼X2 [− log pG (x2 |y2 , E(x2 , y2 ))] mimic human-to-human interaction and un- (11) structured human dialogues in an entertaining
way. Therefore, non-task oriented dialogue is trieved and ranked appropriately from knowledge an instance of conditional text generation. bases and textual documents (Kratzwald et al., 2019), answer generation aims to produce more We formally define the task of dialogue genera- natural answers by using neural models to gener- tion. Generative dialogue models take as input a ate the answer sentence. Question answering can dialogue context c and generate the next response be considered as both a conditional text generation x. The training data consists of a set of samples and constrained text generation task. A question of the form {cn , xn , dn } ∼ psource (c, x, d), where answering system needs to be conditioned on the d denotes the source domain. At testing time, the question that was asked, while simultaneously en- model is given the dialog context c and the target suring that concepts needed to answer the question domain, and must generate the correct response are found in the generated output. x. The goal of a generative dialogue model is A question answering system can be formally to learn the function F : C × D → X which defined as follows. Given a context paragraph performs well on unseen examples from the tar- C = {c1 , c2 , . . . , cn } consisting of n words get domain after seeing the training examples on from word vocabulary V and the query Q = the source domain. The source domain and the {q1 , q2 , . . . , qm } of m words in length, the goal of target domain can be identical; when they differ a question answering system is to either: i) output the problem is defined as zero-shot dialogue gen- a span S = {ci , ci+1 , . . . , ci+j }, ∀i = 1, . . . , n eration (Zhao and Eskenazi, 2018). The dialogue and ∀j = 0, . . . , n − i from the original context generation problem can be summarized as: paragraph C, or ii) generate a sequence of words A = {a1 , a2 , . . . , al }, ak ∈ V, ∀k = 1, . . . , l as Training data : {cn , xn , dn } ∼ psource (c, x, d) the output answer. Below we differentiate between Testing data : {c, x, d} ∼ ptarget (c, x, d) multiple types of question answering tasks: Goal : F : C × D → X • Factoid Question Answering: given a descrip- (12) tion of an entity (person, place or item) for- A common limitation of neural networks for di- mulated as a query and a text document, the alogue generation is that they tend to generate task is to identify the entity referenced in the safe, universally relevant responses that carry little given piece of text. This is an instance of meaning (Serban et al., 2016), (Li et al., 2016a), both conditional and constrained text gener- (Mou et al., 2016); for example universal replies ation, given conditioning on the input ques- such as “I don’t know” or “something” frequently tion and constraining the generation task to occur in the training set are likely to have high be entity-centric. Factoid question answering estimated probabilities at decoding time. Addi- methods combine word and phrase-level rep- tional factors that impact the conversational flow resentations across sentences to reason about in generative models of dialogue are identified as entities (Iyyer et al., 2014), (Yin et al., 2015). repetitions and contradictions of previous state- ments, failing to balance specificity with gener- • Reasoning-based Question Answering: given icness of the output, and not taking turns in ask- a collection of documents and a query, ing questions (See et al., 2019). Furthermore, it the task is to reason, gather, and synthe- is desirable for generated dialogues to incorporate size disjoint pieces of information spread explicit personality traits (Zheng et al., 2019) and within documents and across multiple docu- control the sentiment (Kong et al., 2019a) of the ments to generate an answer (De Cao et al., generated response to resemble human-to-human 2019). The task involves multi-step rea- conversations. soning and understanding of implicit rela- tions for which humans typically rely on 2.4.6 Question Answering their background commonsense knowledge Question answering systems are designed to find (Bauer et al., 2018). The task is conditional and integrate information from various sources to given that the system generates an answer provide responses to user questions (Fu and Feng, conditioned on the input question, and may 2018). While traditionally candidate answers con- be constrained when the information across sist of words, phrases or sentence snippets re- documents is focused on entities or specific
concepts that need to be incorporated in the and end of a sentence, as well as the un- generated answer. known token used for all words not present in the vocabulary V , and V ∗ denotes all • Visual Question Answering: given an image possible sentences over V . Given training set and a natural language question about the im- D = {(I, y ∗ )} containing m pairs of the form age, the goal is to provide an accurate natural (Ij , yj∗ ), ∀j = 1, . . . , m consisting of input im- language answer to the question posed about age Ij and its corresponding ground-truth caption the image (Antol et al., 2015). By its nature yj∗ = (yj∗1 , yj∗2 , . . . , yj∗M ), yj∗ ∈ V ∗ and yj∗k ∈ the task is conditional, and can be constraint V, ∀k = 1, . . . , M , we want to maximize the prob- when specific objects or entities in the image abilistic model p(y|I; θ) with respect to model pa- need to be included in the generated answer. rameters θ. Question answering systems that meet var- ious information needs are proposed in the literature, for eg., for answering mathemat- 2.4.8 Narrative Generation / Story Telling ical questions (Schubotz et al., 2018), med- ical information needs (Wiese et al., 2017), Neural narrative generation aims to produce co- (Bhandwaldar and Zadrozny, 2018), quiz bowl herent stories automatically and is regarded as an questions (Iyyer et al., 2014), cross-lingual and important step towards computational creativity multi-lingual questions (Loginova et al., 2018). (Gervás, 2009). Unlike machine translation which In practical applications of question answering, produces a complete transduction of an input sen- users are typically not only interested in learning tence which fully defines the target semantics, the exact answer word, but also in how this is story telling is a long-form open-ended text gen- related to other important background information eration task which simultaneously addresses two and to previously asked questions and answers separate challenges: the selection of appropriate (Fu and Feng, 2018). content (“what to say”) and the surface realization 2.4.7 Image / Video Captioning of the generation (“how to say it”)(Wiseman et al., 2017). In addition, the most difficult aspect of Image captioning is designed to generate captions neural story generation is producing a a coherent in the form of textual descriptions for an image. and fluent story which is much longer than the This involves the recognition of the important ob- short input specified by the user as the story ti- jects present in the image, as well as object prop- tle. To this end, many neural story generation erties and interactions between objects to be able models assume the existence of a high-level plot to generate syntactically and semantically correct (commonly specified as a one-sentence outline) natural language sentences (Hossain et al., 2019). which serves the role of a bridge between titles and In the literature the image captioning task has been stories (Chen et al., 2019a), (Fan et al., 2018b), framed from either a natural language generation (Xu et al., 2018b), (Drissi et al., 2018), (Yao et al., perspective (Kulkarni et al., 2013), (Chen et al., 2019). Therefore, narrative generation is a con- 2017b) where each system produces a novel sen- strained text generation task since explicit con- tence, or from a ranking perspective where exist- straints are placed on which concepts to include ing captions are ranked and the top one is selected in the narrative so as to steer the generation in par- (Hodosh et al., 2013). Image/ video captioning is ticular topic directions. In addition, another con- a conditional text generation task where the cap- straint is that the output length needs to be strictly tion is conditioned on the input image or video. In greater than the input length. We formally define addition, it can be a constrained text generation the task of narrative generation below. task when specific concepts describing the input need to be present in the generated output. Assuming as input to the neural story generation Formally, the task of image/ video captioning system the title x = x1 , x2 , . . . , xI consisting of takes as input an image or video I and generates I words, the goal is to produce a comprehensible a sequence of words y = (y1 , y2 , . . . , yN ), y ∈ and logical story y = y1 , y2 , . . . , yJ of J words in V ∗ and yi ∈ V, ∀i = 1, . . . , N , where V de- length. Assuming the existence of a one sentence notes the vocabulary of output words and in- outline z = z1 , z2 , . . . , zK that contains K words cludes special tokens to mark the beginning for the entire story, the latent variable model for
neural story generation can be formally expressed: lines of sentences. The process is interactive and the author can keep modifying terms to reflect his X writing intent. Poetry generation is a constrained P (y|x; θ, γ) = P (z|x; θ)P (y|x, z; γ) (13) text generation problem since user defined con- z cepts need to be included in the generated poem. where P (z|x; θ) defines a planning model parame- At the same time, it can also be a conditional text terized by θ and P (y|x, z; γ) defines a generation generation problem given explicit conditioning on model parameterized by γ. the stylistic features of the poem. We define the The planning model P (z|x; θ) receives an input petry generation task below. the one sentence title z for the narrative and gener- Given as input a set of keywords that ates the narrative outline given the title: summarize an author’s writing intent K = {k1 , k2 , . . . , k|K| }, where each ki ∈ V, i = K Y 1, . . . , |K| is a keyword term from vocabulary V , P (z|x; θ) = P (zk |x, z
time between the word-level loss function opti- timization problem can therefore be expressed as: mized by MLE and humans focusing on whole se- X quences of poem lines and assessing fine-grained max log p(r|a) (18) criteria of the generated text such as fluency, coher- (a,r)∈D ence, meaningfulness and overall quality. These Generating long, well-structured and informa- human evaluation criteria are modeled and incor- tive reviews requires considerable effort when porated into the reward function of a mutual re- written by human users and is a similarly challeng- inforcement learning framework for poem genera- ing task to do automatically (Li et al., 2019a). tion (Yi et al., 2018). For a detailed overview of poetry generation we point the reader to (Oliveira, 2.4.11 Miscellaneous tasks related to natural 2017). language generation Handwriting synthesis aims to automatically gen- 2.4.10 Review Generation erate data that resembles natural handwriting and is a key component in the development of intelli- Product reviews allow users to express opinions gent systems that can provide personalized experi- for different aspects of products or services re- ences to humans (Zong and Zhu, 2014). The task ceived, and are popular on many online review of handwritten text generation is very much analo- websites such as Amazon, Yelp, Ebay, etc. These gous to sequence generation. Given as input a user online reviews encompass a wide variety of writ- defined sequence of words x = (x1 , x2 , . . . , xT ) ing styles and polarity strengths. The task of re- which can be either typed into the computer sys- view generation is similar in nature to sentiment tem or fed as an input image I to capture the user’s analysis and a lot of past work has focused on writing style, the goal of handwriting generation is identifying and extracting subjective content in re- to train a neural network model which can produce view data (Liu, 2015), (Zhao et al., 2016). Auto- a cursive handwritten version of the input text to matically generating reviews given contextual in- display under the form of output image O (Graves, formation focused on product attributes, ratings, 2013). Handwriting generation is a conditional sentiment, time and location is a meaningful con- generation task when the system is conditioning ditional text generation task. Common product at- on the input text. In addition, it is also a con- tributes used in the literature are the user ID, the strained text generation task since the task is con- product ID, the product rating or the user senti- strained on generating text in the user’s own writ- ment for the generated review (Dong et al., 2017), ing style. While advances in deep learning have (Tang et al., 2016). The task can also be con- given computers the ability to see and recognize strained text generation when topical and syntac- printed text from input images, generating cursive tic characteristics of natural languages are explic- handwriting is a considerably more challenging itly specified as constraints to incorporate in the problem (Alonso et al., 2019). Character bound- generation process. We formally define the review aries are not always well-defined, which makes it generation task below. hard to segment handwritten text into individual Given as input a set of product attributes a = pieces or characters. In addition, handwriting eval- (a1 , a2 , . . . , a|a| ) of fixed length |a|, the goal is to uation is ambiguous and not well defined given the generate a product review r = (y1 , y2 , . . . , y|r| ) of multitude of existent human handwriting style pro- variable length |r| by maximizing the conditional files (Mohammed et al., 2018). probability p(r|a): Other related tasks where natural language generation plays an important role are generat- |r| ing questions, arguments, counter-arguments and Y p(r|a) = p(yt |y
tasks illustrate the widespread importance of hav- Autoregressive (Fully-observed) generative ing robust models for natural language generation. models model the observed data directly without introducing dependencies on any new unobserved 3 Models local variables. Assuming all items in a sequence x = (x1 , x2 , . . . , xN ) are fully observed, the prob- Neural networks are used in a wide range of su- ability distribution p(x) of the data is modeled in pervised and unsupervised machine learning tasks an auto-regressive fashion using the chain rule of due to their ability to learn hierarchical representa- probability: tions from raw underlying features in the data and model complex high-dimensional distributions. A N Y wide range of model architectures based on neural p(x1 , x2 , . . . , xN ) = p(xi |x1 , x2 , . . . , xi−1 ) networks have been proposed for the task of nat- i=1 ural language generation in a wide variety of con- (19) texts and applications. In what follows we briefly Training autoregressive models is done by max- discuss the main categories of generative models imizing the data likelihood, allowing these mod- in the literature and continue with presenting spe- els to be evaluated quickly and exactly. Sampling cific models for neural language generation. from autoregressive models is exact, but it is ex- Deep generative models have received a lot of pensive since samples need to be generated in se- attention recently due to their ability to model quential order. Extracting representions from fully complex high-dimensional distributions. These observed models is challenging, but this is cur- models combine uncertainty estimates provided rently an active research topic. by probabilistic models with the flexibility and Latent variable generative models explain hid- scalability of deep neural networks to learn in an den causes by introducing an unobserved random unsupervised way the distribution from which data variable z for every observed data point. The data is drawn. Generative probabilistic models are use- likelihood p(x) is computed as follows: ful for two reasons: i) can perform density esti- Z mation and inference of latent variables, and ii) can sample efficiently from the probability density log p(x) = pθ (x|z)p(z)dz = Ep(z) [pθ (x|z)] represented by the input data and generate novel (20) content. Deep generative models can be classified Latent models present the advantage that sampling into either explicit or implicit density probabilistic is exact and cheap, while extracting latent features models. On the one hand, explicit density mod- from these models is straightforward. They are els provide an explicit parametric specification of evaluated using the lower bound of the log like- the data distribution and have tractable likelihood lihood. functions. On the other hand, implicit density mod- Implicit density models (among which the most els do not specify the underlying distribution of famous models are GANs) introduce a second dis- the data, but instead define a stochastic process criminative model able to distinguish model gen- which allows to simulate the data distribution af- erated samples from real samples in addition to ter training by drawing samples from it. Since the generative model. While sampling from these the data distribution is not explicitly specified, im- models is cheap, it is inexact. The evaluation plicit generative models do not have a tractable of these models is difficult or even impossible to likelihood function. A mix of both explicit and carry, and extracting latent representations from implicit models have been used in the literature these models is very challenging. We summarize to generate textual content in a variety of settings. in Table 1 characteristics of the three categories of Among these, we enumerate explicit density mod- generative models discussed above. els with tractable density such as autoregressive In what follows we review models for neural models (Bahdanau et al., 2014), (Vaswani et al., language generation from most general to the most 2017), explicit density models with approxi- specific according to the problem definition cate- mate density like the Variational Autoencoder gorization presented in Section 2; for each model (Kingma and Welling, 2013), and implicit direct architecture we first list models for generic text density generative models such as Generative Ad- generation, then introduce models for conditional versarial Networks (Goodfellow et al., 2014). text generation, and finally outline models used
Table 1: Comparison of generative model frameworks. is the hidden bias vector and by is the output bias vector. H is the function that computes the hid- Model type Evaluation Sampling den layer representation. Gradients in an RNN Fully-observed Exact and Exact and are computed via backpropagation through time Cheap Expensive (Rumelhart et al., 1986), (Werbos, 1989). By def- Latent models Lower Bound Exact and inition, RNNs are inherently deep in time con- Cheap sidering that the hidden state at each timestep is Implicit models Hard or Inexact and computed as a function of all previous timesteps. Impossible Cheap While in theory RNNs can make use of informa- tion in arbitrarily long sequences, in practice they fail to consider context beyond the few previous for constrained text generation. We begin with re- timesteps due to the vanishing and exploding gra- current neural network models for text generation dients (Bengio et al., 1994) which cause gradient in Section 3.1, then present sequence-to-sequence descent to not be able to learn long-range tempo- models in Section 3.2, generative adversarial net- ral structure in a standard RNN. Moreover, RNN- works (GANs) in Section 3.4, variational autoen- based models contain millions of parameters and codes (VAEs) in Section 3.5 and pre-trained mod- have traditionally been very difficult to train, lim- els for text generation in Section 3.8. We also pro- iting their widespread use (Sutskever et al., 2011). vide a comprehensive overview of text generation Improvements in network architectures, optimiza- tasks associated with each model. tion techniques and parallel computation have re- 3.1 Recurrent Architectures sulted in recurrent models learning better at large- scale (Lipton et al., 2015). 3.1.1 Recurrent Models for Generic / Long Short Term Memory (LSTM) Free-Text Generation (Hochreiter and Schmidhuber, 1997) networks are Recurrent Neural Networks (RNNs) introduced to overcome the limitations posed by (Rumelhart et al., 1986), (Mikolov et al., 2010) vanishing gradients in RNNs and allow gradient are able to model long-term dependencies in descent to learn long-term temporal structure. The sequential data and have shown promising results LSTM architecture largely resembles the standard in a variety of natural language processing tasks, RNN architecture with one hidden layer, and from language modeling (Mikolov, 2012) to each hidden layer node is modified to include a speech recognition (Graves et al., 2013) and memory cell with a self-connected recurrent edge machine translation (Kalchbrenner and Blunsom, of fixed weight which stores information over 2013). An important property of RNNs is the long time periods. A memory cell ct consists of a ability of learning to map an input sequence of node with an internal hidden state ht and a series variable length into a fixed dimensional vector of gates, namely an input gate it which controls representation. how much each LSTM unit is updated, a forget At each timestep, the RNN receives an input, gate ft which controls the extent to which the updates its hidden state, and makes a prediction. previous memory cell is forgotten, and an output Given an input sequence x = (x1 , x2 , . . . , xT ), gate ot which controls the exposure of the internal a standard RNN computes the hidden vector se- memory state. The LSTM transition equations at quence h = (h1 , h2 , . . . , hT ) and the output vec- timestep t are: tor sequence y = (y1 , y2 , . . . , yT ), where each dat- apoint xt , ht , yt , ∀ t ∈ {1, . . . , T } is a real valued vector, in the following way: it = σ(W (i) xt + U (i) ht−1 + b(i) ) ft = σ(W (f ) xt + U (f ) ht−1 + b(f ) ) ht = H(Wxh xt + Whh ht−1 + bh ) ot = σ(W (o) xt + U (o) ht−1 + b(o) ) (21) (22) yt = Why ht + by ut = σ(W (u) xt + U (t) ht−1 + b(t) ) In Equation 21 terms W denote weight matrices, ct = it ⊙ ut + ft ⊙ ct−1 in particular Wxh is the input-hidden weight ma- ht = ot ⊙ tanh(ct ) trix and Whh is the hidden-hidden weight ma- trix. The b terms denote bias vectors, where bh In Equation 22, xt is the input at the current
timestep t, σ denotes the logistic sigmoid func- that accumulate and amplify quickly over the gen- tion and ⊙ denotes elementwise multiplication. U erated sequence, (Lamb et al., 2016). As a remedy, and W are learned weight matrices. LSTMs can Scheduled Sampling (Bengio et al., 2015) mixes represent information over multiple time steps by inputs from the ground-truth sequence with inputs adjusting the values of the gating variables for generated by the model at training time, gradually each vector element, therefore allowing the gra- adjusting the training process from fully guided dient to pass without vanishing or exploding. In (i.e. using the true previous token) to less guided both RNNs and LSTMs the data is modeled via (i.e. using mostly the generated token) based on a fully-observed directed graphical model, where curriculum learning (Bengio et al., 2009). While the distribution over a discrete output sequence the model generated distribution can still diverge y = (y1 , y2 , . . . , yT ) is decomposed into an or- from the ground truth distribution as the model dered product of conditional distributions over to- generates several consecutive tokens, possible so- kens: lutions are: i) make the self-generated sequences short, and ii) anneal the probability of using self- T Y generated vs. ground-truth samples to 0, accord- P (y1 , y2 , . . . , yT ) = P (y1 ) P (yt |y1 , . . . , yt−1 ) t=1 ing to some schedule. Still, models trained with (23) scheduled sampling are shown to memorize the Similar to LSTMs, Gated Recurrent Units distribution of symbols conditioned on their posi- (GRUs) (Cho et al., 2014) learn semantically and tion in the sequence instead of the actual prefix of syntactically meaningful representations of natu- preceding symbols (Huszár, 2015). ral language and have gating units to modulate the flow of information. Unlike LSTMs, GRU units Many extensions of vanilla RNN and LSTM do not have a separate memory cell and present a architectures are proposed in the literature simpler design with fewer gates. The activation hjt aiming to improve generalization and sample at timestep t linearly interpolates between the acti- quality (Yu et al., 2019). Bidirectional RNNs vation at the previous timestep htj−1 and the candi- (Schuster and Paliwal, 1997), (Berglund et al., date activation e hjt . The update gate ztj decides how 2015) augment unidirectional recurrent models much the current unit updates its content, while the by introducing a second hidden layer with con- reset gate rtj allows it to forget the previously com- nections flowing in opposite temporal order to puted state. The GRU update equations at each exploit both past and future information in a timestep t are: sequence. Multiplicative RNNs (Sutskever et al., 2011) allow flexible input-dependent transitions, however many complex transition functions hard hjt = (1 − ztj )hjt−1 + ztj e hjt to bypass. Gated feedback RNNs and LSTMs (Chung et al., 2014) rely on gated-feedback ztj = σ(Wz xt + Uz ht−1 )j (24) connections to enable the flow of control signals hjt = tanh(W xt + U (rt ⊙ ht−1 ))j e from the upper to lower recurrent layers in the net- rtj = σ(Wr xt + Ur ht−1 )j work. Similarly, depth gated LSTMs (Yao et al., 2015) introduce dependencies between lower Models with recurrent connections are trained and upper recurrent units by using a depth gate with teacher forcing (Williams and Zipser, 1989), which connects memory cells of adjacent layers. a strategy emerging from the maximum likelihood Stacked LSTMs stack multiple layers at each criterion designed to keep the recurrent model pre- time-step to increase the capacity of the network, dictions close to the ground-truth sequence. At while nested LSTMs (Moniz and Krueger, 2018) each training step the model generated token ŷt selectively access LSTM memory cells with inner is replaced with its ground-truth equivalent token memory. Convolutional LSTMs (Sainath et al., yt , while at inference time each token is generated 2015), (Xingjian et al., 2015) are designed for by the model itself (i.e. sampled from its condi- jointly modeling spatio-temporal sequences. Tree- tional distribution over the sequence given the pre- structured LSTMs (Zhu et al., 2015), (Tai et al., viously generated samples). The discrepancy be- 2015) extend the LSTM structure beyond a linear tween training and inference stages leads to expo- chain to tree-structured network topologies, and sure bias, causing errors in the model predictions are useful at semantic similarity and sentiment
classification tasks. Multiplicative LSTMs forms local operations such as insertion, deletion (Krause et al., 2016) combine vanilla LSTM and replacement in the sentence space for any ran- networks of fixed weights with multiplicative domly selected word in the sentence. RNNs to allow for flexible input-dependent Hard constraints on the generation of scientific weight matrices in the network architecture. Mul- paper titles are imposed by the use of a forward- tiplicative Integration (Wu et al., 2016b) RNNs backward recurrent language model which gener- achieve better performance than vanilla RNNs by ates both previous and future words in a sentence using the Hadamard product in the computational conditioned on a given topic word (Mou et al., additive building block of RNNs. Mogrifier 2015). While the topic word can occur at any LSTMs (Melis et al., 2019) capture interactions arbitrary position in the sentence, the approach between inputs and their context by mutually can only generate sentences constrained precisely gating the current input and the previous output of on one keyword. Multiple constraints are incor- the network. For a comprehensive review of RNN porated in sentences generated by a backward- and LSTM-based network architectures we point forward LSTM language model by lexically substi- the reader to (Yu et al., 2019). tuting constrained tokens with their closest match- 3.1.2 Recurrent Models for Conditional Text ing neighbour in the embedding space (Latif et al., Generation 2020). Guiding the conversation towards a des- ignated topic while integrating specific vocabu- A recurrent free-text generation model becomes a lary words is achieved by combining discourse- conditional recurrent text generation model when level rules with neural next keywords prediction the distribution over training sentences is condi- (Tang et al., 2019). A recurrent network based tioned on another modality. For example in ma- sequence classifier is used for extractive summa- chine translation the distribution is conditioned on rization in (Nallapati et al., 2017). Poetry genera- another language, in image caption generation the tion which obeys hard rhythmic, rhyme and topic condition is the input image, in video description constraints is proposed in (Ghazvininejad et al., generation we condition on the input video, while 2016). in speech recognition we condition on the input speech. Content and stylistic properties (such as senti- ment, topic, style and length) of generated movie 3.2 Sequence-to-Sequence Architectures reviews are controlled in a conditional LSTM language model by conditioning on context vec- Although the recurrent models presented in Sec- tors that reflect the presence of these proper- tion 3.1 present good performance whenever large ties (Ficler and Goldberg, 2017). Affective di- labeled training sets are available, they can only be alogue responses are generated by conditioning applied to problems whose inputs and targets are on affect categories in an LSTM language model encoded with vectors of fixed dimensionality. Se- (Ghosh et al., 2017). A RNN-based language quences represent a challenge for recurrent models model equipped with dynamic memory outper- since RNNs require the dimensionality of their in- forms more complex memory-based models for di- puts and outputs to be known and fixed beforehand. alogue generation (Mei et al., 2017). Participant In practice, there are many problems in which the roles and conversational topics are represented as sequence length is not known a-priori and it is nec- context vectors and incorporated into a LSTM- essary to map variable length sequences into fixed- based response generation model (Luan et al., dimensional vector representations. To this end, 2016). models that can map sequences to sequences are proposed. These models makes minimal assump- 3.1.3 Recurrent Models for Constrained Text tions on the sequence structure and learn to map an Generation input sequence into a vector of fixed dimensional- Metropolis-Hastings sampling (Miao et al., 2019) ity and then map that vector back into an output is proposed for both soft and hard constrained sen- sequence, therefore learning to decode the target tence generation from models based on recurrent sequence from the encoded vector representation neural networks. The method is based on Markov of the source sequence. We present these models Chain Monte Carlo (MCMC) sampling and per- in detail below.
You can also read