Value-Agnostic Conversational Semantic Parsing - Microsoft
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Value-Agnostic Conversational Semantic Parsing Emmanouil Antonios Platanios, Adam Pauls, Subhro Roy, Yuchen Zhang, Alex Kyte, Alan Guo, Sam Thomson, Jayant Krishnamurthy, Jason Wolfe, Jacob Andreas, Dan Klein Microsoft Semantic Machines sminfo@microsoft.com Abstract the output (Suhr et al., 2018). While this approach Conversational semantic parsers map user ut- can capture arbitrary dependencies between inputs terances to executable programs given dia- and outputs, it comes at the cost of sample- and logue histories composed of previous utter- computational inefficiency. ances, programs, and system responses. Ex- We propose a new “value-agnostic” approach to isting parsers typically condition on rich repre- contextual semantic parsing driven by type-based sentations of history that include the complete representations of the dialogue history and function- set of values and computations previously dis- based representations of the generated programs. cussed. We propose a model that abstracts over values to focus prediction on type- and Types and functions have long served as a founda- function-level context. This approach provides tion for formal reasoning about programs, but their a compact encoding of dialogue histories and use in neural semantic parsing has been limited, predicted programs, improving generalization e.g., to constraining the hypothesis space (Krishna- and computational efficiency. Our model in- murthy et al., 2017), guiding data augmentation (Jia corporates several other components, includ- and Liang, 2016), and coarsening in coarse-to-fine ing an atomic span copy operation and struc- models (Dong and Lapata, 2018). We show that tural enforcement of well-formedness con- straints on predicted programs, that are par- representing conversation histories and partial pro- ticularly advantageous in the low-data regime. grams via the types and functions they contain en- Trained on the SMC AL F LOW and T REE DST ables fast, accurate, and sample-efficient contextual datasets, our model outperforms prior work semantic parsing. We propose a neural encoder– by 7.3% and 10.6% respectively in terms of decoder contextual semantic parsing model which, absolute accuracy. Trained on only a thou- in contrast to prior work: sand examples from each dataset, it outper- forms strong baselines by 12.4% and 6.4%. 1. uses a compact yet informative representation These results indicate that simple representa- of discourse context in the encoder that con- tions are key to effective generalization in con- siders only the types of salient entities that versational semantic parsing. were predicted by the model in previous turns 1 Introduction or that appeared in the execution results of the Conversational semantic parsers, which translate predicted programs, and natural language utterances into executable pro- 2. conditions the decoder state on the sequence grams while incorporating conversational context, of function invocations so far, without con- play an increasingly central role in systems for ditioning on any concrete values passed as interactive data analysis (Yu et al., 2019), instruc- arguments to the functions. tion following (Guu et al., 2017), and task-oriented dialogue (Zettlemoyer and Collins, 2009). An ex- Our model substantially improves upon the best ample of this task is shown in Figure 1. Typical published results on the SMC AL F LOW (Semantic models are based on an autoregressive sequence Machines et al., 2020) and T REE DST (Cheng et al., prediction approach, in which a detailed represen- 2020) conversational semantic parsing datasets, im- tation of the dialogue history is concatenated to the proving model performance by 7.3% and 10.6%, re- input sequence, and predictors condition on this se- spectively, in terms of absolute accuracy. In further quence and all previously generated components of experiments aimed at quantifying sample efficiency,
PREVIOUS USER UTTERANCE PREVIOUS PROGRAM EXTRACTED DIALOGUE HISTORY TYPES parsing delete( String Can you delete my event Set of salient types extracted find( Constraint[String] from the dialogue history and called holiday shopping ? Constraint[Event]( used by the parser as a Constraint[Event] compact representation of subject = like("holiday shopping") Event the history to condition on. PREVIOUS AGENT UTTERANCE ) Unit The last type comes from the I can’t find an event execution ) EventNotFoundError program execution results. ) with that name. PARSER LINEARIZED REPRESENTATION Representation predicted by the proposed model. PREDICTED PROGRAM CURRENT USER UTTERANCE parsing revise( Invocation Copy Reference Oh, it’s just called shopping. oldLoc = Constraint[Event](), (to the previous function invocation) rootLoc = Constraint[Any](), [0] Constraint[Event]() It may be at 2. new = Constraint[Event]( Constant Entity [1] Constraint[Any]() subject = like("shopping"), [2] like(value = "shopping") start = Time( [3] Time(hour = 2, meridiem = PM) ENTITY PROPOSERS hour = 2, meridiem = PM [4] Constraint[Event](subject = [2], start = [3]) Propose entities from the current user utterance. ) [5] revise(oldLoc = [0], rootLoc = [1], new = [4]) ) Month.May Number(2) ) Function Argument Value Figure 1: Illustration of the conversational semantic parsing problem that we focus on and the representations that we use. The previous turn user utterance and the previous program are shown in blue on the top. The dialogue history representation extracted using our approach is shown on the top right. The current turn user utterance is shown in red on the bottom left. The current utterance, the set of proposed entities, and the extracted dialogue history representation form the input to our parser. Given this input, the parser predicts a program that is shown on the bottom right (in red rectangles). it improves accuracy by 12.4% and 6.4% respec- ciated programs and execution results). We model tively when trained on only a thousand examples a program as a sequenceo of function invocations, from each dataset. Our model is also effective at each consisting of a function and zero or more non-contextual semantic parsing, matching state-of- argument values, as illustrated at the lower right the-art results on the J OBS, G EO Q UERY, and ATIS of Figure 1. The argument values can be either datasets (Dong and Lapata, 2016). This is achieved literal values or references to results of previous while also reducing the test time computational function invocations. The ability to reference pre- cost by a factor of 10 (from 80ms per utterance vious elements of the sequence, sometimes called down to 8ms when running on the same machine; a target-side copy, allows us to construct programs more details are provided in Appendix H), when that involve re-entrancies. Owing to this referen- compared to our fastest baseline, which makes it tial structure, a program can be equivalently repre- usable as part of a real-time conversational system. sented as a directed acyclic graph (see e.g., Jones One conclusion from these experiments is that et al., 2012; Zhang et al., 2019). most semantic parses have structures that depend only weakly on the values that appear in the dia- We propose a Transformer-based (Vaswani logue history or in the programs themselves. Our et al., 2017) encoder–decoder model that predicts experiments find that hiding values alone results programs by generating function invocations se- in a 2.6% accuracy improvement in the low-data quentially, where each invocation can draw its argu- regime. By treating types and functions, rather than ments from an inventory of values (§2.5)—possibly values, as the main ingredients in learned represen- copied from the utterance—and the results of pre- tations for semantic parsing, we improve model vious function invocations in the current program. accuracy and sample efficiency across a diverse set The encoder (§2.2) transforms a natural language of language understanding problems, while also utterance and a dialogue history to a continuous significantly reducing computational costs. representation. Subsequently, the decoder (§2.3) uses this representation to define an autoregressive 2 Proposed Model distribution over function invocation sequences and chooses a high-probability sequence by performing Our goal is to map natural language utterances to beam search. As our experiments (§3) will show, programs while incorporating context from dia- a naı̈ve encoding of the complete dialogue history logue histories (i.e., past utterances and their asso- and program results in poor model accuracy.
PREVIOUS PROGRAM help the model generalize better to previously un- delete(find( Constraint[Event]( seen values that can be recognized in the utterance subject = like("holiday shopping"), start = Time(hour = 2, meridiem = PM), using hard-coded heuristics (e.g., regular expres- end = Time(hour = 5, meridiem = PM) sions), auxiliary training data, or other runtime in- ) )) formation (e.g., a contact list). In our experiments we make use of simple proposers that recognize "It actually starts at 3pm." numbers, months, holidays, and days of the week, Contains information that but one could define proposers for arbitrary values CURRENT PROGRAM without revision is not mentioned in the (e.g., song titles). As described in §2.5, certain delete(find( current utterance Constraint[Event]( values can also be predicted directly without the subject = like("holiday shopping"), start = Time(hour = 3, meridiem = PM), use of an entity proposer. end = Time(hour = 5, meridiem = PM) ) )) 2.2 Encoder Only contains information CURRENT PROGRAM with revision that is mentioned in the revise( current utterance The encoder, shown in Figure 3, maps a natural oldLoc = Constraint[Event](), language utterance to a continuous representation. rootLoc = RoleConstraint(start), new = Time(hour = 3, meridiem = PM) Like many neural sequence-to-sequence models, ) we produce a contextualized token representation Figure 2: Illustration of the revise meta-computation of the utterance, Hutt ∈ RU ×henc , where U is the operator (§2.1) used in our program representations. number of tokens and henc is the dimensionality of This operator can remove the need to copy program their embeddings. We use a Transformer encoder fragments from the dialogue history. (Vaswani et al., 2017), optionally initialized using the BERT pretraining scheme (Devlin et al., 2019). 2.1 Preliminaries Next, we need to encode the dialogue history and combine its representation with Hutt to produce Our approach assumes that programs have type an- history-contextualized utterance token embeddings. notations on all values and function calls, similar to Prior work has incorporated history information the setting of Krishnamurthy et al. (2017).1 Further- by linearizing it and treating it as part of the input more, we assume that program prediction is local utterance (Cheng et al., 2018; Semantic Machines in that it does not require program fragments to be et al., 2020; Aghajanyan et al., 2020). While flexi- copied from the dialogue history (but may still de- ble and easy to implement, this approach presents pend on history in other ways). Several formalisms, a number of challenges. In complex dialogues, his- including the typed references of Zettlemoyer and tory encodings can grow extremely long relative Collins (2009) and the meta-computation opera- to the user utterance, which: (i) increases the risk tors of Semantic Machines et al. (2020), make it of overfitting, (ii) increases computational costs possible to produce local program annotations even (because attentions have to be computed over long for dialogues like the one depicted in Figure 2, sequences), and (iii) necessitates using small batch which reuse past computations. We transformed sizes during training, making optimization difficult. the datasets in our experiments to use such meta- computation operators (see Appendix C). Thanks to the predictive locality of our repre- We also optionally make use of entity proposers, sentations (§2.1), our decoder (§2.3) never needs similar to Krishnamurthy et al. (2017), which an- to retrieve values or program fragments from notate spans from the current utterance with typed the dialogue history. Instead, context enters into values. For example, the span “one” in “Change programs primarily when programs use referring it to one” might be annotated with the value 1 expressions that point to past computations, or of type Number. These values are scored by the revision expressions that modify them. Even decoder along with other values that it considers though this allows us to dramatically simplify (§2.5) when predicting argument values for func- the dialogue history representation, effective tion invocations. Using entity proposers aims to generation of referring expressions still requires knowing something about the past. For example, 1 This requirement can be trivially satisfied by assigning for the utterance “What’s next?” the model needs all expressions the same type, but in practice defining a set of type declarations for the datasets in our experiments was not to determine what “What” refers to. Perhaps difficult (refer to Appendix C for details). more interestingly, the presence of dates in recent
USER UTTERANCE UTTERANCE ENCODER Consider the following program representing the expression 1 + 2 + 3 + 4 + 5: "Oh, it's just called shopping. [0] +( 1, 2) It may be at 2." [1] +([0], 3) While generating this invocation, the [2] +([1], 4) decoder only gets to condition on the DIALOGUE HISTORY TYPES DIALOGUE HISTORY ENCODER [3] +([2], 5) following program prefix: String Constraint[String] embed K V Q [0] +(Number, Number) Argument values are masked out! Constraint[Event] attention [1] +(Number, Number) [2] +(Number, Number) Event Unit Figure 4: Illustration showing the way in which our de- EventNotFoundError coder is value-agnostic. Specifically, it shows which decoder part of the generated program prefix, our decoder con- Figure 3: Illustration of our encoder (§2.2), using the ditions on while generating programs (§2.3). example of Figure 1. The utterance is processed by a Transformer-based (Vaswani et al., 2017) encoder and each utterance-contextualized token is further con- combined with information extracted from the set of textualized in (1) by adding to it a mixture of dialogue history types using multi-head attention. embeddings of elements in T , where the mix- ture coefficients depends only on that utterance- turns (or values that have dates, such as meetings) contextualized token. This encoder is illustrated should make the decoder more eager to generate in Figure 3. As we show in §3.1, using this mech- referring calls that retrieve dates from the dialogue anism performs better than the naı̈ve approach of history; especially so if other words in the current appending a set-of-types vector to Hutt . utterance hint that dates may be useful and yet date values cannot be constructed directly from the 2.3 Decoder: Programs current utterance. Subsequent steps of the decoder The decoder uses the history-contextualized repre- which are triggered by these other words can sentation Henc of the current utterance to predict a produce functions that consume the referred dates. distribution over the program π that corresponds We thus hypothesize that it suffices to strip the to that utterance. Each successive “line” πi of π dialogue history down to its constituent types, hid- invokes a function fi on an argument value tuple ing all other information.2 Specifically, we extract (vi1 , vi2 , . . . , viAi ), where Ai is the number of (for- a set T of types that appear in the dialogue history mal) arguments of fi . Applying fi to this ordered up to m turns back, where m = 1 in our experi- tuple results in the invocation fi (ai1 = vi1 , ai2 = ments.3 Our encoder then transforms Hutt into a se- vi2 , . . .), where (ai1 , ai2 , . . . , aiAi ) name the for- quence of history-contextualized embeddings Henc mal arguments of fi . Each predicted value vij can by allowing each token to attend over T . This is be the result of a previous function invocation, a motivated by the fact that, in many cases, dialogue constant value, a value copied from the current ut- history is important for determining the meaning terance, or a proposed entity (§2.1), as illustrated in of specific tokens in the utterance, rather than the the lower right corner of Figure 1. These different whole utterance. Specifically, we learn embeddings argument sources are described in §2.5. Formally, T ∈ R|T |×htype for the extracted types, where htype the decoder defines a distribution of programs π: is the embedding size, and use the attention mecha- P Y nism of Vaswani et al. (2017) to contextualize Hutt : p(π | Henc ) = p(πi | f
FUNCTION SIGNATURE example is shown in Figure 5. We encode the Book[Flight](from: City, to: City): Booking[Flight] NAME TYPE ARGUMENT ARGUMENT TYPE function and argument names using the utter- ance encoder of §2.2 and learn embeddings for from: City NAME TYPE the types, to obtain (n, r), {τ1 , . . . , τT }, and POOLING {(a1 , t1 ), . . . , (aA , tA )}. Then, we construct an embedding for each function as follows: function embedding argument embedding FUNCTION EMBEDDER ARGUMENT EMBEDDER a = Pool(a1 + t1 , . . . , aA + tA ), (4) f = n + Pool(τ1 , . . . , τT ) + a + r, (5) Figure 5: Illustration of our function encoder (§2.4), using a simplified example function signature. where “Pool” is the max-pooling operation which is invariant to the arguments’ order. (not their argument values or results) and argument Our main motivation for this function em- values depend only on their calling function (not bedding mechanism is the ability to take cues on one another or any of the previous argument from the user utterance (e.g., due to a function values).4 This is illustrated in Figure 4. In addi- being named similarly to a word appearing in the tion to providing an important inductive bias, these utterance). If the functions and their arguments independence assumptions allow our inference pro- have names that are semantically similar to cedure to efficiently score all possible function in- corresponding utterance parts, then this approach vocations at step i, given the ones at previous steps, enables zero-shot generalization.6 However, there at once (i.e., function and argument value assign- is an additional potential benefit from parameter ments together), resulting in an efficient search sharing due to the compositional structure of the algorithm (§2.6). Note that there is also a corre- embeddings (see e.g., Baroni, 2020). sponding disadvantage (as in many machine trans- 2.5 Decoder: Argument Values lation models) that a meaningful phrase in the utter- ance could be independently selected for multiple This section describes the implementation of the arguments, or not selected at all, but we did not argument predictor p(vij | f
The argument scoring mechanism considers the decoder to copy contiguous spans directly from last-layer decoder state hidec that was used to pre- the utterance. Its goal is to produce a score for dict fi via p(fi | f hidec ). We special- each of the U (U + 1)/2 possible utterance spans. ize this decoder state to argument aij as follows: Naı̈vely, this would result in a computational cost i,a that is quadratic in the utterance length U , and hdecij , ĥidec tanh(fi + aij ), (7) so we instead chose a simple scoring model that where represents elementwise multiplication, fi avoids it. Similar to Stern et al. (2017) and Kurib- is the embedding of the current function fi , aij is ayashi et al. (2019), we assume that the score for a the encoding of argument aij as defined in §2.4, span factorizes, and define the embedding of each and ĥdec is a projection of hdec to the necessary span value as the concatenation of the contextual dimensionality. Intuitively, tanh(fi + aij ) acts as embeddings of the first and last tokens of the span, a gating function over the decoder state, deciding ṽ = [hkuttstart ; hkuttend ]. To compute the copy scores we what is relevant when scoring values for argument also concatenate hi,a dec with itself in Equation 8. aij . This argument-specific decoder state is then Entities. Entities are treated the same way as combined with a value embedding to produce a copies, except that instead of scoring all spans of probability for each (sourced) value assignment: the input, we only score spans proposed by the p(v, s | f (hi,a dec + w kind(s) a ) + bkind(s) a , (8) of candidate entities that are each described by an where a is the argument name aij , kind(s) ∈ utterance span and an associated value. The can- {REFERENCE, CONSTANT, COPY, ENTITY}, ṽ is the didates are scored using an identical mechanism embedding of (v, s) which is described next, and to the one used for scoring copies. This means wak and bka are model parameters that are specific that, for example, the string “sept” could be linked to a and the kind of the source s. to the value Month.September even though the string representations do not match perfectly. References. References are pointers to the re- turn values of previous function invocations. If Type Checking. When scoring argument values the source s for the proposed value v is the result for function fi , we know the argument types, as of the k th invocation (where k < i), we take its em- they are specified in the function’s signature. This bedding ṽ to be a projection of hkdec that was used enables us to use a type checking mechanism that to predict that invocation’s function and arguments. allows the decoder to directly exclude values with mismatching types. For references, the value types Constants. Constants are values that are always can be obtained by looking up the result types of the proposed, so the decoder always has the option of corresponding function signatures. Additionally, generating them. If the source s for the proposed the types are always pre-specified for constants value v is a constant, we embed it by applying the and entities, and copies are only supported for a utterance encoder on a string rendering of the value. subset of types (e.g., String, PersonName; see The set of constants is automatically extracted from Appendix B). The type checking mechanism sets the training data (see Appendix B). p(vij | f
SMC AL F LOW Dataset SMC AL F LOW T REE DST Dataset T REE DST V 1.1 V 2.0 # Training Dialogues 1k 10k 33k 1k 10k 19k Best Reported Result 66.5 68.2 62.2 Seq2Seq 36.8 69.8 74.5 28.2 47.9 50.3 w/o BERT Our Model 73.8 75.3 72.8 Seq2Tree 43.6 69.3 77.7 23.6 46.9 48.8 Seq2Tree++ 48.0 71.9 78.2 74.8 75.4 86.9 Table 1: Test set exact match accuracy comparing our Our Model 53.8 73.2 78.5 78.6 87.6 88.5 model to the best reported results for SMC AL F LOW Seq2Seq 44.6 64.1 67.8 28.6 40.2 47.2 w/ BERT (Seq2Seq model from the public leaderboard; Semantic Seq2Tree 50.8 74.6 78.6 30.9 50.6 51.6 Machines et al., 2020) and T REE DST (TED-PP model; Our Model 63.2 77.2 80.4 81.2 87.1 88.3 Cheng et al., 2020). The evaluation on each dataset (a) Baseline comparison. in prior work requires us to repeat some idiosyncrasies that we describe in Appendix D. Dataset SMC AL F LOW T REE DST # Training Dialogues 1k 10k 33k 1k 10k 19k efficiently implement beam search over complete Our Model 63.2 77.2 80.4 81.2 87.1 88.3 function invocations, by leveraging the fact that: Value Dependence 60.6 76.4 79.4 79.3 86.2 86.5 No Name Embedder 62.8 76.7 80.3 81.1 87.0 88.1 Ai Parser Y No Types 62.4 76.5 79.9 80.6 87.1 88.3 max p(πi ) = max p(fi ) max p(vij | fi ) , (9) No Span Copy 60.2 76.2 79.8 79.0 86.7 87.4 πi fi vij j=1 No Entity Proposers 59.6 76.4 79.8 80.5 86.9 88.2 where we have omitted the dependence on f
variables as any other literal. Dataset Method J OBS G EO ATIS – Seq2Tree++: An enhanced version of the model Zettlemoyer and Collins (2007) — 86.1 84.6 by Krishnamurthy et al. (2017) that predicts Wang et al. (2014) 90.7 90.4 91.3 typed programs in a top-down fashion. Unlike Zhao and Huang (2015) 85.0 88.9 84.2 Seq2Seq and Seq2Tree, this model can only pro- Saparov et al. (2017) 81.4 83.9 — duce well-formed and well-typed programs. It Dong and Lapata (2016) 90.0 87.1 84.6 also makes use of the same entity proposers Rabinovich et al. (2017) 92.9 87.1 85.9 Neural Methods (§2.1) similar to our model, and it can atomi- Yin and Neubig (2018) — 88.2 86.2 Dong and Lapata (2018) — 88.2 87.7 cally copy spans of up to 15 tokens by treating Aghajanyan et al. (2020) — 89.3 — them as additional proposed entities. Further- Our Model 91.4 91.4 90.2 more, it uses the linear history encoder that is x No BERT 91.4 90.0 91.3 described in the next paragraph. Like our model, Table 3: Validation set exact match accuracy for single- re-entrancies are represented as references to pre- turn semantic parsing datasets. Note that Aghajanyan vious outputs in the predicted sequence. et al. (2020) use BART (Lewis et al., 2020), a large pre- trained encoder. The best results for each dataset are We also implemented variants of Seq2Seq and shown in bold red and are underlined. Seq2Tree that use BERT-base9 (Devlin et al., 2019) as the encoder. Our results are shown in Ta- mechanism with a linear function over a multi- ble 2a. Our model outperforms all baselines on hot embedding of the history types. both datasets, showing particularly large gains in the low data regime, even when using BERT. Fi- The results, shown in Table 2b, indicate that all nally, we implemented the following ablations, of our features play a role in improving accuracy. with more details provided in Appendix G: Perhaps most importantly though, the “value de- pendence” ablation shows that our function-based – Value Dependence: Introduces a unique function program representations are indeed important, and for each value in the training data (except for the “previous turn” ablation shows that our type- copies) and transforms the data so that values based program representations are also important. are always produced by calls to these functions, Furthermore, the impact of both these modeling allowing the model to condition on them. decisions grows larger in the low data regime, as – No Name Embedder: Embeds functions and con- does the impact of the span copy mechanism. stants atomically instead of using the approach of §2.4 and the utterance encoder. 3.2 Non-Conversational Semantic Parsing – No Types: Collapses all types to a single type, Our main focus is on conversational semantic which effectively disables type checking (§2.5). parsing, but we also ran experiments on non- – No Span Copy: Breaks up span-level copies into conversational semantic parsing benchmarks to token-level copies which are put together us- show that our model is a strong parser irrespec- ing a special concatenate function. Note that tive of context. Specifically, we manually anno- our model is value-agnostic and so this ablated tated the J OBS, G EO Q UERY, and ATIS datasets model cannot condition on previously copied with typed declarations (Appendix C) and ran ex- tokens when copying a span token-by-token. periments comparing with multiple baseline and – No Entity Proposers: Removes the entity pro- state-of-the-art methods. The results, shown in Ta- posers, meaning that previously entity-linked ble 3, indicate that our model meets or exceeds values have to be generated as constants. state-of-the-art performance in each case. – No History: Sets Henc = Hutt (§2.2). – Previous Turn: Replaces the type-based history 4 Related Work encoding with the previous turn user and system utterances or linearized system actions. Our approach builds on top of a significant amount – Linear Encoder: Replaces the history attention of prior work in neural semantic parsing and also 9 context-dependent semantic parsing. We found that BERT-base worked best for these baselines, but was no better than the smaller BERT-medium when used Neural Semantic Parsing. While there was a with our model. Also, unfortunately, incorporating BERT in Seq2Tree++ turned out to be challenging due to the way that brief period of interest in using unstructured se- model was originally implemented. quence models for semantic parsing (e.g., Andreas
et al., 2013; Dong and Lapata, 2016), most research conditions on explicit representations of user utter- on semantic parsing has used tree- or graph-shaped ances and programs with a neural encoder. While decoders that exploit program structure. Most such this results in highly expressive models, it also in- approaches use this structure as a constraint while creases the risk of overfitting. Contrary to this, decoding, filling in function arguments one-at-a- Zettlemoyer and Collins (2009), Lee et al. (2014) time, in either a top-down fashion (e.g., Dong and and Semantic Machines et al. (2020) do not use Lapata, 2016; Krishnamurthy et al., 2017) or a context to resolve references at all. They instead bottom-up fashion (e.g., Misra and Artzi, 2016; predict context-independent logical forms that are Cheng et al., 2018). Both directions can suffer resolved in a separate step. Our approach occu- from exposure bias and search errors during decod- pies a middle ground: when combined with local ing: in top-down when there’s no way to realize program representations, types, even without any an argument of a given type in the current context, value information, provide enough information to and in bottom-up when there are no functions in the resolve context-dependent meanings that cannot be programming language that combine the predicted derived from isolated sentences. The specific mech- arguments. To this end, there has been some work anism we use to do this “infuses” contextual type on global search with guarantees for neural seman- information into input sentence representations, in tic parsers (e.g., Lee et al., 2016) but it is expensive a manner reminiscent of attention flow models from and makes certain strong assumptions. In contrast the QA literature (e.g., Seo et al., 2016). to this prior work, we use program structure not just as a decoder constraint but as a source of in- 5 Conclusion dependence assumptions: the decoder explicitly We showed that abstracting away values while decouples some decisions from others, resulting in encoding the dialogue history and decoding pro- good inductive biases and fast decoding algorithms. grams significantly improves conversational seman- Perhaps closest to our work is that of Dong and tic parsing accuracy. In summary, our goal in this Lapata (2018), which is also about decoupling de- work is to think about types in a new way. Similar cisions, but uses a dataset-specific notion of an to previous neural and non-neural methods, types abstracted program sketch along with different in- are an important source of constraints on the behav- dependence assumptions, and underperforms our ior of the decoder. Here, for the first time, they are model in comparable settings (§3.2). Also close also the primary ingredient in the representation of are the models of Cheng et al. (2020) and Zhang both the parser actions and the dialogue history. et al. (2019). Our method differs in that our beam Our approach, which is based on type-centric search uses larger steps that predict functions to- encodings of dialogue states and function-centric gether with their arguments, rather than predicting encodings of programs (§2), outperforms prior the argument values serially in separate dependent work by 7.3% and 10.6%, on SMC AL F LOW and steps. Similar to Zhang et al. (2019), we use a T REE DST, respectively (§3), while also being target-side copy mechanism for generating refer- more computationally efficient than competing ences to function invocation results. However, we methods. Perhaps more importantly, it results in extend this mechanism to also predict constants, even more significant gains in the low-data regime. copy spans from the user utterance, and link ex- This indicates that choosing our representations ternally proposed entities. While our span copy carefully and making appropriate independence mechanism is novel, it is inspired by prior attempts assumptions can result in increased accuracy and to copy spans instead of tokens (e.g., Singh et al., computational efficiency. 2020). Finally, bottom-up models with similarities to ours include SMBOP (Rubin and Berant, 2020) 6 Acknowledgements and BUSTLE (Odena et al., 2020). We thank the anonymous reviewers for their helpful Context-Dependent Semantic Parsing. Prior comments, Jason Eisner for his detailed feedback work on conversational semantic parsing mainly and suggestions on an early draft of the paper, Abul- focuses on the decoder, with few efforts on incor- hair Saparov for helpful conversations and pointers porating the dialogue history information in the en- about semantic parsing baselines and prior work, coder. Recent work on context-dependent semantic and Theo Lanman for his help in scaling up some parsing (e.g., Suhr et al., 2018; Yu et al., 2019) of our experiments.
References of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Armen Aghajanyan, Jean Maillard, Akshat Shrivas- pages 731–742, Melbourne, Australia. Association tava, Keith Diedrick, Michael Haeger, Haoran Li, for Computational Linguistics. Yashar Mehdad, Veselin Stoyanov, Anuj Kumar, Mike Lewis, and Sonal Gupta. 2020. Conversa- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. tional Semantic Parsing. In Proceedings of the 2020 Li. 2016. Incorporating Copying Mechanism in Conference on Empirical Methods in Natural Lan- Sequence-to-Sequence Learning. In Proceedings guage Processing (EMNLP), pages 5026–5035. As- of the 54th Annual Meeting of the Association for sociation for Computational Linguistics. Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Jacob Andreas, Andreas Vlachos, and Stephen Clark. Computational Linguistics. 2013. Semantic Parsing as Machine Translation. In Proceedings of the 51st Annual Meeting of the As- Kelvin Guu, Panupong Pasupat, E. Liu, and Percy sociation for Computational Linguistics (Volume 2: Liang. 2017. From language to programs: Bridg- Short Papers), pages 47–52, Sofia, Bulgaria. Associ- ing reinforcement learning and maximum marginal ation for Computational Linguistics. likelihood. ArXiv, abs/1704.07926. Marco Baroni. 2020. Linguistic Generalization and Robin Jia and Percy Liang. 2016. Data recombination Compositionality in Modern Artificial Neural Net- for neural semantic parsing. In Proceedings of the works. Philosophical Transactions of the Royal So- 54th Annual Meeting of the Association for Compu- ciety B, 375(1791):20190307. tational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin, Germany. Association for Computa- Jianpeng Cheng, Devang Agrawal, Héctor tional Linguistics. Martı́nez Alonso, Shruti Bhargava, Joris Driesen, Federico Flego, Dain Kaplan, Dimitri Kartsaklis, Bevan Jones, Jacob Andreas, Daniel Bauer, Lin Li, Dhivya Piraviperumal, Jason D. Williams, Karl Moritz Hermann, and Kevin Knight. 2012. Hong Yu, Diarmuid Ó Séaghdha, and Anders Semantics-Based Machine Translation with Hyper- Johannsen. 2020. Conversational Semantic Parsing edge Replacement Grammars. In Proceedings of for Dialog State Tracking. In Proceedings of the COLING 2012, pages 1359–1376, Mumbai, India. 2020 Conference on Empirical Methods in Natural The COLING 2012 Organizing Committee. Language Processing (EMNLP), pages 8107–8117. Association for Computational Linguistics. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. Jianpeng Cheng, Siva Reddy, Vijay Saraswat, and arXiv:1412.6980 [cs.LG]. ArXiv: 1412.6980. Mirella Lapata. 2018. Learning an Executable Neu- ral Semantic Parser. Computational Linguistics, Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- 45(1):59–94. Publisher: MIT Press. lart, and Alexander Rush. 2017. OpenNMT: Open- source toolkit for neural machine translation. In Luis Damas and Robin Milner. 1982. Principal Type- Proceedings of ACL 2017, System Demonstrations, Schemes for Functional Programs. In Proceedings pages 67–72, Vancouver, Canada. Association for of the 9th ACM SIGPLAN-SIGACT Symposium on Computational Linguistics. Principles of Programming Languages, POPL ’82, page 207–212, New York, NY, USA. Association for Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard- Computing Machinery. ner. 2017. Neural Semantic Parsing with Type Con- straints for Semi-Structured Tables. In Proceed- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ings of the 2017 Conference on Empirical Methods Kristina Toutanova. 2019. BERT: Pre-training of in Natural Language Processing, pages 1516–1526, Deep Bidirectional Transformers for Language Un- Copenhagen, Denmark. Association for Computa- derstanding. In Proceedings of the 2019 Conference tional Linguistics. of the North American Chapter of the Association for Computational Linguistics: Human Language Tatsuki Kuribayashi, Hiroki Ouchi, Naoya Inoue, Paul Technologies, Volume 1 (Long and Short Papers), Reisert, Toshinori Miyoshi, Jun Suzuki, and Ken- pages 4171–4186, Minneapolis, Minnesota. Associ- taro Inui. 2019. An Empirical Study of Span Rep- ation for Computational Linguistics. resentations in Argumentation Structure Parsing. In Proceedings of the 57th Annual Meeting of the Asso- Li Dong and Mirella Lapata. 2016. Language to ciation for Computational Linguistics, pages 4691– Logical Form with Neural Attention. In Proceed- 4698, Florence, Italy. Association for Computational ings of the 54th Annual Meeting of the Association Linguistics. for Computational Linguistics (Volume 1: Long Pa- pers), pages 33–43, Berlin, Germany. Association Kenton Lee, Yoav Artzi, Jesse Dodge, and Luke Zettle- for Computational Linguistics. moyer. 2014. Context-dependent Semantic Parsing for Time Expressions. In Proceedings of the 52nd Li Dong and Mirella Lapata. 2018. Coarse-to-Fine De- Annual Meeting of the Association for Computa- coding for Neural Semantic Parsing. In Proceedings tional Linguistics (Volume 1: Long Papers), pages
1437–1447, Baltimore, Maryland. Association for (CoNLL 2017), pages 248–259, Vancouver, Canada. Computational Linguistics. Association for Computational Linguistics. Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2016. Abigail See, Peter J. Liu, and Christopher D. Manning. Global Neural CCG Parsing with Optimality Guar- 2017. Get to the point: Summarization with pointer- antees. In Proceedings of the 2016 Conference on generator networks. In Proceedings of the 55th An- Empirical Methods in Natural Language Process- nual Meeting of the Association for Computational ing, pages 2366–2376, Austin, Texas. Association Linguistics (Volume 1: Long Papers), pages 1073– for Computational Linguistics. 1083, Vancouver, Canada. Association for Computa- tional Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Semantic Machines, Jacob Andreas, John Bufe, David Levy, Veselin Stoyanov, and Luke Zettlemoyer. Burkett, Charles Chen, Josh Clausman, Jean Craw- 2020. BART: Denoising Sequence-to-Sequence Pre- ford, Kate Crim, Jordan DeLoach, Leah Dorner, Ja- training for Natural Language Generation, Transla- son Eisner, Hao Fang, Alan Guo, David Hall, Kristin tion, and Comprehension. In Proceedings of the Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Sm- 58th Annual Meeting of the Association for Compu- riti Jha, Dan Klein, Jayant Krishnamurthy, Theo tational Linguistics, pages 7871–7880. Association Lanman, Percy Liang, Christopher H. Lin, Ilya for Computational Linguistics. Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Dipendra Kumar Misra and Yoav Artzi. 2016. Neural Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Shift-Reduce CCG Semantic Parsing. In Proceed- Ben Snyder, Stephon Striplin, Yu Su, Zachary ings of the 2016 Conference on Empirical Methods Tellman, Sam Thomson, Andrei Vorobev, Izabela in Natural Language Processing, pages 1775–1786, Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, Austin, Texas. Association for Computational Lin- and Alexander Zotov. 2020. Task-Oriented Dia- guistics. logue as Dataflow Synthesis. Transactions of the As- Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and sociation for Computational Linguistics, 8:556–571. Chin-Yew Lin. 2019. A Simple Recipe towards Re- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, ducing Hallucination in Neural Surface Realisation. and Hannaneh Hajishirzi. 2016. Bidirectional In Proceedings of the 57th Annual Meeting of the Attention Flow for Machine Comprehension. Association for Computational Linguistics, pages arXiv:1611.01603 [cs.CL]. ArXiv: 1611.01603. 2673–2679, Florence, Italy. Association for Compu- tational Linguistics. Abhinav Singh, Patrick Xia, Guanghui Qin, Mahsa Augustus Odena, Kensen Shi, David Bieber, Rishabh Yarmohammadi, and Benjamin Van Durme. 2020. Singh, and Charles Sutton. 2020. BUSTLE: Bottom- CopyNext: Explicit Span Copying and Alignment up Program Synthesis Through Learning-guided Ex- in Sequence to Sequence Models. In Proceedings ploration. arXiv:2007.14381 [cs, stat]. ArXiv: of the Fourth Workshop on Structured Prediction for 2007.14381. NLP, pages 11–16. Association for Computational Linguistics. Panupong Pasupat and Percy Liang. 2015. Composi- tional Semantic Parsing on Semi-Structured Tables. Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. In Proceedings of the 53rd Annual Meeting of the A Minimal Span-Based Neural Constituency Parser. Association for Computational Linguistics and the In Proceedings of the 55th Annual Meeting of the As- 7th International Joint Conference on Natural Lan- sociation for Computational Linguistics (Volume 1: guage Processing (Volume 1: Long Papers), pages Long Papers), pages 818–827, Vancouver, Canada. 1470–1480, Beijing, China. Association for Compu- Association for Computational Linguistics. tational Linguistics. Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. Maxim Rabinovich, Mitchell Stern, and Dan Klein. Learning to Map Context-Dependent Sentences to 2017. Abstract Syntax Networks for Code Gener- Executable Formal Queries. In Proceedings of the ation and Semantic Parsing. In Proceedings of the 2018 Conference of the North American Chapter of 55th Annual Meeting of the Association for Com- the Association for Computational Linguistics: Hu- putational Linguistics (Volume 1: Long Papers), man Language Technologies, Volume 1 (Long Pa- pages 1139–1149, Vancouver, Canada. Association pers), pages 2238–2249, New Orleans, Louisiana. for Computational Linguistics. Association for Computational Linguistics. Ohad Rubin and Jonathan Berant. 2020. SmBoP: Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Semi-autoregressive Bottom-up Semantic Parsing. Toutanova. 2019. Well-Read Students Learn Better: arXiv:2010.12412 [cs]. ArXiv: 2010.12412. On the Importance of Pre-training Compact Models. arXiv:1908.08962 [cs.CL]. ArXiv: 1908.08962. Abulhair Saparov, Vijay Saraswat, and Tom Mitchell. 2017. Probabilistic Generative Grammar for Se- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob mantic Parsing. In Proceedings of the 21st Confer- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz ence on Computational Natural Language Learning Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In Advances in Neural Information Pro- cessing Systems, volume 30, pages 5998–6008. Cur- ran Associates, Inc. Adrienne Wang, Tom Kwiatkowski, and Luke Zettle- moyer. 2014. Morpho-syntactic Lexical Generaliza- tion for CCG Semantic Parsing. In Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1284– 1295, Doha, Qatar. Association for Computational Linguistics. Pengcheng Yin and Graham Neubig. 2018. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, pages 7–12, Brussels, Belgium. Associa- tion for Computational Linguistics. Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vin- cent Zhang, Caiming Xiong, Richard Socher, and Dragomir Radev. 2019. SParC: Cross-Domain Se- mantic Parsing in Context. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 4511–4523, Florence, Italy. Association for Computational Linguistics. Luke Zettlemoyer and Michael Collins. 2007. Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 678–687, Prague, Czech Republic. Association for Computa- tional Linguistics. Luke Zettlemoyer and Michael Collins. 2009. Learn- ing Context-Dependent Mappings from Sentences to Logical Form. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 976–984, Suntec, Singapore. Association for Computational Linguistics. Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin Van Durme. 2019. AMR Parsing as Sequence-to- Graph Transduction. In Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 80–94, Florence, Italy. Associa- tion for Computational Linguistics. Kai Zhao and Liang Huang. 2015. Type-Driven Incre- mental Semantic Parsing with Polymorphism. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1416–1421, Denver, Colorado. Association for Computational Linguistics.
A Invocation Joint Normalization out to be relatively frequent in T REE DST (∼6% of the examples), as we discuss in Appendix C. Instead of the distribution in Equation 3, we can de- fine a distribution over fi and {vij }A i j=1 that factor- Types that must be entity-linked: Types for izes in the same way but is also jointly normalized: which argument values can only be picked from the set of proposed entities (§2.1) and cannot be Ai Y otherwise hallucinated from the model, or directly p(πi | f
def refer[T](): T our checks for SMC AL F LOW, and 585 dialogues (3.0%) for T REE DST. When converting back to the def revise[T, R]( root: Root[T], original format, we tally an error for each discarded path: Path[R], example, and select the most frequent version of revision: R => R, any lossily collapsed annotation. ): T Our approach also provides two simple yet ef- refer goes through the programs and system ac- fective consistency checks for the training data: tions in the dialogue history, starting at the most (i) running type inference using the provided type recent turn, finds the first sub-program that evalu- declarations to detect ill-typed examples, and (ii) ates to type T, and replaces its invocation with that using the constraints described Appendix B to de- sub-program. Similarly, revise finds the first pro- tect other forms of annotation errors. We found gram whose root matches the specified root, walks that these two checks together caught 68 poten- down the tree along the specified path, and applies tial annotation errors (
actions (following Cheng et al. (2020)) when con- warm up the learning rate to 2 × 10−3 during the verting back to the original tree representation. We first 1,000 steps. We then decay it exponentially canonicalize the resulting trees by lexicographi- by a factor of 0.999 every 10 steps. During the full cally sorting the children of each node. training phase, we linearly warm up the learning For our baseline comparisons and ablations rate to 1 × 10−4 during the first 1,000 steps, and (shown in Tables 2a and 2b), we decided to ignore then decay it exponentially in the same fashion. the refer are correct flag for SMC AL F LOW Experiments without BERT. We use a single because it assumes that refer is handled by some training phase for 30,000 steps, where we linearly other model and for these experiments we are only warm up the learning rate to 5 × 10−3 during the interested in evaluating program prediction. Also, first 1,000 steps, and then we decay it exponen- for T REE DST we use the gold plans for the di- tially by a factor of 0.999 every 10 steps. We need alogue history in order to focus on the semantic a larger number of training steps in this case be- parsing problem, as opposed to the dialogue state cause none of the model components have been tracking problem. For the non-conversational se- pre-trained. Also, the encoder is now much smaller, mantic parsing datasets we replicated the evalua- meaning that we can afford a higher learning rate. tion approach of Dong and Lapata (2016), and so we also canonicalize the predicted programs. Even though these hyperparameters may seem very specific, we emphasize that our model is robust to E Model Hyperparameters the choice of hyperparameters and this setup was chosen once and shared across all experiments. We use the same hyperparameters for all of our conversational semantic parsing experiments. For F Baseline Models the encoder, we use either BERT-medium (Turc et al., 2019) or a non-pretrained 2-layer Trans- Seq2Seq. This model predicts linearized, tok- former (Vaswani et al., 2017) with a hidden size enized S-expressions using the OpenNMT imple- of 128, 4 heads, and a fully connected layer size mentation of a Transformer-based (Vaswani et al., of 512, for the non-BERT experiments. For the de- 2017) pointer-generator network (See et al., 2017). coder we use a 2-layer Transformer with a hidden For example, the following program: size of 128, 4 heads, and a fully connected layer +(length("some string"), 1) size of 512, and set htype to 128, and harg to 512. For the non-conversational semantic parsing exper- would correspond to the space-separated sequence: iments we use a hidden size of 32 throughout the ( + ( length " some string " ) 1 ) model as the corresponding datasets are very small. In contrast to the model proposed in this paper, in We also use a dropout of 0.2 for all experiments. this case tokens that belong to functions and values For training, we use the Adam optimizer (i.e., that are outside of quotes) can also be copied (Kingma and Ba, 2017), performing global gradi- directly from the utterance. Furthermore, there ent norm clipping with the maximum allowed norm is no guarantee that this baseline will produce a set to 10. For batching, we bucket the training ex- well-formed program. amples by utterance length and adapt the batch size so that the total number of tokens in each batch Seq2Tree. This model uses the same underly- is 10,240. Finally, we average the log-likelihood ing implementation as our Seq2Seq baseline—also function over each batch, instead of summing it. with no guarantee that it will produce a well-formed Experiments with BERT. We use a pre-training program—but it predicts a different sequence. For phase for 2,000 training steps, where we freeze example, the following program: the parameters of the utterance encoder and only +(+(1, 2), 3) train the dialogue history encoder and the decoder. would be predicted as the sequence: Then, we train the whole model for another 8,000 steps. This because our model is not simply adding +(, 3) +(1, 2) a linear layer on top of BERT, and so, unless ini- tialized properly, we may end up losing some of Each item in the sequence receives a unique em- the information contained in the pre-trained BERT bedding in the output vocabulary and so, "+(1,2)" model. During the pre-training phase, we linearly and "+(, 3)" share no parameters. is
a special placeholder symbol that represents a sub- [0] value_s0(s0) stitution point when converting the linearized se- [1] event_0(subject = [0]) [2] value_t0(t0) quence back to a tree. Furthermore, copies are not [3] event_1(curried = [1], start = [2]) inlined into invocations, but broken out into token [4] value_t1(t1) sequences. For example, the following program: [5] event_2(curried = [3], end = [4]) +(length("some string"), 1) Note that we keep the value s0, value t0, and value t1 function arguments because they allow would be predicted as the sequence: the model to marginalize over multiple possible +(, 1) value sources (§2.5). The reason we do not trans- length() " form the value functions that correspond to copies some is because we attempted doing that on top of the string span copy ablation, but it performed poorly and we " decided that it may be a misrepresentation. Overall, Seq2Tree++. This is a re-implementation of Kr- this ablation offers us a way to obtain a bottom-up ishnamurthy et al. (2017) with some differences: parser that maintains most properties of the pro- (i) our implementation’s entity linking embeddings posed model, except for its dependency structure. are computed over spans, including type informa- tion (as in the original paper) and a span embed- No Span Copy. In order to ablate the proposed ding computed based on the LSTM hidden state span copy mechanism we implemented a data trans- at the start and end of each entity span, (ii) copies formation that replaces all copied values with refer- are treated as entities by proposing all spans up ences to the result of a copy function (for spans of to length 15, and (iii) we use the linear dialogue length 1) or the result of a concatenate function history encoder described in §3.1. called on the results of 2 or more calls to copy. For example, the function invocation: G Ablations [0] event(subject = "water the plant") The “value dependence” and the “no span copy” is converted to: ablations are perhaps the most important in our [0] copy("water") [1] copy("the") experiments, and so we provide some more details [2] concatenate([0], [1]) about them in the following paragraphs. [3] copy("plant") [4] concatenate([2], [3]) Value Dependence. The goal of this ablation is [5] event(subject = [4]) to quantify the impact of the dependency structure When applied on its own and not combined with we propose in Equation 3. To this end, we first other ablation, the single token copies are further convert all functions to a curried form, where each inlined to produce the following program: argument is provided as part of a separate function [0] concatenate("water", "the") invocation. For example, the following invocation: [1] concatenate([0], "plant") [0] event(subject = s0, start = t0, end = t1) [2] event(subject = [1]) is transformed to the following program fragment: H Computational Efficiency [0] value(s0) For comparing model performance we computed [1] event_0(subject = [0]) [2] value(t0) the average utterance processing time across all of [3] event_1(curried = [1], start = [2]) the SMC AL F LOW validation set, using a single [4] value(t1) Nvidia V100 GPU. The fastest baseline required [5] event_2(curried = [3], end = [4]) about 80ms per utterance, while our model only When choosing a function, our decoder does not required about 8ms per utterance. This can be condition on the argument values of the previous attributed to multiple reasons, such as the facts invocations. In order to enable such condition- that: (i) our independence assumptions allow us to ing without modifying the model implementation, predict the argument value distributions in parallel, we also transform the value function invocations (ii) we avoid enumerating all possible utterance whose underlying values are not copies, such that spans when computing the normalizing constant for there exists a unique function for each unique value. the argument values, and (iii) we use ragged tensors This results in the following program: to avoid unnecessary padding and computation.
You can also read