Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling Shrey Desai Akshat Shrivastava Alexander Zotov Ahmed Aly Facebook {shreyd, akshats, azotov, ahhegazy}@fb.com Abstract Li et al., 2020; Chen et al., 2020; Ghoshal et al., 2020). However, current methodology typically Task-oriented semantic parsing models typi- requires crowdsourcing thousands of samples for cally have high resource requirements: to sup- each domain, which can be both time-consuming arXiv:2104.07224v1 [cs.CL] 15 Apr 2021 port new ontologies (i.e., intents and slots), practitioners crowdsource thousands of sam- and cost-ineffective at scale (Wang et al., 2015; Jia ples for supervised fine-tuning. Partly, this is and Liang, 2016; Herzig and Berant, 2019; Chen due to the structure of de facto copy-generate et al., 2020). One step towards reducing these data parsers; these models treat ontology labels as requirements is improving the sample efficiency discrete entities, relying on parallel data to ex- of current, de facto copy-generate parsers. How- trinsically derive their meaning. In our work, ever, even when leveraging pre-trained transform- we instead exploit what we intrinsically know ers, these models are often ill-equipped to handle about ontology labels; for example, the fact that SL:TIME_ZONE has the categorical type low-resource settings, as they fundamentally lack “slot” and language-based span “time zone”. inductive bias and cross-domain reusability. Using this motivation, we build our approach In this work, we explore a task-oriented semantic with offline and online stages. During prepro- parsing model which leverages the intrinsic proper- cessing, for each ontology label, we extract ties of an ontology to improve generalization. To il- its intrinsic properties into a component, and lustrate, consider the ontology label SL:TIME_ZONE insert each component into an inventory as a cache of sorts. During training, we fine-tune : a slot representing the time zone in a user’s query. a seq2seq, pre-trained transformer to map ut- Copy-generate models typically treat this label as terances and inventories to frames, parse trees a discrete entity, relying on parallel data to extrin- comprised of utterance and ontology tokens. sically learn its semantics. In contrast, our model Our formulation encourages the model to con- exploits what we intrinsically know about this la- sider ontology labels as a union of its intrinsic bel, such as its categorical type (e.g., “slot”) and properties, therefore substantially bootstrap- language-based span (e.g., “time zone”). Guided ping learning in low-resource settings. Exper- by this principle, we extract the properties of each iments show our model is highly sample effi- cient: using a low-resource benchmark derived label in a domain’s ontology, building a component from TOPv2 (Chen et al., 2020), our inven- with these properties and inserting each compo- tory parser outperforms a copy-generate parser nent into an inventory. By processing this domain- by +15 EM absolute (44% relative) when fine- specific inventory through strong language models, tuning on 10 samples from an unseen domain. we effectively synthesize an inductive bias useful in low-resource settings. 1 Introduction Concretely, we build our model on top of Task-oriented conversational assistants face an in- seq2seq, pre-trained transformer, namely BART creasing demand to support a wide range of do- (Lewis et al., 2020), and fine-tune it to map utter- mains (e.g., reminders, messaging, weather) as a re- ances to frames, as depicted in Figure 1. Our model sult of their emerging popularity (Chen et al., 2020; operates in two stages: (1) the encoder inputs a Ghoshal et al., 2020). For practitioners, enabling domain-specific utterance and inventory and (2) these capabilities first requires training semantic the decoder outputs a frame composed of utterance parsers which map utterances to frames executable and ontology tokens. Here, instead of perform- by assistants (Gupta et al., 2018; Einolghozati et al., ing vocabulary augmentation, the standard way 2018; Pasupat et al., 2019; Aghajanyan et al., 2020; of “generating” ontology tokens in copy-generate
Figure 1: Illustration of our low-resource, task-oriented semantic parser. Let x represent an utterance and d the utterance’s domain, which consists of an ontology (list of intents and slots) `d1 , · · · , `dm . We create an inventory Id , where each component contains intrinsic properties (index, type, span) derived from its respective label. Then, we fine-tune a seq2seq transformer to input a linearized inventory Id and utterance x and output a frame y. Here, frames are composed of utterance and ontology tokens; ontology tokens, specifically, are referenced naturally via encoder self-attention rather than an augmented decoder vocabulary, like with copy-generate mechanisms. parsers, we treat ontology tokens as pointers to in- sitionality and ontology size, such as reminders ventory components. This is particularly useful in and navigation. Finally, through error analysis, low-resource settings; our model is encouraged to we show our model’s predicted frames are largely represent ontology tokens as a union of its intrinsic precise and linguistically consistent; even when properties, and as a result, does not require many inaccurate, our frames not require substantial mod- labeled examples to achieve strong performance. ifications to achieve gold quality. We view sample efficiency as one of the advan- tages of our approach. As such, we develop a com- 2 Background and Motivation prehensive low-resource benchmark derived from Task-oriented semantic parsers typically cast pars- TOPv2 (Chen et al., 2020), a task-oriented seman- ing as transduction, utilizing seq2seq transformers tic parsing dataset spanning 8 domains. Using a to map utterances to frames comprised of intents leave-one-out setup on each domain, models are and slots (Aghajanyan et al., 2020; Li et al., 2020; fine-tuned on a high-resource, source dataset (other Chen et al., 2020; Ghoshal et al., 2020). Because domains; 100K+ samples), then fine-tuned and frames are a composition of utterance and ontol- evaluated on a low-resource, target dataset (this ogy tokens, these models are often equipped with domain; 10-250 samples). We also randomly sam- copy-generate mechanisms; at each timestep, the ple target subsets to make the transfer task more decoder either copies from the utterance or gen- difficult. In aggregate, our benchmark provides 32 erates from the ontology (See et al., 2017; Agha- experiments, each varying in domain and number janyan et al., 2020; Chen et al., 2020; Ghoshal of target samples. et al., 2020). These parsers are empirically ef- Both coarse-grained and fine-grained experi- fective in high-resource settings, achieving state- ments show our approach outperforms baselines of-the-art performance on numerous benchmarks by a wide margin. Overall, when averaging across (Aghajanyan et al., 2020), but typically lack induc- all domains, our inventory model outperforms a tive bias in low-resource settings. copy-generate model by +15 EM absolute (44% To illustrate, consider a hypothetical domain relative) in the most challenging setting, where adaptation scenario where a copy-generate parser only 10 target samples are provided. Notably, our adapts to the weather domain. In standard method- base inventory model (139M parameters) also out- ology, a practitioner augments the decoder’s vocab- performs a large copy-generate model (406M pa- ulary with weather ontology labels, then fine-tunes rameters) in most settings, suggesting usability in the parser on weather samples. This subsequently resource-constrained environments. We also show trains the copy-generate mechanism to generate systematic improvements on a per-domain basis, these labels as deemed appropriate. But, the effi- even for challenging domains with high compo- cacy of this process scales with the amount of train-
Components target-side, a pre-trained decoder mimics a copy- generate mechanism by either selecting from the Ontology Label Index Type Span utterance or ontology. Instead of selecting ontology IN:CREATE_ALARM 1 intent create alarm labels from an augmented vocabulary, as in copy- IN:UPDATE_ALARM 2 intent update alarm generate methods, our decoder naturally references SL:ALARM_NAME 3 slot alarm name these labels in the source-side inventory through SL:DATE_TIME 4 slot date time self-attention. The sequence ultimately decoded SL:TIME_ZONE 5 slot time zone during generation represents the frame. As alluded to earlier, we focus on two intrin- Table 1: Example inventory Ialarm (right-half) with com- sic properties of ontology labels: types and spans. ponents c1:5 , each corresponding to ontology labels The type is particularly useful for enforcing syn- (left-half) from the alarm domain. tactic structure; for example, the rule “slots cannot be nested in other slots” (Gupta et al., 2018) is ing data as these ontology labels (though, more challenging to meet unless a model can delineate specifically, their embeddings) iteratively derive between intents and slots. Furthermore, the span extrinsic meaning. Put another way, before train- is effectively a natural language description, which ing, there exists no correspondence between an provides a general overview of what the label aligns ontology label (e.g., SL:LOCATION) and an utter- to. Though inventory components can incorporate ance span (e.g., “Menlo Park”). Such an alignment other intrinsic properties, types and spans do not is only established once the model has seen enough require manual curation and can be automatically parallel data where the two co-occur. sourced from existing annotations. In contrast, we focus on reducing data require- Despite the reformulation of the semantic pars- ments by exploiting the intrinsic properties of on- ing task with inventories, our approach inherits the tology labels. These labels typically have several flexibility and simplicity of copy-generate mod- core elements, such as their label types and spans. els (Aghajanyan et al., 2020; Li et al., 2020; Chen For example, by teasing apart SL:LOCATION, we et al., 2020; Ghoshal et al., 2020); we also treat can see it is composed of the type “slot” and span parsing as transduction, leverage pre-trained mod- “location”. These properties, when pieced together ules, and fine-tune with log loss. However, a key and encoded by strong language models, provide difference is our parser is entirely text-to-text and an accurate representation of what the ontology does not require extra parameters, which we show label is. While these properties are learned em- promotes reusability in low-resource settings. In pirically with sufficient training data, as we see the following sub-sections, we elaborate on our with the copy-generate parser, our goal is to train approach in more detail and comment on several high-quality parsers with as little data as possible design decisions. by explicitly supplying this information. Therefore, a central question of our work is whether we can 3.1 Inventories build a parser which leverages the intrinsic nature Task-oriented semantic parsing datasets typically of an ontology space while retaining the flexibility have samples of the form (d, x, y) (Gupta et al., of seq2seq modeling; we detail our approach in the 2018; Li et al., 2020; Chen et al., 2020); that is, a next section. domain d, utterance x, and frame y. There exist many domains d1 , · · · , dn , where each domain de- 3 Semantic Parsing via Label Inventories fines an ontology `d1 , · · · , `dm , or list of intents and Illustrated in Figure 1, we develop a seq2seq parser slots. For a given domain d, we define its inventory which uses inventories—that is, tables enumerat- as a table Id = [cd1 , · · · , cdm ] where each compo- ing the intrinsic properties of ontology labels—to nent cdi = (i, t, s) is a tuple storing the intrinsic map utterances to frames. Inventories are domain- properties of a corresponding label `di . specific, and each component carries the intrinsic Specifically, these components consist of: (1) an properties of a single label in the domain’s ontol- index i ∈ Z≥ representing the label’s position in ogy. On the source-side, a pre-trained encoder the (sorted) ontology; (2) a type t ∈ {intent, slot} consumes both an utterance and inventory (corre- denoting whether the label is an intent or slot; and sponding to the utterance’s domain). Then, on the (3) a span s ∈ V ∗ representing an ontology de-
scription, formally represented as a string from a Domains Source Target (SPIS) vocabulary V . The index is a unique referent to 1 2 5 10 each component and is largely used as an optimiza- tion trick during generation; we elaborate on this in Alarm 104,167 13 25 56 107 the next section. In Table 1, we show an example Event 115,427 22 33 81 139 inventory for the alarm domain. Messaging 114,579 23 44 89 158 Music 113,034 19 41 92 187 3.2 Seq2Seq Model Navigation 103,599 33 63 141 273 Our model is built on top of a pre-trained, seq2seq Reminder 106,757 34 59 130 226 transformer architecture (Vaswani et al., 2017) with Timer 113,073 13 27 62 125 vocabulary V . Weather 101,543 10 22 47 84 Encoder. The input to the model is a concate- Table 2: TOPv2-DA benchmark training splits. nation of an utterance x ∈ V and its domain d’s Models are initially fine-tuned on a source dataset, then inventory Id ∈ V . fine-tuned on an SPIS subset from a target dataset. Following recent work in tabular understanding (Yin et al., 2020), we encode our tabular inventory Id as a linearized string Id . As shown in Figure 1, token at each timestep, conditioning on the utter- for each component, the index is preceded by [ ance, inventory, and previous timesteps: and the remaining elements are demarcated by |. X X L(θ) = − log P (yt |x, Id , y
for target fine-tuning, depending on the subset used. benchmarked on TOPv2-DA? (2) Does our model In aggregate, our benchmark provides 32 experi- perform well on average or does it selectively work ments (8 scenarios × 4 subsets), offering a rigorous on particular domains? (3) How do the intrinsic evaluation of sample efficiency. components of an inventory component (e.g., types and spans) contribute to performance? Creating Experiments. We use a leave-one-out algorithm to create source and target datasets. Systems for Comparison. We chiefly experi- Given domains {d1 , · · · , dn }, we create n scenar- ment with CopyGen and Inventory, a classical ios where the ith scenario uses domains {dj : di 6= copy-generate parser and our proposed inventory dj } as the source dataset and domain {di } as the parser, as discussed in Sections 2 and 3, respec- target dataset. For each target dataset, we also cre- tively. Though both models are built on top of off- ate m subsets using a random sampling algorithm, the-shelf, seq2seq transformers, the copy-generate each with an increasing number of samples. parser requires special tweaking; to prepare its “generate” component, we augment the decoder Algorithm 1 SPIS algorithm (Chen et al., 2020) vocabulary with dataset-specific ontology tokens 1: Input: dataset D = {(d(i) , x(i) , y (i) )}ni=1 , sub- and initialize their embeddings randomly, as is set cardinality k standard practice (Aghajanyan et al., 2020; Chen 2: procedure SPIS(D, k) et al., 2020; Li et al., 2020). In addition, both 3: Shuffle D using a fixed seed models are initialized with pre-trained weights. 4: S ← subset of dataset samples We use BART (Lewis et al., 2020), a seq2seq 5: C ← counter of ontology tokens transformer pre-trained with a denoising objective 6: for (d(i) , x(i) , y (i) ) ∈ D do for generation tasks. Specifically, we we use the 7: for ontology token t ∈ y (i) do BARTBASE (139M parameters; 12L, 768H, 16A) 8: if C[t] < k then and BARTLARGE (406M parameters; 24L, 1024H, 9: S ← S + (d(i) , x(i) , y (i) ) 16A) checkpoints. 10: Store yi ’s ontology token We benchmark the sample efficiency of these 11: counts in C models on TOPv2-DA. Following the methodology 12: break outlined in Section 4, for each scenario and sub- 13: end if set experiment, each model undergoes two rounds 14: end for of fine-tuning: it is initially fine-tuned on a high- 15: end for resource, source dataset, then fine-tuned again on a 16: end procedure low-resource, target dataset using the splits in Ta- ble 2. The resulting model is then evaluated on the For our random sampling algorithm, we use sam- target domain’s TOPv2 test set; note that this set is ples per intent slot (SPIS), shown abovex, which not subsampled for accurate evaluation. We report ensures at least k ontology labels (i.e., intents and the exact match (EM) between the predicted and slots) appear in the resulting subset (Chen et al., gold frame. To account for variance, we average 2020). Unlike a traditional algorithm which selects EM across three runs, each with a different seed. k samples exactly, SPIS guarantees coverage over Hyperparameters. We use BART checkpoints the entire ontology, but as a result, the number of from fairseq (Ott et al., 2019) and elect to use samples per subset is typically much greater than most hyperparameters out-of-the-box. However, k (Chen et al., 2020). Therefore, we use conser- during initial experimentation, we find the batch vative values of k; for each scenario, we sample size, learning rate, and dropout settings to heav- target subsets of 1, 2, 5, and 10 SPIS. Our most ily impact performance, especially for target fine- extreme setting of 1 SPIS is still 10× smaller than tuning. For source fine-tuning, our models use a the equivalent setting in prior work (Chen et al., batch size of 16, dropout in [0, 0.5], and learning 2020; Ghoshal et al., 2020). rate in [1e-5, 3e-5]. Each model is fine-tuned on 5 Experimental Setup a single 32GB GPU given the size of the source datasets. For target fine-tuning, our models use a We seek to answer three questions in our experi- batch size in [1, 2, 4, 8], dropout in [0, 0.5], and ments: (1) How sample efficient is our model when learning rate in [1e-6, 3e-5]. Each model is fine-
SPIS 1 2 5 10 CopyGenBASE 27.93 39.12 46.23 52.51 CopyGenLARGE 35.51 44.40 51.32 56.09 InventoryBASE 38.93 48.98 57.51 63.19 InventoryLARGE 51.34 57.63 63.06 68.76 Table 3: Coarse-grained results on TOPv2-DA. Each model’s EMs are averaged across 8 domains. Both InventoryBASE and InventoryLARGE outperform Copy- Gen in 1, 2, 5, and 10 SPIS settings. Figure 3: EM versus % compositionality and # on- tology labels, two important characteristics of task- oriented domains at 1 SPIS. Horizontal lines show av- erage EM, while vertical lines show average character- Figure 2: ∆ EM between the base and large variants of istics. Domains include alarm (alm), event (eve), mes- Inventory and CopyGen on TOPv2-DA. Notably, Inven- saging (msg), music (mus), navigation (nav), reminder tory makes the best use of large representations, with ∆ (rem), timer (tim), and weather (wth). = 12.41 EM at 1 SPIS. However, provided that these constraints are not tuned on a single 16GB GPU. Finally, across both a concern, Inventory makes better use of larger source and target fine-tuning, we optimize models representations. Figure 2 illustrates this by plot- with Adam (Kingma and Ba, 2015). ting the ∆ EM between the base and large vari- 6 Results and Discussion ants of both models. The delta is especially pronounced at 1 SPIS, where InventoryBASE → 6.1 TOPv2-DA Experiments InventoryLARGE yields +12 EM but CopyGenBASE Table 3 presents the EM of CopyGen and Inventory → CopyGenLARGE only yields +7 EM. Unlike on TOPv2-DA averaged across 8 domains. We CopyGen which requires fine-tuning extra param- also present more fine-grained results in Table 4, eters in a target domain, Inventory seamlessly in- breaking down EM by domain. From these tables, tegrates stronger representations without modifi- we draw the following conclusions: cation to the underlying architecture. This is an advantage: we expect our model to iteratively im- Inventory consistently outperforms CopyGen prove in quality with the advent of new pre-trained in 1, 2, 5, and 10 SPIS settings. On average, transformers. Inventory shows improvements across the board, improving upon CopyGen by at least +10 EM on Inventory also yields strong results when in- each SPIS subset. Compared to CopyGen, Inven- specting each domain separately. TOPv2 do- tory is especially strong at 1 SPIS, demonstrating mains typically have a wide range of characteristics, gains of +11 and +15 EM across the base and such as their compositionality or ontology size, so large variants, respectively. Furthermore, we see one factor we can investigate is how our model InventoryBASE outperforms CopyGenLARGE , in- performs on a per-domain basis. Specifically, is dicating our model’s performance can be attributed our model generalizing across the board or overfit- to more than just the pre-trained weights and, as a ting to particular settings? Using the per-domain, result, carries more utility in compute-constrained large model, 1 SPIS results in Table 4, we analyze environments. EM versus % compositionality (fraction of nested
Base Model (SPIS) Large Model (SPIS) 1 2 5 10 1 2 5 10 Domain: Alarm CopyGen 20.411.16 38.904.92 45.504.00 52.012.68 36.913.10 43.704.73 45.731.42 53.890.58 Inventory 62.130.42 65.260.94 71.811.36 75.272.16 67.250.86 72.110.96 71.822.83 78.152.04 Domain: Event CopyGen 31.850.22 38.850.84 38.310.71 41.782.00 32.370.72 34.590.62 38.480.21 43.930.41 Inventory 46.571.64 54.310.53 58.875.03 68.422.06 64.771.41 55.844.20 67.700.38 71.211.08 Domain: Messaging CopyGen 38.120.61 49.792.26 52.792.56 58.902.29 46.574.70 58.422.80 56.547.94 63.101.90 Inventory 46.544.27 57.432.48 63.725.82 70.143.98 60.363.38 66.683.45 74.692.17 78.042.46 Domain: Music CopyGen 25.580.19 33.282.15 48.752.07 55.163.34 23.8417.42 36.843.78 56.170.55 59.180.35 Inventory 23.000.65 39.654.48 53.590.45 52.182.49 38.681.14 52.751.71 58.231.54 59.731.73 Domain: Navigation CopyGen 19.960.84 30.113.88 43.381.24 45.264.59 24.310.97 36.282.29 48.711.75 56.140.53 Inventory 21.166.59 29.080.47 42.593.48 53.970.56 28.742.11 47.472.42 49.982.13 64.080.88 Domain: Reminder CopyGen 23.663.18 23.302.24 36.372.64 41.661.46 31.742.53 31.822.34 41.571.73 42.622.09 Inventory 28.582.28 38.212.81 48.881.18 52.045.76 40.722.22 41.952.25 53.570.73 58.242.75 Domain: Timer CopyGen 16.621.50 40.8021.47 54.792.55 63.260.85 32.640.33 59.941.73 59.803.10 66.272.54 Inventory 28.921.95 53.583.44 55.543.90 66.820.15 48.454.44 61.703.04 63.741.09 68.442.09 Domain: Weather CopyGen 47.2411.10 57.972.35 49.9411.30 62.072.17 53.081.31 53.600.43 63.563.41 63.581.60 Inventory 54.531.94 54.312.87 65.091.71 66.663.40 61.771.71 62.521.85 64.732.20 72.141.76 Table 4: Fine-grained results on TOPv2-DA. For each domain, base model EM is shown in the left-half and large model EM is shown in the right-half. Subscripts show standard deviation across three runs. Even on a per-domain basis, both InventoryBASE and InventoryLARGE outperform CopyGen in most 1, 2, 5, and 10 SPIS settings.
Model Utterance / Frame I need you to send a video message now Index [IN:SEND_MESSAGE ] + Type, Span [IN:SEND_MESSAGE + [SL:TYPE_CONTENT video ] ] Did I get any messages Tuesday on Twitter Index [IN:GET_MESSAGE - [SL:RECIPIENT i ] [SL:ORDINAL Tuesday ] - [SL:TAG_MESSAGE Twitter ] ] + Type, Span [IN:GET_MESSAGE + [SL:DATE_TIME Tuesday ] + [SL:RESOURCE Twitter ] ] Message Lacey and let her know I will be at the Boxer Rescue Fundraiser Saturday around 8 Index [IN:SEND_MESSAGE [SL:RECIPIENT Lacey ] [SL:CONTENT_EXACT I will be at the Boxer Rescue Fundraiser - ] [SL:GROUP Saturday around 8 ] ] + Type, Span [IN:SEND_MESSAGE [SL:RECIPIENT Lacey ] [SL:CONTENT_EXACT I will be at the Boxer Rescue Fundraiser Saturday around 8 ] ] Table 5: Comparing index only and index + type + span parsers. In each row, we show the utterance and, for each model, its predicted frame; here, the index + type + span frames are always correct. For visualization of edit distance, we use + and - to indicate additions and deletions, respectively. SPIS 6.2 Inventory Ablation Component 1 2 5 10 Moving beyond benchmark performance, we now Index 36.78 44.13 60.63 61.90 turn towards better understanding the driving fac- + Type 46.54 49.67 65.21 69.98 tors behind our model’s performance. From Sec- + Span 60.36 66.68 74.69 78.04 tion 3.1, recall each inventory component consists of an index, type, and span. The index is merely Table 6: Inventory ablation experiment results. We a unique identifier, while the type and span rep- benchmark the performance of three InventoryLARGE resent intrinsic properties of a label. Therefore, models on the messaging domain, each adding an in- the goal of our ablation is to quantify the impact trinsic property to their inventories. Our full model, adding types and spans to inventories. Because where each component consists of an index, type, and conducting ablation experiments on each domain span, outperforms baselines by a wide margin. is cost-prohibitive, we use the messaging domain as a case study given its samples strike a balance between compositionality and ontology size. We experiment with three InventoryLARGE mod- els, where each model iteratively adds an element frames) and # ontology labels (count of intent and to its inventory components: (1) index only, (2) slots). Figure 3 plots these relationships. A key index and type, (3) index, type, and span. The re- trend we notice is Inventory improves EM in gen- sults are shown in Table 6. Here, we see that while eral, though better performance is skewed towards an index model performs poorly, adding types and domains with 20% compositionality and 20-30 on- spans improve performance across all subsets. At tology labels. This can be partially explained by 1 SPIS, in particular, an index model improves by the fact that domains with these characteristics are roughly +10 and +20 EM when types and spans more empirically dominant in TOPv2, as shown are added, respectively. These results suggest that by the proximity of the dots to the vertical bars. these intrinsic properties provide a useful inductive Domains like reminder and navigation are more bias in the absence of copious training data. challenging given the size of their ontology space, In Table 5, we contrast the predictions of the but Inventory still outperforms CopyGen by a rea- index only (1) and index + type + span (3) mod- sonable margin. els more closely, specifically looking at 1 SPIS
Model Utterance / Frame Domain: Alarm Delete my 6pm alarm Inventory [IN:DELETE_ALARM [SL:DATE_TIME 6pm ] ] Oracle [IN:DELETE_ALARM + [SL:ALARM_NAME [IN:GET_TIME [SL:DATE_TIME 6pm ] + ] ] ] Domain: Event Fun activities in Letchworth next summer Inventory [IN:GET_EVENT [SL:CATEGORY_EVENT - fun activities ] [SL:LOCATION Letchworth ] [SL:DATE_TIME next summer ] ] Oracle [IN:GET_EVENT [SL:CATEGORY_EVENT activities ] [SL:LOCATION Letch- worth ] [SL:DATE_TIME next summer ] ] Domain: Messaging Message Candy to send me details for her baby shower Inventory [IN:SEND_MESSAGE - [SL:SENDER Candy ] [SL:CONTENT_EXACT details for her baby shower ] ] Oracle [IN:SEND_MESSAGE + [SL:RECIPIENT Candy ] [SL:CONTENT_EXACT + send me details for her baby shower ] ] Domain: Navigation What is the distance between Myanmar and Thailand Inventory [IN:GET_DISTANCE - [SL:UNIT_DISTANCE Myanmar ] - [SL:UNIT_DISTANCE Thailand ] ] Oracle [IN:GET_DISTANCE + [SL:SOURCE Myanmar ] + [SL:DESTINATION Thai- land ] ] Domain: Reminder Remind me that I have lunch plans with Derek in two days at 1pm Inventory [IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO I have lunch plans ] [SL:ATTENDEE_EVENT Derek ] [SL:DATE_TIME in two days ] [SL:DATE_TIME at 1pm ] ] Oracle [IN:CREATE_REMINDER [SL:PERSON_REMINDED me ] [SL:TODO + [IN:GET_TODO [SL:TODO lunch plans ] [SL:ATTENDEE Derek ] + ] ] [SL:DATE_TIME + in two days at 1pm ] ] Domain: Timer Stop the timer Inventory - [IN:DELETE_TIMER [SL:METHOD_TIMER timer ] ] Oracle + [IN:PAUSE_TIMER [SL:METHOD_TIMER timer ] ] Domain: Weather What is the pollen count for today in Florida Inventory - [IN:GET_WEATHER [SL:WEATHER_ATTRIBUTE pollen ] [SL:DATE_TIME for today ] [SL:LOCATION Florida ] ] Oracle + [IN:UNSUPPORTED_WEATHER [SL:WEATHER_ATTRIBUTE pollen + count ] [SL:DATE_TIME for today ] [SL:LOCATION Florida ] ] Table 7: Error analysis of domain-specific inventory parsers. In each row, we show the utterance and compare our inventory model’s predicted frames to an oracle model’s gold frames. For visualization of edit distance, we use + and - to indicate additions and deletions, respectively.
cases where the frame goes from being incorrect to couragingly, the vast majority of spans our model correct. We see a couple of cases where knowing copies over are correct, and the cases which are about a label’s intrinsic properties might help make incorrect consist of adding or dropping modifiers. the correct assessment during frame generation. For example, in the event example, our model adds The second example shows a scenario where our the adjective “fun”, and in the weather example, model labels “tuesday” as SL:DATE_TIME rather our model drops the noun “count”. These cases are than SL:ORDINAL. This distinction is more or less relatively insignificant; they are typically a result obvious when contrasting the phrases “date time” of annotation inconsistency and do not carry much and “ordinal”, where the latter typically maps to weight in practice. However, a more serious error numbers. In the third example, a more tricky sce- we see is failing to copy over larger spans. For nario, our model correctly labels the entire sub- example, in the reminder example, SL:DATE_TIME ordinate clause as an exact content slot. While corresponds to both “in two days” and “at 1pm”, partitioning this clause and assigning slots to its but our model only copies over the latter. constituents may yield a plausible frame, in this in- stance, there is not much correspondence between Predicting compositional structures is challeng- SL:GROUP and “saturday around 8”. ing in low-resource settings. Our model also struggles with compositionality in low-resource 7 Error Analysis settings. In both the alarm and reminder examples, Thus far, we have demonstrated the efficacy of in- our model does not correctly create nested struc- ventory parsers, but we have not yet conducted tures, which reflect how slots ought to be handled a thorough investigation of their errors. Though during execution. Specifically, in the alarm exam- models may not achieve perfect EM in low- ple, because “6pm” is both a name and date/time, resource settings, they should ideally fail grace- the gold frame suggests resolving the alarm in ques- fully, making mistakes which roughly align with tion before deleting it. Similarly, in the reminder intuition. In this section, we assess this by comb- example, we must first retrieve the “lunch plans” ing through our model’s cross-domain errors. Us- todo before including it as a component in the re- ing InventoryLARGE models fine-tuned in each do- mainder of the frame. Both of these cases are tricky main’s 1 SPIS setting, we first manually inspect as they target prescriptive rather than descriptive 100 randomly sampled errors to build an under- behavior. Parsers often learn this type of compo- standing of the error distribution. Then, for each sitionality in a data-driven fashion, but it remains domain, we select one representative error, and an open question how to encourage this behavior present the predicted and gold frame in Table 7. given minimal supervision. In most cases, the edit distance between the pre- Ontology labels referring to “concepts” are dicted and gold frames is quite low, indicating also difficult. Another trend we notice is our the frames our models produce are fundamentally model predicts concept-based ontology labels good and do not require substantial modification. with low precision. These labels require under- We do not see evidence of erratic behavior caused standing a deeper concept which is not imme- by autoregressive modeling, such as syntactically diately clear from the surface description. A invalid frames or extraneous subword tokens in prominent example of this is the ontology la- the output sequence. Instead, most errors are rela- bel IN:UNSUPPORTED_WEATHER used to tag unsup- tively benign; we can potentially resolve them with ported weather intents. To use this label, a parser rule-based transformations or data augmentation, must understand the distinction between in-domain though these are outside the scope of our work. and out-of-domain intents, which is difficult to Below, we comment on specific observations: ascertain from inventories alone. Other exam- Frame slots are largely correct and respect lin- ples of this phenomenon manifest in the messag- guistic properties. One factor we investigate ing and navigation domain with the slot pairs is if our model copies over utterance spans cor- (SL:SENDER, SL:RECIPIENT) and (SL:SOURCE, rectly, which correspond to arguments in an API SL:DESTINATION), respectively. While these slots call. These spans typically lie on well-defined con- are easier to comprehend given their intrinsic prop- stituent boundaries (e.g., prepositional phrases), so erties, a parser must leverage contextual signals we inspect the degree to which this is respected. En- and jointly reason over their spans to predict them.
8 Related Work experiments use MLE and Adam for simplicity, though future work can considering improving our Prior work improving the generalization of task- source and target fine-tuning steps with better al- oriented semantic parsers can be categorized into gorithms. However, one important caveat is both two groups: (1) contextual model architectures and Reptile and LORAS rely on strong representations (2) fine-tuning and optimization. We compare and (i.e., BARTLARGE ) for maximum efficiency, and contrast our work along these two axes below. typically show marginal returns with weaker rep- Contextual Model Architectures. Bapna et al. resentations (i.e., BARTBASE ). In contrast, even (2017); Lee and Jha (2018); Shah et al. (2019) pro- when using standard practices, both the base and pose BiLSTMs which process both utterance and large variants of our model perform well, indicating slot description embeddings, and optionally, entire our approach is more broadly applicable. examples, to generalize to unseen domains. Similar 9 Conclusion to our work, slot descriptions help contextualize what their respective labels align to. These descrip- In this work, we present a seq2seq-based, task- tions can either be manually curated or automati- oriented semantic parser based on inventories, tab- cally sourced. Our work has three key differences: ular structures which capture the intrinsic proper- (1) Inventories are more generalizable, specifying a ties of an ontology space, such as label types (e.g., format which encompasses multiple intrinsic prop- “slot”) and spans (e.g., “time zone”). Our approach erties of ontology labels, namely their types and is both simple and flexible: we leverage out-of- spans. In contrast, prior work largely focuses on the-box, pre-trained transformers with no modifi- spans, and that too, only for slot labels. (2) Our cation to the underlying architecture. We chiefly model is interpretable: the decoder explicitly aligns perform evaluations on TOPv2-DA, a benchmark inventory components and utterance spans during consisting of 32 low-resource experiments across generation, which can aid debugging. However, 8 domains. Experiments show our inventory parser slot description embeddings are used in an opaque outperforms classical copy-generate parsers by a manner; the mechanism through which BiLSTMs wide margin and ablations illustrate the importance use them to tag slots is largely hidden. (3) We of types and spans. Finally, we conclude with an leverage a seq2seq framework which integrates in- error analysis, providing insight on the types of er- ventories without modification to the underlying rors practitioners can expect when using our model encoder and decoder. In contrast, prior work typ- in low-resource settings. ically builds task-specific architectures consisting of a range of trainable components, which can com- plicate training. References Armen Aghajanyan, Jean Maillard, Akshat Shrivas- Fine-tuning and Optimization. Recently, low- tava, Keith Diedrick, Michael Haeger, Haoran Li, resource semantic parsing has seen a methodologi- Yashar Mehdad, Veselin Stoyanov, Anuj Kumar, cal shift with the advent of pre-trained transform- Mike Lewis, and Sonal Gupta. 2020. Conversational ers. Instead of developing new architectures, as Semantic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Process- discussed above, one thrust of research tackles do- ing (EMNLP). main adaptation via robust optimization. These methods are typically divided between source and Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and target domain fine-tuning. Chen et al. (2020) use Larry Heck. 2017. Towards Zero-Shot Frame Se- mantic Parsing for Domain Scaling. In Proceedings Reptile (Nichol et al., 2018), a meta-learning al- of INTERSPEECH. gorithm which explicitly optimizes for generaliza- tion during source fine-tuning. Similarly, Ghoshal Xilun Chen, Ashish Ghoshal, Yashar Mehdad, Luke et al. (2020) develop LORAS, a low-rank adap- Zettlemoyer, and Sonal Gupta. 2020. Low- Resource Domain Adaptation for Compositional tive label smoothing algorithm which navigates Task-Oriented Semantic Parsing. In Proceedings structured output spaces, therefore improving tar- of the Conference on Empirical Methods in Natural get fine-tuning. Our work is largely orthogonal; Language Processing (EMNLP). we focus on redefining the inputs and outputs of a Arash Einolghozati, Panupong Pasupat, Sonal Gupta, transformer-based parser, but do not subscribe to Rushin Shah, Mrinal Mohit, Mike Lewis, and Luke specific fine-tuning or optimization practices. Our Zettlemoyer. 2018. Improving Semantic Parsing for
Task-Oriented Dialog. In Proceedings of the Con- Panupong Pasupat, Sonal Gupta, Karishma Mandyam, versational AI Workshop. Rushin Shah, Michael Lewis, and Luke Zettlemoyer. 2019. Span-based Hierarchical Semantic Parsing for Asish Ghoshal, Xilun Chen, Sonal Gupta, Luke Zettle- Task-Oriented Dialog. In Proceedings of the Con- moyer, and Yashar Mehdad. 2020. Learning Better ference on Empirical Methods in Natural Language Structured Representations using Low-Rank Adap- Processing and International Joint Conference on tive Label Smoothing. In Proceedings of the Inter- Natural Language Processing (EMNLP-IJCNLP). national Conference on Learning Representations (ICLR). Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the Point: Summarization with Pointer- Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Ku- Generator Networks. In Proceedings of the Annual mar, and Mike Lewis. 2018. Semantic Parsing for Meeting of the Association for Computational Lin- Task Oriented Dialog using Hierarchical Represen- guistics (ACL). tations. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing Darsh Shah, Raghav Gupta, Amir Fayazi, and Dilek (EMNLP). Hakkani-Tur. 2019. Robust Zero-Shot Cross- Domain Slot Filling with Example Values. In Pro- Jonathan Herzig and Jonathan Berant. 2019. Don’t ceedings of the Annual Meeting of the Association Paraphrase, Detect! Rapid and Effective Data Col- for Computational Linguistics (ACL). lectionf or Semantic Parsing. In Proceedings of Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob the Conference on Empirical Methods in Natural Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Language Processing and International Joint Con- Kaiser, and Illia Polosukhin. 2017. Attention is All ference on Natural Language Processing (EMNLP- You Need. In Proceedings of the Conference on IJCNLP). Advances in Neural Information Processing Systems (NeurIPS). Robin Jia and Percy Liang. 2016. Data Recombina- tion for Neural Semantic Parsing. In Proceedings of Yushi Wang, Jonathan Berant, and Percy Liang. 2015. the Annual Meeting of the Association for Computa- Building a Semantic Parser Overnight. In Proceed- tional Linguistics (ACL). ings of the Annual Meeting of the Association for Computational Linguistics (ACL). Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceed- Pengcheng Yin, Graham Neubig, Wen tau Yih, and Se- ings of the International Conference for Learning bastian Riedel. 2020. TaBERT: Pretraining for Joint Representations (ICLR). Understanding of Textual and Tabular Data. In Pro- ceedings of the Annual Meeting of the Association Sungjin Lee and Rahul Jha. 2018. Zero-Shot Adaptive for Computational Linguistics (ACL). Transfer for Conversational Language Understand- ing. In Proceedings of the AAAI Conference on Arti- ficial Intelligence (AAAI). Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre- training for Natural Language Generation, Transla- tion, and Comprehension. In Proceedings of the An- nual Meeting of the Association for Computational Linguistics (ACL). Haoran Li, Abhinav Arora, Shuohui Chen, An- chit Gupta, Sonal Gupta, and Yashar Mehdad. 2020. MTOP: A Comprehensive Multilingual Task- Oriented Semantic Parsing Benchmark. arXiv preprint arXiv:2008.09335. Alex Nichol, Joshua Achiam, and John Schulman. 2018. On First-Order Meta-Learning Algorithms. arXiv preprint arXiv:1803.02999. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. arXiv preprint arXiv:1904.01038.
You can also read