CONTROL PREFIXES for Text Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
C ONTROL P REFIXES for Text Generation Jordan Clive Kris Cao Marek Rei Imperial College London DeepMind, London, UK Imperial College London {jordan.clive19,marek.rei}@imperial.ac.uk kriscao@deepmind.com Abstract existence of parameter efficient alternatives. Many researchers have sought to alleviate these Prompt learning methods adapt pre-trained language models to downstream applications issues by using fixed-LM techniques, where all by using a task-specific prompt together with the parameters of the base LM always remain un- arXiv:2110.08329v1 [cs.CL] 15 Oct 2021 the input. Most of the current work on prompt changed. An ever-growing subsection of these learning in text generation relies on a shared methods can be classed under prompt learning, dataset-level prompt for all examples in the where language models are adapted to downstream dataset. We extend this approach and pro- tasks with the aid of a prompt accompanying the pose a dynamic method, C ONTROL P REFIXES, input. A recent survey on prompt learning (Liu which allows for the inclusion of conditional et al., 2021a), however, notes the dearth of re- input-dependent information in each prompt. C ONTROL P REFIXES is at the intersection of search exploring dynamic prompts, which are input- prompt learning and controlled generation, em- dependent. This work considers dynamic prompts, powering the model to have finer-grained con- and is inspired by how traditional controlled genera- trol during text generation. The method in- tion methods utilize controllable attributes to gener- corporates attribute-level learnable representa- ate target sentences with desired qualities. Existing tions into different layers of a pre-trained trans- controlled generation techniques either aim to gen- former, allowing for the generated text to be erate text with specific target qualities, independent guided in a particular direction. We provide a systematic evaluation of the technique and of overall task performance or are methods that apply it to five datasets from the GEM bench- have the benefit of updating not only the attribute- mark for natural language generation (NLG). level parameters, but adjusting, at the same time, We present state-of-the-art results on several all the LM parameters. data-to-text datasets, including WebNLG. We propose the dynamic prompting method C ONTROL P REFIXES, which extends prefix-tuning. 1 Introduction The prefix-tuning method integrates static task- Recently, approaches in text generation have been specific prompts at every layer of a model, adding dominated by adapting one large-scale, pre-trained only 0.1–2% additional parameters to the base LM. language model (PLM) to various downstream With C ONTROL P REFIXES we aim to preserve the tasks. Such adaptation is often performed via fine- fixed-LM property, while also allowing datapoint- tuning, which necessitates updating and storing all specific attributes to act as guidance signals at the of the parameters, resulting in multiple new lan- input-level. This is done by employing modular guage models (LMs), one for each task. This poses control prefixes, which change alongside the in- a considerable deployment challenge as the scale put according to the guidance signal. Operating of PLMs continues to climb from millions to bil- together with the static prompt parameters, these lions of parameters. Moreover, full fine-tuning has dynamic prompts can steer the frozen PLM to ex- been shown to be unnecessarily profligate through tend finer-grained control. The chosen attributes overwriting natural language understanding (NLU) can provide additional information about the input, that could otherwise be shared among tasks (Peters for example the domain of a data-to-text tripleset, et al., 2019); it has also been shown that fine-tuned or it can specify some aspect of the desired output, networks do not deviate substantially from the pre- such as the target length for text simplification. trained one in parameter space (Aghajanyan et al., We evaluate our method on an array of text gen- 2020; Radiya-Dixit and Wang, 2020), implying the eration tasks, leveraging additional input-level in-
formation specific to each dataset. Our results show tribute (Nguyen et al., 2016; Dathathri et al., 2020). that our fixed-LM architecture outperforms previ- These methods are fixed-LM and are able to con- ous approaches, usually based on fine-tuning, ac- trol target qualities such as sentiment and topic. cording to the WebNLG (Gardent et al., 2017), However, they are slow at inference time due to DART (Radev et al., 2020) and E2E Clean (Dušek requiring multiple passes for a single batch. The et al., 2019) data-to-text datasets. In addition, shift in conditional probability can also lead to text our method attains higher human-assessed perfor- degeneration (Holtzman et al., 2019). mance than existing systems for summarization. Dynamic prompts There have been few works This work establishes that the parameters learnt, exploring dynamic prompts (Liu et al., 2021a; corresponding to similar labels of a single attribute, Tsimpoukelli et al., 2021), which are input- share properties. We also demonstrate that zero- dependent. Perhaps most similar to our work is shot learning with C ONTROL P REFIXES can be work by Yu et al. (2021), who use an attribute align- effective for conditioning on input-level informa- ment function to form dynamic prompts. Unlike tion previously unseen during training. our work, the prompt does not have a static compo- 2 Related Work nent and aims to generate text with specific target attributes, independent of task performance. With Prompt Learning Prompt learning (Liu et al., C ONTROL P REFIXES, the intention is to also maxi- 2021a; Sun et al., 2021; Schick and Schutze, 2021) mize task-specific performance, which is why we is a nascent field, instigated by the arrival of GPT-3 maintain a large static prompt component to specify (Brown et al., 2020), involving task-specific adap- the task itself. tation of large LMs via prepending an instruc- Auxiliary scaffold tasks Incorporating auxiliary tion. Several successive works (Logeswaran et al., scaffold tasks via multitask learning has been pre- 2020; Liu et al., 2021b; Lester et al., 2021) employ viously used for improving span-labeling and text prompt-embedding tuning, which trains continuous classification (Swayamdipta et al., 2018; Cohan embeddings prepended to the input embeddings. Li et al., 2019). Cachola et al. (2020) demonstrate and Liang (2021) discovered that prefix-tuning was that control tokens can be used to effectively in- more effective than prompt-embedding tuning for corporate scaffold tasks alongside the main task text generation. In prefix-tuning, additional train- for BARTLARGE . Inspired by this form of data able key-value pairs, which are fixed across all augmentation, we apply a similar procedure with examples, are used to augment the left context in C ONTROL P REFIXES when training on DART, a every attention computation. Therefore, the prompt dataset formed from an accumulation of heteroge- has constituents at every layer rather than being neous sub-datasets. confined to steer the frozen LM only through the input as in embedding tuning. 3 C ONTROL P REFIXES Controlled generation A complementary field to prompt learning is controlled generation, which 3.1 Background aims to incorporate various types of guidance (e.g. This work considers sequence-to-sequence tasks length specifications (Kikuchi et al., 2016) or high- where the objective is to model the conditional lighted phrases (Grangier and Auli, 2018) beyond probability P (Y | X) with X and Y representing the input text into the generation model. Johnson the tokenized input and output sequences respec- et al. (2016) successfully trained a multilingual tively. For example, in summarization, X is an translation model with control tokens to encode article and Y is a short target summary. each language. Keskar et al. (2019) pre-trained a To model P (Y | X), this paper adopts T5-large 1.63B parameter model, also alongside conditional (Raffel et al., 2020) or BARTLARGE (Lewis et al., control tokens, and demonstrated these learnt to 2020) as the underlying pre-trained LM with pa- govern style, content, and task-specific behaviour. rameters φ; and as we consider fixed-LM meth- However, these examples are undesirable in not ods, φ always remains frozen. These models being fixed-LM techniques—the whole underlying are Transformer encoder-decoder models where LM can adapt alongside the control tokens. decoding proceeds auto-regressively. Let us denote Alternatives exist, such as plug-and-play pertur- d to represent the hidden state dimension and L bations of the LM hidden states towards a target at- the number of layers. We use (E, Dc, Dm) to de-
Prefix-tuning General Task Prefix Control Prefixes Control Prefixes General Task Prefix (400k - 8M params) (70k - 400k params each) (400k - 8M params) Single Task Batch Single Task Batch P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y CA CB CC AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstERpLDGRjwQuZG9ZYMPe3mV3zoRc+BE2Fhpj6++x89+4wBUKvmSSl/dmMjMviKUw6LrfTm5jc2t7J79b2Ns/ODwqHp+0TJRoxpsskpHuBNRwKRRvokDJO7HmNAwkbweT+txvP3FtRKQecRpzP6QjJYaCUbRSu17up7ezcr9YcivuAmSdeBkpQYZGv/jVG0QsCblCJqkxXc+N0U+pRsEknxV6ieExZRM64l1LFQ258dPFuTNyYZUBGUbalkKyUH9PpDQ0ZhoGtjOkODar3lz8z+smOLzxU6HiBLliy0XDRBKMyPx3MhCaM5RTSyjTwt5K2JhqytAmVLAheKsvr5NWteJdVaoP1VLtLosjD2dwDpfgwTXU4B4a0AQGE3iGV3hzYufFeXc+lq05J5s5hT9wPn8AUfeO5Q== AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTQWGIiHwlcyN4ywIa9vcvungm58CNsLDTG1t9j579xgSsUfMkkL+/NZGZeEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGiGLZYJCLVDahGwSW2DDcCu7FCGgYCO8G0sfA7T6g0j+SjmcXoh3Qs+YgzaqzUaZQHaX1eHhRLbsVdgmwSLyMlyNAcFL/6w4glIUrDBNW657mx8VOqDGcC54V+ojGmbErH2LNU0hC1ny7PnZMrqwzJKFK2pCFL9fdESkOtZ2FgO0NqJnrdW4j/eb3EjO78lMs4MSjZatEoEcREZPE7GXKFzIiZJZQpbm8lbEIVZcYmVLAheOsvb5J2teLdVKoP1VKtnsWRhwu4hGvw4BZqcA9NaAGDKTzDK7w5sfPivDsfq9ack82cwx84nz9TfY7m AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBlMYxnBfEByhL3NXrJkb+/YnRPCkR9hY6GIrb/Hzn/jJrlCEx8MPN6bYWZekEhh0HW/nY3Nre2d3cJecf/g8Oi4dHLaNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMGnO/88S1EbF6xGnC/YiOlAgFo2ilTqMyyBqzyqBUdqvuAmSdeDkpQ47moPTVH8YsjbhCJqkxPc9N0M+oRsEknxX7qeEJZRM64j1LFY248bPFuTNyaZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDG/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqEijYEb/XlddKuVb3rau2hVq7f5XEU4Bwu4Ao8uIE63EMTWsBgAs/wCm9O4rw4787HsnXDyWfO4A+czx9VA47n P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y i) guidance ii) control prefixes hX, Y i hX, Y, Gi P P AAACCnicbVA9SwNBEN2L3/Hr1NJmNQgWIdxFQcughZYKxkRyIext5pIle3vH7pwQQmob/4qNhSK2/gI7/42bmEITHww83pthZl6YSmHQ876c3Nz8wuLS8kp+dW19Y9Pd2r41SaY5VHkiE10PmQEpFFRRoIR6qoHFoYRa2Dsf+bV70EYk6gb7KTRj1lEiEpyhlVruXiAhwkAy1ZFA60V6V6QXgRadLgZ6LLbcglfyxqCzxJ+QApngquV+Bu2EZzEo5JIZ0/C9FJsDplFwCcN8kBlIGe+xDjQsVSwG0xyMXxnSA6u0aZRoWwrpWP09MWCxMf04tJ0xw66Z9kbif14jw+i0ORAqzRAU/1kUZZJiQke50LbQwFH2LWFcC3sr5V2mGUebXt6G4E+/PEtuyyX/qFS+Pi5UziZxLJNdsk8OiU9OSIVckitSJZw8kCfyQl6dR+fZeXPef1pzzmRmh/yB8/EN94qZzg== AAACB3icbVDLSgNBEJyNrxhfUY+CDAbBg4TdKOgx6MVjBPOQbAizk95kyOzsMtMrhJCbF3/FiwdFvPoL3vwbJ4+DJhY0FFXddHcFiRQGXffbySwtr6yuZddzG5tb2zv53b2aiVPNocpjGetGwAxIoaCKAiU0Eg0sCiTUg/712K8/gDYiVnc4SKAVsa4SoeAMrdTOH/oSQvQlU10JtHFK730tuj309URp5wtu0Z2ALhJvRgpkhko7/+V3Yp5GoJBLZkzTcxNsDZlGwSWMcn5qIGG8z7rQtFSxCExrOPljRI+t0qFhrG0ppBP198SQRcYMosB2Rgx7Zt4bi/95zRTDy9ZQqCRFUHy6KEwlxZiOQ6EdoYGjHFjCuBb2Vsp7TDOONrqcDcGbf3mR1EpF76xYuj0vlK9mcWTJATkiJ8QjF6RMbkiFVAknj+SZvJI358l5cd6dj2lrxpnN7JM/cD5/AI8qmR0= CB 1 1 1 1 AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTQWGIiHwlcyN4ywIa9vcvungm58CNsLDTG1t9j579xgSsUfMkkL+/NZGZeEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGiGLZYJCLVDahGwSW2DDcCu7FCGgYCO8G0sfA7T6g0j+SjmcXoh3Qs+YgzaqzUaZQHaX1eHhRLbsVdgmwSLyMlyNAcFL/6w4glIUrDBNW657mx8VOqDGcC54V+ojGmbErH2LNU0hC1ny7PnZMrqwzJKFK2pCFL9fdESkOtZ2FgO0NqJnrdW4j/eb3EjO78lMs4MSjZatEoEcREZPE7GXKFzIiZJZQpbm8lbEIVZcYmVLAheOsvb5J2teLdVKoP1VKtnsWRhwu4hGvw4BZqcA9NaAGDKTzDK7w5sfPivDsfq9ack82cwx84nz9TfY7m AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y PAAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 2 2 2 CA AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstERpLDGRjwQuZG9ZYMPe3mV3zoRc+BE2Fhpj6++x89+4wBUKvmSSl/dmMjMviKUw6LrfTm5jc2t7J79b2Ns/ODwqHp+0TJRoxpsskpHuBNRwKRRvokDJO7HmNAwkbweT+txvP3FtRKQecRpzP6QjJYaCUbRSu17up7ezcr9YcivuAmSdeBkpQYZGv/jVG0QsCblCJqkxXc+N0U+pRsEknxV6ieExZRM64l1LFQ258dPFuTNyYZUBGUbalkKyUH9PpDQ0ZhoGtjOkODar3lz8z+smOLzxU6HiBLliy0XDRBKMyPx3MhCaM5RTSyjTwt5K2JhqytAmVLAheKsvr5NWteJdVaoP1VLtLosjD2dwDpfgwTXU4B4a0AQGE3iGV3hzYufFeXc+lq05J5s5hT9wPn8AUfeO5Q== P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 2 PAAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 3 Pre-trained Model (0.4B params) 3 3 CC AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBlMYxnBfEByhL3NXrJkb+/YnRPCkR9hY6GIrb/Hzn/jJrlCEx8MPN6bYWZekEhh0HW/nY3Nre2d3cJecf/g8Oi4dHLaNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMGnO/88S1EbF6xGnC/YiOlAgFo2ilTqMyyBqzyqBUdqvuAmSdeDkpQ47moPTVH8YsjbhCJqkxPc9N0M+oRsEknxX7qeEJZRM64j1LFY248bPFuTNyaZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDG/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqEijYEb/XlddKuVb3rau2hVq7f5XEU4Bwu4Ao8uIE63EMTWsBgAs/wCm9O4rw4787HsnXDyWfO4A+czx9VA47n P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 3 Pre-trained Model (0.4B params) PAAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 4 4 4 CB AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTQWGIiHwlcyN4ywIa9vcvungm58CNsLDTG1t9j579xgSsUfMkkL+/NZGZeEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGiGLZYJCLVDahGwSW2DDcCu7FCGgYCO8G0sfA7T6g0j+SjmcXoh3Qs+YgzaqzUaZQHaX1eHhRLbsVdgmwSLyMlyNAcFL/6w4glIUrDBNW657mx8VOqDGcC54V+ojGmbErH2LNU0hC1ny7PnZMrqwzJKFK2pCFL9fdESkOtZ2FgO0NqJnrdW4j/eb3EjO78lMs4MSjZatEoEcREZPE7GXKFzIiZJZQpbm8lbEIVZcYmVLAheOsvb5J2teLdVKoP1VKtnsWRhwu4hGvw4BZqcA9NaAGDKTzDK7w5sfPivDsfq9ack82cwx84nz9TfY7m P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 4 PAAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 5 5 5 CA AAAB7nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstERpLDGRjwQuZG9ZYMPe3mV3zoRc+BE2Fhpj6++x89+4wBUKvmSSl/dmMjMviKUw6LrfTm5jc2t7J79b2Ns/ODwqHp+0TJRoxpsskpHuBNRwKRRvokDJO7HmNAwkbweT+txvP3FtRKQecRpzP6QjJYaCUbRSu17up7ezcr9YcivuAmSdeBkpQYZGv/jVG0QsCblCJqkxXc+N0U+pRsEknxV6ieExZRM64l1LFQ258dPFuTNyYZUBGUbalkKyUH9PpDQ0ZhoGtjOkODar3lz8z+smOLzxU6HiBLliy0XDRBKMyPx3MhCaM5RTSyjTwt5K2JhqytAmVLAheKsvr5NWteJdVaoP1VLtLosjD2dwDpfgwTXU4B4a0AQGE3iGV3hzYufFeXc+lq05J5s5hT9wPn8AUfeO5Q== P AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK99VmtV+uuDV3DrJKvJxUIEezX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmnXa95FrX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QNh2I0y 5 Figure 1: High-level diagram contrasting prefix-tuning and C ONTROL P REFIXES in the single-task setup for a PLM such as BARTLARGE . The same single-task batch (examples 1,2,3,4 and 5) is considered for both setups. Left: Prefix-tuning has one general prefix P for all examples. Right: C ONTROL P REFIXES utilizes additional attribute information at the input-level, G, in i). This conditional information is used in ii) to dictate which control prefix (CA , CB , CC ) to use for a particular example in a batch. This takes advantage of prefix-tuning’s capacity to include different prefixes in one forward pass. note the three classes of attention present in each remains static, and train at the same time Cθ layer: self-attention in the encoder (E), decoder ("attribute-level parameters"): a set of prefixes cross-attention (Dc) and decoder masked-attention that change depending on the input. This requires (Dm). For an attention computation in the l-th attribute-level information or guidance G, to indi- layer, the query, key and value matrices are de- cate which control prefixes to be used while pro- noted Ql ∈ RN ×d , and Kl , Vl ∈ RM ×d , where 2 X.j Let cessing us consider the parallel corpus N is the number of tokens in the series relating Z = X , Y , Gj j=1,..,N , where Gj indicates j to queries, and M is the number of tokens in the all the conditional attribute-level information for series relating to keys and values. the sample j. The goal is to optimize through gradi- ent descent the final inference parameters, θ, whilst 3.2 Intuition the underlying φ parameters of the pre-trained LM We believe having fixed PLM parameters that cap- remain frozen: ture broad natural language understanding, shared N X task-specific parameters which specify the task it- θ∗ = arg max log p Y j | X j , Gj ; Pθ , Cθ , φ . self, and attribute-level parameters which integrate θ j=1 input-level information has a range of benefits. The (1) general task-specific parameters, which channel General Prefix For each attention class the frozen LM to carry out the overall task, can (E, Dc, Dm), a distinct prefix of key-value pairs themselves adapt to modular control prefixes which is learnt, P = {P1 , . . . , PL }, where Pl ∈ change according to the guidance signal, for each Rρ×2d ∀l ∈ {1, . . . , L}. P ∈ Rρ×2dL and ρ is input X. This demarcation of parameters enables the prompt length, i.e. the number of additional fine-grained control to be extended to aid perfor- key-value pairs in each attention computation. In mance on a downstream task. C ONTROL P REFIXES prefix-tuning3 , for an attention computation in the is, therefore, able to leverage input-level informa- l-th layer, Kl and Vl are augmented to become tion while being a fixed-LM, parameter efficient1 method. For this work, we only consider attributes Kl0 = [Pl,K ; Kl ] , Vl0 = [Pl,V ; Vl ] (2) as guidance signal which are made up of discrete labels. where Kl0 , Vl0 ∈ R(ρ+M )×d . The overall general E Dc prefix, parameterized by θ, is Pθ = 3.3 Description P , P , P Dm , where Pθ ∈ Rρ×6dL . The idea is to have a general task prefix Pθ ("task- 2 We discuss cases where G is not present in §6.2. 3 specific parameters"), as in prefix-tuning which There has been confusion in recent work concerning dif- ferent forms of prefix-tuning (Li and Liang, 2021). For details 1 We use the term parameter efficient to denote methods and observations of the benefits (previously unremarked upon) adding
Control Prefixes Let us consider one attribute processing of the input sequence X than P Dm and with R possible labels4 , such as the news do- CrDm . This is due to being formed from the shared main of an article (e.g. sport, technology etc.), mapping MLPE . Cθ = {Cθ,1 , . . . , Cθ,R }, where Cθ,r ∈ Rρc ×6dL , ∀r ∈ {1 . . . .R}. Cθ,r represents the control prefix 4 Experimental Setup learnt for the r-th attribute label and the parameter 4.1 Datasets, Guidance and Metrics ρc denotes the control prompt length for this par- ticular attribute. Let A be a function which returns Data-to-text The objective of data-to-text gen- the corresponding control prefix for the attribute eration is to produce fluent text from structured label indicated by G. In C ONTROL P REFIXES the input, viz. a tripleset (a set of subject-predicate- Kl and Vl are augmented to become objects). As in Li and Liang (2021), we elect to evaluate on the data-to-text datasets DART and Kl00 = [A(G)l,K ; Pl,K ; Kl ] , (3) WebNLG. However, we implement prefix-tuning Vl00 = [A(G)l,V ; Pl,V ; Vl ] for T5-large rather than GPT-2; for T5-large pro- vides a much stronger baseline and enables compar- where Kl00 , Vl00 ∈ R(ρc +ρ+M )×d . ison with state-of-the-art (SOTA) systems.6 Results Shared Re-parameterization As in Li and are also reported on E2E Clean, a dataset solely fo- Liang (2021), optimization is stabilized by an cused on the restaurant domain. We use the official increase in the trainable parameters. However, evaluation scripts and report a selection of BLEU rather than one network, we use three distinct (Papineni et al., 2002), METEOR (Lavie and Agar- two-layered large feed-forward neural networks for wal, 2007), and TER (Snover et al., 2006) for each each attention class, applied row-wise. For each dataset.7 attention class (E, Dc, Dm), P = MLP(P̃ ) where WebNLG contains triplesets from DBPedia P̃ ∈ Rρ×d is smaller than the matrix P ∈ Rρ×2dL , (Auer et al., 2007). The test set is divided into and each MLP has an intermediate dimension k two partitions: Seen, which contains 10 DBpedia which we set to 800. The distinct MLPs and each categories present in the training set, and Unseen, P̃ are parameterized by training parameters θ̃; thus, which covers 5 categories never seen during train- θ is a function of θ̃ and |θ| < |θ̃|. Once training ing.8 These categories, such as Airport or Food are is complete, the final θ parameters can be saved used as guidance signal in our experiments (indi- for use at inference and the re-parameterization cated by A1 in Table 1); our approach for unseen parameters dispensed with. categories is discussed in §6.2. As described for the general prefix, Pθ , each con- Providing the category explicitly as guidance trol prefix, Cθ,r , comprises three constituents for with C ONTROL P REFIXES may enable inductive bi- E each attention class: Cθ,r = Cr , Cr , CrDc Dm . ases relating to properties of triples belonging to a The re-parameterization of Cθ,r occurs in exactly specific WebNLG category to be captured more the same manner as Pθ , sharing the same MLPE , effectively. This idea is encouraged by studies MLPDc and MLPDm . When using a disjoint set of where there is a clear disparity in performance re-parameterizations for the control prefixes, learn- on different categories between different model ing becomes unstable and performance degrades.5 types (Moryossef et al., 2019; Castro Ferreira et al., Recent work by Buhai et al. (2020) show that 2020). over-parameterization can smooth the optimization DART is an open-domain, multi-source corpus, landscape. With this in mind, the three distinct with six sources: internal and external human an- re-parameterizations compel each prefix element notation of both Wikipedia tables and WikiSQl; as to coordinate control for the particular attention well as the two existing datasets WebNLG and E2E class. For example, the rows of P E and CrE lie in a Clean. Radev et al. (2020) revealed fine-tuning vector space better coordinated for moderating the 6 BARTLARGE exhibits inferior performance to T5-large on 4 It is easy to see how the procedure can be generalized to data-to-text; for example, 9.7 BLEU points lower on WebNLG multiple attributes; we use up to four attributes and varying Unseen (Ribeiro et al., 2020). control prompt lengths. 7 Full results from the evaluation scripts including machine- 5 This also results in a significant increase in the number learned metrics can be found in Appendix A. 8 of training parameters θ̃. In contrast, with the methodology Every training category label can be seen in Appendix outlined, each additional control prefix relates to only an addi- B, where we visualize control prefixes corresponding to each tional dρc training parameters. training category.
T5-large on the WebNLG dataset with only the hu- tomary ROUGE scores (Lin, 2004), we submit our man annotated portion of DART achieves SOTA C ONTROL P REFIXES model outputs to the GENIE performance, whilst using the whole DART dataset external human evaluation framework (Khashabi is not as effective. Nevertheless, this inspired the et al., 2021)—where 300 instances are assessed idea of using the six DART sub-dataset sources as a across 5 intrinsic dimensions. controllable attribute—represented by A2 in Table 1–as a data augmentation strategy. 4.2 Architecture and Hyper-parameters Simplification We use WikiLarge (Zhang and All implementations in this study are built on top Lapata, 2017) as training data and evaluate on of the Transformers library (Wolf et al., 2020). As the two benchmarks TurkCorpus (Xu et al., 2016) T5 has relative position biases, we set these in all and ASSET (Alva-Manchego et al., 2020). Both layers pertaining to offsets where the key is part benchmarks are composed of the same 2000 vali- of a prefix to zero. For BARTLARGE we adapt the dation source and 359 test source sentences. Mar- original implementation (Li and Liang, 2021). For tin et al. (2020) introduced ‘BARTLARGE with AC- the data-to-text datasets, we follow Ribeiro et al. CESS’, which is a fine-tuned BARTLARGE model (2020) and linearize the triples, prepending the spe- trained alongside control tokens to condition on cial tokens , , and before the subject, four simplification-specific attributes, such as the predicate, and object of an individual triple.10 We length compression ratio (the length of the target also prepend “translate Graph to English: ” to every sequence relative to the source sequence). We use input (Raffel et al., 2020). the same controllable attributes in this work to di- The general prompt length and each control rectly compare with Martin et al. (2020) (Table prompt length are architecture-specific parameters 2). The control ratios are discretized into bins of that we vary to try and maximize performance on fixed-width 0.05, capped to a maximum ratio of 2. the validation set. We use gradient accumulation At inference time, once the model has been trained across batches to maintain an effective batch size with these oracle controls, the control ratios are above 64, a linear learning rate scheduler for all set to desired values by tuning on the respective models and beam-search decoding. The hyper- validation set. parameters we consider are principally the learning We report the non-learned metrics SARI (Xu rate and the optimizer: AdamW (Loshchilov and et al., 2016) and FKGL (Kincaid et al., 1975).9 Hutter, 2017) or AdaFactor (Shazeer and Stern, Unlike previous studies, we also use the machine- 2018).11 We chose the checkpoint with the highest learned Q&A metric QuestEval (Scialom et al., validation set score using BLEU for data-to-text, 2021) to assess our text simplification models. SARI for simplification and ROUGE-2 for summa- rization. For all tasks, we train our models on single Summarization As in Li and Liang (2021), Tesla V100-SXM2-16GB machines, with mixed we report results on the XSum dataset (Narayan precision for BARTLARGE based models (fp16) and et al., 2018) using BARTLARGE . XSum com- full precision for T5-large based models (fp32). prises 226,711 British Broadcasting Corporation (BBC) articles coupled with their single-sentence 5 Results summaries—where each sample corresponds to a unique URL. The URL contains information on 5.1 Data-to-Text whether the sub-directory is from the BBC Sport or For DART, both C ONTROL P REFIXES (A2 ) and BBC News page (A1 in Table 3), and further sub- prefix-tuning attain higher performance (Table 1) directory information (A2 in Table 3, where A2 than the current SOTA—which is T5-large fined- has 40 labels), for example (‘sport’, ‘formula1’) or tuned (Radev et al., 2020)—by 1.29 and 0.54 (‘news’, ‘science’). The motivation for using this as BLEU points respectively. This indicates C ON - guidance is that different sub-directories are likely TROL P REFIXES can extend control of the frozen to share properties relating to how the information T5-large more effectively than prefix-tuning. is presented; journalists are also usually confined The SOTA for WebNLG is a T5-large model to one domain. In addition to reporting the cus- 10 The embeddings relating to these special tokens are the 9 only embeddings we train, as our work is focused on fixed-LM We use the FKGL and latest version of SARI imple- mented in EASSE (Alva-Manchego et al., 2019) which is methods. 11 used in Martin et al. (2020). Full details can be found in Appendix D.
φ% DART φ% WebNLG φ% E2E Clean BLEU METEOR TER ↓ S U A BLEU METEOR T5-large fine-tuned 100 50.66 40 43 100 64.89 54.01 59.95 100 38.74 37.4 SOTA 100 50.66 40 43 100 65.82 56.01 61.44 100 43.6 39 Prefix-tuning 1.0 51.20 40.62 43.13 1.0 66.95 55.39 61.73 1.0 43.66 39.0 C ONTROL P REFIXES (A1 ) - - - - 1.4 67.32 55.38 61.94 - - - +Data: DART Prefix-tuning 1.0 51.20 40.62 43.13 1.0 67.05 55.37 61.78 1.0 43.04 38.7 C ONTROL P REFIXES (A2 ) 1.1 51.95 41.07 42.75 1.0 66.99 55.56 61.83 1.0 44.15 39.2 C ONTROL P REFIXES (A1 ,A2 ) - - - - 1.4 67.15 56.41 62.27 - - - Table 1: Data-to-text test set results reported on the respective official evaluation scripts. φ% denotes the % of additional parameters to the number of fixed-LM parameters required at inference time. T5-large fine-tuned results for WebNLG are from Ribeiro et al. (2020), and for E2E Clean are calculated from public model outputs (Gehrmann et al., 2021). Several of the baseline results were only reported to the significant figures shown. A1 signifies models trained with control prefixes for the WebNLG category attribute, and A2 with control prefixes for the DART sub-dataset source attribute. For WebNLG, S, U and A refer to BLEU scores for the Seen, Unseen and All portions of the dataset. The DART results are reported on the official evaluation script for v1.1.1, the same version as the official leaderboard. A C ONTROL P REFIXES model attains state-of-the-art results for each dataset12 . fine-tuned on WebNLG and the human annotated comparing our C ONTROL P REFIXES to fine-tuned portion of DART (Radev et al., 2020). C ONTROL ‘BARTLARGE with ACCESS’ there is comparable P REFIXES achieves a 0.83 higher BLEU overall, performance in terms of SARI for ASSET, and bet- and 1.33 on the Seen categories than this model. ter FKGL results. However on TurkCorpus, C ON - Notably, C ONTROL P REFIXES (A1 ) outperforms TROL P REFIXES yields lower performance on av- C ONTROL P REFIXES (A1 ,A2 ) on the Seen com- erage for SARI and FKGL. For text simplification, ponent of the dataset, but does not generalize as Martin et al. (2020) indicate the gains from using well to the unseen categories. We argue this il- the controllable attributes, as assessed by SARI and lustrates the benefit of using both controllable at- FKGL, are mostly due to being able to calibrate tributes. The prefix-tuning model with additional the length ratio, with validation and test sets being DART data, like the SOTA, is trained on only the drawn from the same distribution, as opposed to human annotated portion and yields a minor perfor- the WikiLarge training distribution. We highlight mance increase of 0.05 BLEU compared to prefix- the Gold Reference score for TurkCorpus, which tuning solely trained on WebNLG. We believe this produces inferior results for SARI and FKGL com- indicates that for fine-tuning, training on a comple- pared to both guided models. The Gold Reference mentary type of additional data allows the PLM result is computed via a leave-one-out scenario to maintain more NLU by not over-fitting a nar- where each reference is evaluated against all others, row distribution. Therefore the LM can generalize and then an average is taken. better. Whilst for prefix-tuning, much of this gain has already been realized by retaining the original 5.3 Summarization frozen parameters. Our research is not solely focused on parameter ef- The SOTA (Harkous et al., 2020) for E2E Clean ficiency, but more on the effectiveness of adapting consists of a fine-tuned GPT-2 with a semantic an already parameter efficient, fixed-LM method fidelity classifier trained on additional generated (adding
φ% ASSET TurkCorpus SARI FKGL ↓ QuestEval SARI FKGL ↓ QuestEval Gold Reference - 44.87 6.49 0.63∗ 40.04 8.77 0.66∗ BARTLARGE with ACCESS† 100 43.63 6.25 0.64∗ 42.62 6.98 0.66∗ BARTLARGE fine-tuned 100 39.91∗ 7.73∗ - 39.55∗ 7.73∗ - Prefix-tuning 1.8 40.12 7.28 - 39.06 7.28 - C ONTROL P REFIXES 1.8 43.58 5.97 0.64 42.32 7.74 0.66 Table 2: Simplification results on ASSET and TurkCorpus test sets. † This model is from Martin et al. (2020), where the authors fine-tuned BARTLARGE model alongside control tokens for the four attributes. The C ONTROL P REFIXES model is trained with control prefixes for these same four attributes. Prefix-tuning and C ONTROL P REFIXES use BARTLARGE as the fixed LM. The ∗ denotes baseline results calculated in this study—the model outputs of Martin et al. (2020) are publicly available. The BARTLARGE with ACCESS and C ONTROL P REFIXES model are the average test set results over 5 random seeds. Human Human Human Human Human φ% R-1 R-2 R-L overall conciseness fluency no-hallucination informativeness BARTLARGE fine-tuned 100 0.49+0.03 −0.04 0.50+0.03 −0.03 0.50+0.03 −0.03 0.52+0.03 −0.03 0.49+0.03 −0.03 45.14∗ 22.27∗ 37.25∗ Prefix-tuning 3.0 - - - - - 43.53 20.66 35.63 +0.03 C ONTROL P REFIXES (A1 , A2 ) 2.8 0.51−0.03 0.53+0.02 −0.02 0.51+0.03 −0.03 +0.03 0.53−0.03 0.49+0.03 −0.03 43.81 20.84 35.81 Table 3: Summarization results on XSum13 . R-1, R-2 and R-L refer to ROUGE-1, ROUGE-2 and ROUGE-L. The human-assessed results are from the GENIE benchmark, where the 95% confidence intervals are computed with bootstrap re-sampling. Note the BARTLARGE fine-tuned results for the human-assessed dimensions are transcribed from Khashabi et al. (2021), whilst the automatic metric results, indicated by ∗ , are from Lewis et al. (2020). Prefix- tuning and C ONTROL P REFIXES (A1 ,A2 ) use BARTLARGE as the fixed LM. A1 refers to the BBC news/sport page attribute and A2 the further sub-directory attribute. Table 3 shows that despite C ONTROL P REFIXES prefixes learnt as part of our simplification C ON - underperforming fine-tuning according to auto- TROL P REFIXES model.15 We plot only the de- matic metrics, C ONTROL P REFIXES attains higher coder self-attention constituent of each control pre- human-assessed results. C ONTROL P REFIXES also fix (comprising multiple key-value pairs at each holds the highest overall human evaluation ranking layer) as the length ratio directly concerns the tar- on the GENIE platform (higher than T5-large and get.16 The relationship learnt by the control pre- PEGASUSLARGE (Zhang et al., 2019) fine-tuned). fixes is very manifest—aided by the near uniform This study is limited in not being able to compare distribution of length ratios in the WikiLarge train- human-assessment of prefix-tuning, which yields ing dataset from 0 to 1.1. slightly lower ROUGE scores than C ONTROL P RE - FIXES, as participants of GENIE are limited to one Fig. 2 establishes that for this simplistic at- submission. The confidence intervals indicate that tribute, different control prefixes corresponding to this result is not necessarily definitive—but it at similar attribute labels (i.e., varying length ratios least highlights the problems with evaluation for for the length attribute) share properties. Inter- XSum, and that the quality of generations in this do- estingly the decoder cross-attention of the control main is not captured fully with ROUGE. A sample prefix is not as manifest. We believe this is due to size of 300 is typically much larger than that where BARTLARGE being accustomed to the same cross- authors construct their own evaluation (Narayan attention key-value pairs in each layer. et al. (2018) use 50 and Dou et al. (2020) use 100). 6 Analysis 15 A perplexity of 5 is used for all plots. 16 Plots for the encoder and decoder cross-attention con- 6.1 Visualizing Control Prefixes stituents can be seen found in Appendix E. 16 The public GENIE leaderboard is available Fig. 2 displays t-SNE (Maaten and Hinton, 2008) at https://leaderboard.allenai.org/ visualizations of the length compression control genie-xsum/submissions/public
0.0 examples relating to the unseen category Athlete.19 0.05 Table 4 shows a comparison of using an out-of- 0.1 0.15 vocabulary (OOV) control prefix for each example 0.25 0.2 with an unseen category, and the zero-shot transfer 0.3 method for both WebNLG datasets20 . The OOV 0.35 0.4 control prefix is trained on a random 2% of the data 0.5 0.45 for each accumulated batch. These results indicate 0.55 that zero-shot transfer is more promising than a 0.65 0.6 0.75 learned OOV representation. The result fundamen- 0.7 0.8 tally depends on the WebNLG categories, and if 0.85 0.9 similar textual labels pertain to similar triplesets 0.95 1.0 that C ONTROL P REFIXES can utilize. 1.1 1.05 Unseen Component # Examples # Categories BLEU Figure 2: t-SNE visualizations for the decoder self- WebNLG 891 5 attention constituent of the length compression control OOV Representation 56.35 prefixes of the simplification model. Each circle repre- Zero-shot 56.41 sents a control prefix corresponding to each length ratio WebNLG+ 2020 896 3 (bins of fixed width 0.05, from 0 to 1.1). OOV Representation 50.02 Zero-shot 50.39 Table 4: A comparison of the performance on the Un- 6.2 Zero-shot Learning seen portions for WebNLG test sets, with i) a single We argue that even for more complicated attributes, OOV Control Prefix used for all samples from unseen such as the WebNLG category attribute, if the at- categories, or ii) the zero-shot transfer approach out- tribute labels are similar, the respective control pre- lined, utilizing the available textual labels. fixes will similarly guide both the general, task- specific prefix parameters and the frozen LM pa- rameters. Previous work has discussed the notion 7 Conclusion of task similarity (Achille et al., 2019) for prompt We introduce C ONTROL P REFIXES, a controlled learning methods (Lester et al., 2021); however, generation technique, which integrates a task- we argue prefixes concerning different labels of specific prompt alongside dynamic prompts to one attribute are more likely to overlap in terms of leverage additional input-level information. The learnable properties than different tasks or whole method extends prefix-tuning, enabling the model datasets. to have finer-grained control over generated text, In the case of WebNLG, where although no ex- and assists in maximizing downstream task perfor- amples of the unseen category are present during mance. training, a textual label for the category exists.17 We demonstrate that C ONTROL P REFIXES out- This gives us some prior on the properties of the performs prefix-tuning, as well as existing ap- unseen categories, which we show is enough to proaches, on an array of natural language gen- successfully zero-shot transfer with control pre- eration tasks. Our method attains state-of-the- fixes. For each WebNLG model with the category art results on several data-to-text datasets includ- attribute, we map each category’s textual label, in- ing WebNLG. This is despite learning
References Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. 2019. Structural scaffolds for ci- Alessandro Achille, Michael Lam, Rahul Tewari, tation intent classification in scientific publications. Avinash Ravichandran, Subhransu Maji, Charless CoRR, abs/1904.01608. Fowlkes, Stefano Soatto, and Pietro Perona. 2019. Task2vec: Task embedding for meta-learning. In Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane 2019 IEEE/CVF International Conference on Com- Hung, Eric Frank, Piero Molino, Jason Yosinski, and puter Vision (ICCV), pages 6429–6438. Rosanne Liu. 2020. Plug and play language mod- els: A simple approach to controlled text generation. Armen Aghajanyan, Luke Zettlemoyer, and Sonal In International Conference on Learning Represen- Gupta. 2020. Intrinsic dimensionality explains the tations. effectiveness of language model fine-tuning. CoRR, abs/2012.13255. Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. 2020. Gsum: A general Fernando Alva-Manchego, Louis Martin, Carolina framework for guided neural abstractive summariza- Scarton, and Lucia Specia. 2019. Easse: Easier tion. CoRR, abs/2010.08014. automatic sentence simplification evaluation. arXiv preprint arXiv:1908.04567. Ondřej Dušek, David M. Howcroft, and Verena Rieser. 2019. Semantic noise matters for neural natural lan- Fernando Emilio Alva-Manchego, Louis Martin, An- guage generation. In Proc. of the 12th International toine Bordes, Carolina Scarton, Benoît Sagot, and Conference on Natural Language Generation, pages Lucia Specia. 2020. ASSET: A dataset for tun- 421–426, Tokyo, Japan. Association for Computa- ing and evaluation of sentence simplification mod- tional Linguistics. els with multiple rewriting transformations. CoRR, Claire Gardent, Anastasia Shimorina, Shashi Narayan, abs/2005.00481. and Laura Perez-Beltrachini. 2017. The WebNLG challenge: Generating text from RDF data. In Pro- Sören Auer, Christian Bizer, Georgi Kobilarov, Jens ceedings of the 10th International Conference on Lehmann, Richard Cyganiak, and Zachary Ives. Natural Language Generation, pages 124–133, San- 2007. Dbpedia: A nucleus for a web of open data. tiago de Compostela, Spain. Association for Compu- In The Semantic Web, pages 722–735, Berlin, Hei- tational Linguistics. delberg. Springer Berlin Heidelberg. Sebastian Gehrmann, Tosin P. Adewumi, Karmanya Tom Brown, Benjamin Mann, Nick Ryder, Melanie Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Subbiah, Jared D Kaplan, Prafulla Dhariwal, Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Arvind Neelakantan, Pranav Shyam, Girish Sastry, Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Voss, Gretchen Krueger, Tom Henighan, Rewon Ondrej Dusek, Chris Emezue, Varun Gangal, Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Cristina Garbacea, Tatsunori Hashimoto, Yufang Clemens Winter, Chris Hesse, Mark Chen, Eric Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Ji, Shailza Jolly, Dhruv Kumar, Faisal Ladhak, Jack Clark, Christopher Berner, Sam McCandlish, Aman Madaan, Mounica Maddela, Khyati Mahajan, Alec Radford, Ilya Sutskever, and Dario Amodei. Saad Mahamood, Bodhisattwa Prasad Majumder, 2020. Language models are few-shot learners. In Pedro Henrique Martins, Angelina McMillan-Major, NeurIPS. Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo An- Rares-Darius Buhai, Yoni Halpern, Yoon Kim, Andrej dre Niyongabo, Salomey Osei, Ankur P. Parikh, Risteski, and David Sontag. 2020. Empirical study Laura Perez-Beltrachini, Niranjan Ramesh Rao, of the benefits of overparameterization in learning Vikas Raunak, Juan Diego Rodriguez, Sashank latent variable models. Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio So- Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S. brevilla Cabezudo, Hendrik Strobelt, Nishant Sub- Weld. 2020. TLDR: extreme summarization of sci- ramani, Wei Xu, Diyi Yang, Akhila Yerukola, and entific documents. CoRR, abs/2004.15011. Jiawei Zhou. 2021. The GEM benchmark: Natu- ral language generation, its evaluation and metrics. Thiago Castro Ferreira, Claire Gardent, Nikolai CoRR, abs/2102.01672. Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. 2020. The David Grangier and Michael Auli. 2018. QuickEdit: 2020 bilingual, bi-directional WebNLG+ shared Editing text & translations by crossing words out. task: Overview and evaluation results (WebNLG+ In Proceedings of the 2018 Conference of the North 2020). In Proceedings of the 3rd International Work- American Chapter of the Association for Compu- shop on Natural Language Generation from the Se- tational Linguistics: Human Language Technolo- mantic Web (WebNLG+), pages 55–76, Dublin, Ire- gies, Volume 1 (Long Papers), pages 272–282, New land (Virtual). Association for Computational Lin- Orleans, Louisiana. Association for Computational guistics. Linguistics.
Hamza Harkous, Isabel Groves, and Amir Saffari. 2020. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Have your text and use it too! end-to-end neural data- Optimizing continuous prompts for generation. to-text generation with semantic fidelity. CoRR, abs/2004.06577. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin tion Branches Out, pages 74–81, Barcelona, Spain. Choi. 2019. The curious case of neural text degener- Association for Computational Linguistics. ation. CoRR, abs/1904.09751. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Hiroaki Hayashi, and Graham Neubig. 2021a. Pre- Chen. 2021. Lora: Low-rank adaptation of large lan- train, prompt, and predict: A systematic survey of guage models. CoRR, abs/2106.09685. prompting methods in natural language processing. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- Yujie Qian, Zhilin Yang, and Jie Tang. 2021b. GPT rat, Fernanda B. Viégas, Martin Wattenberg, Greg understands, too. CoRR, abs/2103.10385. Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s multilingual neural machine translation Lajanugen Logeswaran, Ann Lee, Myle Ott, Honglak system: Enabling zero-shot translation. CoRR, Lee, Marc’Aurelio Ranzato, and Arthur Szlam. abs/1611.04558. 2020. Few-shot sequence learning with transform- ers. CoRR, abs/2012.09543. N. Keskar, B. McCann, L. R. Varshney, Caiming Xiong, and R. Socher. 2019. Ctrl: A conditional trans- Ilya Loshchilov and Frank Hutter. 2017. Fixing former language model for controllable generation. weight decay regularization in adam. CoRR, ArXiv, abs/1909.05858. abs/1711.05101. Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, L. V. D. Maaten and Geoffrey E. Hinton. 2008. Visual- Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. izing data using t-sne. Journal of Machine Learning Smith, and Daniel S. Weld. 2021. GENIE: A leader- Research, 9:2579–2605. board for human-in-the-loop evaluation of text gen- eration. CoRR, abs/2101.06561. Louis Martin, Angela Fan, Éric de la Clergerie, An- Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya toine Bordes, and Benoît Sagot. 2020. Multilin- Takamura, and Manabu Okumura. 2016. Control- gual unsupervised sentence simplification. CoRR, ling output length in neural encoder-decoders. In abs/2005.00352. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages Amit Moryossef, Ido Dagan, and Yoav Goldberg. 2019. 1328–1338, Austin, Texas. Association for Compu- Improving quality and efficiency in plan-based neu- tational Linguistics. ral data-to-text generation. J. Peter Kincaid, Robert P Fishburne Jr., Richard L. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Rogers, and Brad S. Chissom. 1975. Derivation of 2018. Don’t give me the details, just the summary! new readability formulas (automated readability in- topic-aware convolutional neural networks for ex- dex, fog count and flesch reading ease formula) for treme summarization. CoRR, abs/1808.08745. navy enlisted personnel. Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Alon Lavie and Abhaya Agarwal. 2007. METEOR: An Dosovitskiy, and Jeff Clune. 2016. Plug & play gen- automatic metric for MT evaluation with high levels erative networks: Conditional iterative generation of of correlation with human judgments. In Proceed- images in latent space. CoRR, abs/1612.00005. ings of the Second Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Repub- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- lic. Association for Computational Linguistics. Jing Zhu. 2002. Bleu: A method for automatic eval- Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. uation of machine translation. In Proceedings of the The power of scale for parameter-efficient prompt 40th Annual Meeting on Association for Computa- tuning. CoRR, abs/2104.08691. tional Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Nivranshu Pasricha, Mihael Arcan, and Paul Buite- Levy, Veselin Stoyanov, and Luke Zettlemoyer. laar. 2020. NUIG-DSI at the WebNLG+ chal- 2020. BART: Denoising sequence-to-sequence pre- lenge: Leveraging transfer learning for RDF-to-text training for natural language generation, translation, generation. In Proceedings of the 3rd Interna- and comprehension. In Proceedings of the 58th An- tional Workshop on Natural Language Generation nual Meeting of the Association for Computational from the Semantic Web (WebNLG+), pages 137–143, Linguistics, pages 7871–7880, Online. Association Dublin, Ireland (Virtual). Association for Computa- for Computational Linguistics. tional Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Manning. 2014. GloVe: Global vectors for word Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi representation. In Proceedings of the 2014 Confer- Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhi- ence on Empirical Methods in Natural Language hua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Processing (EMNLP), pages 1532–1543, Doha, Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Qatar. Association for Computational Linguistics. Yu, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE 3.0: Large-scale knowledge enhanced pre- Matthew E. Peters, Sebastian Ruder, and Noah A. training for language understanding and generation. Smith. 2019. To tune or not to tune? adapting pre- CoRR, abs/2107.02137. trained representations to diverse tasks. In Proceed- ings of the 4th Workshop on Representation Learn- Swabha Swayamdipta, Sam Thomson, Kenton Lee, ing for NLP (RepL4NLP-2019), pages 7–14, Flo- Luke Zettlemoyer, Chris Dyer, and Noah A. Smith. rence, Italy. Association for Computational Linguis- 2018. Syntactic scaffolds for semantic structures. In tics. EMNLP. Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, Dragomir R. Radev, Rui Zhang, Amrit Rau, Abhi- S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. nand Sivaprasad, Chiachun Hsieh, Nazneen Fatema Multimodal few-shot learning with frozen language Rajani, Xiangru Tang, Aadit Vyas, Neha Verma, models. CoRR, abs/2106.13884. Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Murori Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Chaumond, Clement Delangue, Anthony Moi, Pier- Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- and Richard Socher. 2020. DART: open-domain icz, Joe Davison, Sam Shleifer, Patrick von Platen, structured data record to text generation. CoRR, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, abs/2007.02871. Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Evani Radiya-Dixit and Xin Wang. 2020. How fine can Transformers: State-of-the-art natural language pro- fine-tuning be? learning efficient language models. cessing. In Proceedings of the 2020 Conference on In Proceedings of the Twenty Third International Empirical Methods in Natural Language Processing: Conference on Artificial Intelligence and Statistics, System Demonstrations, pages 38–45, Online. Asso- volume 108 of Proceedings of Machine Learning Re- ciation for Computational Linguistics. search, pages 2435–2443, Online. PMLR. Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Chen, and Chris Callison-Burch. 2016. Optimizing ine Lee, Sharan Narang, Michael Matena, Yanqi statistical machine translation for text simplification. Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Transactions of the Association for Computational the limits of transfer learning with a unified text-to- Linguistics, 4:401–415. text transformer. Journal of Machine Learning Re- search, 21(140):1–67. Dian Yu, Kenji Sagae, and Zhou Yu. 2021. Attribute alignment: Controlling text generation from pre- Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich trained language models. CoRR, abs/2103.11070. Schütze, and Iryna Gurevych. 2020. Investigating Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- pretrained language models for graph-to-text gener- ter J. Liu. 2019. PEGASUS: pre-training with ex- ation. arXiv. tracted gap-sentences for abstractive summarization. CoRR, abs/1912.08777. Timo Schick and H. Schutze. 2021. Exploiting cloze- questions for few-shot text classification and natural Xingxing Zhang and Mirella Lapata. 2017. Sen- language inference. In EACL. tence simplification with deep reinforcement learn- ing. arXiv preprint arXiv:1703.10931. Thomas Scialom, Louis Martin, Jacopo Staiano, Éric Villemonte de la Clergerie, and Benoît Sagot. 2021. Rethinking automatic evaluation in sentence simplification. CoRR, abs/2104.07560. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and Ralph Weischedel. 2006. A study of translation error rate with targeted human annota- tion. In In Proceedings of the Association for Ma- chine Transaltion in the Americas (AMTA 2006.
You can also read