Few-Shot Learning for Opinion Summarization - Association ...

Page created by Harry Reed

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Few-Shot Learning for Opinion Summarization

Arthur Bražinskas1 Mirella Lapata1 Ivan Titov1,2
1
ILCC, University of Edinburgh
2
ILLC, University of Amsterdam
abrazinskas@ed.ac.uk, {mlap,ititov}@inf.ed.ac.uk

Abstract These shoes run true to size, do a good
job supporting the arch of the foot and
Opinion summarization is the automatic cre- are well-suited for exercise. They’re
ation of text reflecting subjective information Gold good looking, comfortable, and the sole
expressed in multiple documents, such as user feels soft and cushioned. Overall they
reviews of a product. The task is practically are a nice, light-weight pair of shoes and
important and has attracted a lot of attention. come in a variety of stylish colors.
However, due to the high cost of summary pro- These running shoes are great! They fit
duction, datasets large enough for training su- true to size and are very comfortable to
pervised models are lacking. Instead, the task run around in. They are light weight and
Ours have great support. They run a little on
has been traditionally approached with extrac-
tive methods that learn to select text fragments the narrow side, so make sure to order a
half size larger than normal.
in an unsupervised or weakly-supervised way.
Recently, it has been shown that abstractive perfect fit for me ... supply the support
summaries, potentially more fluent and better that I need ... are flexible and comfort-
at reflecting conflicting information, can also able ... || ... It is very comfortable ...
be produced in an unsupervised fashion. How- I enjoy wearing them running ... || ...
running shoes ... felt great right out of
ever, these models, not being exposed to actual the box ... They run true to size ... ||
summaries, fail to capture their essential prop- Reviews ... my feet and feel like a dream ... To-
erties. In this work, we show that even a hand- tally light weight ... || ... shoes run small
ful of summaries is sufficient to bootstrap gen- ... fit more true to size ... fit is great!
eration of the summary text with all expected ... supports my arch very well ... || ...
They are lightweight... usually wear a
properties, such as writing style, informative- size women’s 10 ... ordered a 10.5 and
ness, fluency, and sentiment preservation. We the fit is great!
start by training a conditional Transformer lan-
guage model to generate a new product review Table 1: Example summaries produced by our system
given other available reviews of the product. and an annotator; colors encode its alignment to the
The model is also conditioned on review prop- input reviews. The reviews are truncated, and delimited
erties that are directly related to summaries; with the symbol ‘||’.
the properties are derived from reviews with
no manual effort. In the second stage, we
fine-tune a plug-in module that learns to pre- generation (Hu and Liu, 2004; Medhat et al., 2014;
dict property values on a handful of summaries. Angelidis and Lapata, 2018).
This lets us switch the generator to the summa-
Although significant progress has been observed
rization mode. We show on Amazon and Yelp
datasets that our approach substantially outper- in supervised summarization in non-subjective
forms previous extractive and abstractive meth- single-document context, such as news articles
ods in automatic and human evaluation. (Rush et al., 2015; Nallapati et al., 2016; Paulus
et al., 2017; See et al., 2017; Liu et al., 2018), mod-
1 Introduction
ern deep learning methods rely on large amounts
Summarization of user opinions expressed in on- of annotated data that are not readily available in
line resources, such as blogs, reviews, social media, the opinion-summarization domain and expensive
or internet forums, has drawn much attention due to produce. A key obstacle making data annotation
to its potential for various information access appli- expensive is that annotators need to consider multi-
cations, such as creating digests, search, and report ple input texts when writing a summary, which is

4119
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4119–4135,
November 16–20, 2020. c 2020 Association for Computational Linguistics

time-consuming. Moreover, annotation would have 2017). While reviews span a large range of property
to be undertaken for multiple domains as online re- values, only a subset of them is appropriate for sum-
views are inherently multi-domain (Blitzer et al., maries. For example, summaries should be close
2007) and summarization systems can be domain- to the product’s reviews in content, avoid using the
sensitive (Isonuma et al., 2017). This suggests that first-person pronouns and agree with the reviews
it is unlikely that human-annotated corpora large in sentiment. Our approach starts with estimat-
enough for training deep models will be available. ing a property-aware model on a large collection
Recently, a number of unsupervised abstrac- of reviews and then adapts the model using a few
tive multi-document models were introduced (e.g., annotator-created summaries, effectively switching
C OPYCAT; Bražinskas et al. 2020 and M EAN S UM; the generator to the summarization regime. As we
Chu and Liu 2019) that are trained on large collec- demonstrate in our experiments, the summaries do
tions of unannotated product reviews.1 However, not even have to come from the same domain.
unsurprisingly perhaps, since the models are not More formally, we estimate a text model on
exposed to the actual summaries, they are unable to a dataset of reviews; the generator is a Trans-
learn their key characteristics. For instance, M EAN - former conditional language model (CLM) that
S UM (Chu and Liu, 2019) is prone to producing is trained with a ‘leave-one-out’ objective (Besag,
summaries that contain a significant amount of in- 1975; Bražinskas et al., 2020) by attending to other
formation that is unsupported by reviews; C OPY- reviews of the product. We define properties of
CAT generates summaries that are better aligned unannotated data that are directly related to the end
with reviews, yet they are limited in detail. More- task of summarization. Those properties are easy
over, both systems, are trained mostly on subjec- to derive from reviews, and no extra annotation ef-
tively written reviews, and as a result, tend to gen- fort is required. The CLM is conditioned on these
erate summaries in the same writing style. properties in training. The properties encode partial
The main challenge in the absence of large anno- information about the target review that is being
tated corpora lies in successful utilization of scarce predicted. We capitalize on that by fine-tuning
annotated resources. Unlike recent approaches to parts of the model jointly with a tiny plug-in net-
language model adaptation for abstractive single- work on a handful of human-written summaries.
document summarization (Hoang et al., 2019; Raf- The plug-in network is trained to output property
fel et al., 2019) that utilize hundreds of thousands values that make the summaries likely under the
of summaries, our two annotated datasets consist of trained CLM. The plug-in has less than half a per-
only 60 and 100 annotated data-points. It was also cent of the original model’s parameters, and thus is
observed that a naive fine-tuning of multi-million less prone to over-fitting on small datasets. Never-
parameter models on small corpora leads to rapid theless, it can successfully learn to control dynam-
over-fitting and poor generalization (Vinyals et al., ics of a large CLM by providing property values
2016; Finn et al., 2017). In this light, we propose that force generation of summaries. We shall refer
a few-shot learning framework and demonstrate to the model produced using the procedure as Few
that even a tiny number of annotated instances is Shot Summarizer (F EW S UM).
sufficient to bootstrap generation of the formal sum- We evaluate our model against both extractive
mary text that is both informative and fluent (see and abstractive methods on Amazon and Yelp
Table 1). To the best of our knowledge, this work human-created summaries. Summaries generated
is the first few-shot learning approach applied to by our model are substantially better than those
summarization. produced by competing methods, as measured by
In our work, we observe that reviews in a large automatic and human evaluation metrics on both
unannotated collection vary a lot; for example, they datasets. Finally, we show that it allows for suc-
differ in style, the level of detail, or how much they cessful cross-domain adaption. Our contributions
diverge from other reviews of the product in terms can be summarized as follows:
of content and overall sentiment. We refer to indi-
vidual review characteristics and their relations to • we introduce the first few-shot learning frame-
other reviews as properties (Ficler and Goldberg, work for abstractive opinion summarization;

1
For simplicity, we use the term ‘product’ to refer to both • we demonstrate that the approach substan-
Amazon products and Yelp businesses. tially outperforms extractive and abstractive

4120

models, both when measured with automatic               q(ri , r−i ) as shown in Eq. 2.
      metrics and in human evaluation;                           M N
                                                               1 XX
                                                                     log Gθ (rij |Eθ (r−i
                                                                                       j
                                                                                          ), q(rij , r−i
                                                                                                      j
                                                                                                         )) (2)
    • we release datasets with abstractive sum-               MN
                                                                     j=1 i=1
      maries for Amazon products and Yelp busi-               We refer to this partial information as properties
      nesses.2                                                (Ficler and Goldberg, 2017), which correspond to
                                                              text characteristics of ri or relations between ri
2    Unsupervised Training                                    and r−i . For example, one such property can be
                                                              the ROUGE score (Lin, 2004) between ri and r−i ,
User reviews about an entity (e.g., a product) are            which indicates the degree of overlap between ri
naturally inter-dependent. For example, knowing               and r−i . In Fig. 1, a high ROUGE value can signal
that most reviews are negative about a product’s              to the generator to attend the word vacuum in the
battery life, it becomes more likely that the next            source reviews instead of predicting it based on
review will also be negative about it. To model               language statistics. Intuitively, while the model ob-
inter-dependencies, yet to avoid intractabilities as-         serves a wide distribution of ROUGE scores during
sociated with undirected graphical models (Koller             training on reviews, during summarization in test
and Friedman, 2009), we use the leave-one-out set-            time we can achieve a high degree of input-output
ting (Besag, 1975; Bražinskas et al., 2020).                 text overlap by setting the property to a high value.
   Specifically, we assume access to a large cor-             We considered three types of properties.
pus of user text reviews, which are arranged as M
groups {r1:N }M    j=1 , where r1:N are reviews about
                                                                 Content Coverage: ROUGE-1, ROUGE-2, and
a particular product that are arranged as a tar-              ROUGE-L between ri and r−i signals to Gθ how
get review ri and N − 1 source reviews r−i =                  much to rely on syntactic information in r−i dur-
{r1 , ..., ri−1 , ri+1 , ..., rN }. Our goal is to estimate   ing prediction of ri . Writing Style: as a proxy for
the conditional distribution ri |r−i by optimizing            formal and informal writing style, we compute pro-
the parameters θ as shown in Eq. 1.                           noun counts, and create a distribution over 3 points
                    M N                                       of view and an additional class for cases with no
                  1 XX
θ∗ = arg max            log pθ (rij |r−i
                                      j
                                         )                    pronouns; see Appendix 9.7 for details. Rating and
           θ     MN                                           Length Deviations: for the former, we compute
                         j=1 i=1
                    M N                                       the difference between ri ’s rating and the average
                  1 XX
    = arg max           log Gθ (rij |Eθ (r−i
                                          j
                                             ))               r−i rating; in the latter case, we use the difference
           θ     MN                                           between ri ’s length and the average r−i length.
                         j=1 i=1
                                                 (1)
   Our model has an encoder-generator Trans-
former architecture (Vaswani et al., 2017), where
the encoder Eθ produces contextual representations            2.1   Novelty Reduction
of r−i that are attended by the generator Gθ , which
in-turn is a conditional language model predict-              While summary and review generation are techni-
ing the target review ri , estimated using teacher-           cally similar, there is an important difference that
forcing (Williams and Zipser, 1989). An illustra-             needs to be addressed. Reviews are often very
tion is presented in Fig. 1.                                  diverse, so when a review is predicted, the gen-
   The objective lets the model exploit common in-            erator often needs to predict content that is not
formation across reviews, such as rare brand names            present in source reviews. On the other hand, when
or aspect mentions. For example, in Fig. 1, the               a summary is predicted, its semantic content al-
generator can directly attend to the word vacuum in           ways matches the content of source reviews. To
the source reviews to increase its prediction proba-          address this discrepancy, in addition to using the
bility.                                                       ROUGE scores, as was explained previously, we
   Additionally, we condition on partial informa-             introduce a novelty reduction technique, which is
tion about the target review ri using an oracle               similar to label smoothing (Pereyra et al., 2017).

  2
    Both the code and datasets are available at: https://       Specifically, we add a regularization term L,
github.com/abrazinskas/FewSum                                 scaled by λ, that is applied to word distributions

                                                         4121

vacuum
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Attention
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          hoover
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          product

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Generator States
    Encoder States

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 …
                        …

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      …
                                 …

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      …

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      …

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    … …
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        … …
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             …

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                …
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Oracle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    q(ri , r i )
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBahQi1JPeix4MVjBfsBbQyb7aZdutmkuxulhP4PLx4U8ep/8ea/cdvmoK0PBh7vzTAzz485U9q2v63c2vrG5lZ+u7Czu7d/UDw8aqkokYQ2ScQj2fGxopwJ2tRMc9qJJcWhz2nbH93M/PYjlYpF4l5PYuqGeCBYwAjWRnoYl6XHKkh66QWbnnvFkl2150CrxMlICTI0vOJXrx+RJKRCE46V6jp2rN0US80Ip9NCL1E0xmSEB7RrqMAhVW46v3qKzozSR0EkTQmN5urviRSHSk1C33SGWA/VsjcT//O6iQ6u3ZSJONFUkMWiIOFIR2gWAeozSYnmE0MwkczcisgQS0y0CapgQnCWX14lrVrVuazW7pxSvZLFkYcTOIUyOHAFdbiFBjSBgIRneIU368l6sd6tj0VrzspmjuEPrM8fNl+RmA==

                       Very   sturdy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                vacuum   …       Great    vacuum                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  …                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             This                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                …
                                       r1
                                       AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuKbQM2FhGNB+QHGFvs5cs2ds7dueEcOQn2FgoYusvsvPfuEmu0MQHA4/3ZpiZFyRSGHTdb6ewsbm1vVPcLe3tHxwelY9P2iZONeMtFstYdwNquBSKt1Cg5N1EcxoFkneCye3c7zxxbUSsHnGacD+iIyVCwSha6UEPvEG54lbdBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLU2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGN74mVBJilyx5aIwlQRjMv+bDIXmDOXUEsq0sLcSNqaaMrTplGwI3urL66Rdq3r1au3eqzSu8jiKcAbncAkeXEMD7qAJLWAwgmd4hTdHOi/Ou/OxbC04+cwp/IHz+QP7o42D
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               rN
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuFloGbKwkovmA5Ah7m71kyd7esTsnhCM/wcZCEVt/kZ3/xk1yhSY+GHi8N8PMvCCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9mfvuJayNi9YiThPsRHSoRCkbRSg+6f9cvV9yqOwdZJV5OKpCj0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXUkUjbvxsfuqUnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGZ/k4HQnKGcWEKZFvZWwkZUU4Y2nZINwVt+eZW0alXvslq79yr1izyOIpzAKZyDB1dQh1toQBMYDOEZXuHNkc6L8+58LFoLTj5zDH/gfP4AJ6aNoA==
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ri
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuKbQM2FhGNB+QHGFvs5cs2ds7dueEcOQn2FgoYusvsvPfuEmu0MQHA4/3ZpiZFyRSGHTdb6ewsbm1vVPcLe3tHxwelY9P2iZONeMtFstYdwNquBSKt1Cg5N1EcxoFkneCye3c7zxxbUSsHnGacD+iIyVCwSha6UEPxKBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDG/8TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKuVb16tXbvVRpXeRxFOINzuAQPrqEBd9CEFjAYwTO8wpsjnRfn3flYthacfOYU/sD5/AFQko27

Figure 1: Illustration of the F EW S UM model that uses the leave-one-out objective. Here predictions of the target
review ri is performed by conditioning on the encoded source reviews r−i . The generator attends the last encoder
layer’s output to extract common information (in red). Additionally, the generator has partial information about ri
passed by the oracle q(ri , r−i ).

produced by the generator Gθ as shown in Eq. 3.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              itively, some combinations of property values drive
                       M X
                         N h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 it into generation of text that has qualities of a sum-
     1
                                 log Gθ (rij |Eθ (r−i
                                                   j
                                                      ), q(rij , r−i
                                                                  j
                       X
                                                                     ))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      mary while others of a review. However, we might
    MN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       not know values in advance that are necessary for
                       j=1 i=1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             generation of summaries. Furthermore, q(ri , r−i )
                                             i
      − λL(Gθ (rij |Eθ (r−i
                         j
                            ), q(rij , r−i
                                        j
                                           ))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             cannot be applied at test time as it requires access
                                                 (3)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         to target texts. In the following section, we de-
It penalizes assigning the probability mass to words                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         scribe a solution that switches the generator to the
not appearing in r−i , as shown in Eq. 4, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                summarization mode relying only on input reviews.
thus steers towards generation of text that is more
grounded in content of r−i .
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             3.1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Plug-in Network
L(Gθ (ri |r−i , q(ri , r−i ))) =
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             We start by introducing a parametrized plug-in net-
 T
 X                     X                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     work pφ (r−i ) that yields the same types of prop-
                              Gθ (Wt = w|ri1:t−1 , r−i , q(ri , r−i ))
 t=1 w6∈V (r−i )
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             erties as q(ri , r−i ). From a practical perspective,
                                                  (4)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        the plug-in should be input-permutation invariant
Here, T is the length of ri , and the inner sum is                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           and allow for an arbitrary number of input reviews
over all words that do not appear in the word vo-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (Zaheer et al., 2017). Importantly, the trainable
cabulary of r−i . Intuitively, in Fig. 1, the penalty                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        plug-in can have a marginal fraction of the main
could reduce the probability of the word hoover                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              model’s parameters, which makes it less prone to
to be predicted as it does not appear in the source                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          over-fitting when trained on small datasets. We
reviews.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     initialize the parameters of pφ (r−i ) by matching
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             its output to q(ri , r−i ) on the unannotated reviews.
3                    Summary Adaptation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Specifically, we used a weighted combination of
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             distances as shown for one group of reviews in
Once the unsupervised model is trained on reviews,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Eq. 5.
our task is to adapt it to generation of summaries.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    N X
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      L
Here, we assume access to a small number of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         X
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              αl Dl (pφ (r−i )l , q(ri , r−i )l )                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     (5)
annotator-written summaries (sk , r1:N
                                     k )K where
                                          k=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       i=1 l=1
s is a summary for r1:N input reviews. As we will
show in Sec. 6.1, naive fine-tuning of the unsuper-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Here, Dl (pφ (r−i )l , q(ri , r−i )l ) is a distance for
vised model on a handful of annotated data-points                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            the property l, and αl is an associated weight.
leads to poor generalization. Instead, we capital-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Specifically, we used L1 norm for Content Cover-
ize on the fact that the generator Gθ has observed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           age, Rating and Length Deviations, and Kullback-
a wide range of property values associated with                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Leibler divergence for Writing Style.
ri during the unsupervised training phase. Intu-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               For the plug-in network, we employed a multi-

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      4122

layer feed-forward network with multi-head atten- Dataset Training Validation
tion modules over the encoded states of the source Yelp 38,913/1,016,347 4,324/113,886
reviews at each layer, followed by a linear transfor- Amazon 182,932/3,889,782 9,629/205,992
mation, predicting property values. Note that the
encoder is shared with the main model. Table 2: Data statistics after pre-processing. The
format in the cells is Businesses/Reviews and Prod-
3.2 Fine-Tuning ucts/Reviews for Yelp and Amazon, respectively.

Unsurprisingly, perhaps, the network pφ being ini- We obtained 480 human-written summaries (180
tialized on unannotated reviews inherits a strong for Amazon and 300 for Yelp) for 8 reviews each,
bias towards outputting property values resulting using Amazon Mechanical Turk (AMT). Each prod-
in generation of reviews, which should not be ap- uct/business received 3 summaries, and averaged
propriate for generating summaries. Fortunately, ROUGE scores are reported in the following sec-
due to the simplicity of the chosen properties, it is tions. Also, we reserved approximately 13 for test-
possible to fine-tune pφ to match the output of q on ing and the rest for training and validation. The
the annotated data (sk , r1:N
k )K using Eq. 5.
k=1
details are in Appendix 9.2.
An alternative is to optimize the plug-in to di-
4.2 Experimental Details
rectly increase the likelihood of summaries under
Gθ while keeping all other parameters fixed.3 For the main model, we used the Transformer archi-
As the generator is trained on unannotated re- tecture (Vaswani et al., 2017) with trainable length
views, it might not encounter a sufficient amount embeddings and shared parameters between the en-
of text that is written as a summary, and that highly coder and generator (Raffel et al., 2019). Subwords
overlaps in content with the input reviews. We ad- were obtained with BPE (Sennrich et al., 2016)
dress that by unfreezing the attention module of using 32000 merges. Subword embeddings were
Gθ over input reviews and the plug-in pφ , and by shared across the model as a form of regularization
maximizing the likelihood of summaries: (Press and Wolf, 2017). For a fair comparison, we
K
approximately matched the number of parameters
1 Xh i
to C OPYCAT (Bražinskas et al., 2020). We ran-
log Gθ (sk |Eθ (r1:N
k k
), pφ (r1:N )) (6)
K domly initialized all parameters with Glorot (Glo-
k=1
rot and Bengio, 2010). For the plug-in network,
This allows the system to learn an interaction be-
we employed a multi-layer feed-forward network
tween Gθ and pφ . For example, what property
with multi-head attention modules over encoded
values are better associated with summaries and
states of the source review. After the last layer, we
how Gθ should better respond to them.
performed a linear projection to compute property
4 Experimental Setup values. Further, parameter optimization was per-
formed using Adam (Kingma and Ba, 2014), and
4.1 Dataset beam search with n-gram blocking (Paulus et al.,
For training we used customer reviews from Ama- 2017) was applied to our model and Copycat for
zon (He and McAuley, 2016) and Yelp.4 From the summary generation.
Amazon reviews we selected 4 categories: Elec- All experiments were conducted on 4 x GeForce
tronics; Clothing, Shoes and Jewelry; Home and RTX 2080 Ti.
Kitchen; Health and Personal Care. We used a sim- 4.3 Hyperparameters
ilar pre-processing schema as in (Bražinskas et al.,
2020), details are presented in Appendix 9.1. For Our parameter-shared encoder-generator model
training, we partitioned business/product reviews used a 8-head and 6-layer Transformer stack.
to the groups of 9 reviews by sampling without Dropout in sub-layers and subword embeddings
replacement. Thus, for unsupervised training in dropout was both set to 0.1, and we used 1000
Sec. 2, we conditioned on 8 reviews for each target dimensional position-wise feed-forward neural net-
review. The data-statistics are shown in Table 2. works. We set subword and length embeddings to
390 and 10 respectively, and both were concate-
3
We explored that option, and observed that it works simi- nated to be used as input. For the plug-in network,
larly, yet leads to a slightly worse result.
4
https://www.yelp.com/dataset/ we set the output dimension to 30 and internal feed-
challenge forward network hidden dimensions to 20. We used

4123

a stack of 3 layers, and the attention modules with 3 R1 R2 RL
heads at each layer. We applied 0.4 internal dropout FewSum 0.3356 0.0716 0.2149
and 0.15 attention dropout. Property values pro- Copycat 0.2785 0.0477 0.1886
duced by the plug-in or oracle were concatenated MeanSum 0.2663 0.0489 0.1711
with subword and length embeddings and linearly LexRank 0.2772 0.0506 0.1704
projected before being passed to the generator. In Clustroid 0.2716 0.0361 0.1677
total, our model had approximately 25M parame- Lead 0.2700 0.0492 0.1495
ters, while the plug-in network only 100K (i.e., less Random 0.2500 0.0382 0.1572
than 0.5 % of the main model’s parameters).
In all experiments, the hyperparameter tuning Table 3: ROUGE scores on the Amazon test set.
was performed based on the ROUGE-L score on
Yelp and Amazon validation sets. R1 R2 RL
FewSum 0.3729 0.0992 0.2276
4.4 Baselines Copycat 0.2812 0.0589 0.1832
L EX R ANK (Erkan and Radev, 2004) is an unsuper- MeanSum 0.2750 0.0354 0.1609
vised extractive graph-based algorithm selecting LexRank 0.2696 0.0493 0.1613
sentences based on graph centrality. Sentences rep- Clustroid 0.2890 0.0490 0.1800
resent nodes in a graph whose edges have weights Lead 0.2620 0.0457 0.1432
denoting similarity computed with tf-idf. Random 0.2148 0.0259 0.1387
M EAN S UM is an unsupervised abstractive sum-
Table 4: ROUGE scores on the Yelp test set.
marization model (Chu and Liu, 2019) that treats
a summary as a discrete latent state of an autoen- datasets. Also, the results are supported by qualita-
coder. The model is trained in a multi-task fashion tive improvements over other models, see examples
with two objectives, one for prediction of reviews in the Appendix.
and the other one for summary-reviews alignment
Best-Worst Scaling We performed human eval-
in the semantic space using the cosine similarity.
uation with the Best-Worst scaling (Louviere and
C OPYCAT is the state-of-the-art unsupervised
Woodworth, 1991; Louviere et al., 2015; Kir-
abstractive summarizer (Bražinskas et al., 2020)
itchenko and Mohammad, 2016) on the Amazon
that uses continuous latent representations to model
and Yelp test sets using the AMT platform. We
review groups and individual review semantics. It
assigned multiple workers to each tuple containing
has an implicit mechanism for novelty reduction
summaries from C OPYCAT, our model, L EX R ANK,
and uses a copy mechanism.
and human annotators. The judgment criteria
As is common in the summarization literature,
were the following: Fluency, Coherence, Non-
we also employed a number of simple summariza-
redundancy, Informativeness, Sentiment. Details
tion baselines. First, the CLUSTROID review was
are provided in Appendix 9.6.
computed for each group of reviews as follows.
For every criterion, a system’s score is computed
We took each review from a group and computed
as the percentage of times it was selected as best,
ROUGE-L with respect to all other reviews. The
minus the percentage of times it was selected as
review with the highest ROUGE score was selected
worst (Orme, 2009). The scores range from -1
as the clustroid review. Second, we sampled a
(unanimously worst) to +1 (unanimously best).
RANDOM review from each group to be used as the
The results are presented in Tables 5 and 6 for
summary. Third, we constructed the summary by
Amazon and Yelp, respectively. On the Amazon
selecting the leading sentences (L EAD) from each
data, they indicate that our model is preferred
review of a group.
across the board over the baselines. C OPYCAT
5 Evaluation Results is preferred over L EX R ANK in terms of fluency
and non-redundancy, yet it shows worse results
Automatic Evaluation We report ROUGE F1 in terms of informativeness and overall sentiment
score (Lin, 2004) based evaluation results on the preservation. In the same vein, on Yelp in Table 6
Amazon and Yelp test sets in Tables 3 and 4, respec- our model outperforms the other models.
tively. The results indicate that our model outper- All pairwise differences between our model
forms abstractive and extractive methods on both and other models are statistically significant at

4124

Fluency Coherence Non-Redundancy Informativeness Sentiment
FewSum 0.1000 0.1429 0.1250 0.2000 0.3061
Copycat -0.1765 -0.5333 -0.2727 -0.7455 -0.7143
LexRank -0.4848 -0.5161 -0.5862 -0.3488 -0.0909
Gold 0.5667 0.6364 0.6066 0.6944 0.4138

Table 5: Human evaluation results in terms of the Best-Worst scaling on the Amazon test set.

Fluency Coherence Non-Redundancy Informativeness Sentiment
FewSum 0.1636 0.1429 0.0000 0.3793 0.3725
Copycat -0.2000 -0.0769 0.1053 -0.4386 -0.2857
LexRank -0.5588 -0.5312 -0.6393 -0.6552 -0.4769
Gold 0.5278 0.3784 0.4795 0.6119 0.4118

Table 6: Human evaluation results in terms of the Best-Worst scaling on the Yelp test set.

Full (%) Partial (%) No (%) in Sec. 4.1. First, we trained a model in an unsu-
FewSum 43.09 34.14 22.76 pervised learning setting (USL) on the Amazon re-
Copycat 46.15 27.18 26.67 views with the leave-one-out objective in Eq. 1. In
this setting, the model has neither exposure to sum-
Table 7: Content support on the Amazon test set. maries nor the properties, as the oracle q(ri , r−i )
p < 0.05, using post-hoc HD Tukey tests. The is not used. Further, we considered two alternative
only exception is non-redundency on Yelp when settings how the pre-trained unsupervised model
comparing our model and C OPYCAT (where our can be adapted on the gold summaries. In the first
model shows a slightly lower score). setting, the model is fine-tuned by predicting sum-
maries conditioned on input reviews (USL+F). In
Content Support As was observed by Falke the second one, similar to Hoang et al. (2019), we
et al. (2019); Tay et al. (2019); Bražinskas et al. performed adaptation in a multi-tasking learning
(2020), the ROUGE metric can be insensitive to hal- (MTL) fashion. Here, USL is further trained on
lucinating facts and entities. We also investigated a mixture of unannotated corpus review and gold
how well generated text is supported by input re- summary batches with a trainable embedding in-
views. We split summaries generated by our model dicating the task.5 The results are presented in
and C OPYCAT into sentences. Then for each sum- Table 8.
mary sentence, we hired 3 AMT workers to judge
First, we observed that USL generates sum-
how well content of the sentence is supported by
maries that get the worst ROUGE scores. Addi-
the reviews. Three following options were avail-
tionally, the generated text tends to be informal
able. Full support: all the content is reflected in
and substantially shorter than an average summary,
the reviews; Partial support: only some content is
we shall discuss that in Sec. 6.2. Second, when
reflected in the reviews; No support: content is not
the model is fine-tuned on the gold summaries
reflected in the reviews.
(USL+F), it noticeably improves the results, yet
The results are presented in Table 7. Despite not
they are substantially worse than of our proposed
using the copy mechanism, that is beneficial for
few-shot approach. It can be explained by strong
fact preservation (Falke et al., 2019) and genera-
influence of the unannotated data stored in mil-
tion of more diverse and detailed summaries (see
lions of parameters that requires more annotated
Appendix), we score on par with C OPYCAT.
data-points to overrule. Finally, we observed that
6 Analysis MTL fails to decouple the tasks, indicated by only
a slight improvement over USL.
6.1 Alternative Adaptation Strategies
We further explored alternative utilization ap-
proaches of annotated data-points, based on the 5
We observed that the 1:1 review-summary proportion
same split of the Amazon summaries as explained works the best.

4125

R1 R2 RL These shoes run true to size, do a good
FewSum 0.3356 0.0716 0.2149 job supporting the arch of the foot and
are well-suited for exercise. They’re
MTL 0.2403 0.0435 0.1627 Gold good looking, comfortable, and the sole
USL+F 0.2823 0.0624 0.1964 feels soft and cushioned. Overall they
USL 0.2145 0.0315 0.1523 are a nice, light-weight pair of shoes and
come in a variety of stylish colors.
Random 0.2500 0.0382 0.1572
These running shoes are great! They fit
Table 8: ROUGE scores on the Amazon test set for true to size and are very comfortable to
alternative summary adaptation strategies. run around in. They are light weight and
FewSum have great support. They run a little on
the narrow side, so make sure to order a
1st 2nd 3rd NoPr Len half size larger than normal.
Gold 0.0 1.7 60.0 38.3 0.0 This is my second pair of Reebok run-
FewSum 0.5 1.3 83.2 15.0 3.4 ning shoes and they are the best run-
USL+F 29.7 0.0 45.3 25.0 -28.6 USL+F ning shoes I have ever owned. They are
lightweight, comfortable, and provide
USL 56.7 0.0 43.3 0.0 -32.7 great support for my feet.
Reviews 49.0 7.3 35.6 8.1 -17.6
This is my second pair of Reebok run-
ning shoes and I love them. They are
Table 9: Text characteristics of generated summaries USL the most comfortable shoes I have ever
by different models on the Amazon test set. worn.

6.2 Influence of Unannotated Data
Table 10: Example summaries produced by models
We further analyzed how plain fine-tuning on sum- with different adaptation approaches.
maries differs from our approach in terms of cap-
turing summary characteristics. For comparison, Domain In-domain Cross-domain
we used USL and USL+F, which are presented in Cloth 0.2188 0.2220
Sec. 6.1. Additionally, we analyzed unannotated Electronics 0.2146 0.2136
reviews from the Amazon training set. Specifically, Health 0.2121 0.1909
we focused on text formality and the average word Home 0.2139 0.2250
count difference (Len) from the gold summaries Avg 0.2149 0.2129
in the Amazon test set. As a proxy for the former,
we computed the marginal distribution over points Table 11: In and cross domain experiments on the Ama-
of view (POV), based on pronoun counts; an ad- zon dataset, ROUGE-L scores are reported.
ditional class (NoPr) was allocated to cases of no tually fine-tuned on them. Indeed, we observed that
pronouns. The results are presented in Table 9. USL+F starts to shift in the direction of the sum-
First, we observed that the training reviews are maries by reducing the pronouns of the 1st POV
largely informal (49.0% and 7.3% for 1st and 2nd and increasing the average summary length. Never-
POV, respectively). Unsurprisingly, the model theless, the gap is still wide, which would probably
trained only on the reviews (USL) transfers a simi- require more data to be bridged. Finally, we ob-
lar trait to the summaries that it generates.6 On the served that our approach adapts much better to the
contrary, the gold summaries are largely formal - desired characteristics by producing well-formed
indicated by a complete absence of the 1st and a summary text that is also very close in length to the
marginal amount of 2nd POV pronouns. Also, an gold summaries.
average review is substantially shorter than an aver-
age gold summary, and consequently, the generated 6.3 Cross-Domain
summaries by USL are also shorter. Example sum-
We hypothesized that on a small dataset, the model
maries are presented in Table 10.
primarily learns course-grained features, such as
Further, we investigated how well USL+F,
common writing phrases, and their correlations be-
adapts to the summary characteristics by being ac-
tween input reviews and summaries. Also, that they,
6
As beam search, attempting to find the most likely can- in principle, could be learned from remotely re-
didate sequence, was utilized, opposed to a random sequence lated domains. We investigated that by fine-tuning
sampling, we observed that generated sequences had no cases
of the 2nd POV pronouns and complete absence of pronouns the model on summaries that are not in the tar-
(NoPr). get domain of the Amazon dataset. Specifically,

4126

we matched data-point count for 3/4 domains of pipeline framework for abstractive summary gen-
training and validation sets to the in-domain Ama- eration. Our conditioning on text properties ap-
zon data experiment presented in Sec 5; the test proach is similar to Ficler and Goldberg (2017),
set remained the same for each domain as in the yet we rely on automatically derived properties that
in-domain experiment. Then, we fine-tuned the associate a target to source, and learn a separate
same model 5 times with different seeds per target module to generate their combinations. Moreover,
domain. For comparison, we used the in-domain their method has not been studied in the context of
model which was used in Amazon experiments in summarization.
Sec. 5. We computed the average ROUGE-L score
per target domain, where overall σ was 0.0137. The 8 Conclusions
results are reported in Table 11. In this work, we introduce the first to our knowl-
The results indicate that the models perform on- edge few-shot framework for abstractive opinion
par on most of the domains, supporting the hypothe- summarization. We show that it can efficiently uti-
sis. On the other hand, the in-domain model shows lize even a handful of annotated reviews-summary
substantially better results on the health domain, pairs to train models that generate fluent, informa-
which is expected, as, intuitively, this domain is the tive, and overall sentiment reflecting summaries.
most different from the rest. We propose to exploit summary related properties
in unannotated reviews that are used for unsuper-
7 Related Work vised training of a generator. Then we train a tiny
plug-in network that learns to switch the generator
Extractive weakly-supervised opinion summa-
to the summarization regime. We demonstrate that
rization has been an active area of research.
our approach substantially outperforms competitive
L EX R ANK (Erkan and Radev, 2004) is an unsu-
ones, both abstractive and extractive, in human and
pervised extractive model. O PINOSIS (Ganesan
automatic evaluation. Finally, we show that it also
et al., 2010) does not use any supervision and relies
allows for successful cross-domain adaptation.
on POS tags and redundancies to generate short
opinions. However, this approach is not well suited Acknowledgments
for the generation of coherent long summaries and,
although it can recombine fragments of input text, We would like to thank members of Edinburgh NLP
it cannot generate novel words and phrases. Other group for discussion, and the anonymous reviewers
earlier approaches (Gerani et al., 2014; Di Fab- for their valuable feedback. We gratefully acknowl-
brizio et al., 2014) relied on text planners and tem- edge the support of the European Research Council
plates, which restrict the output text. A more re- (Titov: ERC StG BroadSem 678254; Lapata: ERC
cent extractive method of Angelidis and Lapata CoG TransModal 681760) and the Dutch National
(2018) frames the problem as a pipeline of steps Science Foundation (NWO VIDI 639.022.518).
with different models for each step. Isonuma et al.
(2019) introduce an unsupervised approach for sin-
gle product review summarization, where they rely
on latent discourse trees. The most related unsu-
pervised approach to this work is our own work,
C OPYCAT (Bražinskas et al., 2020). Unlike that
work, we rely on a powerful generator to learn con-
ditional spaces of text without hierarchical latent
variables. Finally, in contract to M EAN S UM (Chu
and Liu, 2019), our model relies on inductive bi-
ases without explicitly modeling of summaries. A
concurrent model D ENOISE S UM (Amplayo and La-
pata, 2020) uses a syntactically generated dataset
of source reviews to train a generator to denoise
and distill common information. Another paral-
lel work, O PINION D IGEST (Suhara et al., 2020),
considers controllable opinion aggregation and is a

4127

References deep networks. In Proceedings of the 34th Interna-
tional Conference on Machine Learning-Volume 70,
Reinald Kim Amplayo and Mirella Lapata. 2020. Un- pages 1126–1135. JMLR. org.
supervised opinion summarization with noising and
denoising. Proceedings of Association for Computa- Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.
tional Linguistics (ACL). 2010. Opinosis: A graph based approach to abstrac-
tive summarization of highly redundant opinions. In
Stefanos Angelidis and Mirella Lapata. 2018. Sum- Proceedings of the 23rd International Conference
marizing opinions: Aspect extraction meets senti- on Computational Linguistics (Coling 2010), pages
ment prediction and they are both weakly supervised. 340–348.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages Shima Gerani, Yashar Mehdad, Giuseppe Carenini,
3675–3686. Raymond T Ng, and Bita Nejat. 2014. Abstractive
summarization of product reviews using discourse
Julian Besag. 1975. Statistical analysis of non-lattice structure. In Proceedings of the 2014 conference on
data. Journal of the Royal Statistical Society: Series empirical methods in natural language processing
D (The Statistician), 24(3):179–195. (EMNLP), pages 1602–1613.
John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Xavier Glorot and Yoshua Bengio. 2010. Understand-
Biographies, bollywood, boom-boxes and blenders: ing the difficulty of training deep feedforward neural
Domain adaptation for sentiment classification. In networks. In Proceedings of the thirteenth interna-
Proceedings of the 45th annual meeting of the asso- tional conference on artificial intelligence and statis-
ciation of computational linguistics, pages 440–447. tics, pages 249–256.
Arthur Bražinskas, Mirella Lapata, and Ivan Titov. Ruining He and Julian McAuley. 2016. Ups and downs:
2020. Unsupervised opinion summarization as Modeling the visual evolution of fashion trends with
copycat-review generation. In Proceedings of Asso- one-class collaborative filtering. In proceedings of
ciation for Computational Linguistics (ACL). the 25th international conference on world wide
web, pages 507–517.
Eric Chu and Peter Liu. 2019. Meansum: a neu-
ral model for unsupervised multi-document abstrac- Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz,
tive summarization. In Proceedings of International and Yejin Choi. 2019. Efficient adaptation of pre-
Conference on Machine Learning (ICML), pages trained transformers for abstractive summarization.
1223–1232. arXiv preprint arXiv:1906.00138.
Hoa Trang Dang. 2005. Overview of duc 2005. In Pro- Minqing Hu and Bing Liu. 2004. Mining and summa-
ceedings of the document understanding conference, rizing customer reviews. In Proceedings of the tenth
volume 2005, pages 1–12. ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 168–177.
Giuseppe Di Fabbrizio, Amanda Stent, and Robert ACM.
Gaizauskas. 2014. A hybrid approach to multi-
document summarization of opinions in reviews. Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka
pages 54–63. Matsuo, and Ichiro Sakata. 2017. Extractive sum-
marization using multi-task learning with document
Günes Erkan and Dragomir R Radev. 2004. Lexrank: classification. In Proceedings of the 2017 Confer-
Graph-based lexical centrality as salience in text ence on Empirical Methods in Natural Language
summarization. Journal of artificial intelligence re- Processing, pages 2101–2110.
search, 22:457–479.
Masaru Isonuma, Junichiro Mori, and Ichiro Sakata.
Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie 2019. Unsupervised neural single-document sum-
Utama, Ido Dagan, and Iryna Gurevych. 2019. marization of reviews via learning latent discourse
Ranking generated summaries by correctness: An in- structure and its ranking. In Proceedings of Associa-
teresting but challenging application for natural lan- tion for Computational Linguistics (ACL).
guage inference. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin- Diederik P Kingma and Jimmy Ba. 2014. Adam: A
guistics, pages 2214–2220. method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Jessica Ficler and Yoav Goldberg. 2017. Controlling
linguistic style aspects in neural language genera- Svetlana Kiritchenko and Saif M Mohammad. 2016.
tion. In Proceedings of the Workshop on Stylis- Capturing reliable fine-grained sentiment associa-
tic Variation, pages 94–104, Copenhagen, Denmark. tions by crowdsourcing and best–worst scaling. In
Association for Computational Linguistics. Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. tional Linguistics: Human Language Technologies,
Model-agnostic meta-learning for fast adaptation of pages 811–817.

4128

Daphne Koller and Nir Friedman. 2009. Probabilistic Abigail See, Peter J Liu, and Christopher D Manning.
graphical models: principles and techniques. MIT 2017. Get to the point: Summarization with pointer-
press. generator networks. In Proceedings of Association
for Computational Linguistics (ACL).
Chin-Yew Lin. 2004. Rouge: A package for auto-
matic evaluation of summaries acl. In Proceedings Rico Sennrich, Barry Haddow, and Alexandra Birch.
of Workshop on Text Summarization Branches Out 2016. Neural machine translation of rare words with
Post Conference Workshop of ACL, pages 2017–05. subword units. Proceedings of Association for Com-
putational Linguistics (ACL).
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis,
Shazeer. 2018. Generating wikipedia by summariz- and Wang-Chiew Tan. 2020. Opiniondigest: A sim-
ing long sequences. In Proceedings of International ple framework for opinion summarization. Proceed-
Conference on Learning Representations (ICLR). ings of Association for Computational Linguistics
(ACL).
Jordan J Louviere, Terry N Flynn, and Anthony Al-
fred John Marley. 2015. Best-worst scaling: The- Wenyi Tay, Aditya Joshi, Xiuzhen Jenny Zhang, Sarv-
ory, methods and applications. Cambridge Univer- naz Karimi, and Stephen Wan. 2019. Red-faced
sity Press. rouge: Examining the suitability of rouge for opin-
ion summary evaluation. In Proceedings of the The
Jordan J Louviere and George G Woodworth. 1991.
17th Annual Workshop of the Australasian Language
Best-worst scaling: A model for the largest differ-
Technology Association, pages 52–60.
ence judgments. University of Alberta: Working Pa-
per. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
2014. Sentiment analysis algorithms and applica- Kaiser, and Illia Polosukhin. 2017. Attention is all
tions: A survey. Ain Shams engineering journal, you need. In Advances in neural information pro-
5(4):1093–1113. cessing systems, pages 5998–6008.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Oriol Vinyals, Charles Blundell, Timothy Lillicrap,
Caglar Gulcehre, and Bing Xiang. 2016. Abstrac- Daan Wierstra, et al. 2016. Matching networks for
tive text summarization using sequence-to-sequence one shot learning. In Advances in neural informa-
rnns and beyond. In Proceedings of The 20th tion processing systems, pages 3630–3638.
SIGNLL Conference on Computational Natural Lan-
Ronald J Williams and David Zipser. 1989. A learn-
guage Learning, pages 280–290.
ing algorithm for continually running fully recurrent
Bryan Orme. 2009. Maxdiff analysis: Simple counting, neural networks. Neural computation, 1(2):270–
individual-level logit, and hb. Sequim, WA: Saw- 280.
tooth Software.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh,
Romain Paulus, Caiming Xiong, and Richard Socher. Barnabas Poczos, Russ R Salakhutdinov, and
2017. A deep reinforced model for abstractive sum- Alexander J Smola. 2017. Deep sets. In Advances in
marization. arXiv preprint arXiv:1705.04304. neural information processing systems, pages 3391–
3401.
Gabriel Pereyra, George Tucker, Jan Chorowski,
Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular-
izing neural networks by penalizing confident output
distributions. arXiv preprint arXiv:1701.06548.
Ofir Press and Lior Wolf. 2017. Using the output em-
bedding to improve language models. In Proceed-
ings of the 15th Conference of the European Chap-
ter of the Association for Computational Linguistics,
pages 157–163.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683.
Alexander M Rush, Sumit Chopra, and Jason Weston.
2015. A neural attention model for abstractive sen-
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 379–389.

4129

9 Appendices 9.4 Summary Annotation

9.1 Dataset Pre-Processing For summary annotation, we reused 60 Amazon
products from Bražinskas et al. (2020) and sampled
We selected only Amazon products and Yelp busi- 100 businesses from Yelp. We assigned 3 Mechan-
nesses with minimum of 10 reviews, and the min- ical Turk workers to each product/business, and
imum and maximum lengths of 20 and 70 words, instructed them to read the reviews and produce a
respectively. Also, popular products/businesses summary text. We used the following instructions:
above the 90th percentile were removed. From
each business/product we sampled 9 reviews with- • The summary should reflect user common
out replacement to form groups of reviews. opinions expressed in the reviews. Try to pre-
serve the common sentiment of the opinions
9.2 Evaluation Data Split and their details (e.g. what exactly the users
From the Amazon annotated dataset, We used 28, like or dislike). For example, if most reviews
12, 20 products for training, validation, and testing, are negative about the sound quality, then also
respectively. On Yelp, we used 30, 30, 40 for train- write negatively about it.
ing, validation, and testing, respectively. Both the
automatic and human evaluation experiments were • Please make the summary coherent and fluent
performed on the test sets. in terms of sentence and information struc-
ture. Iterate over the written summary mul-
9.3 Training Procedure tiple times to improve it, and re-read the re-
First, to speed-up the training phase, we trained an views whenever necessary.
unconditional language model for 13 epoch on the
• The summary should not look like a review,
Amazon reviews with the learning rate (LR) set to
please write formally.
5 ∗ 10−4 . On Yelp we trained it for 27 epochs with
LR set to 7 ∗ 10−4 . The language model was used • Keep the length of the summary reasonably
to initialize both the encoder and generator of the close to the average length of the reviews.
main model.
Subsequently, we trained the model using Eq. 2 • Please try to write the summary using your
for 9 epochs on the Amazon reviews with 6 ∗ 10−5 own words instead of copying text directly
LR, and for 57 epochs with LR set to 5 ∗ 10−5 . from the reviews. Using the exact words from
Additionally, we reduced novelty using Eq. 4 by the reviews is allowed but do not copy more
training the model further for 1 epoch with 10−5 than 5 consecutive words from a review.
LR and λ = 2 on Amazon; On Yelp we trained for
4 epochs, with 3 ∗ 10−5 LR, and λ = 2.5. 9.5 Human Evaluation Setup
For the plugin network’s initialization, as ex- To perform the human evaluation experiments de-
plained in Sec. 3.1, we performed optimization by scribed in Sec 5, we hired workers with 98% ap-
output matching with the oracle for 11 epochs on proval rate, 1000+ HITS, Location: USA, UK,
the unannotated Amazon reviews with 1 ∗ 10−5 LR. Canada, and the maximum score on a qualifica-
On Yelp we trained for 87 epochs with 1 ∗ 10−5 tion test that we had designed. The test was asking
Lastly, we fine-tuned the plugin network on the if the workers were native English speakers, and
human-written summaries by output matching with was verifying that they correctly understood the
the oracle7 . On the Amazon data for 98 epochs with instructions of both the best-worst scaling and con-
7 ∗ 10−4 , and for 62 epochs with 7 ∗ 10−5 on Yelp. tent support tasks.
We set weights to 0.1, 1., 0.08, 0.5 for length devi-
ation, rating deviation, POV, and ROUGE scores, 9.6 Best-Worst Scaling Details
respectively. Then fine-tuned the attention part of We performed human evaluation based on the Ama-
the model and the plug-in network jointly for 33 zon and Yelp test sets using the AMT platform. We
epochs with 1 ∗ 10−4 on the Amazon data. And 23 assigned workers to each tuple containing sum-
epochs with 1 ∗ 10−4 LR on Yelp. maries from C OPYCAT, our model, L EX R ANK,
7
We set rating deviation to 0 as summaries do not have and human annotators. Due to dataset size dif-
associated human-annotated ratings. ferences, we assigned 5 and 3 workers to each

4130

written informally, populated by pronouns such as
I bought this as a gift for my husband.
I and you. In contrast, summaries are desirable
to be written formally. In this work, we observed
I’ve been using Drakkar Noir Balm for over
1st twenty years. that a surprisingly simple way to achieve that is
to condition the generator on the distribution over
I purchased these for my son as a kind of a pronoun classes of the target review. We computed
joke. pronoun counts and produced the 4 class distribu-
tions: 1st, 2nd, 3rd person POV, and ‘other’ in case
This is the best product you can buy! if no pronouns are present.
Consider the example sentences in Table 12.
You get what you pay for. Here one can observe that the sentences of dif-
2nd
ferent pronoun classes differ in the style of writing
Please do yourself a favor and avoid this
product.
and often the intention of the message: 1st POV
sentences tend to provide clues about the personal
experience of the user; 2nd POV sentences, on the
This is his every work day scent.
other hand, commonly convey recommendations
3rd It’s very hard to buy the balm separately.
to a reader; 3rd POV and ‘other‘ sentences often
describe aspects and their associated opinions.
It smells like Drakkar, but it is hard to find.

Very nice, not too overpowering.

No
This product has no smell what ever.
Pronouns

Nice to use for hardwood floors.

Table 12: Examples of review sentences that contain
only pronouns belonging to a specific class.

tuple in the Amazon and Yelp test sets, respec-
tively. We presented the associated reviews in
a random order and asked the workers to judge
summaries using the Best-Worst scaling (BWS)
(Louviere and Woodworth, 1991; Louviere et al.,
2015) that is known to produce more reliable re-
sults than ranking scales (Kiritchenko and Moham-
mad, 2016). The judgment criteria are presented
below, where non-redundancy and coherence were
taken from Dang (2005). Fluency: the summary
sentences should be grammatically correct, easy
to read and understand; Coherence: the summary
should be well structured and well organized; Non-
redundancy: there should be no unnecessary repe-
tition in the summary; Informativeness: how much
useful information about the product does the sum-
mary provide?; Sentiment: how well the sentiment
of the summary agrees with the overall sentiment
of the original reviews?

9.7 Points of View
Summaries differ from reviews in terms of the writ-
ing style. Specifically, reviews are predominantly

4131

Bennett Medical has poor customer service. Phone calls can take a long time to get answered
Gold and leaving voice mails tend to be fruitless. The products are overpriced and take a long time to
be refilled. Using this medical supply company can be a hassle.

This medical supply is the worst medical supply company in the valley. The customer service is
Ours horrible, the staff is rude, the wait times are long, and the service reps are not helpful at all. Do
not recommend this company to anyone.

If I could give them 0 stars I would. The customer service is terrible, the staff is extremely rude
Copycat and helpful. If you’re looking for a new provider, this is the place to be. I will never use them
again.

Service is horrible, especially the manager. I have a lot of kids but not this place. Two months
later I was able to go in and get a new one to go in the next day. They would tell me that they
MeanSum would have to do a lot of our water to be there to get a new one. But this is the first time I have
dealt with him and we will never go back again. Thanks for your hard work, and I will never go
anywhere else.

Bennett Medical for Cpap supplies are horrible. Never enough staff to answer phone, so you’ll
Lexrank need to leave messages. DON’T use this medical supply. If I could give Bennett Medical zero
stars I would! Will be moving to another medical supply as soon as I can.

Bennett Medical for Cpap supplies are horrible. We have waited for three weeks to refill supplies
Review 1 and we are still waiting. This company does not have good customer service, you can only leave
messages, and they never call back. If I could give Bennett Medical zero stars I would!

Teachers Health Trust, please look into the practice of the billing and filling of durable services.
The mask cushions go for 45 to 50 days because of the lack of communication. The people in
Review 2 charge of billing are very argumentative and lack customer service. I will drop them after annual,
because of my insurance obligations.

Fantastic service from Jocelyn at the front desk, we had a really hard time getting the right
Review 3 paperwork together from Drs but she stuck with us and helped us every step of the way, even
calling to keep us updated and to update info we might have for her. Thanks Jocelyn.

I hardly ever write reviews, but I’d like to spare someone else from what I experienced. So a
warning to the wise... If you like rude incompetent employees, almost an hour long wait for just
Review 4 picking up a phone order, and basically being treated like a second class citizen then look no
further than Bennett Medical.

DON’T use this medical supply. Never enough staff to answer phone, so you’ll need to leave
messages. No return phone calls. I am unable to get my CPAP supplies every quarter without
Review 5 hours of calling / waiting / calling. Poor customer service. Will be moving to another medical
supply as soon as I can.

Terrible experience. They have ridiculous price also bad customer services. You can get nebulizer
machine around $50 at amazon, bennet medical charge you almost twice more expensive price.
Review 6 And breathing kit price was unbelievable too. Because of deduction, I had to pay them all out of
my pocket whatever they charged. I don’t recommand this medical company to anyone.

Good luck getting a phone call back or someone to answer the phone without hanging up
immediately. I have called over 20 times left 5 voicemails over the last 30 days, just to refill a
Review 7 mask perscription. This is an ongoing issue that is beyond frustrating. Not trying to destroy this
businesses name just want the owners to implement some basic customer service skills.

Always receive friendly customer service whenever we call or go into the location. My questions
Review 8 are always answered and I am very happy with the supplies we get from them. Great people
providing a great service! Thank you for all you do!

Table 13: Example summaries produced by different systems on Yelp data.

4132

You can also read