Few-Shot Learning for Opinion Summarization - Association ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Few-Shot Learning for Opinion Summarization Arthur Bražinskas1 Mirella Lapata1 Ivan Titov1,2 1 ILCC, University of Edinburgh 2 ILLC, University of Amsterdam abrazinskas@ed.ac.uk, {mlap,ititov}@inf.ed.ac.uk Abstract These shoes run true to size, do a good job supporting the arch of the foot and Opinion summarization is the automatic cre- are well-suited for exercise. They’re ation of text reflecting subjective information Gold good looking, comfortable, and the sole expressed in multiple documents, such as user feels soft and cushioned. Overall they reviews of a product. The task is practically are a nice, light-weight pair of shoes and important and has attracted a lot of attention. come in a variety of stylish colors. However, due to the high cost of summary pro- These running shoes are great! They fit duction, datasets large enough for training su- true to size and are very comfortable to pervised models are lacking. Instead, the task run around in. They are light weight and Ours have great support. They run a little on has been traditionally approached with extrac- tive methods that learn to select text fragments the narrow side, so make sure to order a half size larger than normal. in an unsupervised or weakly-supervised way. Recently, it has been shown that abstractive perfect fit for me ... supply the support summaries, potentially more fluent and better that I need ... are flexible and comfort- at reflecting conflicting information, can also able ... || ... It is very comfortable ... be produced in an unsupervised fashion. How- I enjoy wearing them running ... || ... running shoes ... felt great right out of ever, these models, not being exposed to actual the box ... They run true to size ... || summaries, fail to capture their essential prop- Reviews ... my feet and feel like a dream ... To- erties. In this work, we show that even a hand- tally light weight ... || ... shoes run small ful of summaries is sufficient to bootstrap gen- ... fit more true to size ... fit is great! eration of the summary text with all expected ... supports my arch very well ... || ... They are lightweight... usually wear a properties, such as writing style, informative- size women’s 10 ... ordered a 10.5 and ness, fluency, and sentiment preservation. We the fit is great! start by training a conditional Transformer lan- guage model to generate a new product review Table 1: Example summaries produced by our system given other available reviews of the product. and an annotator; colors encode its alignment to the The model is also conditioned on review prop- input reviews. The reviews are truncated, and delimited erties that are directly related to summaries; with the symbol ‘||’. the properties are derived from reviews with no manual effort. In the second stage, we fine-tune a plug-in module that learns to pre- generation (Hu and Liu, 2004; Medhat et al., 2014; dict property values on a handful of summaries. Angelidis and Lapata, 2018). This lets us switch the generator to the summa- Although significant progress has been observed rization mode. We show on Amazon and Yelp datasets that our approach substantially outper- in supervised summarization in non-subjective forms previous extractive and abstractive meth- single-document context, such as news articles ods in automatic and human evaluation. (Rush et al., 2015; Nallapati et al., 2016; Paulus et al., 2017; See et al., 2017; Liu et al., 2018), mod- 1 Introduction ern deep learning methods rely on large amounts Summarization of user opinions expressed in on- of annotated data that are not readily available in line resources, such as blogs, reviews, social media, the opinion-summarization domain and expensive or internet forums, has drawn much attention due to produce. A key obstacle making data annotation to its potential for various information access appli- expensive is that annotators need to consider multi- cations, such as creating digests, search, and report ple input texts when writing a summary, which is 4119 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4119–4135, November 16–20, 2020. c 2020 Association for Computational Linguistics
time-consuming. Moreover, annotation would have 2017). While reviews span a large range of property to be undertaken for multiple domains as online re- values, only a subset of them is appropriate for sum- views are inherently multi-domain (Blitzer et al., maries. For example, summaries should be close 2007) and summarization systems can be domain- to the product’s reviews in content, avoid using the sensitive (Isonuma et al., 2017). This suggests that first-person pronouns and agree with the reviews it is unlikely that human-annotated corpora large in sentiment. Our approach starts with estimat- enough for training deep models will be available. ing a property-aware model on a large collection Recently, a number of unsupervised abstrac- of reviews and then adapts the model using a few tive multi-document models were introduced (e.g., annotator-created summaries, effectively switching C OPYCAT; Bražinskas et al. 2020 and M EAN S UM; the generator to the summarization regime. As we Chu and Liu 2019) that are trained on large collec- demonstrate in our experiments, the summaries do tions of unannotated product reviews.1 However, not even have to come from the same domain. unsurprisingly perhaps, since the models are not More formally, we estimate a text model on exposed to the actual summaries, they are unable to a dataset of reviews; the generator is a Trans- learn their key characteristics. For instance, M EAN - former conditional language model (CLM) that S UM (Chu and Liu, 2019) is prone to producing is trained with a ‘leave-one-out’ objective (Besag, summaries that contain a significant amount of in- 1975; Bražinskas et al., 2020) by attending to other formation that is unsupported by reviews; C OPY- reviews of the product. We define properties of CAT generates summaries that are better aligned unannotated data that are directly related to the end with reviews, yet they are limited in detail. More- task of summarization. Those properties are easy over, both systems, are trained mostly on subjec- to derive from reviews, and no extra annotation ef- tively written reviews, and as a result, tend to gen- fort is required. The CLM is conditioned on these erate summaries in the same writing style. properties in training. The properties encode partial The main challenge in the absence of large anno- information about the target review that is being tated corpora lies in successful utilization of scarce predicted. We capitalize on that by fine-tuning annotated resources. Unlike recent approaches to parts of the model jointly with a tiny plug-in net- language model adaptation for abstractive single- work on a handful of human-written summaries. document summarization (Hoang et al., 2019; Raf- The plug-in network is trained to output property fel et al., 2019) that utilize hundreds of thousands values that make the summaries likely under the of summaries, our two annotated datasets consist of trained CLM. The plug-in has less than half a per- only 60 and 100 annotated data-points. It was also cent of the original model’s parameters, and thus is observed that a naive fine-tuning of multi-million less prone to over-fitting on small datasets. Never- parameter models on small corpora leads to rapid theless, it can successfully learn to control dynam- over-fitting and poor generalization (Vinyals et al., ics of a large CLM by providing property values 2016; Finn et al., 2017). In this light, we propose that force generation of summaries. We shall refer a few-shot learning framework and demonstrate to the model produced using the procedure as Few that even a tiny number of annotated instances is Shot Summarizer (F EW S UM). sufficient to bootstrap generation of the formal sum- We evaluate our model against both extractive mary text that is both informative and fluent (see and abstractive methods on Amazon and Yelp Table 1). To the best of our knowledge, this work human-created summaries. Summaries generated is the first few-shot learning approach applied to by our model are substantially better than those summarization. produced by competing methods, as measured by In our work, we observe that reviews in a large automatic and human evaluation metrics on both unannotated collection vary a lot; for example, they datasets. Finally, we show that it allows for suc- differ in style, the level of detail, or how much they cessful cross-domain adaption. Our contributions diverge from other reviews of the product in terms can be summarized as follows: of content and overall sentiment. We refer to indi- vidual review characteristics and their relations to • we introduce the first few-shot learning frame- other reviews as properties (Ficler and Goldberg, work for abstractive opinion summarization; 1 For simplicity, we use the term ‘product’ to refer to both • we demonstrate that the approach substan- Amazon products and Yelp businesses. tially outperforms extractive and abstractive 4120
models, both when measured with automatic q(ri , r−i ) as shown in Eq. 2. metrics and in human evaluation; M N 1 XX log Gθ (rij |Eθ (r−i j ), q(rij , r−i j )) (2) • we release datasets with abstractive sum- MN j=1 i=1 maries for Amazon products and Yelp busi- We refer to this partial information as properties nesses.2 (Ficler and Goldberg, 2017), which correspond to text characteristics of ri or relations between ri 2 Unsupervised Training and r−i . For example, one such property can be the ROUGE score (Lin, 2004) between ri and r−i , User reviews about an entity (e.g., a product) are which indicates the degree of overlap between ri naturally inter-dependent. For example, knowing and r−i . In Fig. 1, a high ROUGE value can signal that most reviews are negative about a product’s to the generator to attend the word vacuum in the battery life, it becomes more likely that the next source reviews instead of predicting it based on review will also be negative about it. To model language statistics. Intuitively, while the model ob- inter-dependencies, yet to avoid intractabilities as- serves a wide distribution of ROUGE scores during sociated with undirected graphical models (Koller training on reviews, during summarization in test and Friedman, 2009), we use the leave-one-out set- time we can achieve a high degree of input-output ting (Besag, 1975; Bražinskas et al., 2020). text overlap by setting the property to a high value. Specifically, we assume access to a large cor- We considered three types of properties. pus of user text reviews, which are arranged as M groups {r1:N }M j=1 , where r1:N are reviews about Content Coverage: ROUGE-1, ROUGE-2, and a particular product that are arranged as a tar- ROUGE-L between ri and r−i signals to Gθ how get review ri and N − 1 source reviews r−i = much to rely on syntactic information in r−i dur- {r1 , ..., ri−1 , ri+1 , ..., rN }. Our goal is to estimate ing prediction of ri . Writing Style: as a proxy for the conditional distribution ri |r−i by optimizing formal and informal writing style, we compute pro- the parameters θ as shown in Eq. 1. noun counts, and create a distribution over 3 points M N of view and an additional class for cases with no 1 XX θ∗ = arg max log pθ (rij |r−i j ) pronouns; see Appendix 9.7 for details. Rating and θ MN Length Deviations: for the former, we compute j=1 i=1 M N the difference between ri ’s rating and the average 1 XX = arg max log Gθ (rij |Eθ (r−i j )) r−i rating; in the latter case, we use the difference θ MN between ri ’s length and the average r−i length. j=1 i=1 (1) Our model has an encoder-generator Trans- former architecture (Vaswani et al., 2017), where the encoder Eθ produces contextual representations 2.1 Novelty Reduction of r−i that are attended by the generator Gθ , which in-turn is a conditional language model predict- While summary and review generation are techni- ing the target review ri , estimated using teacher- cally similar, there is an important difference that forcing (Williams and Zipser, 1989). An illustra- needs to be addressed. Reviews are often very tion is presented in Fig. 1. diverse, so when a review is predicted, the gen- The objective lets the model exploit common in- erator often needs to predict content that is not formation across reviews, such as rare brand names present in source reviews. On the other hand, when or aspect mentions. For example, in Fig. 1, the a summary is predicted, its semantic content al- generator can directly attend to the word vacuum in ways matches the content of source reviews. To the source reviews to increase its prediction proba- address this discrepancy, in addition to using the bility. ROUGE scores, as was explained previously, we Additionally, we condition on partial informa- introduce a novelty reduction technique, which is tion about the target review ri using an oracle similar to label smoothing (Pereyra et al., 2017). 2 Both the code and datasets are available at: https:// Specifically, we add a regularization term L, github.com/abrazinskas/FewSum scaled by λ, that is applied to word distributions 4121
vacuum Attention hoover product Generator States Encoder States … … … … … … … … … … … … Oracle q(ri , r i ) AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBahQi1JPeix4MVjBfsBbQyb7aZdutmkuxulhP4PLx4U8ep/8ea/cdvmoK0PBh7vzTAzz485U9q2v63c2vrG5lZ+u7Czu7d/UDw8aqkokYQ2ScQj2fGxopwJ2tRMc9qJJcWhz2nbH93M/PYjlYpF4l5PYuqGeCBYwAjWRnoYl6XHKkh66QWbnnvFkl2150CrxMlICTI0vOJXrx+RJKRCE46V6jp2rN0US80Ip9NCL1E0xmSEB7RrqMAhVW46v3qKzozSR0EkTQmN5urviRSHSk1C33SGWA/VsjcT//O6iQ6u3ZSJONFUkMWiIOFIR2gWAeozSYnmE0MwkczcisgQS0y0CapgQnCWX14lrVrVuazW7pxSvZLFkYcTOIUyOHAFdbiFBjSBgIRneIU368l6sd6tj0VrzspmjuEPrM8fNl+RmA== Very sturdy vacuum … Great vacuum … This … r1 AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuKbQM2FhGNB+QHGFvs5cs2ds7dueEcOQn2FgoYusvsvPfuEmu0MQHA4/3ZpiZFyRSGHTdb6ewsbm1vVPcLe3tHxwelY9P2iZONeMtFstYdwNquBSKt1Cg5N1EcxoFkneCye3c7zxxbUSsHnGacD+iIyVCwSha6UEPvEG54lbdBcg68XJSgRzNQfmrP4xZGnGFTFJjep6boJ9RjYJJPiv1U8MTyiZ0xHuWKhpx42eLU2fkwipDEsbalkKyUH9PZDQyZhoFtjOiODar3lz8z+ulGN74mVBJilyx5aIwlQRjMv+bDIXmDOXUEsq0sLcSNqaaMrTplGwI3urL66Rdq3r1au3eqzSu8jiKcAbncAkeXEMD7qAJLWAwgmd4hTdHOi/Ou/OxbC04+cwp/IHz+QP7o42D rN AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuFloGbKwkovmA5Ah7m71kyd7esTsnhCM/wcZCEVt/kZ3/xk1yhSY+GHi8N8PMvCCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9mfvuJayNi9YiThPsRHSoRCkbRSg+6f9cvV9yqOwdZJV5OKpCj0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXUkUjbvxsfuqUnFllQMJY21JI5urviYxGxkyiwHZGFEdm2ZuJ/3ndFMNrPxMqSZErtlgUppJgTGZ/k4HQnKGcWEKZFvZWwkZUU4Y2nZINwVt+eZW0alXvslq79yr1izyOIpzAKZyDB1dQh1toQBMYDOEZXuHNkc6L8+58LFoLTj5zDH/gfP4AJ6aNoA== ri AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMgFuEuKbQM2FhGNB+QHGFvs5cs2ds7dueEcOQn2FgoYusvsvPfuEmu0MQHA4/3ZpiZFyRSGHTdb6ewsbm1vVPcLe3tHxwelY9P2iZONeMtFstYdwNquBSKt1Cg5N1EcxoFkneCye3c7zxxbUSsHnGacD+iIyVCwSha6UEPxKBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDG/8TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKuVb16tXbvVRpXeRxFOINzuAQPrqEBd9CEFjAYwTO8wpsjnRfn3flYthacfOYU/sD5/AFQko27 Figure 1: Illustration of the F EW S UM model that uses the leave-one-out objective. Here predictions of the target review ri is performed by conditioning on the encoded source reviews r−i . The generator attends the last encoder layer’s output to extract common information (in red). Additionally, the generator has partial information about ri passed by the oracle q(ri , r−i ). produced by the generator Gθ as shown in Eq. 3. itively, some combinations of property values drive M X N h it into generation of text that has qualities of a sum- 1 log Gθ (rij |Eθ (r−i j ), q(rij , r−i j X )) mary while others of a review. However, we might MN not know values in advance that are necessary for j=1 i=1 generation of summaries. Furthermore, q(ri , r−i ) i − λL(Gθ (rij |Eθ (r−i j ), q(rij , r−i j )) cannot be applied at test time as it requires access (3) to target texts. In the following section, we de- It penalizes assigning the probability mass to words scribe a solution that switches the generator to the not appearing in r−i , as shown in Eq. 4, and summarization mode relying only on input reviews. thus steers towards generation of text that is more grounded in content of r−i . 3.1 Plug-in Network L(Gθ (ri |r−i , q(ri , r−i ))) = We start by introducing a parametrized plug-in net- T X X work pφ (r−i ) that yields the same types of prop- Gθ (Wt = w|ri1:t−1 , r−i , q(ri , r−i )) t=1 w6∈V (r−i ) erties as q(ri , r−i ). From a practical perspective, (4) the plug-in should be input-permutation invariant Here, T is the length of ri , and the inner sum is and allow for an arbitrary number of input reviews over all words that do not appear in the word vo- (Zaheer et al., 2017). Importantly, the trainable cabulary of r−i . Intuitively, in Fig. 1, the penalty plug-in can have a marginal fraction of the main could reduce the probability of the word hoover model’s parameters, which makes it less prone to to be predicted as it does not appear in the source over-fitting when trained on small datasets. We reviews. initialize the parameters of pφ (r−i ) by matching its output to q(ri , r−i ) on the unannotated reviews. 3 Summary Adaptation Specifically, we used a weighted combination of distances as shown for one group of reviews in Once the unsupervised model is trained on reviews, Eq. 5. our task is to adapt it to generation of summaries. N X L Here, we assume access to a small number of X αl Dl (pφ (r−i )l , q(ri , r−i )l ) (5) annotator-written summaries (sk , r1:N k )K where k=1 i=1 l=1 s is a summary for r1:N input reviews. As we will show in Sec. 6.1, naive fine-tuning of the unsuper- Here, Dl (pφ (r−i )l , q(ri , r−i )l ) is a distance for vised model on a handful of annotated data-points the property l, and αl is an associated weight. leads to poor generalization. Instead, we capital- Specifically, we used L1 norm for Content Cover- ize on the fact that the generator Gθ has observed age, Rating and Length Deviations, and Kullback- a wide range of property values associated with Leibler divergence for Writing Style. ri during the unsupervised training phase. Intu- For the plug-in network, we employed a multi- 4122
layer feed-forward network with multi-head atten- Dataset Training Validation tion modules over the encoded states of the source Yelp 38,913/1,016,347 4,324/113,886 reviews at each layer, followed by a linear transfor- Amazon 182,932/3,889,782 9,629/205,992 mation, predicting property values. Note that the encoder is shared with the main model. Table 2: Data statistics after pre-processing. The format in the cells is Businesses/Reviews and Prod- 3.2 Fine-Tuning ucts/Reviews for Yelp and Amazon, respectively. Unsurprisingly, perhaps, the network pφ being ini- We obtained 480 human-written summaries (180 tialized on unannotated reviews inherits a strong for Amazon and 300 for Yelp) for 8 reviews each, bias towards outputting property values resulting using Amazon Mechanical Turk (AMT). Each prod- in generation of reviews, which should not be ap- uct/business received 3 summaries, and averaged propriate for generating summaries. Fortunately, ROUGE scores are reported in the following sec- due to the simplicity of the chosen properties, it is tions. Also, we reserved approximately 13 for test- possible to fine-tune pφ to match the output of q on ing and the rest for training and validation. The the annotated data (sk , r1:N k )K using Eq. 5. k=1 details are in Appendix 9.2. An alternative is to optimize the plug-in to di- 4.2 Experimental Details rectly increase the likelihood of summaries under Gθ while keeping all other parameters fixed.3 For the main model, we used the Transformer archi- As the generator is trained on unannotated re- tecture (Vaswani et al., 2017) with trainable length views, it might not encounter a sufficient amount embeddings and shared parameters between the en- of text that is written as a summary, and that highly coder and generator (Raffel et al., 2019). Subwords overlaps in content with the input reviews. We ad- were obtained with BPE (Sennrich et al., 2016) dress that by unfreezing the attention module of using 32000 merges. Subword embeddings were Gθ over input reviews and the plug-in pφ , and by shared across the model as a form of regularization maximizing the likelihood of summaries: (Press and Wolf, 2017). For a fair comparison, we K approximately matched the number of parameters 1 Xh i to C OPYCAT (Bražinskas et al., 2020). We ran- log Gθ (sk |Eθ (r1:N k k ), pφ (r1:N )) (6) K domly initialized all parameters with Glorot (Glo- k=1 rot and Bengio, 2010). For the plug-in network, This allows the system to learn an interaction be- we employed a multi-layer feed-forward network tween Gθ and pφ . For example, what property with multi-head attention modules over encoded values are better associated with summaries and states of the source review. After the last layer, we how Gθ should better respond to them. performed a linear projection to compute property 4 Experimental Setup values. Further, parameter optimization was per- formed using Adam (Kingma and Ba, 2014), and 4.1 Dataset beam search with n-gram blocking (Paulus et al., For training we used customer reviews from Ama- 2017) was applied to our model and Copycat for zon (He and McAuley, 2016) and Yelp.4 From the summary generation. Amazon reviews we selected 4 categories: Elec- All experiments were conducted on 4 x GeForce tronics; Clothing, Shoes and Jewelry; Home and RTX 2080 Ti. Kitchen; Health and Personal Care. We used a sim- 4.3 Hyperparameters ilar pre-processing schema as in (Bražinskas et al., 2020), details are presented in Appendix 9.1. For Our parameter-shared encoder-generator model training, we partitioned business/product reviews used a 8-head and 6-layer Transformer stack. to the groups of 9 reviews by sampling without Dropout in sub-layers and subword embeddings replacement. Thus, for unsupervised training in dropout was both set to 0.1, and we used 1000 Sec. 2, we conditioned on 8 reviews for each target dimensional position-wise feed-forward neural net- review. The data-statistics are shown in Table 2. works. We set subword and length embeddings to 390 and 10 respectively, and both were concate- 3 We explored that option, and observed that it works simi- nated to be used as input. For the plug-in network, larly, yet leads to a slightly worse result. 4 https://www.yelp.com/dataset/ we set the output dimension to 30 and internal feed- challenge forward network hidden dimensions to 20. We used 4123
a stack of 3 layers, and the attention modules with 3 R1 R2 RL heads at each layer. We applied 0.4 internal dropout FewSum 0.3356 0.0716 0.2149 and 0.15 attention dropout. Property values pro- Copycat 0.2785 0.0477 0.1886 duced by the plug-in or oracle were concatenated MeanSum 0.2663 0.0489 0.1711 with subword and length embeddings and linearly LexRank 0.2772 0.0506 0.1704 projected before being passed to the generator. In Clustroid 0.2716 0.0361 0.1677 total, our model had approximately 25M parame- Lead 0.2700 0.0492 0.1495 ters, while the plug-in network only 100K (i.e., less Random 0.2500 0.0382 0.1572 than 0.5 % of the main model’s parameters). In all experiments, the hyperparameter tuning Table 3: ROUGE scores on the Amazon test set. was performed based on the ROUGE-L score on Yelp and Amazon validation sets. R1 R2 RL FewSum 0.3729 0.0992 0.2276 4.4 Baselines Copycat 0.2812 0.0589 0.1832 L EX R ANK (Erkan and Radev, 2004) is an unsuper- MeanSum 0.2750 0.0354 0.1609 vised extractive graph-based algorithm selecting LexRank 0.2696 0.0493 0.1613 sentences based on graph centrality. Sentences rep- Clustroid 0.2890 0.0490 0.1800 resent nodes in a graph whose edges have weights Lead 0.2620 0.0457 0.1432 denoting similarity computed with tf-idf. Random 0.2148 0.0259 0.1387 M EAN S UM is an unsupervised abstractive sum- Table 4: ROUGE scores on the Yelp test set. marization model (Chu and Liu, 2019) that treats a summary as a discrete latent state of an autoen- datasets. Also, the results are supported by qualita- coder. The model is trained in a multi-task fashion tive improvements over other models, see examples with two objectives, one for prediction of reviews in the Appendix. and the other one for summary-reviews alignment Best-Worst Scaling We performed human eval- in the semantic space using the cosine similarity. uation with the Best-Worst scaling (Louviere and C OPYCAT is the state-of-the-art unsupervised Woodworth, 1991; Louviere et al., 2015; Kir- abstractive summarizer (Bražinskas et al., 2020) itchenko and Mohammad, 2016) on the Amazon that uses continuous latent representations to model and Yelp test sets using the AMT platform. We review groups and individual review semantics. It assigned multiple workers to each tuple containing has an implicit mechanism for novelty reduction summaries from C OPYCAT, our model, L EX R ANK, and uses a copy mechanism. and human annotators. The judgment criteria As is common in the summarization literature, were the following: Fluency, Coherence, Non- we also employed a number of simple summariza- redundancy, Informativeness, Sentiment. Details tion baselines. First, the CLUSTROID review was are provided in Appendix 9.6. computed for each group of reviews as follows. For every criterion, a system’s score is computed We took each review from a group and computed as the percentage of times it was selected as best, ROUGE-L with respect to all other reviews. The minus the percentage of times it was selected as review with the highest ROUGE score was selected worst (Orme, 2009). The scores range from -1 as the clustroid review. Second, we sampled a (unanimously worst) to +1 (unanimously best). RANDOM review from each group to be used as the The results are presented in Tables 5 and 6 for summary. Third, we constructed the summary by Amazon and Yelp, respectively. On the Amazon selecting the leading sentences (L EAD) from each data, they indicate that our model is preferred review of a group. across the board over the baselines. C OPYCAT 5 Evaluation Results is preferred over L EX R ANK in terms of fluency and non-redundancy, yet it shows worse results Automatic Evaluation We report ROUGE F1 in terms of informativeness and overall sentiment score (Lin, 2004) based evaluation results on the preservation. In the same vein, on Yelp in Table 6 Amazon and Yelp test sets in Tables 3 and 4, respec- our model outperforms the other models. tively. The results indicate that our model outper- All pairwise differences between our model forms abstractive and extractive methods on both and other models are statistically significant at 4124
Fluency Coherence Non-Redundancy Informativeness Sentiment FewSum 0.1000 0.1429 0.1250 0.2000 0.3061 Copycat -0.1765 -0.5333 -0.2727 -0.7455 -0.7143 LexRank -0.4848 -0.5161 -0.5862 -0.3488 -0.0909 Gold 0.5667 0.6364 0.6066 0.6944 0.4138 Table 5: Human evaluation results in terms of the Best-Worst scaling on the Amazon test set. Fluency Coherence Non-Redundancy Informativeness Sentiment FewSum 0.1636 0.1429 0.0000 0.3793 0.3725 Copycat -0.2000 -0.0769 0.1053 -0.4386 -0.2857 LexRank -0.5588 -0.5312 -0.6393 -0.6552 -0.4769 Gold 0.5278 0.3784 0.4795 0.6119 0.4118 Table 6: Human evaluation results in terms of the Best-Worst scaling on the Yelp test set. Full (%) Partial (%) No (%) in Sec. 4.1. First, we trained a model in an unsu- FewSum 43.09 34.14 22.76 pervised learning setting (USL) on the Amazon re- Copycat 46.15 27.18 26.67 views with the leave-one-out objective in Eq. 1. In this setting, the model has neither exposure to sum- Table 7: Content support on the Amazon test set. maries nor the properties, as the oracle q(ri , r−i ) p < 0.05, using post-hoc HD Tukey tests. The is not used. Further, we considered two alternative only exception is non-redundency on Yelp when settings how the pre-trained unsupervised model comparing our model and C OPYCAT (where our can be adapted on the gold summaries. In the first model shows a slightly lower score). setting, the model is fine-tuned by predicting sum- maries conditioned on input reviews (USL+F). In Content Support As was observed by Falke the second one, similar to Hoang et al. (2019), we et al. (2019); Tay et al. (2019); Bražinskas et al. performed adaptation in a multi-tasking learning (2020), the ROUGE metric can be insensitive to hal- (MTL) fashion. Here, USL is further trained on lucinating facts and entities. We also investigated a mixture of unannotated corpus review and gold how well generated text is supported by input re- summary batches with a trainable embedding in- views. We split summaries generated by our model dicating the task.5 The results are presented in and C OPYCAT into sentences. Then for each sum- Table 8. mary sentence, we hired 3 AMT workers to judge First, we observed that USL generates sum- how well content of the sentence is supported by maries that get the worst ROUGE scores. Addi- the reviews. Three following options were avail- tionally, the generated text tends to be informal able. Full support: all the content is reflected in and substantially shorter than an average summary, the reviews; Partial support: only some content is we shall discuss that in Sec. 6.2. Second, when reflected in the reviews; No support: content is not the model is fine-tuned on the gold summaries reflected in the reviews. (USL+F), it noticeably improves the results, yet The results are presented in Table 7. Despite not they are substantially worse than of our proposed using the copy mechanism, that is beneficial for few-shot approach. It can be explained by strong fact preservation (Falke et al., 2019) and genera- influence of the unannotated data stored in mil- tion of more diverse and detailed summaries (see lions of parameters that requires more annotated Appendix), we score on par with C OPYCAT. data-points to overrule. Finally, we observed that 6 Analysis MTL fails to decouple the tasks, indicated by only a slight improvement over USL. 6.1 Alternative Adaptation Strategies We further explored alternative utilization ap- proaches of annotated data-points, based on the 5 We observed that the 1:1 review-summary proportion same split of the Amazon summaries as explained works the best. 4125
R1 R2 RL These shoes run true to size, do a good FewSum 0.3356 0.0716 0.2149 job supporting the arch of the foot and are well-suited for exercise. They’re MTL 0.2403 0.0435 0.1627 Gold good looking, comfortable, and the sole USL+F 0.2823 0.0624 0.1964 feels soft and cushioned. Overall they USL 0.2145 0.0315 0.1523 are a nice, light-weight pair of shoes and come in a variety of stylish colors. Random 0.2500 0.0382 0.1572 These running shoes are great! They fit Table 8: ROUGE scores on the Amazon test set for true to size and are very comfortable to alternative summary adaptation strategies. run around in. They are light weight and FewSum have great support. They run a little on the narrow side, so make sure to order a 1st 2nd 3rd NoPr Len half size larger than normal. Gold 0.0 1.7 60.0 38.3 0.0 This is my second pair of Reebok run- FewSum 0.5 1.3 83.2 15.0 3.4 ning shoes and they are the best run- USL+F 29.7 0.0 45.3 25.0 -28.6 USL+F ning shoes I have ever owned. They are lightweight, comfortable, and provide USL 56.7 0.0 43.3 0.0 -32.7 great support for my feet. Reviews 49.0 7.3 35.6 8.1 -17.6 This is my second pair of Reebok run- ning shoes and I love them. They are Table 9: Text characteristics of generated summaries USL the most comfortable shoes I have ever by different models on the Amazon test set. worn. 6.2 Influence of Unannotated Data Table 10: Example summaries produced by models We further analyzed how plain fine-tuning on sum- with different adaptation approaches. maries differs from our approach in terms of cap- turing summary characteristics. For comparison, Domain In-domain Cross-domain we used USL and USL+F, which are presented in Cloth 0.2188 0.2220 Sec. 6.1. Additionally, we analyzed unannotated Electronics 0.2146 0.2136 reviews from the Amazon training set. Specifically, Health 0.2121 0.1909 we focused on text formality and the average word Home 0.2139 0.2250 count difference (Len) from the gold summaries Avg 0.2149 0.2129 in the Amazon test set. As a proxy for the former, we computed the marginal distribution over points Table 11: In and cross domain experiments on the Ama- of view (POV), based on pronoun counts; an ad- zon dataset, ROUGE-L scores are reported. ditional class (NoPr) was allocated to cases of no tually fine-tuned on them. Indeed, we observed that pronouns. The results are presented in Table 9. USL+F starts to shift in the direction of the sum- First, we observed that the training reviews are maries by reducing the pronouns of the 1st POV largely informal (49.0% and 7.3% for 1st and 2nd and increasing the average summary length. Never- POV, respectively). Unsurprisingly, the model theless, the gap is still wide, which would probably trained only on the reviews (USL) transfers a simi- require more data to be bridged. Finally, we ob- lar trait to the summaries that it generates.6 On the served that our approach adapts much better to the contrary, the gold summaries are largely formal - desired characteristics by producing well-formed indicated by a complete absence of the 1st and a summary text that is also very close in length to the marginal amount of 2nd POV pronouns. Also, an gold summaries. average review is substantially shorter than an aver- age gold summary, and consequently, the generated 6.3 Cross-Domain summaries by USL are also shorter. Example sum- We hypothesized that on a small dataset, the model maries are presented in Table 10. primarily learns course-grained features, such as Further, we investigated how well USL+F, common writing phrases, and their correlations be- adapts to the summary characteristics by being ac- tween input reviews and summaries. Also, that they, 6 As beam search, attempting to find the most likely can- in principle, could be learned from remotely re- didate sequence, was utilized, opposed to a random sequence lated domains. We investigated that by fine-tuning sampling, we observed that generated sequences had no cases of the 2nd POV pronouns and complete absence of pronouns the model on summaries that are not in the tar- (NoPr). get domain of the Amazon dataset. Specifically, 4126
we matched data-point count for 3/4 domains of pipeline framework for abstractive summary gen- training and validation sets to the in-domain Ama- eration. Our conditioning on text properties ap- zon data experiment presented in Sec 5; the test proach is similar to Ficler and Goldberg (2017), set remained the same for each domain as in the yet we rely on automatically derived properties that in-domain experiment. Then, we fine-tuned the associate a target to source, and learn a separate same model 5 times with different seeds per target module to generate their combinations. Moreover, domain. For comparison, we used the in-domain their method has not been studied in the context of model which was used in Amazon experiments in summarization. Sec. 5. We computed the average ROUGE-L score per target domain, where overall σ was 0.0137. The 8 Conclusions results are reported in Table 11. In this work, we introduce the first to our knowl- The results indicate that the models perform on- edge few-shot framework for abstractive opinion par on most of the domains, supporting the hypothe- summarization. We show that it can efficiently uti- sis. On the other hand, the in-domain model shows lize even a handful of annotated reviews-summary substantially better results on the health domain, pairs to train models that generate fluent, informa- which is expected, as, intuitively, this domain is the tive, and overall sentiment reflecting summaries. most different from the rest. We propose to exploit summary related properties in unannotated reviews that are used for unsuper- 7 Related Work vised training of a generator. Then we train a tiny plug-in network that learns to switch the generator Extractive weakly-supervised opinion summa- to the summarization regime. We demonstrate that rization has been an active area of research. our approach substantially outperforms competitive L EX R ANK (Erkan and Radev, 2004) is an unsu- ones, both abstractive and extractive, in human and pervised extractive model. O PINOSIS (Ganesan automatic evaluation. Finally, we show that it also et al., 2010) does not use any supervision and relies allows for successful cross-domain adaptation. on POS tags and redundancies to generate short opinions. However, this approach is not well suited Acknowledgments for the generation of coherent long summaries and, although it can recombine fragments of input text, We would like to thank members of Edinburgh NLP it cannot generate novel words and phrases. Other group for discussion, and the anonymous reviewers earlier approaches (Gerani et al., 2014; Di Fab- for their valuable feedback. We gratefully acknowl- brizio et al., 2014) relied on text planners and tem- edge the support of the European Research Council plates, which restrict the output text. A more re- (Titov: ERC StG BroadSem 678254; Lapata: ERC cent extractive method of Angelidis and Lapata CoG TransModal 681760) and the Dutch National (2018) frames the problem as a pipeline of steps Science Foundation (NWO VIDI 639.022.518). with different models for each step. Isonuma et al. (2019) introduce an unsupervised approach for sin- gle product review summarization, where they rely on latent discourse trees. The most related unsu- pervised approach to this work is our own work, C OPYCAT (Bražinskas et al., 2020). Unlike that work, we rely on a powerful generator to learn con- ditional spaces of text without hierarchical latent variables. Finally, in contract to M EAN S UM (Chu and Liu, 2019), our model relies on inductive bi- ases without explicitly modeling of summaries. A concurrent model D ENOISE S UM (Amplayo and La- pata, 2020) uses a syntactically generated dataset of source reviews to train a generator to denoise and distill common information. Another paral- lel work, O PINION D IGEST (Suhara et al., 2020), considers controllable opinion aggregation and is a 4127
References deep networks. In Proceedings of the 34th Interna- tional Conference on Machine Learning-Volume 70, Reinald Kim Amplayo and Mirella Lapata. 2020. Un- pages 1126–1135. JMLR. org. supervised opinion summarization with noising and denoising. Proceedings of Association for Computa- Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. tional Linguistics (ACL). 2010. Opinosis: A graph based approach to abstrac- tive summarization of highly redundant opinions. In Stefanos Angelidis and Mirella Lapata. 2018. Sum- Proceedings of the 23rd International Conference marizing opinions: Aspect extraction meets senti- on Computational Linguistics (Coling 2010), pages ment prediction and they are both weakly supervised. 340–348. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages Shima Gerani, Yashar Mehdad, Giuseppe Carenini, 3675–3686. Raymond T Ng, and Bita Nejat. 2014. Abstractive summarization of product reviews using discourse Julian Besag. 1975. Statistical analysis of non-lattice structure. In Proceedings of the 2014 conference on data. Journal of the Royal Statistical Society: Series empirical methods in natural language processing D (The Statistician), 24(3):179–195. (EMNLP), pages 1602–1613. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Xavier Glorot and Yoshua Bengio. 2010. Understand- Biographies, bollywood, boom-boxes and blenders: ing the difficulty of training deep feedforward neural Domain adaptation for sentiment classification. In networks. In Proceedings of the thirteenth interna- Proceedings of the 45th annual meeting of the asso- tional conference on artificial intelligence and statis- ciation of computational linguistics, pages 440–447. tics, pages 249–256. Arthur Bražinskas, Mirella Lapata, and Ivan Titov. Ruining He and Julian McAuley. 2016. Ups and downs: 2020. Unsupervised opinion summarization as Modeling the visual evolution of fashion trends with copycat-review generation. In Proceedings of Asso- one-class collaborative filtering. In proceedings of ciation for Computational Linguistics (ACL). the 25th international conference on world wide web, pages 507–517. Eric Chu and Peter Liu. 2019. Meansum: a neu- ral model for unsupervised multi-document abstrac- Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, tive summarization. In Proceedings of International and Yejin Choi. 2019. Efficient adaptation of pre- Conference on Machine Learning (ICML), pages trained transformers for abstractive summarization. 1223–1232. arXiv preprint arXiv:1906.00138. Hoa Trang Dang. 2005. Overview of duc 2005. In Pro- Minqing Hu and Bing Liu. 2004. Mining and summa- ceedings of the document understanding conference, rizing customer reviews. In Proceedings of the tenth volume 2005, pages 1–12. ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 168–177. Giuseppe Di Fabbrizio, Amanda Stent, and Robert ACM. Gaizauskas. 2014. A hybrid approach to multi- document summarization of opinions in reviews. Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka pages 54–63. Matsuo, and Ichiro Sakata. 2017. Extractive sum- marization using multi-task learning with document Günes Erkan and Dragomir R Radev. 2004. Lexrank: classification. In Proceedings of the 2017 Confer- Graph-based lexical centrality as salience in text ence on Empirical Methods in Natural Language summarization. Journal of artificial intelligence re- Processing, pages 2101–2110. search, 22:457–479. Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie 2019. Unsupervised neural single-document sum- Utama, Ido Dagan, and Iryna Gurevych. 2019. marization of reviews via learning latent discourse Ranking generated summaries by correctness: An in- structure and its ranking. In Proceedings of Associa- teresting but challenging application for natural lan- tion for Computational Linguistics (ACL). guage inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- Diederik P Kingma and Jimmy Ba. 2014. Adam: A guistics, pages 2214–2220. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language genera- Svetlana Kiritchenko and Saif M Mohammad. 2016. tion. In Proceedings of the Workshop on Stylis- Capturing reliable fine-grained sentiment associa- tic Variation, pages 94–104, Copenhagen, Denmark. tions by crowdsourcing and best–worst scaling. In Association for Computational Linguistics. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. tional Linguistics: Human Language Technologies, Model-agnostic meta-learning for fast adaptation of pages 811–817. 4128
Daphne Koller and Nir Friedman. 2009. Probabilistic Abigail See, Peter J Liu, and Christopher D Manning. graphical models: principles and techniques. MIT 2017. Get to the point: Summarization with pointer- press. generator networks. In Proceedings of Association for Computational Linguistics (ACL). Chin-Yew Lin. 2004. Rouge: A package for auto- matic evaluation of summaries acl. In Proceedings Rico Sennrich, Barry Haddow, and Alexandra Birch. of Workshop on Text Summarization Branches Out 2016. Neural machine translation of rare words with Post Conference Workshop of ACL, pages 2017–05. subword units. Proceedings of Association for Com- putational Linguistics (ACL). Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Shazeer. 2018. Generating wikipedia by summariz- and Wang-Chiew Tan. 2020. Opiniondigest: A sim- ing long sequences. In Proceedings of International ple framework for opinion summarization. Proceed- Conference on Learning Representations (ICLR). ings of Association for Computational Linguistics (ACL). Jordan J Louviere, Terry N Flynn, and Anthony Al- fred John Marley. 2015. Best-worst scaling: The- Wenyi Tay, Aditya Joshi, Xiuzhen Jenny Zhang, Sarv- ory, methods and applications. Cambridge Univer- naz Karimi, and Stephen Wan. 2019. Red-faced sity Press. rouge: Examining the suitability of rouge for opin- ion summary evaluation. In Proceedings of the The Jordan J Louviere and George G Woodworth. 1991. 17th Annual Workshop of the Australasian Language Best-worst scaling: A model for the largest differ- Technology Association, pages 52–60. ence judgments. University of Alberta: Working Pa- per. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 2014. Sentiment analysis algorithms and applica- Kaiser, and Illia Polosukhin. 2017. Attention is all tions: A survey. Ain Shams engineering journal, you need. In Advances in neural information pro- 5(4):1093–1113. cessing systems, pages 5998–6008. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Caglar Gulcehre, and Bing Xiang. 2016. Abstrac- Daan Wierstra, et al. 2016. Matching networks for tive text summarization using sequence-to-sequence one shot learning. In Advances in neural informa- rnns and beyond. In Proceedings of The 20th tion processing systems, pages 3630–3638. SIGNLL Conference on Computational Natural Lan- Ronald J Williams and David Zipser. 1989. A learn- guage Learning, pages 280–290. ing algorithm for continually running fully recurrent Bryan Orme. 2009. Maxdiff analysis: Simple counting, neural networks. Neural computation, 1(2):270– individual-level logit, and hb. Sequim, WA: Saw- 280. tooth Software. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Romain Paulus, Caiming Xiong, and Richard Socher. Barnabas Poczos, Russ R Salakhutdinov, and 2017. A deep reinforced model for abstractive sum- Alexander J Smola. 2017. Deep sets. In Advances in marization. arXiv preprint arXiv:1705.04304. neural information processing systems, pages 3391– 3401. Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular- izing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Ofir Press and Lior Wolf. 2017. Using the output em- bedding to improve language models. In Proceed- ings of the 15th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 157–163. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- former. arXiv preprint arXiv:1910.10683. Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- guage Processing, pages 379–389. 4129
9 Appendices 9.4 Summary Annotation 9.1 Dataset Pre-Processing For summary annotation, we reused 60 Amazon products from Bražinskas et al. (2020) and sampled We selected only Amazon products and Yelp busi- 100 businesses from Yelp. We assigned 3 Mechan- nesses with minimum of 10 reviews, and the min- ical Turk workers to each product/business, and imum and maximum lengths of 20 and 70 words, instructed them to read the reviews and produce a respectively. Also, popular products/businesses summary text. We used the following instructions: above the 90th percentile were removed. From each business/product we sampled 9 reviews with- • The summary should reflect user common out replacement to form groups of reviews. opinions expressed in the reviews. Try to pre- serve the common sentiment of the opinions 9.2 Evaluation Data Split and their details (e.g. what exactly the users From the Amazon annotated dataset, We used 28, like or dislike). For example, if most reviews 12, 20 products for training, validation, and testing, are negative about the sound quality, then also respectively. On Yelp, we used 30, 30, 40 for train- write negatively about it. ing, validation, and testing, respectively. Both the automatic and human evaluation experiments were • Please make the summary coherent and fluent performed on the test sets. in terms of sentence and information struc- ture. Iterate over the written summary mul- 9.3 Training Procedure tiple times to improve it, and re-read the re- First, to speed-up the training phase, we trained an views whenever necessary. unconditional language model for 13 epoch on the • The summary should not look like a review, Amazon reviews with the learning rate (LR) set to please write formally. 5 ∗ 10−4 . On Yelp we trained it for 27 epochs with LR set to 7 ∗ 10−4 . The language model was used • Keep the length of the summary reasonably to initialize both the encoder and generator of the close to the average length of the reviews. main model. Subsequently, we trained the model using Eq. 2 • Please try to write the summary using your for 9 epochs on the Amazon reviews with 6 ∗ 10−5 own words instead of copying text directly LR, and for 57 epochs with LR set to 5 ∗ 10−5 . from the reviews. Using the exact words from Additionally, we reduced novelty using Eq. 4 by the reviews is allowed but do not copy more training the model further for 1 epoch with 10−5 than 5 consecutive words from a review. LR and λ = 2 on Amazon; On Yelp we trained for 4 epochs, with 3 ∗ 10−5 LR, and λ = 2.5. 9.5 Human Evaluation Setup For the plugin network’s initialization, as ex- To perform the human evaluation experiments de- plained in Sec. 3.1, we performed optimization by scribed in Sec 5, we hired workers with 98% ap- output matching with the oracle for 11 epochs on proval rate, 1000+ HITS, Location: USA, UK, the unannotated Amazon reviews with 1 ∗ 10−5 LR. Canada, and the maximum score on a qualifica- On Yelp we trained for 87 epochs with 1 ∗ 10−5 tion test that we had designed. The test was asking Lastly, we fine-tuned the plugin network on the if the workers were native English speakers, and human-written summaries by output matching with was verifying that they correctly understood the the oracle7 . On the Amazon data for 98 epochs with instructions of both the best-worst scaling and con- 7 ∗ 10−4 , and for 62 epochs with 7 ∗ 10−5 on Yelp. tent support tasks. We set weights to 0.1, 1., 0.08, 0.5 for length devi- ation, rating deviation, POV, and ROUGE scores, 9.6 Best-Worst Scaling Details respectively. Then fine-tuned the attention part of We performed human evaluation based on the Ama- the model and the plug-in network jointly for 33 zon and Yelp test sets using the AMT platform. We epochs with 1 ∗ 10−4 on the Amazon data. And 23 assigned workers to each tuple containing sum- epochs with 1 ∗ 10−4 LR on Yelp. maries from C OPYCAT, our model, L EX R ANK, 7 We set rating deviation to 0 as summaries do not have and human annotators. Due to dataset size dif- associated human-annotated ratings. ferences, we assigned 5 and 3 workers to each 4130
written informally, populated by pronouns such as I bought this as a gift for my husband. I and you. In contrast, summaries are desirable to be written formally. In this work, we observed I’ve been using Drakkar Noir Balm for over 1st twenty years. that a surprisingly simple way to achieve that is to condition the generator on the distribution over I purchased these for my son as a kind of a pronoun classes of the target review. We computed joke. pronoun counts and produced the 4 class distribu- tions: 1st, 2nd, 3rd person POV, and ‘other’ in case This is the best product you can buy! if no pronouns are present. Consider the example sentences in Table 12. You get what you pay for. Here one can observe that the sentences of dif- 2nd ferent pronoun classes differ in the style of writing Please do yourself a favor and avoid this product. and often the intention of the message: 1st POV sentences tend to provide clues about the personal experience of the user; 2nd POV sentences, on the This is his every work day scent. other hand, commonly convey recommendations 3rd It’s very hard to buy the balm separately. to a reader; 3rd POV and ‘other‘ sentences often describe aspects and their associated opinions. It smells like Drakkar, but it is hard to find. Very nice, not too overpowering. No This product has no smell what ever. Pronouns Nice to use for hardwood floors. Table 12: Examples of review sentences that contain only pronouns belonging to a specific class. tuple in the Amazon and Yelp test sets, respec- tively. We presented the associated reviews in a random order and asked the workers to judge summaries using the Best-Worst scaling (BWS) (Louviere and Woodworth, 1991; Louviere et al., 2015) that is known to produce more reliable re- sults than ranking scales (Kiritchenko and Moham- mad, 2016). The judgment criteria are presented below, where non-redundancy and coherence were taken from Dang (2005). Fluency: the summary sentences should be grammatically correct, easy to read and understand; Coherence: the summary should be well structured and well organized; Non- redundancy: there should be no unnecessary repe- tition in the summary; Informativeness: how much useful information about the product does the sum- mary provide?; Sentiment: how well the sentiment of the summary agrees with the overall sentiment of the original reviews? 9.7 Points of View Summaries differ from reviews in terms of the writ- ing style. Specifically, reviews are predominantly 4131
Bennett Medical has poor customer service. Phone calls can take a long time to get answered Gold and leaving voice mails tend to be fruitless. The products are overpriced and take a long time to be refilled. Using this medical supply company can be a hassle. This medical supply is the worst medical supply company in the valley. The customer service is Ours horrible, the staff is rude, the wait times are long, and the service reps are not helpful at all. Do not recommend this company to anyone. If I could give them 0 stars I would. The customer service is terrible, the staff is extremely rude Copycat and helpful. If you’re looking for a new provider, this is the place to be. I will never use them again. Service is horrible, especially the manager. I have a lot of kids but not this place. Two months later I was able to go in and get a new one to go in the next day. They would tell me that they MeanSum would have to do a lot of our water to be there to get a new one. But this is the first time I have dealt with him and we will never go back again. Thanks for your hard work, and I will never go anywhere else. Bennett Medical for Cpap supplies are horrible. Never enough staff to answer phone, so you’ll Lexrank need to leave messages. DON’T use this medical supply. If I could give Bennett Medical zero stars I would! Will be moving to another medical supply as soon as I can. Bennett Medical for Cpap supplies are horrible. We have waited for three weeks to refill supplies Review 1 and we are still waiting. This company does not have good customer service, you can only leave messages, and they never call back. If I could give Bennett Medical zero stars I would! Teachers Health Trust, please look into the practice of the billing and filling of durable services. The mask cushions go for 45 to 50 days because of the lack of communication. The people in Review 2 charge of billing are very argumentative and lack customer service. I will drop them after annual, because of my insurance obligations. Fantastic service from Jocelyn at the front desk, we had a really hard time getting the right Review 3 paperwork together from Drs but she stuck with us and helped us every step of the way, even calling to keep us updated and to update info we might have for her. Thanks Jocelyn. I hardly ever write reviews, but I’d like to spare someone else from what I experienced. So a warning to the wise... If you like rude incompetent employees, almost an hour long wait for just Review 4 picking up a phone order, and basically being treated like a second class citizen then look no further than Bennett Medical. DON’T use this medical supply. Never enough staff to answer phone, so you’ll need to leave messages. No return phone calls. I am unable to get my CPAP supplies every quarter without Review 5 hours of calling / waiting / calling. Poor customer service. Will be moving to another medical supply as soon as I can. Terrible experience. They have ridiculous price also bad customer services. You can get nebulizer machine around $50 at amazon, bennet medical charge you almost twice more expensive price. Review 6 And breathing kit price was unbelievable too. Because of deduction, I had to pay them all out of my pocket whatever they charged. I don’t recommand this medical company to anyone. Good luck getting a phone call back or someone to answer the phone without hanging up immediately. I have called over 20 times left 5 voicemails over the last 30 days, just to refill a Review 7 mask perscription. This is an ongoing issue that is beyond frustrating. Not trying to destroy this businesses name just want the owners to implement some basic customer service skills. Always receive friendly customer service whenever we call or go into the location. My questions Review 8 are always answered and I am very happy with the supplies we get from them. Great people providing a great service! Thank you for all you do! Table 13: Example summaries produced by different systems on Yelp data. 4132
You can also read