Finnish Paraphrase Corpus - Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpel ainen, Hanna-Mari Kupari ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Finnish Paraphrase Corpus Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Jenna Saarni, Maija Sevón, and Otto Tarkka TurkuNLP group Department of Computing Faculty of Technology University of Turku, Finland jmnybl@utu.fi Abstract corpora naturally introduce a bias to the corpora In this paper, we introduce the first which, as we will show later in this paper, demon- fully manually annotated paraphrase cor- strates itself as a tendency towards short examples pus for Finnish containing 53,572 para- with a relatively high lexical overlap. Addressing phrase pairs harvested from alternative this bias to the extent possible, and providing a subtitles and news headings. Out of all corpus with longer, lexically more diverse para- paraphrase pairs in our corpus 98% are phrases is one of the motivations for our work. The manually classified to be paraphrases at other motivation is to cater for the needs of Finnish least in their given context, if not in all NLP, and improve the availability of high-quality, contexts. Additionally, we establish a manually annotated paraphrase data specifically manual candidate selection method and for the Finnish language. demonstrate its feasibility in high quality In this paper, we therefore aim for the follow- paraphrase selection in terms of both cost ing contributions: Firstly, we establish and test and quality. a fully manual procedure for paraphrase candi- date selection with the aim of avoiding a selec- 1 Introduction tion bias towards short, lexically overlapping can- The powerful language models that have recently didates. Secondly, we release the first fully man- become available in NLP have also resulted in a ually annotated paraphrase corpus of Finnish, suf- distinct shift towards more meaning-oriented tasks ficiently large for model training. The number of for model fine-tuning and evaluation. The most manually annotated examples makes the released typical example is entailment detection, with the dataset one of the largest, if not the largest manu- paraphrase task raising in interest recently. Para- ally annotated paraphrase corpus for any language. phrases, texts that express the same meaning with And thirdly, we report the experiences, tools, and differing words (Bhagat and Hovy, 2013), are — baseline results on this new dataset, hopefully al- already by their very definition — a suitable target lowing other language NLP communities to assess to induce and evaluate models’ ability to represent the potential of developing a similar corpus for meaning. Paraphrase detection and generation has other languages. numerous direct applications in NLP (Madnani 2 Related Work and Dorr, 2010), among others in question answer- ing (Soni and Roberts, 2019), plagiarism detec- Statistics of the different paraphrase corpora most tion (Altheneyan and Menai, 2019), and machine relevant to our work are summarized in Table 1. translation (Mehdizadeh Seraj et al., 2015). For English, the Microsoft Research Paraphrase Research in paraphrase naturally depends on the Corpus (MRPC) (Dolan and Brockett, 2005) is availability of datasets for the task. We will review extracted from an online news collection by ap- these in more detail in Section 2, nevertheless, bar- plying heuristics to recognize candidate document ring few exceptions, paraphrase corpora are typi- pairs and candidate sentences from the documents. cally large and gathered automatically using one Paraphrase candidates are subsequently filtered of several possible heuristics. Typically a com- using a classifier, before the final manual binary paratively small section of the corpus is manually annotation (paraphrase or not). In the Twitter classified to serve as a test set for method develop- URL Corpus (TUC) (Lan et al., 2017), para- ment. The heuristics used to gather and filter the phrase candidates are identified by recognizing
Corpus Data source Size autom. Size manual Labels English MRPC Online news — 5,801 0/1 TUC News tweets — 52K 0/1 ParaSCI Scientific papers 350K — 1-5 PARADE flashcards (computer sci.) — 10K 0-3 QQP Quora 404K — 0/1 Finnish Opusparcus OpenSubtitles 480K* 3,703 1-4 TaPaCo Tatoeba crowdsourcing 12K — — Table 1: Summary of available paraphrase corpora of naturally occurring sentential paraphrases. The corpora sizes include the total amount of pairs in the corpus (i.e. also those labeled as non-paraphrases), thus the actual number of good paraphrases depend on the class distribution of each corpus. *The highest quality cutpoint estimated by the authors. shared URLs in news related tweets. All can- 2020), both including a Finnish subsection. Opus- didates are manually binary-labeled. ParaSCI parcus consists of candidate paraphrases automat- (Dong et al., 2021) is created by collecting para- ically extracted from the alternative translations of phrase candidates from ACL and arXiv papers us- movie and TV show subtitles after automatic sen- ing heuristics based on term definitions, citation tence alignment. While the candidate paraphrases information as well as sentence embedding simi- are automatically extracted, a small subset of a larity. The extracted candidates are automatically few thousand paraphrase pairs for each language filtered, but no manually annotated data is avail- is manually annotated. TaPaCo contains candi- able. PARADE (He et al., 2020) is created by col- date paraphrases automatically extracted from the lecting online user-generated flashcards for com- Tatoeba dataset2 , which is a multilingual crowd- puter science related concepts. All definitions for sourced database of sentences and their transla- the same term are first clustered, and paraphrase tions. Like Opusparcus, TaPaCo is based on lan- candidates are extracted only among a cluster to guage pivoting, where all alternative translations reduce noise in candidate selection. All extracted for the same statement are collected. However, un- candidates are manually annotated using a scheme like most other corpora, the candidate paraphrases with four labels. Quora Question Pairs (QQP)1 are grouped into ‘sets’ instead of pairs, and all sen- contains question headings from the forum with tences in a set are considered equivalent in mean- binary labels into duplicate-or-not questions. The ing. TaPaCo does not include any manual valida- QQP dataset is larger than other datasets, however, tion. although including human-produced labels, the la- beling is not originally designed for paraphrasing 3 Text Selection and the dataset providers warn about labeling not As discussed previously, we elect to rely on fully guaranteed to be perfect. manual candidate extraction as a measure against Another common approach for automatic para- any bias introduced through heuristic candidate phrase identification is through language pivoting selection methods. In order to obtain sufficiently using multilingual parallel datasets. Here sentence many paraphrases for the person-months spent, the alignments are used to recognize whether two dif- text sources need to be paraphrase-rich, i.e. have ferent surface realizations share an identical or a high probability for naturally occurring para- near-identical translation, assuming that the iden- phrases. Such text sources include for example tical translation likely implies a paraphrase. There news headings and articles reporting on the same are two different multilingual paraphrase datasets news, alternative translations of the same source automatically extracted using language pivoting, material, different student essays and exam an- Opusparcus (Creutz, 2018) and TaPaCo (Scherrer, swers for the same assignment, and related ques- tions with their replies in discussion fora, where 1 data.quora.com/First-Quora-Dataset-\ 2 Release-Question-Pairs https://tatoeba.org/eng/
one can assume different writers using distinct lected for annotation, TV series comprise different words to state similar meaning. For this first episodes, dealing with the same plot and charac- version of the corpus, we use two different text ters, and therefore overlapping in language. Af- sources: alternative Finnish subtitles for the same ter an initial annotation period, we noticed a topic movies or TV episodes, and headings from news bias towards a limited number of TV series with articles discussing the same event in two different a large number of episodes, and decided to limit Finnish news sites. the number of annotated episodes to 10 per each TV series in all subsequent annotation. In total, 3.1 Alternative Subtitles close to 3,000 different movies/episodes are used OpenSubtitles3 distributes an extensive collection for manual paraphrase candidate extraction, each of user generated subtitles for different movies including exactly one pair of aligned subtitles. and TV episodes. These subtitles are available in multiple languages, but surprisingly often the 3.2 News Headings same movie or episode have versions in a sin- We have downloaded news articles through open gle language, originating from different sources. RSS feeds of different Finnish news sites during This gives an opportunity to exploit the natural 2017–2021, resulting in a substantial collection of variation produced by independent translators, and news from numerous complementary sources. For by comparing two different subtitles for a single this present work, we narrow the data down to movie or episode, there is a high likelihood of find- two sources: the Finnish Broadcasting Company ing naturally occurring paraphrases. (YLE) and Helsingin Sanomat (HS, English trans- From the database dump of OpenSubtitles2018 lation: Helsinki News). We align the news using a obtained through OPUS (Tiedemann, 2012), we 7-day sliding window on time of publication, com- selected all movies and TV episodes with at least bined with cosine similarity of TF-IDF-weighted two Finnish subtitle versions. In case more ver- document vectors induced on the article body, ob- sions are available, the two most lexically differ- taining article pairs likely reporting on the same ing are selected for paraphrase extraction. We event. The settings of the TF-IDF vectors is the measure lexical similarity by TF-IDF weighted same as in Section 3.1. We use the article headings document vectors. Specifically, we create TF- as paraphrase candidates, striving to select maxi- IDF vectors with TfidfVectorizer from the mally dissimilar headings of maximally similar ar- sklearn package. We limit the number of fea- ticles as the most promising candidates for non- tures to 200K, apply sublinear scaling, use charac- trivial paraphrases. In practice, we used a simple ter 4-grams created out of text inside word bound- grid search and human judgement to establish the aries, and otherwise use the default settings. To most promising region of article body and heading filter out subtitle pairs with low density of inter- similarity values. esting paraphrase candidates, pairs with too high or too low cosine similarity of TF-IDF vectors are 4 Paraphrase Annotation discarded. High similarity usually reflects iden- The paraphrase annotation is comprised of multi- tical subtitles with minor formatting differences, ple annotation steps, including candidate selection while low similarity is typically caused by incor- as described above, manual classification of can- rect identifiers in the source data. The two selected didates based on an annotation scheme, as well as subtitle versions are then roughly aligned using the the possibility of rewriting partial paraphrases into timestamps, and divided into segments of 15 min- full paraphrases. Next, we will discuss the differ- utes. For every movie/episode, the annotators are ent paraphrase types represented in our annotation assigned one random such segment, the two ver- scheme, and afterwards the annotation workflow sions presented side-by-side in a custom tool, al- is discussed in a more detailed fashion. lowing for fast selection of paraphrase candidates. In total, we were able to obtain at least one 4.1 Annotation Scheme pair of aligned subtitle versions for 1,700 unique Instead of a simple yes/no (equivalent or not movies and TV series. While for each unique equivalent) as in MRPC (Dolan and Brockett, movie only one pair of aligned subtitles is se- 2005) or 1–4 scale (bad, mostly bad, mostly good 3 http://www.opensubtitles.org and good) as in Opusparcus (Creutz, 2018), our
annotation scheme is adapted to capture the level Subsumption (> or
Label Statement 1 Statement 2 4 Tyrmistyttävän lapsellista! Pöyristyttävän kypsymätöntä! 4s Olen työskennellyt lounaan ajan. Tein töitä koko ruokiksen. 4i Teitäpä onnisti. Oletpa onnekas. 4> Tein lujasti töitä niiden rahojen eteen. Paiskin kovasti töitä. 4
1.0 14000 No additional flags i s 12000 is 0.8 Label 10000 2 Number of paraphrase pairs 3 4 8000 0.6 4< 4is 4>s 2000 4i 4is 0 0.2 4s 4 4< 4> 3 2 Label 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1: Labels distribution in our corpus exclud- Cosine Similarity ing 7,909 rewrites which can be added up with la- bel 4. Figure 2: Density of different labels in the training set conditioned on cosine similarity of the para- of label 4 increases throughout the range and dom- phrase pairs. inates the sparsely populated range of similarities over 0.8. The overall accuracy is 68.7% when the base 5.1 Annotation Quality label (labels 1–4) as well as all additional flags After the initial annotator training phase most of are taken into consideration. When discarding the annotation work is carried out as single an- the least common flags s and i and evaluat- notation. In order to monitor annotation con- ing only base labels and directional subsumption sistency, double annotation batches are assigned flags, the overall accuracy is 72.9%. To com- regularly. In double annotation, one annotator pare the observed agreement to previous studies on first extracts the candidate paraphrases from the paraphrase annotation, the Opusparcus annotation aligned documents, but later on these candidates agreement is approximately 64% on Finnish de- are assigned to two different annotators, who an- velopment set and 67% on test set (calculated from notate the labels for these independently from numbers in Table 4 and Table 5 in Creutz (2018)). each other. Next, these two individual annotations The Opusparcus uses an annotation scheme with are merged and conflicting labels are resolved to- four labels, similar to our base label scheme. In gether with the whole annotation team in a meet- MRPC, the reported agreement score is 84% on ing. These consensus annotations constitute a gold a binary paraphrase-or-not scheme. While direct standard against which individual annotators can comparison is difficult due to the different an- be measured. notation schemes and label distributions, the fig- A total of 1,175 examples are double annotated ures show that the observed agreement seem to (2.5% of the data4 ). Most of these are annotated be roughly within the same range with agreement by exactly two annotators, however, some exam- numbers seen in related works. ples may include annotations from more than two In addition to agreement accuracy, we calculate annotators, and thus the total amount of individ- two versions of Cohen’s kappa, a metric for inter- ual annotations for which the gold standard label annotator agreement taking into account the pos- exists is 2,513. We measure the agreement of indi- sibility of agreement occurring by chance. First vidually annotated examples against the gold stan- we measure kappa agreement of all individual an- dard annotations in terms of accuracy, i.e. the pro- notations against the gold standard, an approach portion of individually annotated examples with typical in paraphrase literature. This kappa is 0.62, correctly assigned label. indicating substantial agreement. Additionally, we 4 measure the Cohen’s kappa between each pair of During the initial annotator training double annotation was used extensively; this annotator training data is not in- annotators. The weighted average kappa over all cluded in the released corpus. annotator pairs is 0.41 indicating moderate agree-
Figure 3: Comparison of paraphrase length distributions in terms of tokens per paraphrase. Figure 4: Comparison of paraphrase pair cosine similarity distributions. ment. Both are measured on full labels. When example likely being on average more dissimilar evaluating only on base labels and directional sub- in terms of lexical overlap than true candidates. sumption flags, these kappa scores are 0.65 and For each corpus, we sample 12,000 paraphrase 0.45, respectively. pairs. For our corpus, we selected a random sam- ple of true paraphrases (label 3 or higher) from the 5.2 Corpus Comparison train section. For TaPaCo, the sample covers all We compare the distribution of paraphrase lengths paraphrase candidates from the corpus, however and lexical similarity with the two Finnish with the restriction of taking only one, random pair paraphrase candidate corpora, Opusparcus and from each ‘set’ of paraphrases. For Opusparcus, TaPaCo, as the reference. Direct comparison is which is sorted by a confidence score in descend- complicated by several factors. Firstly, both Opus- ing order, the sample was selected to contain the parcus and TaPaCo consist primarily of automat- most confident 12K paraphrase candidates.5 ically extracted paraphrase candidates, Opuspar- In Figure 3 the length distribution of para- cus having only small manually curated develop- phrases in terms of tokens is measured for the ment and test sections, and TaPaCo being fully un- abovementioned samples. Although the majority curated. Secondly, the small manually annotated of paraphrases are rather short in all three cor- sections of Opusparcus are sampled to emphasize pora, we see that our corpus includes a consider- lexically dissimilar pairs, and therefore not repre- ably higher proportion of longer paraphrases. The sentative of the characteristics of the whole cor- average number of tokens in our corpus is 8.3 to- pus. We therefore compare with the fully automat- kens per paraphrase, while it is 5.6 in TaPaCo and ically extracted sections of both Opusparcus and 3.6 in Opusparcus candidates. TaPaCo. For our corpus, we discard the small pro- In Figure 4 the paraphrase pair cosine similarity portion of examples of base labels 1 and 2, i.e. not distribution is measured using TF-IDF weighted paraphrases. Another important factor to consider character n-grams of length 2–4. While both is that the proportion of false candidates in the au- tomatically extracted sections of Opusparcus and 5 When we repeated the length analysis with a sample of TaPaCo is unknown, further decreasing compara- 480K most confident pairs, the length distribution and aver- age length remained largely unchanged, while the similarity bility: the characteristics of false and true candi- distribution became close to flat. Without manual annotation, dates may differ substantially, false candidates for it is hard to tell the reason for this behavior.
TaPaCo and Opusparcus lean towards higher sim- Label Prec Rec F-score Support ilarity candidates, the distribution of our corpus 2 50.9 31.2 38.7 93 is more balanced including a considerably higher 3 57.7 31.9 41.1 990 proportion of pairs with low lexical similarity. 4 66.2 78.2 71.7 2149 4< 52.8 53.5 53.2 1007 6 Paraphrase Classification Baseline 4> 52.6 56.1 54.3 1136 In order to establish a baseline classification per- i 51.5 36.5 42.7 329 formance on the new dataset, we train a classi- s 51.4 28.9 37.0 249 fier based on the FinBERT model (Virtanen et al., W. avg 52.9 54.0 52.2 2019). Each paraphrase pair of statements A and B Acc 54.0 is encoded as the sequence [CLS] A [SEP] B [SEP], where [CLS] and [SEP] are the special Table 4: Classification performance on the test set, marker tokens of the BERT model. Subsequently, when the base label and the flags are predicted sep- the output embeddings of the three special tokens arately. In the upper section, we merge the sub- are concatenated together with the averaged em- sumption flags with the base class prediction, but beddings of the tokens in A and B. These five leave the i and s separated. The rows W. avg and concatenated embeddings are then propagated into Acc on the other hand refer to performance on the four decision layers: one for the base label 2/3/4, complete labels, comprising all allowed combina- one for the subsumption flag /none, and one tions of base label and flags. W. avg is the aver- for each the binary flag s and i. Since the flags age of P/R/F values across the classes, weighted only apply to base label 4, no gradients are ap- by class support. Acc is the accuracy. plied to these layers for examples with base labels 2 and 3. We have explored also other BERT-based architectures, such as basing the classification on strated that, indeed, the corpus has longer, more the [CLS] embedding only as is customary, and lexically dissimilar paraphrases. Building such a having a single classification layer comprising all corpus is therefore shown feasible and presently possible base label and flag combinations. These it is likely the largest manually annotated para- resulted in a consistent drop in prediction accu- phrase dataset for any language, naturally at the racy, and we did not pursue them any further. inevitably higher data collection cost. The man- The baseline results are listed in Table 4 show- ual selection is only feasible for texts rich in para- ing that per-class F-score ranges between 38–71%, phrase, and the domains and genres covered by the strongly correlated with the number of examples corpus is necessarily restricted by this condition. available for each class. When interpreting the In our future work, we intend to extend the man- task as a pure multi-class classification, i.e. when ually annotated corpus, ideally roughly double its counting all possible combinations of base label present size. We expect the pursued data size will and flags as their own class, the accuracy is 54% allow us to build sufficiently accurate models, both with majority baseline being 34.3%, and the anno- in terms of embedding and pair classification, to tators’ accuracy 68.7%. The model thus positions gather further candidates automatically at a level roughly to the mid-point between the trivial ma- of accuracy sufficient to support down-stream ap- jority baseline, and human performance. plications. We are also investigating further text sources, especially parallel translations outside of 7 Discussion and Future Work the present subtitle domain. The additional flags In this work, we set out to build a paraphrase in our annotation scheme, as well as the nearly corpus for Finnish that would be (a) in the size 10,000 rewrites allow for interesting further inves- category allowing deep model fine-tuning and tigations in their own right. (b) manually gathered maximizing the chance of While in the current study we concentrated finding more non-trivial, longer paraphrases than on training a classifier for categorizing the para- would be possible with the traditional automatic phrases into fine-grained sub-categories, where candidate extraction. The annotation so far took only 2% of the paraphrases in the current release 14 person-months and resulted in little over 50,000 belonged to related but not a paraphrase category manually classified paraphrases. We have demon- (label 2), which can be seen as a negative class
in the more traditional paraphrase or not a para- and Jörg Tiedemann for his generous assistance phrase classification task. In order to better ac- with the OpenSubtitles data. count for this traditional classification task, in fu- ture work, in addition to extending the number of positive examples, we will also look into methods References for expanding the training section with negative Alaa Saleh Altheneyan and Mohamed El Bachir Menai. examples. While extending the data with unre- 2019. Evaluation of state-of-the-art paraphrase lated paraphrase candidates (label 1) can be con- identification and its application to automatic pla- giarism detection. International Journal of Pattern sidered a trivial task, as more or less any random Recognition and Artificial Intelligence, 34. sentence pair can be considered unrelated, the task of expanding the data with interesting related but Rahul Bhagat and Eduard Hovy. 2013. Squibs: What is a paraphrase? Computational Linguistics, not a paraphrase candidates (label 2) is an in- 39(3):463–472. triguing question. One option to consider in fu- ture work is active learning, where the confidence Mathias Creutz. 2018. Open Subtitles paraphrase scores provided by the initial classifier could be corpus for six languages. In Proceedings of the Eleventh International Conference on Language Re- used to collect difficult negatives. sources and Evaluation (LREC 2018). 8 Conclusions William B. Dolan and Chris Brockett. 2005. Automati- cally constructing a corpus of sentential paraphrases. In this paper we presented the first entirely manu- In Proceedings of the Third International Workshop ally annotated paraphrase corpus for Finnish in- on Paraphrasing (IWP 2005). cluding 45,663 naturally occurring paraphrases Qingxiu Dong, Xiaojun Wan, and Yue Cao. 2021. gathered from alternative movie or TV episode ParaSCI: A large scientific paraphrase dataset for subtitles and news headings. Further 7,909 longer paraphrase generation. arXiv preprint hand-made rewrites are provided, turning context- arXiv:2101.08382. dependent paraphrases into perfect paraphrases Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, whenever possible. The total size of the released and James Caverlee. 2020. PARADE: A new dataset corpus is 53,572 paraphrase pairs of which 98% for paraphrase identification requiring computer sci- are manually classified to be at least paraphrases ence domain knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- in their given context if not in all contexts. guage Processing (EMNLP), pages 7572–7582. Additionally, we evaluated the advantages and costs of manual paraphrase candidate selection Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. from two ‘parallel’ but monolingual documents. A continuously growing dataset of sentential para- phrases. In Proceedings of the 2017 Conference on We demonstrated the approach on alternative sub- Empirical Methods in Natural Language Processing titles showing the technique being feasible for (EMNLP), pages 1224–1234. high quality candidate selection yielding sufficient Nitin Madnani and Bonnie J. Dorr. 2010. Generat- amount of paraphrase candidates for the given an- ing phrasal and sentential paraphrases: A survey notation effort. We have shown the candidates to of data-driven methods. Computational Linguistics, be notably longer and less lexically overlapping 36(3):341–387. than what automated candidate selection permits. Ramtin Mehdizadeh Seraj, Maryam Siahbani, and The corpus is available at github.com/ Anoop Sarkar. 2015. Improving statistical machine TurkuNLP/Turku-paraphrase-corpus translation with a multilingual paraphrase database. under the CC-BY-SA license. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing Acknowledgments (EMNLP), pages 1379–1390. We gratefully acknowledge the support of Euro- Yves Scherrer. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of The pean Language Grid which funded the annotation 12th Language Resources and Evaluation Confer- work. Computational resources were provided by ence (LREC 2020), pages 6868–6873. CSC — the Finnish IT Center for Science and the Sarvesh Soni and Kirk Roberts. 2019. A paraphrase research was supported by the Academy of Fin- generation system for EHR question answering. In land. We also thank Sampo Pyysalo for fruitful Proceedings of the 18th BioNLP Workshop and discussions and feedback throughout the project, Shared Task, pages 20–29.
Jörg Tiedemann. 2012. Parallel data, tools and inter- faces in OPUS. In Proceedings of the Eighth In- ternational Conference on Language Resources and Evaluation (LREC 2012), pages 2214–2218. Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Lu- oma, Juhani Luotolahti, Tapio Salakoski, Filip Gin- ter, and Sampo Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
A English Translation of Table 2 Label Statement 1 Statement 2 4 Shockingly childish! Astoundingly immature! 4s I have worked for the duration of lunch. I worked through the whole chowtime. 4i You guys got lucky, didn’t you. Aren’t you fortunate. 4> I worked so hard for the money. I put so much effort into work. 4
You can also read