Extracting Lexically Divergent Paraphrases from Twitter
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Extracting Lexically Divergent Paraphrases from Twitter Wei Xu1 , Alan Ritter2 , Chris Callison-Burch1 , William B. Dolan3 and Yangfeng Ji4 1 University of Pennsylvania, Philadelphia, PA, USA {xwe, ccb}@cis.upenn.edu 2 The Ohio State University, Columbus, OH, USA ritter.1492@osu.edu 3 Microsoft Research, Redmond, WA, USA billdol@microsoft.com 4 Georgia Institute of Technology, Atlanta, GA, USA jiyfeng@gatech.edu Abstract (e.g. oscar nom’d doc ↔ Oscar-nominated docu- mentary). We present M ULTI P (Multi-instance Learn- In this paper, we investigate the task of determin- ing Paraphrase Model), a new model suited ing whether two tweets are paraphrases. Previous to identify paraphrases within the short mes- sages on Twitter. We jointly model para- work has exploited a pair of shared named entities phrase relations between word and sentence to locate semantically equivalent patterns from re- pairs and assume only sentence-level annota- lated news articles (Shinyama et al., 2002; Sekine, tions during learning. Using this principled la- 2005; Zhang and Weld, 2013). But short sentences tent variable model alone, we achieve the per- in Twitter do not often mention two named entities formance competitive with a state-of-the-art (Ritter et al., 2012) and require nontrivial general- method which combines a latent space model ization from named entities to other words. For ex- with a feature-based supervised classifier. Our model also captures lexically divergent para- ample, consider the following two sentences about phrases that differ from yet complement previ- basketball player Brook Lopez from Twitter: ous methods; combining our model with pre- ◦ That boy Brook Lopez with a deep 3 vious work significantly outperforms the state- of-the-art. In addition, we present a novel an- ◦ brook lopez hit a 3 and i missed it notation methodology that has allowed us to crowdsource a paraphrase corpus from Twit- Although these sentences do not have many words in ter. We make this new dataset available to the common, the identical word “3” is a strong indicator research community. that the two sentences are paraphrases. We therefore propose a novel joint word-sentence approach, incorporating a multi-instance learning 1 Introduction assumption (Dietterich et al., 1997) that two sen- Paraphrases are alternative linguistic expressions of tences under the same topic (we highlight topics in the same or similar meaning (Bhagat and Hovy, bold) are paraphrases if they contain at least one 2013). Twitter engages millions of users, who nat- word pair (we call it an anchor and highlight with urally talk about the same topics simultaneously underscores; the words in the anchor pair need not and frequently convey similar meaning using diverse be identical) that is indicative of sentential para- linguistic expressions. The unique characteristics phrase. This at-least-one-anchor assumption might of this user-generated text presents new challenges be ineffective for long or randomly paired sentences, and opportunities for paraphrase research (Xu et al., but holds up better for short sentences that are tem- 2013b; Wang et al., 2013). For many applications, porally and topically related on Twitter. Moreover, like automatic summarization, first story detection our model design (see Figure 1) allows exploitation (Petrović et al., 2012) and search (Zanzotto et al., of arbitrary features and linguistic resources, such 2011), it is crucial to resolve redundancy in tweets as part-of-speech features and a normalization lex-
(a) (b) Figure 1: (a) a plate representation of the M ULTI P model (b) an example instantiation of M ULTI P for the pair of sentences “Manti bout to be the next Junior Seau” and “Teo is the little new Junior Seau”, in which a new American football player Manti Te’o was being compared to a famous former player Junior Seau. Only 4 out of the total 6 × 5 word pairs, z1 - z30 , are shown here. icon, to discriminatively determine word pairs as yields state-of-the-art results and identifies paraphrastic anchors or not. paraphrases that are complementary to previ- Our graphical model is a major departure from ous methods. popular surface- or latent- similarity methods (Wan 2) We develop an efficient crowdsourcing method et al., 2006; Guo and Diab, 2012; Ji and Eisenstein, and construct a Twitter Paraphrase Corpus of 2013, and others). Our approach to extract para- about 18,000 sentence pairs, as a first common phrases from Twitter is general and can be com- testbed for the development and comparison of bined with various topic detecting solutions. As a paraphrase identification and semantic similar- demonstration, we use Twitter’s own trending topic ity systems. We make this dataset available to service1 to collect data and conduct experiments. the research community.2 While having a principled and extensible design, our model alone achieves performance on par with 2 Joint Word-Sentence Paraphrase Model a state-of-the-art ensemble approach that involves both latent semantic modeling and supervised classi- We present a new latent variable model that jointly fication. The proposed model also captures radically captures paraphrase relations between sentence pairs different paraphrases from previous approaches; a and word pairs. It is very different from previous ap- combined system shows significant improvement proaches in that its primary design goal and motiva- over the state-of-the-art. tion is targeted towards short, lexically diverse text This paper makes the following contributions: on the social web. 1) We present a novel latent variable model for 2.1 At-least-one-anchor Assumption paraphrase identification, that specifically ac- Much previous work on paraphrase identification commodates the very short context and di- has been developed and evaluated on a specific vergent wording in Twitter data. We exper- benchmark dataset, the Microsoft Research Para- imentally compare several representative ap- phrase Corpus (Dolan et al., 2004), which is de- proaches and show that our proposed method 2 The dataset and code are made available at: SemEval-2015 1 More information about Twitter’s trends: shared task http://alt.qcri.org/semeval2015/ https://support.twitter.com/articles/ task1/ and https://github.com/cocoxu/ 101125-faqs-about-twitter-s-trends twitterparaphrase/
Corpus Examples ◦ Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier. News ◦ With the scandal hanging over Stewart’s company, revenue in the first quarter of the year dropped 15 percent from the same period a year earlier. (Dolan and ◦ The Senate Select Committee on Intelligence is preparing a blistering report on Brockett, prewar intelligence on Iraq. 2005) ◦ American intelligence leading up to the war on Iraq will be criticized by a pow- erful US Congressional committee due to report soon, officials said today. ◦ Can Klay Thompson wake up ◦ Cmon Klay need u to get it going Twitter ◦ Ezekiel Ansah wearing 3D glasses wout the lens (This Work) ◦ Wait Ezekiel ansah is wearing 3d movie glasses with the lenses knocked out ◦ Marriage equality law passed in Rhode Island ◦ Congrats to Rhode Island becoming the 10th state to enact marriage equality Table 1: Representative examples from paraphrase corpora. The average sentence length is 11.9 words in Twitter vs. 18.6 in the news corpus. rived from news articles. Twitter data is very dif- 2.2 Multi-instance Learning Paraphrase Model ferent, as shown in Table 1. We observe that among (M ULTI P) tweets posted around the same time about the same The at-least-one-anchor assumption naturally leads topic (e.g. a named entity), sentential paraphrases to a multi-instance learning problem (Dietterich are short and can often be “anchored” by lexical et al., 1997), where the learner only observes labels paraphrases. This intuition leads to the at-least-one- on bags of instances (i.e. sentence-level paraphrases anchor assumption we stated in the introduction. in this case) instead of labels on each individual in- The anchor could be a word the two sentences stance (i.e. word pair). share in common. It also could be a pair of different We formally define an undirected graphical model words. For example, the word pair “next k new” in of multi-instance learning for paraphrase identifica- two tweets about a new player Manti Te’o to a fa- tion – M ULTI P. Figure 1 shows the proposed model mous former American football player Junior Seau: in plate form and gives an example instantiation. The model has two layers, which allows joint rea- ◦ Manti bout to be the next Junior Seau soning between sentence-level and word-level com- ◦ Teo is the little new Junior Seau ponents. For each pair of sentences si = (si1 , si2 ), there Further note that not every word pair of similar is an aggregate binary variable yi that represents meaning indicates sentence-level paraphrase. For whether they are paraphrases, and which is observed example, the word “3”, shared by two sentences in the labeled training data. Let W (sik ) be the set about movie “Iron Man” that refers to the 3rd se- of words in the sentence sik , excluding the topic quel of the movie, is not a paraphrastic anchor: names. For each word pair wj = (wj1 , wj2 ) ∈ ◦ Iron Man 3 was brilliant fun W (si1 ) × W (si2 ), there exists a latent variable zj ◦ Iron Man 3 tonight see what this is like which denotes whether the word pair is a paraphrase anchor. In total there are m = |W (si1 )| × |W (si2 )| Therefore, we use a discriminative model at the word pairs, and thus zi = z1 , z2 , ..., zj , ..., zm . word-level to incorporate various features, such as Our at-least-one-anchor assumption is realized by part-of-speech features, to determine how probable a deterministic-or function; that is, if there exists at a word pair is a paraphrase anchor. least one j such that zj = 1, then the sentence pair
is a paraphrase. Input: a training set {(si , yi )|i = 1...n}, where i is Our conditional paraphrase identification model is an index corresponding to a particular sentence defined as follows: pair si , and yi is the training label. m Y P (zi , yi |wi ; θ) = φ(zj , wj ; θ) × σ(zi , yi ) 1: initialize parameter vector θ ← 0 j=1 2: for i ← 1 to n do m (1) 3: extract all possible word pairs wi = Y = exp(θ · f (zj , wj )) × σ(zi , yi ) w1 , w2 , ..., wm and their features from the j=1 sentence pair si 4: end for where f (zj , wj ) is a vector of features extracted for 5: for l ← 1 to maximum iterations do the word pair wj , θ is the parameter vector, and σ 6: for i ← 1 to n do is the factor that corresponds to the deterministic-or 7: (yi0 , z0i ) ← arg maxyi ,zi P (zi , yi |wi ; θ) constraint: 8: if yi0 6= yi then 9: z∗i ← arg maxzi P (zi |wi , yi ; θ) 1 if yi = true ∧ ∃j : zj = 1 10: θ ← θ + f (z∗i , wi ) − f (z0i , wi ) σ(zi , yi ) = 1 if yi = f alse ∧ ∀j : zj = 0 (2) 11: end if 12: end for 0 otherwise 13: end for 2.3 Learning 14: return model parameters θ To learn the parameters of the word-level paraphrase anchor classifier, θ, we maximize likelihood over the Figure 2: M ULTI P Learning Algorithm sentence-level annotations in our paraphrase corpus: θ∗ = arg max P (y|w; θ) Computing both z0 and z∗ are rather straightfor- θ ward under the structure of our model and can be YX (3) = arg max P (zi , yi |wi ; θ) solved in time linear in the number of word pairs. θ i zi The dependencies between z and y are defined as An iterative gradient-ascent approach is used to deterministic-or factors σ(zi , yi ), which when sat- estimate θ using perceptron-style additive updates isfied do not affect the overall probability of the (Collins, 2002; Liang et al., 2006; Zettlemoyer and solution. Each sentence pair is independent con- Collins, 2007; Hoffmann et al., 2011). We define an ditioned on the parameters. For z0 , it is sufficient update based on the gradient of the conditional log to independently compute the most likely assign- likelihood using Viterbi approximation, as follows: ment z0i for each word pair, ignoring the determin- istic dependencies. y0i is then set by aggregating ∂ log P (y|w; θ) X all z0i through the deterministic-or operation. Sim- = EP (z|w,y;θ) ( f (zi , wi )) ∂θ ilarly, we can find the exact solution for z∗ , the i most likely assignment that respects the sentence- X − EP (z,y|w;θ) ( f (zi , wi )) (4) i level training label y. For a positive training in- X X stance, we simply find its highest scored word pair ≈ f (z∗i , wi ) − f (z0i , wi ) wτ by the word-level classifier, then set zτ∗ = 1 and i i zj∗ = arg maxx∈0,1 φ(x, wj ; θ) for all j 6= τ ; for where we define P the feature sum for each sentence a negative example, we set z∗i = 0. The time com- f (zi , wi ) = j f (zj , wj ) over all word pairs. plexity of both inferences for one sentence pair is These two above expectations are approximated O(|W (s)|2 ), where |W (s)|2 is the number of word by solving two simple inference problems as maxi- pairs. mizations: In practice, we use online learning instead of opti- z∗ = arg max P (z|w, y; θ) mizing the full objective. The detailed learning algo- z 0 0 (5) rithm is presented in Figure 2. Following Hoffmann y , z = arg max P (z, y|w; θ) y,z et al. (2011), we use 50 iterations in the experiments.
2.4 Feature Design 3 Experiments At the word-level, our discriminative model allows 3.1 Data use of arbitrary features that are similar to those in monolingual word alignment models (MacCartney It is nontrivial to gather a gold-standard dataset et al., 2008; Thadani and McKeown, 2011; Yao of naturally occurring paraphrases and non- et al., 2013a,b). But unlike discriminative mono- paraphrases efficiently from Twitter, since this lingual word alignment, we only use sentence-level requires pairwise comparison of tweets and faces training labels instead of word-level alignment a very large search space. To make this annotation annotation. For every word pair, we extract the task tractable, we design a novel and efficient following features: crowdsourcing method using Amazon Mechanical Turk. Our entire data collection process is de- String Features that indicate whether the two tailed in Section §4, with several experiments that words, their stemmed forms and their normalized demonstrate annotation quality and efficiency. forms are the same, similar or dissimilar. We In total, we constructed a Twitter Paraphrase Cor- used the Morpha stemmer (Minnen et al., 2001),3 pus of 18,762 sentence pairs and 19,946 unique sen- Jaro-Winkler string similarity (Winkler, 1999) and tences. The training and development set consists the Twitter normalization lexicon by Han et al. of 17,790 sentence pairs posted between April 24th (2012). and May 3rd, 2014 from 500+ trending topics (ex- POS Features that are based on the part-of-speech cluding hashtags). Our paraphrase model and data tags of the two words in the pair, specifying whether collection approach is general and can be combined the two words have same or different POS tags with various Twitter topic detecting solutions (Diao and what the specific tags are. We use the Twitter et al., 2012; Ritter et al., 2012). As a demonstra- Part-Of-Speech tagger developed by Derczynski tion, we use Twitter’s own trends service since it et al. (2013). We add new fine-grained tags for is easily available. Twitter trending topics are de- variations of the eight words: “a”, “be”, “do”, termined by an unpublished algorithm, which finds “have”, “get”, “go”, “follow” and “please”. For words, phrases and hashtags that have had a sharp example, we use a tag HA for words “have”, “has” increase in popularity, as opposed to overall vol- and “had”. ume. We use case-insensitive exact matching to lo- Topical Features that relate to the strength of cate topic names in the sentences. a word’s association to the topic. This feature Each sentence pair was annotated by 5 different identifies the popular words in each topic, e.g. “3” crowdsourcing workers. For the test set, we ob- in tweets about basketball game, “RIP” in tweets tained both crowdsourced and expert labels on 972 about a celebrity’s death. We use G2 log-likelihood- sentence pairs from 20 randomly sampled Twitter ratio statistic, which has been frequently used in trending topics between May 13th and June 10th. NLP, as a measure of word associations (Dunning, Our dataset is more realistic and balanced, contain- 1993; Moore, 2004). The significant scores are ing 79% non-paraphrases vs. 34% in the benchmark computed for each trend on an average of about Microsoft Paraphrase Corpus of news data. As noted 1500 sentences and converted to binary features for in (Das and Smith, 2009), the lack of natural non- every word pair, indicating whether the two words paraphrases in the MSR corpus creates bias towards are both significant or not. certain models. Our topical features are novel and were not 3.2 Baselines used in previous work. Following Riedel et al. We use four baselines to compare with our proposed (2010) and Hoffmann et al. (2011), we also incor- approach for the sentential paraphrase identification porate conjunction features into our system for bet- task. For the first baseline, we choose a super- ter accuracy, namely Word+POS, Word+Topical and vised logistic regression (LR) baseline used by Das Word+POS+Topical features. and Smith (2009). It uses simple n-gram (also in 3 https://github.com/knowitall/morpha stemmed form) overlapping features but shows very
Method F1 Precision Recall Random 0.294 0.208 0.500 WTMF (Guo and Diab, 2012)* 0.583 0.525 0.655 LR (Das and Smith, 2009)** 0.630 0.629 0.632 LEXLATENT 0.641 0.663 0.621 LEXDISCRIM (Ji and Eisenstein, 2013) 0.645 0.664 0.628 M ULTI P 0.724 0.722 0.726 Human Upperbound 0.823 0.752 0.908 Table 2: Performance of different paraphrase identification approaches on Twitter data. *An enhanced version that uses additional 1.6 million sentences from Twitter. ** Reimplementation of a strong baseline used by Das and Smith (2009). competitive performance on the MSR corpus. The second baseline is a state-of-the-art unsu- 3.3 System Performance pervised method, Weighted Textual Matrix Factor- For evaluation of different systems, we compute ization (WTMF),4 which is specially developed for precision-recall curves and report the highest F1 short sentences by modeling the semantic space of measure of any point on the curve, on the test dataset both words that are present in and absent from of 972 sentence pairs against the expert labels. Ta- the sentences (Guo and Diab, 2012). The origi- ble 2 shows the performance of different systems. nal model was learned from WordNet (Fellbaum, Our proposed M ULTI P, a principled latent variable 2010), OntoNotes (Hovy et al., 2006), Wiktionary, model alone, achieves competitive results with the the Brown corpus (Francis and Kucera, 1979). We state-of-the-art system that combines discriminative enhance the model with 1.6 million sentences from training and latent semantics. Twitter as suggested by Guo et al. (2013). In Table 2, we also show the agreement levels of Ji and Eisenstein (2013) presented a state-of- labels derived from 5 non-expert annotations on Me- the-art ensemble system, which we call LEXDIS- chanical Turk, which can be considered as an up- CRIM.5 It directly combines both discriminatively- perbound for automatic paraphrase recognition task tuned latent features and surface lexical features into performed on this data set. The annotation quality a SVM classifier. Specifically, the latent representa- of our corpus is surprisingly good given the fact that tion of a pair of sentences v~1 and v~2 is converted into the definition of paraphrase is rather inexact (Bhagat a feature vector, [v~1 + v~2 , |v~1 − v~2 |], by concatenating and Hovy, 2013); the inter-rater agreement between the element-wise sum v~1 + v~2 and absolute different expert annotators on news data is only 0.83 as re- |v~1 − v~2 |. ported by Dolan et al. (2004). We also introduce a new baseline, LEXLATENT, which is a simplified version of LEXDISCRIM and F1 Prec Recall easy to reproduce. It uses the same method to com- M ULTI P 0.724 0.722 0.726 bine latent features and surface features, but com- - String features 0.509 0.448 0.589 bines the open-sourced WTMF latent space model - POS features 0.496 0.350 0.851 and the logistic regression model from above in- - Topical features 0.715 0.694 0.737 stead. It achieves similar performance as LEXDIS- CRIM on our dataset (Table 2). Table 3: Feature ablation by removing each individ- ual feature group from the full set. 4 The source code and data for WTMF is available at: To assess the impact of different features on the http://www.cs.columbia.edu/˜weiwei/code. html model’s performance, we conduct feature ablation 5 The parsing feature was removed because it was not helpful experiments, removing one group of features at a on our Twitter dataset. time. The results are shown in Table 3. Both string
Para? Sentence Pair from Twitter M ULTI P LEXLATENT YES ◦ The new Ciroc flavor has arrived rank=12 rank=266 ◦ Ciroc got a new flavor comin out YES ◦ Roberto Mancini gets the boot from Man City rank=64 rank=452 ◦ Roberto Mancini has been sacked by Manchester City with the Blues saying YES ◦ I want to watch the purge tonight rank=136 rank=11 ◦ I want to go see The Purge who wants to come with NO ◦ Somebody took the Marlins to 20 innings rank= 8 rank=54 ◦ Anyone who stayed 20 innings for the marlins NO ◦ WORLD OF JENKS IS ON AT 11 rank=167 rank=9 ◦ World of Jenks is my favorite show on tv Table 4: Example system outputs; rank is the position in the list of all candidate paraphrase pairs in the test set ordered by model score. M ULTI P discovers lexically divergent paraphrases while LEXLATENT prefers more overall sentence similarity. Underline marks the word pair(s) with highest estimated probability as paraphrastic anchor(s) for each sentence pair. 1.0 1.0 MULTIP MULTIP−PE LEXLATENT LEXLATENT 0.8 0.8 0.6 0.6 Precision Precision 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Figure 3: Precision and recall curves. Our M ULTI P model alone achieves competitive performance with the LEXLATENT system that combines latent space model and feature-based supervised classifier. The two approaches have complementary strengths, and achieves significant improvement when combined together (M ULTI P-PE). and POS features are essential for system perfor- sistently better precision at the second half (recall > mance, while topical features are helpful but not as 0.5) as well as a higher maximum F1 score. This re- crucial. sult reflects our design concept of M ULTI P, which is intended to pick up sentential paraphrases with more Figure 3 presents precision-recall curves and divergent wordings aggressively. LEXLATENT, as shows the sensitivity and specificity of each model a combined system, considers sentence features in in comparison. In the first half of the curve (recall both surface and latent space and is more conserva- < 0.5), M ULTI P model makes bolder and less ac- tive. Table 4 further illustrates this difference with curate decisions than LEXLATENT. However, the some example system outputs. curve for M ULTI P model is more flat and shows con-
3.4 Product of Experts (M ULTI P-PE) expert=5 Our M ULTI P model and previous similarity-based approaches have complementary strengths, so we expert=4 experiment with combining M ULTI P (Pm ) and expert=3 LEXLATENT (Pl ) through a product of experts expert=2 (Hinton, 2002): expert=1 Pm (y|s1 , s2 ) × Pl (y|s1 , s2 ) P (y|s1 , s2 ) = P y Pm (y|s1 , s2 ) × Pl (y|s1 , s2 ) expert=0 (6) =0 =1 =2 =3 =4 =5 turk turk turk turk turk turk The resulting system M ULTI P-PE provides con- sistently better precision and recall over the Figure 4: A heat-map showing overlap between LEXLATENT model, as shown on the right in expert and crowdsourcing annotation. The inten- Figure 3. The M ULTI P-PE system outperforms sity along the diagonal indicates good reliability of LEXLATENT significantly according to a paired t- crowdsourcing workers for this particular task; and test with ρ less than 0.05. Our proposed M UL - the shift above the diagonal reflects the difference TI P takes advantage of Twitter’s specific properties between the two annotation schemas. For crowd- and provides complementary information to previ- sourcing (turk), the numbers indicate how many an- ous approaches. Previously, Das and Smith (2009) notators out of 5 picked the sentence pair as para- has also used a product of experts to combine a lex- phrases; 0,1 are considered non-paraphrases; 3,4,5 ical and a syntax-based model together. are paraphrases. For expert annotation, all 0,1,2 are non-paraphrases; 4,5 are paraphrases. Medium- 4 Constructing Twitter Paraphrase scored cases are discarded in training and testing in Corpus our experiments. We now turn to describing our data collection and annotation methodology. Our goal is to construct a high quality dataset that contains representative ex- identifies topics that are immediately popular, rather amples of paraphrases and non-paraphrases in Twit- than those that have been popular for longer periods ter. Since Twitter users are free to talk about any- of time or which trend on a daily basis. We tokenize thing regarding any topic, a random pair of sen- and split each tweet into sentences.7 tences about the same topic has a low chance (less 4.2 Task Design on Mechanical Turk than 8%) of expressing the same meaning. This causes two problems: a) it is expensive to obtain We show the annotator an original sentence, then paraphrases via manual annotation; b) non-expert ask them to pick sentences with the same mean- annotators tend to loosen the criteria and are more ing from 10 candidate sentences. The original and likely to make false positive errors. To address candidate sentences are randomly sampled from the these challenges, we design a simple annotation task same topic. For each such 1 vs. 10 question, we ob- and introduce two selection mechanisms to select tain binary judgements from 5 different annotators, sentences which are more likely to be paraphrases, paying each annotator $0.02 per question. On aver- while preserving diversity and representativeness. age, each question takes one annotator about 30 ∼ 45 seconds to answer. 4.1 Raw Data from Twitter We crawl Twitter’s trending topics and their associ- 4.3 Annotation Quality ated tweets using public APIs.6 According to Twit- We remove problematic annotators by checking ter, trends are determined by an algorithm which their Cohen’s Kappa agreement (Artstein and Poe- 6 7 More information about Twitter’s APIs: https://dev. We use the toolkit developed by O’Connor et al. (2010): twitter.com/docs/api/1.1/overview https://github.com/brendano/tweetmotif
U.S. filtered Facebook Dwight Howard random GWB Netflix Trending Topics Ronaldo Dortmund Momma Dee Morning Huck Klay Milwaukee Harvick Jeff Green Ryu The Clippers Candice Robert Woods Amber Reggie Miller 0.0 0.2 0.4 0.6 0.8 Percentage of Positive Judgements Figure 5: The proportion of paraphrases (percentage of positive votes from annotators) vary greatly across different topics. Automatic filtering in Section 4.4 roughly doubles the paraphrase yield. sio, 2008) with other annotators. We also compute method is inspired by a typical problem in extractive inter-annotator agreement with an expert annotator summarization, that the salient sentences are likely on 971 sentence pairs. In the expert annotation, we redundant (paraphrases) and need to be removed adopt a 5-point Likert scale to measure the degree in the output summaries. We employ the scoring of semantic similarity between sentences, which is method used in SumBasic (Nenkova and Vander- defined by Agirre et al. (2012) as follows: wende, 2005; Vanderwende et al., 2007), a simple 5: Completely equivalent, as they mean the same but powerful summarization system, to find salient thing; sentences. For each topic, we compute the probabil- 4: Mostly equivalent, but some unimportant details ity of each word P (wi ) by simply dividing its fre- differ; quency by the total number of all words in all sen- 3: Roughly equivalent, but some important informa- tences. Each sentence s is scored as the average of tion differs/missing. the probabilities of the words in it, i.e. 2: Not equivalent, but share some details; 1: Not equivalent, but are on the same topic; X P (wi ) Salience(s) = (7) 0: On different topics. w ∈s |{wi |wi ∈ s}| i We then rank the sentences and pick the original Although the two scales of expert and crowd- sentence randomly from top 10% salient sentences sourcing annotation are defined differently, their and candidate sentences from top 50% to present to Pearson correlation coefficient reaches 0.735 (two- the annotators. tailed significance 0.001). Figure 4 shows a heat- In a trial experiment of 20 topics, the filtering map representing the detailed overlap between the technique double the yield of paraphrases from 152 two annotations. It suggests that the graded simi- to 329 out of 2000 sentence pairs over naı̈ve ran- larity annotation task could be reduced to a binary dom sampling (Figure 5 and Figure 6). We also use choice in a crowdsourcing setup. PINC (Chen and Dolan, 2011) to measure the qual- ity of paraphrases collected (Figure 7). PINC was 4.4 Automatic Summarization Inspired designed to measure n-gram dissimilarity between Sentence Filtering two sentences, and in essence it is the inverse of We filter the sentences within each topic to se- BLEU. In general, the cases with high PINC scores lect more probable paraphrases for annotation. Our include more complex and interesting rephrasings.
Number of Sentence Pairs (out of 2000) 400 1.00 random random filtered filtered PINC (lexical dissimilarity) MAB 0.95 MAB 300 0.90 200 0.85 100 0.80 0 0.75 5 4 3 2 1 5 4 3 2 1 0 Number of Annotators Judging as Paraphrases (out of 5) Number of Annotators Judging as Paraphrases (out of 5) Figure 6: Numbers of paraphrases collected by dif- Figure 7: PINC scores of paraphrases collected. ferent methods. The annotation efficiency (3,4,5 The higher the PINC, the more significant the re- are regarded as paraphrases) is significantly im- wording. Our proposed annotation strategy quadru- proved by the sentence filtering and Multi-Armed ples paraphrase yield, while not greatly reducing Bandits (MAB) based topic selection. diversity as measured by PINC. 4.5 Topic Selection using Multi-Armed Bandits using MAB and sentence filtering, a 4-fold increase (MAB) Algorithm compared to only using random selection (Figure 6). Another approach to increasing paraphrase yield is 5 Related Work to choose more appropriate topics. This is partic- ularly important because the number of paraphrases Automatic Paraphrase Identification has been varies greatly from topic to topic and thus the chance widely studied (Androutsopoulos and Malakasiotis, to encounter paraphrases during annotation (Fig- 2010; Madnani and Dorr, 2010). The ACL Wiki ure 5). We treat this topic selection problem as a gives an excellent summary of various techniques.8 variation of the Multi-Armed Bandit (MAB) prob- Many recent high-performance approaches use sys- lem (Robbins, 1985) and adapt a greedy algorithm, tem combination (Das and Smith, 2009; Madnani the bounded -first algorithm, of Tran-Thanh et al. et al., 2012; Ji and Eisenstein, 2013). For exam- (2012) to accelerate our corpus construction. ple, Madnani et al. (2012) combines multiple sophis- ticated machine translation metrics using a meta- Our strategy consists of two phases. In the first classifier. An earlier attempt on Twitter data is that exploration phase, we dedicate a fraction of the to- of Xu et al. (2013b). They limited the search space tal budget, , to explore randomly chosen arms of to only the tweets that explicitly mention a same each slot machine (trending topic on Twitter), each date and a same named entity, however there remain m times. In the second exploitation phase, we sort a considerable amount of mislabels in their data.9 all topics according to their estimated proportion Zanzotto et al. (2011) also experimented with SVM of paraphrases, and sequentially annotate d (1−)B l−m e tree kernel methods on Twitter data. arms that have the highest estimated reward until Departing from the previous work, we propose reaching the maximum l = 10 annotations for any a latent variable model to jointly infer the corre- topic to insure data diversity. spondence between words and sentences. It is re- We tune the parameters m to be 1 and to be be- lated to discriminative monolingual word alignment tween 0.35 ∼ 0.55 through simulation experiments, (MacCartney et al., 2008; Thadani and McKeown, by artificially duplicating a small amount of real an- 8 notation data. We then apply this MAB algorithm http://aclweb.org/aclwiki/index.php? title=Paraphrase_Identification_(State_of_ in the real-world. We explore 500 random topics the_art) and then exploited 100 of them. The yield of para- 9 The data is released by Xu et al. (2013b) at: https:// phrases rises to 688 out of 2000 sentence pairs by github.com/cocoxu/twitterparaphrase/
2011; Yao et al., 2013a,b), but different in that the 6 Conclusions paraphrase task requires additional sentence align- This paper introduced M ULTI P, a joint word- ment modeling with no word alignment data. Our sentence model to learn paraphrases from tempo- approach is also inspired by Fung and Cheung’s rally and topically grouped messages in Twitter. (2004a; 2004b) work on bootstrapping bilingual par- While simple and principled, our model achieves allel sentence and word translations from compara- performance competitive with a state-of-the-art en- ble corpora. semble system combining latent semantic represen- tations and surface similarity. By combining our Multiple Instance Learning (Dietterich et al., method with previous work as a product-of-experts 1997) has been used by different research groups we outperform the state-of-the-art. Our latent- in the field of information extraction (Riedel et al., variable approach is capable of learning word-level 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; paraphrase anchors given only sentence annotations. Ritter et al., 2013; Xu et al., 2013a). The idea is Because our graphical model is modular and ex- to leverage structured data as weak supervision for tensible (for example it should be possible to re- tasks such as relation extraction. This is done, for place the deterministic-or with other aggregators), example, by making the assumption that at least one we are optimistic this work might provide a path sentence in the corpus which mentions a pair of en- towards weakly supervised word alignment models tities (e1 , e2 ) participating in a relation (r) expresses using only sentence-level annotations. the proposition: r(e1 , e2 ). In addition, we presented a novel and efficient annotation methodology which was used to crowd- source a unique corpus of paraphrases harvested Crowdsourcing Paraphrase Acquisition: Buzek from Twitter. We make this resource available to the et al. (2010) and Denkowski et al. (2010) focused research community. specifically on collecting paraphrases of text to be translated to improve machine translation quality. Acknowledgments Chen and Dolan (2011) gathered a large-scale para- phrase corpus by asking Mechanical Turk workers The author would like to thank editor Sharon Gold- to caption the action in short video segments. Sim- water and three anonymous reviewers for their ilarly, Burrows et al. (2012) asked crowdsourcing thoughtful comments, which substantially improved workers to rewrite selected excerpts from books. this paper. We also thank Ralph Grishman, Sameer Ling et al. (2014) crowdsourced bilingual parallel Singh, Yoav Artzi, Mark Yatskar, Chris Quirk, Ani text using Twitter as the source of data. Nenkova and Mitch Marcus for their feedback. This material is based in part on research spon- In contrast, we design a simple crowdsourcing sored by the NSF under grant IIS-1430651, DARPA task requiring only binary judgements on sentences under agreement number FA8750-13-2-0017 (the collected from Twitter. There are several advantages DEFT program) and through a Google Faculty Re- as compared to existing work: a) the corpus also search Award to Chris Callison-Burch. The U.S. covers a very diverse range of topics and linguistic Government is authorized to reproduce and dis- expressions, especially colloquial language, which tribute reprints for governmental purposes. The is different from and thus complements previous views and conclusions contained in this publication paraphrase corpora; b) the paraphrase corpus col- are those of the authors and should not be interpreted lected contains a representative proportion of both as representing official policies or endorsements of negative and positive instances, while lack of good DARPA or the U.S. Government. Yangfeng Ji is negative examples was an issue in the previous re- supported by a Google Faculty Research Award search (Das and Smith, 2009); c) this method is scal- awarded to Jacob Eisenstein. able and sustainable due to the simplicity of the task and real-time, virtually unlimited text supply from Twitter.
References shop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on se- Derczynski, L., Ritter, A., Clark, S., and Bontcheva, mantic textual similarity. In Proceedings of the K. (2013). Twitter part-of-speech tagging for all: First Joint Conference on Lexical and Computa- Overcoming sparse and noisy data. In Proceed- tional Semantics (*SEM). ings of the Recent Advances in Natural Language Processing (RANLP). Androutsopoulos, I. and Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012). methods. Journal of Artificial Intelligence Re- Finding bursty topics from microblogs. In Pro- search, 38. ceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL). Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics. Compu- Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez, tational Linguistics, 34(4). T. (1997). Solving the multiple instance prob- lem with axis-parallel rectangles. Artificial Intel- Bhagat, R. and Hovy, E. (2013). What is a para- ligence, 89(1). phrase? Computational Linguistics, 39(3). Dolan, B., Quirk, C., and Brockett, C. (2004). Un- Burrows, S., Potthast, M., and Stein, B. (2012). supervised construction of large paraphrase cor- Paraphrase acquisition via crowdsourcing and pora: Exploiting massively parallel news sources. machine learning. Transactions on Intelligent In Proceedings of the 20th International Confer- Systems and Technology (ACM TIST). ence on Computational Linguistics (COLING). Buzek, O., Resnik, P., and Bederson, B. B. (2010). Dolan, W. and Brockett, C. (2005). Automatically Error driven paraphrase annotation using Me- constructing a corpus of sentential paraphrases. In chanical Turk. In Proceedings of the Workshop on Proceedings of the 3rd International Workshop on Creating Speech and Language Data with Ama- Paraphrasing. zon’s Mechanical Turk. Dunning, T. (1993). Accurate methods for the statis- Chen, D. L. and Dolan, W. B. (2011). Collecting tics of surprise and coincidence. Computational highly parallel data for paraphrase evaluation. In Linguistics, 19(1). Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL). Fellbaum, C. (2010). WordNet. In Theory and Ap- plications of Ontology: Computer Applications. Collins, M. (2002). Discriminative training methods Springer. for hidden Markov models: Theory and experi- ments with perceptron algorithms. In Proceed- Francis, W. N. and Kucera, H. (1979). Brown corpus ings of the Conference on Empirical Methods on manual. Brown University. Natural Language Processing (EMNLP). Fung, P. and Cheung, P. (2004a). Mining very-non- Das, D. and Smith, N. A. (2009). Paraphrase identi- parallel corpora: Parallel sentence and lexicon ex- fication as probabilistic quasi-synchronous recog- traction via bootstrapping and EM. In Proceed- nition. In Proceedings of the Joint Conference ings of the Conference on Empirical Methods in of the 47th Annual Meeting of the Association Natural Language Processing (EMNLP). for Computational Linguistics and the 4th Inter- Fung, P. and Cheung, P. (2004b). Multi-level boot- national Joint Conference on Natural Language strapping for extracting parallel sentences from a Processing of the Asian Federation of Natural quasi-comparable corpus. In Proceedings of the Language Processing (ACL-IJCNLP). International Conference on Computational Lin- Denkowski, M., Al-Haj, H., and Lavie, A. (2010). guistics (COLING). Turker-assisted paraphrasing for English-Arabic Guo, W. and Diab, M. (2012). Modeling sentences machine translation. In Proceedings of the Work- in the latent space. In Proceedings of the 50th
Annual Meeting of the Association for Computa- MacCartney, B., Galley, M., and Manning, C. tional Linguistics (ACL). (2008). A phrase-based alignment model for Guo, W., Li, H., Ji, H., and Diab, M. (2013). Link- natural language inference. In Proceedings of ing tweets to news: A framework to enrich short the Conference on Empirical Methods in Natural text data in social media. In Proceedings of the Language Processing (EMNLP). 51th Annual Meeting of the Association for Com- Madnani, N. and Dorr, B. J. (2010). Generating putational Linguistics (ACL). phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, Han, B., Cook, P., and Baldwin, T. (2012). Auto- 36(3). matically constructing a normalisation dictionary for microblogs. In Proceedings of the Confer- Madnani, N., Tetreault, J., and Chodorow, M. ence on Empirical Methods on Natural Language (2012). Re-examining machine translation met- Processing and Computational Natural Language rics for paraphrase identification. In Proceedings Learning (EMNLP-CoNLL). of the Conference of the North American Chapter of the Association for Computational Linguistics Hinton, G. E. (2002). Training products of experts - Human Language Technologies (NAACL-HLT). by minimizing contrastive divergence. Neural Computation, 14(8). Minnen, G., Carroll, J., and Pearce, D. (2001). Ap- plied morphological processing of english. Natu- Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, ral Language Engineering, 7(03). L. S., and Weld, D. S. (2011). Knowledge-based weak supervision for information extraction of Moore, R. C. (2004). On log-likelihood-ratios and overlapping relations. In Proceedings of the 49th the significance of rare events. In Proceedings of Annual Meeting of the Association for Computa- the Conference on Empirical Methods in Natural tional Linguistics (ACL). Language Processing (EMNLP). Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Nenkova, A. and Vanderwende, L. (2005). The im- and Weischedel, R. (2006). OntoNotes: the 90% pact of frequency on summarization. Technical solution. In Proceedings of the Human Language report, Microsoft Research. MSR-TR-2005-101. Technology Conference - North American Chap- O’Connor, B., Krieger, M., and Ahn, D. (2010). ter of the Association for Computational Linguis- Tweetmotif: Exploratory search and topic sum- tics Annual Meeting (HLT-NAACL). marization for Twitter. In Proceedings of the 4th Ji, Y. and Eisenstein, J. (2013). Discriminative International AAAI Conference on Weblogs and improvements to distributional sentence similar- Social Media (ICWSM). ity. In Proceedings of the Conference on Em- Petrović, S., Osborne, M., and Lavrenko, V. (2012). pirical Methods in Natural Language Processing Using paraphrases for improving first story detec- (EMNLP). tion in news and Twitter. In Proceedings of the Liang, P., Bouchard-Côté, A., Klein, D., and Taskar, Conference of the North American Chapter of the B. (2006). An end-to-end discriminative approach Association for Computational Linguistics - Hu- to machine translation. In Proceedings of the 21st man Language Technologies (NAACL-HLT). International Conference on Computational Lin- Riedel, S., Yao, L., and McCallum, A. (2010). Mod- guistics and the 44th annual meeting of the Asso- eling relations and their mentions without labeled ciation for Computational Linguistics (COLING- text. In Proceedigns of the European Conference ACL). on Machine Learning and Principles and Practice Ling, W., Marujo, L., Dyer, C., Alan, B., and Isabel, of Knowledge Discovery in Databases (ECML- T. (2014). Crowdsourcing high-quality parallel PKDD). data extraction from Twitter. In Proceedings of Ritter, A., Mausam, Etzioni, O., and Clark, S. the Ninth Workshop on Statistical Machine Trans- (2012). Open domain event extraction from Twit- lation (WMT). ter. In Proceedings of the 18th International Con-
ference on Knowledge Discovery and Data Min- Wang, L., Dyer, C., Black, A. W., and Trancoso, ing (SIGKDD). I. (2013). Paraphrasing 4 microblog normaliza- Ritter, A., Zettlemoyer, L., Mausam, and Etzioni, O. tion. In Proceedings of the Conference on Em- (2013). Modeling missing data in distant super- pirical Methods on Natural Language Processing vision for information extraction. Transactions (EMNLP). of the Association for Computational Linguistics Winkler, W. E. (1999). The state of record link- (TACL). age and current research problems. Technical re- Robbins, H. (1985). Some aspects of the sequen- port, Statistical Research Division, U.S. Census tial design of experiments. In Herbert Robbins Bureau. Selected Papers. Springer. Xu, W., Hoffmann, R., Zhao, L., and Grishman, R. (2013a). Filling knowledge base gaps for distant Sekine, S. (2005). Automatic paraphrase discovery supervision of relation extraction. In Proceedings based on context and keywords between NE pairs. of the 51st Annual Meeting of the Association for In Proceedings of the 3rd International Workshop Computational Linguistics (ACL). on Paraphrasing. Xu, W., Ritter, A., and Grishman, R. (2013b). Gath- Shinyama, Y., Sekine, S., and Sudo, K. (2002). Au- ering and generating paraphrases from Twitter tomatic paraphrase acquisition from news articles. with application to normalization. In Proceed- In Proceedings of the 2nd International Confer- ings of the Sixth Workshop on Building and Using ence on Human Language Technology Research Comparable Corpora (BUCC). (HLT). Yao, X., Van Durme, B., Callison-Burch, C., and Surdeanu, M., Tibshirani, J., Nallapati, R., and Man- Clark, P. (2013a). A lightweight and high perfor- ning, C. D. (2012). Multi-instance multi-label mance monolingual word aligner. In Proceedings learning for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for of the 50th Annual Meeting of the Association for Computational Linguistics (ACL). Computational Linguistics (ACL). Yao, X., Van Durme, B., and Clark, P. (2013b). Thadani, K. and McKeown, K. (2011). Optimal and Semi-markov phrase-based monolingual align- syntactically-informed decoding for monolingual ment. In Proceedings of the Conference on Em- phrase-based alignment. In Proceedings of the pirical Methods on Natural Language Processing 49th Annual Meeting of the Association for Com- (EMNLP). putational Linguistics - Human Language Tech- nologies (ACL-HLT). Zanzotto, F. M., Pennacchiotti, M., and Tsiout- siouliklis, K. (2011). Linguistic redundancy in Tran-Thanh, L., Stein, S., Rogers, A., and Jennings, Twitter. In Proceedings of the Conference on Em- N. R. (2012). Efficient crowdsourcing of un- pirical Methods in Natural Language Processing known experts using multi-armed bandits. In Pro- (EMNLP). ceedings of the European Conference on Artificial Intelligence (ECAI). Zettlemoyer, L. S. and Collins, M. (2007). On- line learning of relaxed CCG grammars for pars- Vanderwende, L., Suzuki, H., Brockett, C., and ing to logical form. In Proceedings of the 2007 Nenkova, A. (2007). Beyond SumBasic: Task- Joint Conference on Empirical Methods in Natu- focused summarization with sentence simplifica- ral Language Processing and Computational Nat- tion and lexical expansion. Information Process- ural Language Learning (EMNLP-CoNLL). ing & Management, 43. Zhang, C. and Weld, D. S. (2013). Harvesting paral- Wan, S., Dras, M., Dale, R., and Paris, C. (2006). lel news streams to generate paraphrases of event Using dependency-based features to take the relations. In Proceedings of the Conference on “para-farce” out of paraphrase. In Proceedings Empirical Methods in Natural Language Process- of the Australasian Language Technology Work- ing (EMNLP). shop.
You can also read