Robust Argument Unit Recognition and Classification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Robust Argument Unit Recognition and Classification Dietrich Trautmann† , Johannes Daxenberger‡ , Christian Stab‡ , Hinrich Schütze† , Iryna Gurevych‡ † Center for Information and Language Processing (CIS), LMU Munich, Germany ‡ Ubiquitous Knowledge Processing Lab (UKP-TUDA), TU Darmstadt, Germany dietrich@trautmann.me; inquiries@cislmu.org http://www.ukp.tu-darmstadt.de Abstract Topic: Death Penalty It does not deter crime and Argument mining is generally performed on CON arXiv:1904.09688v1 [cs.CL] 22 Apr 2019 the sentence-level – it is assumed that an en- it is extremely expensive to administer . CON tire sentence (not parts of it) corresponds to an argument. In this paper, we introduce the Topic: Gun Control new task of Argument unit Recognition and Yes , guns can be used for protection CON Classification (ARC). In ARC, an argument is but laws are meant to protect us , too . generally a part of a sentence – a more real- PRO istic assumption since several different argu- ments can occur in one sentence and longer Figure 1: Examples of sentences with two arguments sentences often contain a mix of argumenta- as well as with annotated spans and stances. tive and non-argumentative parts. Recogniz- ing and classifying the spans that correspond to arguments makes ARC harder than previ- ously defined argument mining tasks. We re- (Stab et al., 2018b; Shnarch et al., 2018). While lease ARC-8, a new benchmark for evaluating discourse-level argument mining aims to parse the ARC task. We show that token-level an- argumentative structures in a fine-grained man- notations for argument units can be gathered ner within single documents (thus, mostly in sin- using scalable methods. ARC-8 contains 25% gle domains or applications), topic-dependent ar- more arguments than a dataset annotated on gument retrieval focusses on argumentative con- the sentence-level would. We cast ARC as a structs such as claims or evidences with regard to sequence labeling task, develop a number of a given topic that can be found in very different methods for ARC sequence tagging and es- tablish the state of the art for ARC-8. A fo- types of discourse. Argument retrieval typically cus of our work is robustness: both robust- frames the argumentative unit (argument, claim, ness against errors in sentence identification evidence etc.) on the level of sentence, i.e., it (which are frequent for noisy text) and ro- seeks to detect sentences that are relevant support- bustness against divergence in training and test ing (PRO) or opposing (CON) arguments as in the data. examples given in Fig. 1. 1 Introduction In this work, we challenge the assumption that arguments should be detected on the sentence- Argument mining (Peldszus and Stede, 2013) has level. This is partly justified by the difficulty of gained substantial attention from researchers in “unitizing”, i.e., of segmenting a sentence into the NLP community, mostly due to its complex- meaningful units for argumentation tasks (Stab ity as a task requiring sophisticated reasoning, et al., 2018b; Miller et al., 2019). We show that re- but also due to the availability of high-quality framing the argument retrieval task as Argument resources. Those resources include discourse- unit Recognition and Classification (ARC), i.e., level closed-domain datasets for political, educa- as recognition and classification of spans within tional or legal applications (Walker et al., 2012; a sentence on the token-level is feasible, not just Stab and Gurevych, 2014; Wyner et al., 2010), as in terms of the reliability of recognizing argumen- well as open-domain datasets for topic-dependent tative spans, but also in terms of the scalability of argument retrieval from heterogeneous sources generating training data.
Framing argument retrieval as in ARC, i.e., on the latter: we model arguments as self-contained the token-level, has several advantages: pieces of information which can be verified as rel- evant arguments for a given topic with no or mini- • It prevents merging otherwise separate argu- mal surrounding context. ments into a single argument (e.g., for the topic death penalty in Fig. 1). As one of the main contributions of this work, we show how to create training data for token- • It can handle two-sided argumentation ade- level argument mining with the help of crowd- quately (e.g., for the topic gun control in Fig. sourcing. Stab et al. (2018b) and Shnarch et al. 1). (2018) annotated topic-dependent arguments on the sentence-level using crowdsourcing. Fleiss κ • It can be framed as a sequence labeling task, agreement scores reported were 0.45 in Shnarch which is a common scenario for many NLP et al. (2018) for crowd workers and 0.72 in Stab applications with many available architec- et al. (2018b) for experts. Miller et al. (2019) tures for experimentation (Eger et al., 2017). present a multi-step approach to crowdsource To address the feasibility of ARC, we will ad- more complex argument structures in customer re- dress the following questions. First, we discuss views. Like us, they annotate arguments on the how to select suitable data for annotating argu- token-level – however, they annotate argument ments on the token-level. Second, we analyze components from the discourse-level perspective. whether the annotation of arguments on the token- Their inter-annotator agreement (αu roughly be- level can be reliably conducted with trained ex- tween 0.4 and 0.5) is low, demonstrating the dif- perts, as well as with untrained workers in a ficulty of this task. In this work, to capture ar- crowdsourcing setup. Third, we test a few basic gument spans more precisely, we test the validity as well as state-of-the-art sequence labeling meth- of arguments using a slot filling approach. Reis- ods on ARC. ert et al. (2018) also use argument templates, i.e., A focus of our work is robustness. (i) The as- slots to determine arguments. sumption that arguments correspond to complete Close to the spirit of this work, Ajjour et al. sentences makes argument mining brittle – when (2017) compare various argumentative unit seg- the assumption is not true, then sentence-level ar- mentation approaches on the token-level across gument mining makes mistakes. In addition, sen- three corpora. They use a feature-based approach tence identification is error-prone for noisy text and various architectures for segmentation and (e.g., text crawled from the web), resulting in find that BiLSTMs work best on average. How- noisy non-sentence units being equated with argu- ever, as opposed to this work, they study argu- ments. (ii) The properties of argument topics vary mentation on the discourse level, i.e., they do not considerably from topic to topic. An ARC method consider topic-dependency and only account for trained on one topic will not necessarily perform arguments and non-arguments (no argumentative well on another. We set ARC-8 up to make it types or relations like PRO and CON). Eger et al. easy to test the robustness of argument mining (2017) model discourse-level argument segmen- by including a cross-domain split and demonstrate tation, identification (claim, premises and major that cross-domain generalization is challenging for claims) and relation extraction as sequence tag- ARC-8. ging, dependency parsing and entity-relation ex- traction. For a dataset of student essays (Stab and 2 Related Work Gurevych, 2014), they find that sequence tagging Our work follows the established line of work on and an entity-relation extraction approach (Miwa argument mining in the NLP community, which and Bansal, 2016) work best. In particular, for the can loosely be divided into approaches detect- unit segmentation task (vanilla BIO), they find that ing and classifying arguments on the discourse state-of-the-art sequence tagging approaches can level (Palau and Moens, 2009; Stab and Gurevych, perform as well or even better than human experts. 2014; Eger et al., 2017) and ones focusing on Stab and Gurevych (2017) propose a CRF-based topic-dependent argument retrieval (Levy et al., approach with manually defined features for the 2014; Wachsmuth et al., 2017; Hua and Wang, unit segmentation task on student essays (Stab and 2017; Stab et al., 2018b). Our work is in line with Gurevych, 2014) and also achieve performance
close to human experts. the topic. Each document was checked for its cor- responding WARC file at the Common Crawl In- 3 Corpus Creation dex.4 We then downloaded and parsed the orig- Collecting annotations on the token-level is chal- inal HTML document for the next steps of our lenging. First, the unit of annotation needs to be pipeline; this ensures reproducibility. Following clearly defined. This is straightforward for tasks this, we used justext5 to remove HTML boiler- with short spans (sequences of words) such as plate. The resulting document was segmented into named entities, but much harder for longer spans separate sentences as well as within a sentence – as in the case of argument units. Second, labels into single tokens using spacy.6 We only con- from multiple annotators need to be merged into a sider sentences with number of tokens in the range single gold standard.1 This is also more difficult [3, 45]. for longer sequences because simple majority vot- 3.3 Sentence Sampling ing over individual words will likely create invalid The sentences were pre-classified with a sentence- (e.g., disrupted or grammatically incorrect) spans. level argument mining model following (Stab To address these challenges, we carefully de- et al., 2018b) and available via the ArgumenText signed selection of sources, sampling and anno- Classify API.7 The API returns for each sentence tation of input for ARC-8, our novel argument (i) an argument confidence score arg score in unit dataset. We first describe how we pro- [0.0, 1.0) (we discard sentences with arg score < cessed and retrieved data from a large webcrawl. 0.5), (ii) the stance on the sentence-level (PRO Next, we outline the sentence sampling process or CON) and (iii) the stance confidence score that accounts for a balanced selection of both stance score. This information was used together (non-)argument types and source documents. Fi- with the doc score to rank sentences for a selec- nally, we describe how we crowdsource annota- tion in the following crowd annotation process. tions of argument units within sentences in a scal- First, all three scores (for documents, arguments able way. and stance confidence) were normalized in the 3.1 Data Source range of available sentences and secondly summed up to create a rank for each sentence (see Eq. 1) We used the February 2016 Common Crawl with di , ai and si being the ranks fo document, ar- archive,2 which was indexed with Elasticsearch3 gument and stance confidence scores, respectively. following the description in (Stab et al., 2018a). For the sake of comparability, we adopt Stab et al. ranki = di + ai + si (1) (2018b)’s eight topics (cf. Table 1). The topics are general enough to have good coverage in Common The ranked sentences were divided by topic and Crawl. They are also of a controversial nature and the pre-classified stance on the sentence-level and hence a potentially good choice for argument min- ordered by rank (where a lower rank indicates a ing with an expected broad set of supporting and better candidate). We then went down the ranked opposing arguments. list selected each sentence with a probability of p = 0.5 until the target size of n = 500 per 3.2 Retrieval Pipeline stance and topic was reached; otherwise we did For document retrieval, we queried the indexed additional passes through the list. Table 1 gives data for Stab et al. (2018b)’s topics and collected data set creation statistics. the first 500 results per topic ordered by their document score (doc score) from Elasticsearch; 3.4 Crowd Annotations a higher doc score indicates higher relevance for The goal of this work was to come up with a 1 scalable approach to annotate argument units on One could also learn from “soft” labels, i.e., a distribu- the token-level. Given that arguments need to be tion created from the votes of multiple annotators. However, annotated with regard to a specific topic, large this does not solve the problem that some annotators deliver low quality work and their votes should be outscored by a 4 (hopefully) higher-quality majority of annotators. http://index.commoncrawl.org/ 2 CC-MAIN-2016-07 http://commoncrawl.org/2016/02/ 5 february-2016-crawl-archive-now-available/ http://corpus.tools/wiki/Justext 3 6 https://www.elastic.co/products/ https://spacy.io/ 7 elasticsearch https://api.argumentsearch.com/en/doc
# topic #docs #text #sentences #candidates #final #arg-sent. #arg-segm. #non-arg T1 abortion 491 454 39,083 3,282 1,000 424 472 (+11.32%) 576 T2 cloning 495 252 30,504 2,594 1,000 353 400 (+13.31%) 647 T3 marijuana legalization 490 472 45,644 6,351 1,000 630 759 (+20.48%) 370 T4 minimum wage 494 479 43,128 8,290 1,000 630 760 (+20.63%) 370 T5 nuclear energy 491 470 43,576 5,056 1,000 623 726 (+16.53%) 377 T6 death penalty 491 484 32,253 6,079 1,000 598 711 (+18.90%) 402 T7 gun control 497 479 38,443 4,576 1,000 529 624 (+17.96%) 471 T8 school uniforms 495 475 40,937 3,526 1,000 713 891 (+24.96%) 287 total 3,944 3,565 314,568 39,754 8,000 4,500 5,343 (+18.73%) 3,500 Table 1: Number of documents and sentences in the selection process and the final corpus size; arg-sent. is the number of argumentative sentences; arg-segm. is the information about argumentative segments; the percentage value is comparing the number of argumentative sentences with the number of argumentative segments amounts of (cross-topic) training data need to be satisfying agreement (αunom = 0.51, average over created. As has been shown by previous work on topics), one reason being inconsistency in select- topic-dependent argument mining (Shnarch et al., ing argument spans (median length of arguments 2018; Stab et al., 2018b), crowdsourcing can be ranged from nine to 16 words among the three used to obtain reliable annotations for argument experts). In a second round, we therefore de- mining datasets. However, as outlined above, cided to restrict the spans that could be selected token-level annotation significantly increases the by applying a slot filling approach that enforces difficulty of the annotation task, so it was unclear valid argument spans that match a template. We whether agreement among untrained crowd work- use the template: “< T OP IC > should be sup- ers would be sufficiently high. ported/opposed, because < argument span >”. We use the αu agreement measure Krippendorff The guidelines specify that the resulting sentence et al. (2016) in this work. It is designed for anno- had to be a grammatically sound statement. Al- tation tasks that involve unitizing textual continua though this choice unsurprisingly increased the – i.e., segmenting continuous text into meaningful length of spans and reduced the total number of subunits – and measuring chance-corrected agree- arguments selected, it increased consistency of ment in those tasks. It is also a good fit for ar- spans substantially (min/max. median length was gument spans within a sentence: typically these now between 15 and 17). Furthermore, the agree- spans are long and the context is a single sentence ment between the three experts rose to αunom = that may contain any type of argument and any 0.61 (average over topics). Compared to other number of arguments. Krippendorff et al. (2016) studies on token-level argument mining (Eckle- define a family of α-reliability coefficients that Kohler et al., 2015; Li et al., 2017; Stab and improve upon several weaknesses of previous α Gurevych, 2014), this score is in an acceptable measures. From these, we chose the αunom coef- range and we deem it sufficient to proceed with ficient, which also takes into account agreement crowdsourcing. on “blanks” (non-arguments in our case). The ra- In our crowdsourcing setup, workers could se- tionale behind this was that ignoring agreement on lect one or multiple spans, where each span’s per- sentences without any argument spans would over- missible length is between one token and the en- proportionally penalize disagreement in sentences tire sentence. Workers had to either choose at that contain arguments while ignoring agreement least one argument span and its stance (support- in sentences without arguments. ing/opposing), or select that the sentence did not To determine agreement, we initially carried out contain a valid argument and instead solve a sim- an in-house expert study with three graduate em- ple math problem. We introduced further qual- ployees (who were trained on the task beforehand) ity control measures in the form of a qualification and randomly sampled 160 sentences (10 per topic test and periodic attention checks.8 On an initial and stance) from the overall data. In the first round, we did not impose any restrictions on the 8 Workers had to be located in the US, CA, AU, NZ or span of words to be selected, other than that the GB, with an acceptance rate of 95% or higher. Payment was $0.42 per HIT, corresponding to US federal minimum wage selected span should be the shortest self-contained ($7.25/hour). The annotators in the expert study were salaried span that forms an argument. This resulted in un- research staff.
batch of 160 sentences, we collected votes from 3.6 Dataset Statistics nine workers. To determine the optimal number of The resulting data set, ARC-8,9 consists of 8000 workers for the final study, we did majority voting annotated sentences with 3500 (43.75%) being on the token-level (ties broken as non-arguments) non-argumentative. The 4500 argumentative sen- for both the expert study and workers from the tences are divided into 1951 (43.36%) single pro initial crowd study. We artificially reduced the argument sentences, 1799 (39.98%) single con- number of workers (1-9) and calculated percent- tra argument sentences and the remaining 750 age overlap averaged across all worker combina- (16.67%) sentences are many possible combina- tions (for worker numbers lower than 9). Whereas tions of supporting (PRO) and opposing (CON) the overlap was highest with 80.2% at nine votes, arguments with up to five single argument seg- it only dropped to 79.5% for five votes (and de- ments in a sentence. Thus, the token-level an- creased more significantly for fewer votes). We notation leads to a higher (+18.73%) total count deemed five votes to be an acceptable compromise of arguments of 5343, compared to 4500 with a between quality and cost. The agreement with ex- sentence-level approach. If we propagate the la- perts on the five-worker-setup is αunom = 0.71, bel of a sentence to all its tokens, then 100% of which is substantial (Landis and Koch, 1977). tokens of argumentative sentences are argumenta- The final gold standard labels on the 8000 tive. This ratio drops to 69.94% in our token-level sampled sentences was determined using a vari- setup, reducing the amount of non-argumentative ant of Bayesian Classifier Combination (Kim and tokens otherwise incorrectly selected as argumen- Ghahramani, 2012), referred to as IBCC in Simp- tative in a sentence. son and Gurevych (2018)’s modular framework for Bayesian aggregation of sequence labels. This 4 Methods method has been shown to yield results superior to majority voting or MACE (Hovy et al., 2013). We model ARC as a sequence labeling task. The input is a topic t and a sentence S = w1 ...wn . The goal is to select 0 ≤ k ≤ n spans of words 3.5 Dataset Splits each of which corresponding to an argument unit A = wj ...wm , 1 ≤ j, j ≤ m, m ≤ n. Following We create two different dataset splits. (i) An in- Stab et al. (2018b), we distinguish between PRO domain split. This lets us evaluate how models and CON (t should be supported/opposed, because perform on known vocabulary and data distribu- A) arguments. To measure the difficulty of the task tions. (ii) A cross-domain split. This lets us evalu- of ARC, we estimate the performance of simple ate how well a model generalizes for unseen topics baselines as well as current models in NLP, that and distributions different from the training set. In achieve state-of-the-art results on other sequence the cross-domain setup, we defined topics T1-T5 labeling data sets (Devlin et al., 2018). to be in the train set, topic T6 in the development set and topics T7 and T8 in the test set. For the 4.1 1-class Baselines in-domain setup, we excluded topics T7 and T8 (cross-domain test set), and used the first 70% of The 1-class baseline labels the data set completely the topics T1-T6 for train, the next 10% for dev (i.e., for each S the entire sequence w1 ...wn ) with und the remaining 20% for test. The samples from one of the three labels PRO, CON and NON. the in-domain test set were also excluded in the cross-domain train and development sets. As a re- 4.2 Sentence-Level Baselines sult, there are 4000 samples in train, 800 in dev As the sentence-level baseline, we used labels pro- and 2000 in test for the cross-domain split; and duced by the previously mentioned ArgumenText 4200 samples in train, 600 in dev and 1200 in test Classify API from Stab et al. (2018a). Since it is for the in-domain split. We work with two differ- a sentence-level classifier, we also projected the ent splits so as to guarantee that train/dev sets (in- sentence-level prediction on all of the tokens in a domain or cross-domain) do not overlap with test sequence to enable token-level evaluation. sets (in-domain or cross-domain). The assignment of sentences to the two splits is released as part of 9 We will make the dataset available at www.ukp. ARC-8. tu-darmstadt.de/data
4.3 BERT type of label, the sentence is labeled with it. Other- Furthermore, we used the BERT10 base (cased) wise, if there is the NON label with only one other model (Devlin et al., 2018) as a recent state-of- label (PRO or CON), then the NON label is omit- the-art model which achieved impressive results ted and the sentence is labeled with the remaining on many tasks including sequence labeling. For label. In other cases, a majority vote determines this model we considered two scenarios. First, we the final sentence label, or, in the case of ties, the kept the parameters as they are and used the model NON label is assigned. as a feature extractor (considered frozen, tagged 5.3 Sequence Labeling with Different Tagsets as ). Second, we fine-tuned (tagged as ) the parameters for the ARC task and the correspond- In the sequence labeling experiments with the new ing different tags. ARC-8 data set, we investigate the performance of BERT (cf. Section 4.3). The base scenario is 5 Experiments with three labels PRO, CON and NON (TAGS=3), but we also use two extended label sets. In one of In total, we run three different experiments on them, we extended the PRO and CON labels with the ARC-8 dataset with the previously intro- BI tags (TAGS=5), with B being the beginning of duced models, which we will describe in this sec- a segment and I a within-segment token, resulting tion. Additionally we experimented with different in the tags: B-PRO, I-PRO, B-CON, I-CON and tagsets for the ARC task. All experiments were NON. The other extension is with BIES tags, were conducted on a single GPU with 11 GB memory. we add E for end of a segment and S for single unit segments (TAGS=9), resulting in the following tag 5.1 1-class Baselines set: B-PRO, I-PRO, E-PRO, S-PRO, B-CON, I- For the simple baselines, we applied 1-class se- CON, E-CON, S-CON and NON. quence tagging on the corresponding development and test sets for the in-domain and cross-domain 5.4 Adding Topic Information setups. This allowed us to estimate the expected The methods described so far do not use topic in- lower bounds for more complex models. formation. We also tests methods for ARC that make use of topic information. In the first sce- 5.2 Token- vs. Sentence-Level nario, we just add the topic information to the la- To further investigate the performance of a token- bels, resulting in 25 TAGS (2 span information (B level model vs. a sentence-level model, we run and I) × 2 stance information (PRO and CON) × four different training procedures and evaluate the 6 topics (in-domain, topics T1-T6), and the NON results on both token- and sentence-level. We first label; for example B-PRO-CLONING). In the sce- train models on the token-level (sequence label- nario “TAGS=25++”, in addition to the TAGS=25 ing) and also evaluate on the token-level. Second, setup, we add the topic at the beginning of a se- we train a model on the sentence-level (as a text quence. Additionally, in the TAGS=25++ sce- classification task) and project the predictions to nario, we add all sentences of the other topics as all tokens of the sentence, which we then com- negative examples to the training set, with all la- pare to the token-level labels of the gold standard. bels set to NON. For example, a sentence with Third, we train models on the token-level and ag- PRO tokens for the topic CLONING, was added gregate a sentence-level score from the predicted as is (argumentative, for CLONING) and as non- scores, which we evaluate against an aggregated argumentative for the other five topics. Since all sentence-level gold-standard. Finally, in the last the topics need to be known beforehand, this is of this type of experiments, we train a model on done only on the in-domain datasets. This last sentence-level and compared it against the aggre- experiment is to investigate whether the model is gated sentence-level gold-standard. In the latter able to learn the topic-dependency of argument two cases, we aggregate on sentence-level as fol- units. lows: for each sentence, all occurrences of possi- ble types of label are counted. If there is only one 6 Evaluation 10 https://github.com/huggingface/ In this section we evaluate the results and analyze pytorch-pretrained-BERT the errors from the models in the different ARC
Domain In-Domain Cross-Domain Train Token Sentence Token Sentence Token Sentence Token Sentence EVAL Token Sentence Token Sentence SET Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Dev Test Model TAGS Baseline (PRO) 3 10.46 10.91 10.46 10.91 14.10 14.10 14.10 14.10 6.46 12.36 6.46 12.36 9.87 16.43 9.87 16.43 Baseline (CON) 3 10.24 10.53 10.24 10.53 13.48 14.07 13.48 14.07 15.78 11.55 15.78 11.55 20.13 15.03 20.13 15.03 Baseline (NON) 3 25.83 25.43 25.83 25.43 21.57 21.13 21.57 21.13 24.54 24.01 24.54 24.01 18.83 18.43 18.83 18.43 ArgumenText 3 25.10 23.46 25.10 23.46 32.72 29.86 32.72 29.86 19.87 24.81 19.87 24.81 26.56 31.26 26.56 31.26 BERT 3 55.60 52.93 49.95 49.24 62.93 59.99 49.97 50.10 38.91 40.86 37.47 34.60 43.98 49.50 38.56 34.37 BERT 5 55.38 52.23 - - 61.93 60.20 - - 38.49 40.73 - 43.45 48.71 - - - BERT 9 54.50 51.37 - - 61.16 60.09 - - 37.86 39.96 - 42.82 48.54 - - - BERT 3 68.95 63.35 64.83 63.78 72.51 65.49 64.92 64.26 53.66 52.28 46.47 52.19 55.54 51.21 46.56 51.68 BERT 5 68.34 64.67 - - 70.21 65.80 - - 53.32 52.52 - - 53.07 51.98 - - BERT 9 67.58 64.98 - - 67.19 64.27 - - 53.50 54.96 - - 52.45 51.90 - - BERT 25 71.18 63.23 - - 72.91 64.66 - - - - - - - - - - BERT 25++ 66.58 64.19 - - 65.72 64.21 - - - - - - - - - - Table 2: F1 scores for all methods; training was done with the corresponding TAGS in the table, while evalutation was always on three labels (PRO, CON, NON) with aggregation if necessary; the missing values (-) were not possible or applicable experiment setups and hence omitted. Model TAGS Train time (sec./it.) Token- vs. Sentence-Level The four experi- BERT (base, cased) 3 (T) 3 (S) 37 18 ments on token- and sentence-level for both in- BERT (base, cased) 3 (T) 3 (S) 39 29 and cross-domain setups (Table 2) work signifi- cantly better with a fine-tuned BERT model for Table 3: BERT average runtimes with training on the ARC task, which is a similar discovery as in token-level (T) and training on sentence-level (S), with 32 sentences per batch on a single GPU with 11 GB Peters et al. (2019) for many other NLP tasks. memory. Furthermore, training on token-level leads always to better results, which was one of our motiva- tions and objectives for this task and the dataset. experiments. All reported results are macro F1 For an evaluation on token-level, a model trained scores, except otherwise stated. For the computa- on token-level with TAGS=9 works best, while tion of the scores we used a function from scikit- TAGS=5 work best for an evaluation on sentence- learn11 where we concatenated all the true values level. However, the average runtime per iteration and the predictions over all sentences per set. (Table 3) is for sentence-level models on average between 25% and 50% faster compared to token- 6.1 Results level models. We present the results in the following manner: Sequence Labeling Across Domains The re- Table 2 shows experiments across domains and sults for the evaluation on three labels are in for different tagsets in the training step, always Table 2 and the best F1 scores for in-domain evaluating on three labels (PRO, CON and NON). on token- (64.26) and sentence-level (65.80) are In Table 3 we compare the runtimes of the token- higher than the corresponding scores for cross- and sentence-level training. Finally, we show the domain (54.96 and 51.98, respectively). This val- results of the evaluation on the same tags as we idates our assumption that the ARC problem de- used for the training in Table 4 and Table 5. pends on the topic at hand and that cross-topic (cross-domain) transfer is more difficult to learn. For the results in Table 2 we see that the base- Sequence Labeling with Different Tagsets The line for the NON label (most frequent label) and results in Table 4 are from evaluations of models the model “ArgumenText” (ArgumenText Classify that were trained on the corresponding TAGS 3, 5 API) are clearly worse than all BERT-based mod- and 9, and work again better for in-domain and a els. This shows that we are definitely improving fine-tuned model (63.35, 54.23 and 36.01, respec- upon the pipeline that we used to select the data. tively). Results for larger tagsets are clearly lower 11 which is to be expected from the increased com- https://scikit-learn.org/stable/ modules/generated/sklearn.metrics. plexity of the task and the low number of training precision_recall_fscore_support.html examples for some of the tags.
Model TAGS Dev (In-Domain) Test (In-Domain) Dev (Cross-Domain) Test (Cross-Domain) BERT (base, cased) 3 5 9 55.60 34.25 18.93 52.93 32.45 17.92 38.91 23.20 12.82 40.86 24.88 13.76 BERT (base, cased) 3 5 9 68.95 58.32 39.66 63.35 54.23 36.01 53.66 41.81 28.35 52.28 42.65 30.34 Table 4: Sequence labeling with BERT for 3, 5 and 9 labels. Model TAGS Dev Test cross-domain shorter than the true segments. Re- (In-Domain) (In-Domain) garding the count of segments, there are 297 more BERT (base, cased) 25 51.38 41.73 segments in the predicted labels for in-domain BERT (base, cased) 25++ 45.50 42.83 and 372 more segments in the predicted labels for Table 5: BERT Experiments with added topic informa- cross-domain, than there are in the gold-standard. tion; for 25, the topic information is only in the labels; for 25++, the topic information is in the labels, neg- ative examples are added and the topic information is Stance The complete misclassification of the provided at the beginning of a sequence. stance occured for the best token-level model (TAGS=9) in 7.67% of the test sentences in- domain and in 16.50% of the test sentences cross- Adding Topic Information Adding the topic in- domain. A frequent error is that apparently stance- formation in the labels or before a sequence gen- specific words are assigned a label that is not con- erally does not help when evaluating on three tags sistent with the overall segment stance. (results for 25 and 25++ TAGS in Table 2). So we suggest to use more complex models that can improve the results when the topic information is Topic We looked for errors where the topic- provided. The results in Table 5 show that addi- independent tag was correct (e.g., B-CON, begin- tional information about the topic and from the ning of a con argument), but the topic was incor- negative examples (42.83) are helping to train the rect. This type of error occurred only four times model. So the model is able to learn the topic rel- on the testset for TAGS=25++ on some of the to- evance of a sentence for the six topics in the in- kens, but never for a full sequence. The model domain sets. misclassified for example the actual topic nuclear energy as the topic abortion, or the actual topic 6.2 Error Analysis death penalty was confused for the topic minimun We classified errors in three ways: (i) the span is wage. Reasons for this could be some topic spe- not correctly recognized, (ii) the stance is not cor- cific vocabulary that the model learned, but none rectly classified, or (iii) the topic is not correctly of them are actually words one would assign to the classified. misclassified topics. Span The errors by the models for the span can be divided into two more cases: (a) the beginning 7 Conclusion and/or end of a segment is incorrectly recognized, and/or (b) the segment is broken into several seg- We introduced a new task, argument unit recog- ments or merged into fewer segments, such that to- nition and classification (ARC), and release the kens inside or outside an actual argument unit are benchmark ARC-8 for this task. We demonstrated misclassified as non-argumentative. Therefore, we that ARC-8 has good quality in terms of annotator used the predictions by the best token-level model agreement: the required annotations can be crowd- with TAGS=9 in both in-domain and cross-domain sourced using specific data selection and filtering settings, and analyzed the average length of seg- methods as well as a slot filling approach. We cast ments as well as the total count of segments for the ARC as a sequence labeling task and established a true and predicted labels. For the average length state of the art for ARC-8, using baseline as well of segments (in tokens), we got 17.66 for true and as advanced methods for sequence labeling. In the 13.73 for predicted labels in-domain and 16.35 for future, we plan to find better models for this task, true and 13.14 for predicted labels cross-domain, especially models with the ability to better incor- showing that predicted segments are on average porate the topic information in the learning pro- four tokens in-domain and on average three token cess.
Acknowledgments K. Krippendorff, Y. Mathet, S. Bouvry, and A. Widlöcher. 2016. On the reliability of uni- We gratefully acknowledge support by Deutsche tizing textual continua: Further developments. Forschungsgemeinschaft (DFG) (SPP-1999 Quality & Quantity, 50(6):2347–2364. Robust Argumentation Machines (RATIO), J. Richard Landis and Gary G. Koch. 1977. The mea- SCHU2246/13), as well as by the German Federal surement of observer agreement for categorical data. Ministry of Education and Research (BMBF) Biometrics, 33(1):159–174. under the promotional reference 03VP02540 Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud (ArgumenText). Aharoni, and Noam Slonim. 2014. Context depen- dent claim detection. In Proceedings of COLING 2014, the 25th International Conference on Compu- tational Linguistics: Technical Papers, pages 1489– References 1500, Dublin, Ireland. Dublin City University and Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Hen- Association for Computational Linguistics. ning Wachsmuth, and Benno Stein. 2017. Unit seg- Mengxue Li, Shiqiang Geng, Yang Gao, Shuhua Peng, mentation of argumentative texts. In Proceedings of Haijing Liu, and Hao Wang. 2017. Crowdsourcing the 4th Workshop on Argument Mining, pages 118– argumentation structures in Chinese hotel reviews. 128, Copenhagen, Denmark. Association for Com- In Proceedings of the 2017 IEEE International Con- putational Linguistics. ference on Systems, Man, and Cybernetics, pages 87–92. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep Tristan Miller, Maria Sukhareva, and Iryna Gurevych. bidirectional transformers for language understand- 2019. A streamlined method for sourcing discourse- ing. arXiv preprint arXiv:1810.04805. level argumentation annotations from the crowd. In Proceedings of the 2019 Conference of the North Judith Eckle-Kohler, Roland Kluge, and Iryna American Chapter of the Association for Computa- Gurevych. 2015. On the role of discourse markers tional Linguistics. for discriminating claims and premises in argumen- tative discourse. In Proceedings of the 2015 Con- Makoto Miwa and Mohit Bansal. 2016. End-to-end re- ference on Empirical Methods in Natural Language lation extraction using lstms on sequences and tree Processing, pages 2236–2242, Lisbon, Portugal. As- structures. In Proceedings of the 54th Annual Meet- sociation for Computational Linguistics. ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1116, Berlin, Germany. Association for Computational Linguis- Steffen Eger, Johannes Daxenberger, and Iryna tics. Gurevych. 2017. Neural end-to-end learning for computational argumentation mining. In Proceed- Raquel Mochales Palau and Marie-Francine Moens. ings of the 55th Annual Meeting of the Association 2009. Argumentation mining: The detection, classi- for Computational Linguistics (ACL 2017), volume fication and structure of arguments in text. In Pro- Volume 1: Long Papers, pages 11–22. Association ceedings of the 12th International Conference on Ar- for Computational Linguistics. tificial Intelligence and Law, ICAIL ’09, pages 98– 107, New York, NY, USA. ACM. Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust Andreas Peldszus and Manfred Stede. 2013. From ar- with mace. In Proceedings of the 2013 Conference gument diagrams to argumentation mining in texts: of the North American Chapter of the Association A survey. Int. J. Cogn. Inform. Nat. Intell., 7(1):1– for Computational Linguistics: Human Language 31. Technologies, pages 1120–1130. Matthew Peters, Sebastian Ruder, and Noah A Smith. Xinyu Hua and Lu Wang. 2017. Understanding and 2019. To tune or not to tune? adapting pretrained detecting supporting arguments of diverse types. In representations to diverse tasks. arXiv preprint Proceedings of the 55th Annual Meeting of the As- arXiv:1903.05987. sociation for Computational Linguistics (Volume 2: Paul Reisert, Naoya Inoue, Tatsuki Kuribayashi, and Short Papers), pages 203–208, Vancouver, Canada. Kentaro Inui. 2018. Feasible annotation scheme Association for Computational Linguistics. for capturing policy argument reasoning using argu- ment templates. In Proceedings of the 5th Workshop Hyun-Chul Kim and Zoubin Ghahramani. 2012. on Argument Mining, pages 79–89. Association for Bayesian classifier combination. In Proceedings of Computational Linguistics. the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceed- Eyal Shnarch, Carlos Alzate, Lena Dankin, Mar- ings of Machine Learning Research, pages 619–627, tin Gleize, Yufang Hou, Leshem Choshen, Ranit La Palma, Canary Islands. PMLR. Aharonov, and Noam Slonim. 2018. Will it blend?
blending weak and strong labeled data in a neu- ral network for argumentation mining. In Proceed- ings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 599–605. Association for Computa- tional Linguistics. Edwin Simpson and Iryna Gurevych. 2018. Bayesian ensembles of crowds and deep learners for sequence tagging. CoRR, abs/1811.00780. Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, Benjamin Schiller, Christopher Tauchmann, Steffen Eger, and Iryna Gurevych. 2018a. Argumentext: Searching for arguments in heterogeneous sources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demon- strations, pages 21–25. Christian Stab and Iryna Gurevych. 2014. Annotat- ing argument components and relations in persua- sive essays. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 1501–1510. Dublin City University and Association for Computational Linguistics. Christian Stab and Iryna Gurevych. 2017. Parsing ar- gumentation structures in persuasive essays. Com- putational Linguistics, 43(3):619–659. Christian Stab, Tristan Miller, Benjamin Schiller, Pranav Rai, and Iryna Gurevych. 2018b. Cross- topic argument mining from heterogeneous sources. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3664–3674. Association for Computational Linguis- tics. Henning Wachsmuth, Martin Potthast, Khalid Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017. Building an argument search engine for the web. In Proceedings of the 4th Workshop on Argument Mining, pages 49–59, Copenhagen, Denmark. Association for Computational Linguistics. Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012. A corpus for re- search on deliberation and debate. In Proceed- ings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 812–817, Istanbul, Turkey. European Lan- guage Resources Association (ELRA). Adam Wyner, Raquel Mochales-Palau, Marie-Francine Moens, and David Milward. 2010. Approaches to text mining arguments from legal cases. In Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia, editors, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, pages 60–79. Springer Berlin Heidelberg, Berlin, Heidelberg.
You can also read