I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews James O’Neill‡∗ Polina Rozenshtein†,∗ Ryuichi Kiryo† James.O-Neill@liverpool.ac.uk prrozens@amazon.co.jp kiryor@amazon.co.jp Motoko Kubota† Danushka Bollegala†,‡ kubmotok@amazon.co.jp danubol@amazon.com Amazon† , University of Liverpool‡ Abstract the consequence of the event (I would have been content with purchasing this iPhone), referred to Counterfactual statements describe events that did not or cannot take place. We consider as the consequent. Counterfactual statements are the problem of counterfactual detection (CFD) ubiquitous in natural language and have been well- arXiv:2104.06893v2 [cs.CL] 15 Sep 2021 in product reviews. For this purpose, we an- studied in fields such as philosophy (Lewis, 2013), notate a multilingual CFD dataset from Ama- psychology (Markman et al., 2007; Roese, 1997), zon product reviews covering counterfactual linguistics (Ippolito, 2013), logic (Milmed, 1957; statements written in English, German, and Quine, 1982), and causal inference (Höfler, 2005). Japanese languages. The dataset is unique as it contains counterfactuals in multiple lan- Accurate detection of counterfactual statements guages, covers a new application area of e- is beneficial to numerous applications in natural commerce reviews, and provides high quality language processing (NLP) such as in medicine professional annotations. We train CFD mod- (e.g., clinical letters), law (e.g., court proceedings), els using different text representation meth- sentiment analysis, and information retrieval. For ods and classifiers. We find that these mod- example, in information retrieval, counterfactual els are robust against the selectional biases in- detection (CFD) can potentially help to remove ir- troduced due to cue phrase-based sentence se- relevant results to a given query. Revisiting our lection. Moreover, our CFD dataset is com- patible with prior datasets and can be merged previous example, we should not return the iPhone to learn accurate CFD models. Applying ma- in question for a user who is searching for iPhone chine translation on English counterfactual ex- with warranty because that iPhone does not come amples to create multilingual data performs with a warranty. A simple bag-of-words retrieval poorly, demonstrating the language-specificity model that does not detect counterfactuals would of this problem, which has been ignored so far. return the iPhone in question because all the to- 1 Introduction kens in the query (i.e. iPhone, with, warranty) occur in the review sentence. Detecting counter- Counterfactual statements are an essential tool of factuals can also be a precursor to capturing causal human thinking and are often found in natural lan- inferences (Wood-Doughty et al., 2018) and inter- guages. Counterfactual statements may be identi- actions, which have shown to be effective in fields fied as statements of the form – If p was true, then such as health sciences (Höfler, 2005). Janocko q would be true (i.e. assertions whose antecedent et al. (2016) and Son et al. (2017) studied CFD in (p) and consequent (q) are known or assumed to social media for automatic psychological assess- be false) (Milmed, 1957). In other words, a coun- ment of large populations. terfactual statement describes an event that may CFD is often modelled as a binary classifica- not, did not, or cannot take place, and the subse- tion task (Son et al., 2017; Yang et al., 2020a). A quent consequence(s) or alternative(s) did not take manually annotated sentence-level counterfactual place. For example, consider the counterfactual dataset was introduced in SemEval-2020 (Yang statement – I would have been content with pur- et al., 2020a) to facilitate further research into this chasing this iPhone, if it came with a warranty!. important problem. However, successful devel- Counterfactual statements can be broken into two opments of classification methods require exten- parts: a statement about the event (if it came with a sive high quality labelled datasets. To the best warranty), also referred to as the antecedent, and of our knowledge, currently there are only two ∗ The two first authors contributed equally labelled datasets for counterfactuals: (a) the pio-
neering small dataset of tweets (Son et al., 2017) First-ever Multilingual Counterfactual Dataset: and (b) a recent larger corpus covering the area of We introduce the first-ever multilingual CFD the finance, politics, and healthcare domains (Yang dataset containing manually labelled product re- et al., 2020a). However, these datasets are limited view sentences covering English, German, and to the English language. Japanese languages.1 As already mentioned above, In this paper, we contribute to this emerging line counterfactual statements are naturally infrequent. of work by annotating a novel CFD dataset for a We ensure that the positive (i.e. counterfactual) new domain (i.e. product reviews), covering lan- class is represented by at least 10% of samples for guages in addition to English, such as Japanese each language. Distinguishing between a counter- and German, ensuring a balanced representation factual and non-counterfactual statements is a fairly of counterfactuals and the high quality of the la- complex task even for humans. Unlike previous belling. Following prior work, we model coun- works, which relied on crowdsourcing, we employ terfactual statement detection as a binary classifi- professional linguists to produce a high quality an- cation problem, where given a sentence extracted notation. We follow the definition of counterfac- from a product review, we predict whether it ex- tuals used by Yang et al. (2020a) to ensure that presses a counterfactual or a non-counterfactual our dataset is compatible with the SemEval-2020 statement. Specifically, we annotate sentences se- CFD dataset (SemEval). We experimentally verify lected from Amazon product reviews, where the that by merging our dataset with the SemEval CFD annotators provided sentence-level annotations as dataset, we can further improve the accuracies of to whether a sentence is counterfactual with respect counterfactual classifiers. Moreover, applying ma- to the product being discussed. We then represent chine translation on the English CFD dataset to sentences using different encoders and train CFD produce multilingual CFD datasets results in poor models using different classification algorithms. CFD models, indicating the language-specificity of the problem that require careful manual annota- The percentage of sentences that contain a coun- tions. terfactual statement in a random sample of sen- tences has been reported to be low as 1-2% (Son Accurate CFD Models: Using the annotated et al., 2017). Therefore, all prior works annotat- dataset we train multiple classifiers using (a) lex- ing CFD datasets have used clue phrases such as I icalised word-order insensitive bag-of-words rep- wished to select candidate sentences that are likely resentations as well as (b) contextualised sentence to be true counterfactuals, which are then subse- embeddings. We find that there is a clear advan- quently annotated by human annotators (Yang et al., tage to using contextualised embeddings over non- 2020a). However, this selection process can poten- contextualized embeddings, indicating that coun- tially introduce a selection bias towards the clue terfactuals are indeed context-sensitive. phrases used. To the best of our knowledge, while the data se- 2 Related Work lection bias is a recognised problem in other NLP Counterfactuals have been studied in various con- tasks (e.g., Larson et al. (2020)), this selection bias texts such as for problem solving (Markman et al., on CFD classifiers has not been studied previously. 2007), explainable machine learning (Byrne, 2019), Therefore, we train counterfactual classifiers with advertisement placement (Joachims and Swami- and without masking the clue phrases used for can- nathan, 2016) and algorithmic fairness (Kusner didate sentence selection. Furthermore, we exper- et al., 2017). Kaushik et al. (2020) proposed an iment with enriching the dataset with sentences annotation scheme whereby the original data is that do not contain clue phrases but are semanti- augmented in a counterfactual manner to overcome cally similar to the ones that contain clue phrases. spurious associations that a classifier heavily relies Interestingly, our experimental results reveal that upon, thus failing to perform well on test data dis- compared to the lexicalised CFD such as bag-of- tributions that are not identical. Unlike Kaushik words representations, CFD models trained using et al. (2020) and closely related work by Gardner contextualised masked language models such as et al. (2020), we are interested in identifying exist- BERT are robust against the selection bias (Devlin et al., 2019). Our contributions in this paper are as 1 https://github.com/amazon-research/ follows: amazon-multilingual-counterfactual-dataset
ing counterfacts and filtering these statements to 3 Dataset Curation improve search performance. We adopt the definition of a counterfactual state- A CFD task was presented in SemEval-2020 ment proposed by Janocko et al. (2016) where they Challenge (Yang et al., 2020b). The provided define it as a statement which looks at how a hy- dataset contains counterfactual statements from pothetical change in past experience could have news articles. However, the dataset does not cover affected the outcome of that experience. Their defi- counterfactuals in e-commerce product reviews, nition is based on linguistic structures of 6 types of which is our focus in this paper. One of the ear- counterfactuals as following. liest CFD datasets was annotated by Son et al. Conjunctive Normal: The antecedent is fol- (2017) and covers counterfactual statements ex- lowed by the consequent. The antecedent consists tracted from social media. Both datasets are la- of a conditional conjunction followed by a past belled for binary classification by crowdsourcing tense subjunctive verb or past modal verb. The and contain only sentences in English. We will consequent contains a past or present tense modal compare our dataset to these previous works in verb. (Example: If everyone got along, it would be § 3.4. To summarise, our dataset is unique as it con- more enjoyable.) tains counterfactuals in multiple languages, covers Conjunctive Converse: The consequent is fol- a new application area of e-commerce reviews, and lowed by the antecedent. The consequent consists provides high quality annotations. of a modal verb and past or present tense verb. The antecedent consists of a conditional conjunction A range of CFD methods was recently proposed followed by a past tense subjunctive verb or past in response to the SemEval-2020 challenge (Yang tense modal. (Example: I would be stronger, if I et al., 2020b). Most of the high performing meth- had lifted weights.) ods (Ding et al., 2020; Fajcik et al., 2020; Lu et al., Modal Normal: The antecedent is followed by 2020; Ojha et al., 2020; Yabloko, 2020) use state- the consequent. The antecedent consists of a modal of-the-art pretrained language models (Devlin et al., verb and past participle verb. The consequent con- 2019; Liu et al., 2019; Lan et al., 2020; Radford sists of a past/present tense modal verb. (Example: et al., 2019; Yang et al., 2019). Traditional ML We should have gone bowling, that would have methods, such as SVM and random forests were been better.) also used but with less success (Ojha et al., 2020). Wish/Should Implied: The antecedent is present, the consequent is implied. The antecedent To achieve the best prediction quality, ensem- is the independent clause following ‘wish’ or ble strategies are employed. The top performing ‘should’. The consequent is implied and can be systems use an ensemble of transformers (Ding paraphrased as “would be better off”. (Examples: I et al., 2020; Fajcik et al., 2020; Lu et al., 2020), wish I had been richer. I should have revised my while others include Convolutional Neural Net- rehearsal lines.) works (CNNs) with Global Vectors (GloVe; Pen- Verb Inversion: No specific order of the an- nington et al., 2014) embeddings (Ojha et al., 2020). tecedent and consequent. The antecedent uses the Various structures are used on top of transformers. subjunctive mood by inverting the verbs ‘had’ and For example, Lu et al. (2020); Ojha et al. (2020) ‘were’ to create a hypothetical conditional state- use a CNN as the top layer, while Bai and Zhou ment along with a past tense verb. The consequent (2020) use a Bi-GRUs and Bi-LSTMs. Some other consists of a modal verb and past or present tense proposed methods use additional modules, such as verb. (Example: Had I listened to your advice, I constituency and dependency parsers, in the lower may have got the job.) layers of the architecture (Yabloko, 2020). Modal Propositional, Would/Could Have: The consequent is followed by the antecedent. The CFD datasets tend be highly imbalanced because antecedent consists of a past/present modal verb. counterfactual statements are less frequent in natu- The consequent consists of a prepositional phrase ral language texts. Prior work has used techniques (only certain types). (Examples: I would have such as pseudo-labelling (Ding et al., 2020) and been better off not reading this. I would have been multi sample dropout (Chen et al., 2020) to address happier without John.) the data imbalance and overfitting problems. Note that, while Yang et al. (2020a) explicitly
mention only 5 types of counterfactual and Son Shorter sentences might not contain sufficient in- et al. (2017) work with 7 types, their definitions formation for a human annotator to decide whether and clue words used for data collection effectively it is a counterfactual statement, whereas longer cover the same 6 types defined by Janocko et al. sentences are likely to contain various other infor- (2016). We worked with professional linguists mation besides counterfactuals. to extend these counterfactual definitions for the The above-mentioned first iteration might pro- German and Japanese languages. While the ex- duce a biased dataset in the sense that all sentences tension of the definition from English to German contain counterfactual clues from the predefined is relatively straightforward, the extension to syn- lists. There are two possible drawbacks in this se- tactically and orthographically different structure lection method. First, the manually compiled clue of Japanese sentences was challenging (Jacobsen, phrase lists might not cover all the different ways in 2011) and required re-writing the annotation guide- which we can express a counterfactual in a particu- lines including additional examples. The annota- lar language. Therefore, the sentences selected us- tion guidelines are included in the dataset release. ing the clue phrase lists might have coverage issues. Second, a counterfactual classification model might 3.1 Data Collection assign high confidence scores for some high preci- The main step of data collection in the previous sion clue phrases (e.g., “wish” for English). Such works (Son et al., 2017; Yang et al., 2020a) is a classifier is likely to perform poorly on test data filtering of the data using a pre-compiled list of that do not use clue phrases for expressing coun- clue words/phrases. Because the exact list of clue terfactuality. On the contrary, adding sentences phrases used by Janocko et al. (2016) was not pub- with no clue words to the dataset might result in a licly available, we created a new list of clue phrases greater bias: those additional sentences are likely following the definitions of counterfactual types. to be negative examples, and thus discriminatory In addition, we compiled similar clue phrase lists power of the clue phrases can get amplified. Later for German and Japanese languages. Yang et al. in our experiments, we empirically evaluate the (2020a) applied a more complex procedure, where effect of selection bias due to the reliance on clue they match Part of Speech (PoS)-tagged sentences phrases. against lexico-syntactic patterns. In our work, we To address the selection bias, in addition to the do not consider PoS-based patterns, which are dif- sentences selected in the first iteration, we conduct ficult to generalise across languages. a second iteration where we select sentences that do We use the Amazon Customer Reviews Dataset,2 not contain counterfactual clues from our lists. For which contains over 130 million customer reviews this purpose, we create sentence embeddings for collected and released by Amazon to the research each sentence selected in the first iteration. We use community. To create an annotated dataset, we a pretrained multilingual BERT model3 . We then select reviews in different categories as detailed use k-means clustering to cluster these sentences in the Supplementary. Next, we sample candidate into k = 100 clusters. We assume each cluster rep- sentences for annotation in two iterations. resents some aspect of a product, and represented In the first iteration, we consider reviews writ- by its centroid. Next, we pick sentences that do not ten by customers with a verified purchase (i.e., the contain the clue phrases, compute their sentence customer has bought the product about which he or embeddings, and measure the similarity to each of she is writing the review). Given that counterfac- the centroids. For each centroid we select the top n tual statements are infrequent, all prior works (Son most similar sentences for manual annotation. We et al., 2017; Yang et al., 2020a) have used clue set n such that we obtain an approximately equal phrase lists for selecting data for human annota- number of sentences to the number of sentences tion. Following this practice, we select sentences that contain clue phrases selected in the first itera- that contain exactly one clue phrase from our pre- tion. All selected sentences are manually annotated compiled clue phrase lists for each language. We for counterfactuality as described in § 3.2. remove sentences that are exceedingly long (more than 512 tokens) or short (less than 10 tokens). 2 3 https://s3.amazonaws.com/amazon-reviews-pds/ https://huggingface.co/ readme.html bert-base-multilingual-uncased
3.2 Annotation Dataset Positive Negative Total CF % EN 954 4069 5023 18.9 The annotators were provided guidelines with defi- EN-ext 1030 8970 10000 10.0 nitions, extensive examples and counterexamples. DE 4840 2160 7000 69.1 JP 667 6333 7000 9.5 Briefly, counterfactual statements were identified if they belong to any of the counterfactual types Table 1: Dataset statistics: the number of positive described in § 3. If any part of a sentence con- (counterfactual) and negative (non-counterfactual) ex- tains a counterfactual, then we consider the entire amples, total sizes of the datasets, percentage of coun- sentence to be a counterfactual. This annotation terfactual (CF) examples. process increases the number of counterfactual ex- amples and the coverage across the counterfactual Dataset N fP fN fdata types in the dataset, thereby improving the class EN 29 100. 100. 100. imbalance. We require that at least 90% of the sen- EN-ext 29 92.6 45.3 50.2 DE 27 100. 100. 100. tences have agreement of 2 professional linguists JP 70 100. 100. 100. (2 out of 2 agreement), the rest at most 10% cases had a third linguist to resolve the disagreement (2 Table 2: Clue phrases summary for the datasets: N is out of 3 agreement). the total number of clue phrases in each clue phrase list. fP and fN are the percentages of examples con- taining clue phrases respectively in counterfactual and 3.3 Dataset Statistics non-counterfactal classes. fdata is the percentage of sentences containing a clue phrase in a dataset. The basic dataset statistics can be found in Table 1. We present two versions of the English dataset: 3.4 Comparison with Existing Datasets EN contains only sentences filtered by the clue words, EN-ext is a superset of EN enriched by We compare the multilingual counterfactual dataset sentences with no clue words as described above. we create against existing datasets in Table 3. Our The clue-based dataset EN contains about 1/5-th dataset is well-aligned with the two other existing of positive examples, while its extended version datasets in the sense that we use the same definition contains 1/10-th of counterfactuals. Only 76 out of a counterfactual, keep a similar percentage of of 4977 added sentences were labelled positively. positive examples, and use similar keywords for DE dataset contains 69.1% and JP contains 9.5% dataset construction. These properties ensure that of counterfactuals. our dataset of product reviews can be used on its own, as well as organically combined with the ex- The summary of clue phrase distributions in pos- isting datasets from other domains. A distinctive itive and negative classes is shown in Table 2. In- feature of our dataset is its coverage of a novel terestingly, English and German lists have approx- domain, e-commerce reviews, which is not cov- imately the same number of clues, but the preci- ered by any of the existing counterfactual datasets. sion for German clues is much higher, resulting Furthermore, our dataset is available for three lan- in more counterfactual statements being extracted guages: English, German, and Japanese. This is the using those clue phrases. On the contrary, the first counterfactual dataset not limited to English Japanese list has the largest number of clues, yet language. Unlike previous works, which relied on results in the lowest precision. The specification crowdsourcing, we employ professional linguists of counterfactual clue phrases for Japanese is a lin- to produce the lists of clue words and supervise guistically hard problem because the meaning of the annotation. This ensures the high quality of the the clues is highly context dependent. The large labelling. number of Japanese clue phrases is due to the or- thographic variations present in Japanese where the 4 Evaluations same phrase can be written using kanji, hiragana, katakana characters or a mixture of them. Because We conduct a series of experiments to systemati- we were able to select a sufficiently large datasets cally evaluate several important factors related to for German and Japanese using the clue phrases, counterfactuality such as (a) selection bias due to we did not consider the second iteration step de- clue phrases (§ 4.1), (b) effect of merging multiple scribed in § 3.1 for those languages. counterfactual datasets (§ 4.2), (c) use of machine
Dataset Language Size CF % Son et al. (2017) English 1637 (2137) 10.1 (31.2) Yang et al. (2020a) English 20000 11.0 This work English / German / Japanese 10000 (5023) / 7000 / 7000 10.0 (18.9) / 69.1 / 9.5 Dataset CF definition Domain Construction Annotation Son et al. (2017) Janocko et al. Twitter keywords filtering mixed: manual (unknown), auto- (2016) matic pattern matching Yang et al. (2020a) Janocko et al. News: finance, politics, keywords filtering, pat- manual (crowdsourcing, strong (2016) healthcare tern matching agreement) This work Janocko et al. Amazon Reviews keywords filtering manual (curated by linguists) (2016) Table 3: Dataset comparisons. The numbers in parenthesis for Son et al. (2017) correspond to the union of manually and automatically labelled datasets. The numbers in parenthesis for this work correspond to clue-based English dataset EN . translation (MT) to translate counterfactual state- from nltk.tokenize.punkt7 for English and Ger- ments (§ 4.3), and (d) effect of different sentence man languages; and MeCab8 as the morphological encoders and classifiers for training CFD models analyser for Japanese. (§ 4.4). For evaluations in (a), (b), and (c), we fine-tune a 4.1 Selection Bias due to Clue Phrases widely used multilingual transformer model BERT To evaluate the effectiveness of clue phrases for se- (mBERT) (Devlin et al., 2019) to train a CFD lecting sentences for human annotation and any model. The model is pretrained for the tasks of selection bias due to this process, we fine-tune masked language modelling and next sentence pre- mBERT with and without masking the clue phrases. diction for 104 languages4 and is used with the de- Classification performance values are shown in Ta- fault parameter settings. The model is implemented ble 4. Overall, we see that no mask (training with- using the Transformer.5 library We fine-tune a lin- out masking) returns slightly better performance ear layer on top of these pretrained language mod- than mask (training with masking), however the els for the CFD task using the training process as differences are not statistically significant. This described next.6 is reassuring because it shows that the sentence We use an 80%-20% train-test data split and tune embeddings produced by mBERT generalise well hyperparameters via 5-fold cross-validation. Hy- beyond the clue phrases used to select sentences perparameters in the already pretrained transformer for manual annotation. On the other hand, if a models are kept fixed. F1, Matthew’s Correlation CFD model had simply memorised the clue phrases Coefficient (MCC; Boughorbel et al., 2017), and and was classifying based on the occurrences of accuracy are used as evaluation metrics. MCC the clue phrases in a sentence, we would expect (∈ [−1, 1]) accounts for class imbalance and incor- a drop in classification performance in no mask porates all correlations within the confusion ma- setting due to overfitting to the clue phrases that trix (Chicco and Jurman, 2020). Accuracy may be are not observed in the test data. Indeed for EN misleading in highly imbalanced datasets because a where all sentences contain clue phrases, we see a simple classification of all instances to the majority slight drop in all evaluation measure for no mask class has a high accuracy. However, for consis- relative to mask, which we believe is due to this tency with prior work, we report all three evalua- overfitting effect. The performance on JP is the tion metrics in this paper. All the reported results lowest among all languages compared. This could are averaged over at least 3 independently trained be attributed to the tokenisation issues and lack of models initialised with the same hyperparameter Japanese coverage in mBERT. Many counterfac- values. For tokenisation, unless the tokeniser is pre- tual clues in Japanese are parts of verb/adjective specified for the model, we use word tokenize inflections, which can get split/removed during the tokenisation. 4 https://huggingface.co/ Table 5 shows recall (R) and precision (P ) on bert-base-multilingual-uncased 5 7 https://github.com/huggingface/transformers https://www.nltk.org/api/nltk.tokenize.html 6 8 See Supplementary for the details on fine-tuning. https://pypi.org/project/mecab-python3/
Dataset Mask mBERT Train Test mBERT F1 MCC Acc F1 MCC Acc mask 0.92 0.76 0.92 EN 0.89 0.73 0.89 EN no mask 0.89 0.73 0.89 EN-ext 0.96 0.85 0.96 EN SemEval 0.65 0.28 0.59 mask 0.93 0.69 0.93 EN-ext Comb 0.68 0.31 0.62 no mask 0.94 0.74 0.94 EN 0.92 0.80 0.92 mask 0.86 0.68 0.86 DE EN-ext 0.94 0.74 0.94 no mask 0.90 0.79 0.90 EN-ext SemEval 0.50 0.19 0.42 mask 0.86 0.48 0.84 Comb 0.49 0.19 0.42 JP no mask 0.85 0.49 0.82 EN 0.82 0.56 0.80 EN-ext 0.86 0.48 0.83 SemEval Table 4: F1, MCC and Accuracy (Acc) for CFD models SemEval 0.93 0.71 0.92 trained with and without masking the clue phrases. Comb 0.96 0.84 0.96 EN 0.95 0.86 0.95 Metric EN EN-ext DE JP EN-ext 0.94 0.72 0.94 Comb SemEval 0.93 0.70 0.92 Rnm 0.93 0.94 0.92 0.85 Comb 0.96 0.84 0.96 Pnm 0.71 0.59 0.94 0.30 Rm 0.87 0.79 0.86 0.88 Table 6: Classification quality, combining datasets for Pm 0.68 0.66 0.93 0.37 training and evaluation. Table 5: Precision and Recall for mBERT trained with (m) and without (nm) masking the clue phrases. use cover a narrow subdomain compared to the do- mains in SemEval . Interestingly, the CFD model masked (subscript m) and non-masked (subscript trained on Comb reports the best performance nm) settings. In all datasets the recall is higher than across all measures, indicating that our dataset is precision for both masked and non-masked ver- compatible with SemEval and can be used in con- sions due to dataset imbalance with an underrepre- junction with existing datasets to train better CFD sented positive class. The number of positive exam- models. ples misclassified under masked and non-masked 4.3 Cross-Lingual Transfer via MT settings are typically very small. We see that the CFD model trained on EN-ext has a higher recall, Considering the costs involved in manually anno- but lower precision than the one on EN . Most of tating counterfactual statements for each language, the added examples in EN-ext are negatives, which a frugal alternative would be to train a model for makes it hard to maintain a high precision. English and then apply it on test sentences in a tar- get language of interest, which are translated into 4.2 Cross-Dataset Adaptation English using a machine translation (MT) system. To evaluate this possibility, we first translate the To study the compatibility of our dataset with exist- German and Japanese CFD datasets into English ing datasets, we train a CFD model on one dataset (denoted respectively by DE-EN and JP-EN ) using and test the trained model on a different dataset. Amazon MT.9 Next, we train separate English CFD Prior work on domain adaptation (Ben-David et al., models using EN , EN-ext and SemEval datasets, 2009) has shown that the classification accuracy of and apply those models on DE-EN and JP-EN . such a cross-domain classifier is upper-bounded by As shown in Table 7, the MCC values for the MT- the similarity between the train and test datasets. based CFD model are significantly lower than that Further, we merge our EN-ext dataset with the for the corresponding in-language baseline, which SemEval dataset (Yang et al., 2020a) to create a is trained using the target language data. Therefore, dataset denoted by Comb . Specifically, we sepa- simply applying MT on test data is not an alter- rately pool the the counterfactual and noncounter- native to annotating counterfactual datasets from factual instances in each dataset to create Comb . scratch for a novel target language. This result As can be seen from Table 6, the models trained shows the importance of developing counterfactual on EN and EN-ext perform poorly on SemEval , datasets for languages other than English, which while the model trained on SemEval has relatively has not been done prior to this work. Moreover, high values of F1, MCC, and Accuracy on EN and 9 EN-ext. This implies that the product reviews we https://aws.amazon.com/translate/
Train Test mBERT Method Mask Dataset F1 MCC Acc EN EN-ext DE JP EN DE-EN 0.65 0.41 0.64 mBERT mask 0.76 0.69 0.68 0.48 EN-ext DE-EN 0.73 0.49 0.72 no mask 0.73 0.74 0.79 0.49 SemEval DE-EN 0.58 0.35 0.58 XLM-RoBERTa mask 0.75 0.68 0.59 0.42 DE DE 0.90 0.79 0.90 no mask 0.79 0.76 0.80 0.38 XLM-w/o-Emb mask 0.71 0.64 0.67 0.47 EN JP-EN 0.80 0.26 0.78 no mask 0.76 0.70 0.79 0.47 EN-ext JP-EN 0.80 0.28 0.76 SVM (BoN) mask 0.50 0.44 0.47 0.58 SemEval JP-EN 0.86 0.22 0.86 no mask 0.74 0.70 0.76 0.58 JP JP 0.85 0.49 0.82 DT (BoN) mask 0.36 0.28 0.37 0.43 no mask 0.64 0.58 0.70 0.48 Table 7: Classification quality of English translations. RF (BoN) mask 0.16 0.11 0.20 0.14 no mask 0.40 0.34 0.60 0.11 SVM (WE) mask 0.42 0.32 0.40 0.49 no mask 0.56 0.49 0.67 0.49 the performance for German, which belongs to DT (WE) mask 0.23 0.25 0.28 0.42 the same Germanic language group as English, is no mask 0.37 0.37 0.56 0.40 RF (WE) mask 0.20 0.08 0.17 0.16 better than for Japanese. The model trained on Se- no mask 0.26 0.14 0.39 0.14 mEval performs the worst on DE-EN dataset, and has the lowest MCC on JP-EN . This experimental Table 8: MCC for the different CFD Models. result indicates the importance of introducing new languages to the counterfactual dataset family. Pretrained Language Models Along with 4.4 Sentence Encoders and Classifiers mBERT, we fine-tune a linear layer for CFD task on top of two following pretrained transformer We evaluate the effect of the sentence encoding and models: XLM model (Conneau and Lample, binary classification methods on the performance 2019)11 and base XLM-RoBERTa model (Con- of CFD using multiple settings. neau et al., 2020).12 Both models were trained for the task of masked language modelling for 100 Bag-of-N-grams (BoN): We represent a sen- languages. tence using tf-idf weighted unigrams and bi-grams and ignore n-grams with a frequency less than 2 or Results Here we extend our experiment with clue more than 95% of the frequency distribution. Next, word masking. For the transformer-based models Principal Component Analysis (PCA; Wold et al., we mask the clue words similar to mBERT. For the 1987) is used to create 600-dimensional sentence traditional ML methods we remove the clue words embeddings. from the sentences before tokenization. The results with and without masking are re- Word Embeddings (WE): We average the 300- ported in Table 8 (F1 and Accuracy are reported in dimensional fastText embeddings trained on Com- the Supplementary). First, we note that masking mon Crawl and Wikipedia10 for the words in a decreases the performance of all classifiers on all sentence to create its sentence embedding. We datasets. Transformer-based classifiers are the least note that there have been meta-embedding meth- affected by masking: they are able to learn seman- ods (Bollegala and Bao, 2018; Bollegala et al., tic dependencies from the remaining text. We could 2018) proposed to combine multiple word embed- also say that transformers are the least affected by dings to further improve their accuracy. However, the data-selection bias as they do not rely on the their consideration for CFD is beyond the scope of clue words. Traditional ML methods with BoN fea- current work. tures are affected by masking the most: they seem BoN and WE representations are used to train to use clue words for discrimination. Interestingly, binary CFD models using different classification for these methods the performance drops equally methods such as a Support Vector Machine (SVM; for clue-based EN and enriched EN-ext datasets. Cortes and Vapnik, 1995) with a Radial Basis func- This could indicate that in both cases the classifier tion, an ID3 Decision Tree (DT; Breiman et al., relies on the clue words. 1984), a Random Forest (RF; Breiman, 2001) with Overall transformer-based models (especially 20 trees. 11 https://huggingface.co/xlm-mlm-100-1280 10 12 https://fasttext.cc/docs/en/crawl-vectors.html https://huggingface.co/xlm-roberta-base
XLM-RoBERTa) perform the best across all dat- tion performance, indicating the need for language- sets except for JP . For JP the best performance specific CFD datasets. is obtained by an SVM model with BoN fea- tures. This could indicate that for Japanese, a 6 Ethical Considerations language-specific tokenisation works for the lex- icalised (BoN) models better than the language- In this work, we annotated a multilingual dataset independent subtokenisation methods such as Byte covering counterfactual statements. Moreover, we Pair Encoding (BPE; Sennrich et al., 2016) that train CFD models using different sentence represen- are used when training contextualised transformer- tation methods and binary classification algorithms. based sentence encoders. The former preserves In this section, we discuss the ethical considera- more information than the latter at the expense of tions related to these contributions. a sparser and larger feature space (Bollegala et al., With regard to the dataset being released, all sen- 2020). Transformer-based masked language mod- tences that are included in the dataset were selected els on the other hand require subtokenisation as from a publicly available Amazon product review they must use a smaller vocabulary to make the dataset. In particular, we do not collect or release token prediction task efficient (Yang et al., 2018; any additional product reviews as part of this paper. Li et al., 2019). Moreover, we have manually verified that the sen- In general, unlike the simpler word embedding tences in our dataset do not contain any customer and bag of words approaches, large pretrained con- sensitive information. However, product reviews textualized embeddings maintain high test perfor- do often contain subjective opinions, which can mance according to the reported evaluation met- sometimes be socially biased. We do not filter out rics. We note that these also converged after a few any such biases. epochs using a relatively small number of labelled We use two pretrained sentence encoders, instances, based on the model with the best 5-fold mBERT and XLM-RoBERTa, when training the validation accuracy. Hence, contextualized em- CFD models. It has been reported that pretrained beddings can identify various context-dependent masked language model encode unfair social biases counterfactuals from a diverse range of reviews such as gender, racial and religious biases (Bom- using a small number of mini-batch gradient up- masani et al., 2020). Although we have evalu- dates of a single linear layer. Among the different ated ourselves the mBERT and XLM-RoBERTa sentence embedding methods compared, the best based CFD models that we use in our experiments, performance is reported by XLM-RoBERTa. we suspect any social biases encoded in these pre- Between the two baselines, we see that using trained masked language models could propagate word embeddings to represent the sentences does into the CFD models that we train. In particular, not offer clear benefits for traditional ML meth- these social biases could be further amplified dur- ods and BoN features are sufficient. However, em- ing the CFD model training process, if the counter- bedding based methods suffer generally a smaller factual statements in the training data also contain performance drop when clues are masked. This such biases. Debiasing masked language models suggests that embeddings provide a more general is an active research field (Kaneko and Bollegala, and robust representation of counterfactuals in the 2021) and we plan to evaluate the social biases in semantic space than BoN features. CFD models in our future work. 5 Conclusion References We annotated a multilingual counterfactual dataset Yang Bai and Xiaobing Zhou. 2020. Byteam at using Amazon product reviews for English, Ger- semeval-2020 task 5: Detecting counterfactual state- man and Japanese languages. Experimental re- ments with bert and ensembles. In Proceedings of sults show that our English dataset is compatible the Fourteenth Workshop on Semantic Evaluation, with the previously proposed SemEval-2020 Task pages 640–644. 5 dataset. Moreover, the CFD models trained using Shai Ben-David, John Blitzer, Koby Crammer, Alex our dataset are relatively robust against selection Kulesza, Fernando Pereira, and Jennifer Wortman bias due to clue phrases. Simply applying MT Vaughan. 2009. A theory of learning from different on test data results in poor cross-lingual classifica- domains. Machine Learning, 79:151–175.
Danushka Bollegala and Cong Bao. 2018. Learning in Neural Information Processing Systems 32: An- word meta-embeddings by autoencoding. In Pro- nual Conference on Neural Information Processing ceedings of the 27th International Conference on Systems 2019, NeurIPS 2019, December 8-14, 2019, Computational Linguistics, pages 1650–1661, Santa Vancouver, BC, Canada. Fe, New Mexico, USA. Association for Computa- tional Linguistics. Corinna Cortes and Vladimir Vapnik. 1995. Support- vector networks. Machine learning, 20(3):273–297. Danushka Bollegala, Kohei Hayashi, and Ken-ichi Kawarabayashi. 2018. Think globally, embed lo- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and cally — locally linear meta-embedding of words. In Kristina Toutanova. 2019. BERT: Pre-training of Proc. of IJCAI-EACI, pages 3970–3976. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of Danushka Bollegala, Ryuichi Kiryo, Kosuke Tsujino, the North American Chapter of the Association for and Haruki Yukawa. 2020. Language-independent Computational Linguistics: Human Language Tech- tokenisation rivals language-specific tokenisation nologies, Volume 1 (Long and Short Papers). for word similarity prediction. In Proc. of LREC. Xiao Ding, Dingkui Hao, Yuewei Zhang, Kuo Liao, Rishi Bommasani, Kelly Davis, and Claire Cardie. Zhongyang Li, Bing Qin, and Ting Liu. 2020. Hit- 2020. Interpreting Pretrained Contextualized Repre- scir at semeval-2020 task 5: Training pre-trained lan- sentations via Reductions to Static Embeddings. In guage model with pseudo-labeling data for counter- Proceedings of the 58th Annual Meeting of the Asso- factuals detection. In Proceedings of the Fourteenth ciation for Computational Linguistics, pages 4758– Workshop on Semantic Evaluation, pages 354–360. 4781, Online. Association for Computational Lin- Martin Fajcik, Josef Jon, Martin Docekal, and Pavel guistics. Smrz. 2020. BUT-FIT at SemEval-2020 task 5: Au- Sabri Boughorbel, Fethi Jarray, and Mohammed El- tomatic detection of counterfactual statements with Anbari. 2017. Optimal classifier for imbalanced deep pre-trained language representation models. In data using matthews correlation coefficient metric. Proceedings of the Fourteenth Workshop on Seman- PloS one, 12(6):e0177678. tic Evaluation, pages 437–444, Barcelona (online). International Committee for Computational Linguis- Leo Breiman. 2001. Random forests. Machine learn- tics. ing, 45(1):5–32. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Leo Breiman, Jerome Friedman, Charles J Stone, and Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Richard A Olshen. 1984. Classification and regres- Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, sion trees. CRC press. Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- Ruth MJ Byrne. 2019. Counterfactuals in explain- son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer able artificial intelligence (xai): evidence from hu- Singh, Noah A. Smith, Sanjay Subramanian, Reut man reasoning. In Proceedings of the Twenty-Eighth Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. International Joint Conference on Artificial Intelli- 2020. Evaluating models’ local decision boundaries gence, IJCAI-19, pages 6276–6282. via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages Weilong Chen, Yan Zhuang, Peng Wang, Feng Hong, 1307–1323, Online. Association for Computational Yan Wang, and Yanru Zhang. 2020. Ferryman at Linguistics. semeval-2020 task 5: Optimized bert for detecting Corrado Gini. 1912. Variabilità e mutabilità (variabil- counterfactuals. In Proceedings of the Fourteenth ity and mutability). Tipografia di Paolo Cuppini, Workshop on Semantic Evaluation, pages 653–657. Bologna, Italy, page 156. Davide Chicco and Giuseppe Jurman. 2020. The M Höfler. 2005. Causal inference based on counterfac- advantages of the matthews correlation coefficient tuals. BMC medical research methodology, 5(1):28. (mcc) over f1 score and accuracy in binary classi- fication evaluation. BMC genomics, 21(1):6. Michela Ippolito. 2013. Counterfactuals and condi- tional questions under discussion. In Semantics and Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Linguistic Theory, volume 23, pages 194–211. Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Wesley M. Jacobsen. 2011. The interrelationship of moyer, and Veselin Stoyanov. 2020. Unsupervised time and realis in japanese – in search of the seman- cross-lingual representation learning at scale. In tic roots of hypothetical meaning. NINJAL project Proceedings of the 58th Annual Meeting of the As- review, 1(5). sociation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Anthony Janocko, Allegra Larche, Joseph Raso, and Kevin Zembroski. 2016. Counterfactuals in the lan- Alexis Conneau and Guillaume Lample. 2019. Cross- guage of social media: A natural language process- lingual language model pretraining. In Advances ing project in conjunction with the world well being
project. Technical report, University of Pennsylva- Bella K Milmed. 1957. Counterfactual statements and nia. logical modality. Mind, 66(264):453–470. Thorsten Joachims and Adith Swaminathan. 2016. Anirudh Anil Ojha, Rohin Garg, Shashank Gupta, and Counterfactual evaluation and learning for search, Ashutosh Modi. 2020. Iitk-rsa at semeval-2020 task recommendation and ad placement. In Proceedings 5: Detecting counterfactuals. In Proceedings of the of the 39th International ACM SIGIR conference on Fourteenth Workshop on Semantic Evaluation, pages Research and Development in Information Retrieval, 458–467. pages 1199–1201. Jeffery Pennington, Richard Socher, and Christopher D. Masahiro Kaneko and Danushka Bollegala. 2021. De- Manning. 2014. Glove: global vectors for word rep- biasing pre-trained contextualised embeddings. In resentation. In Proc. of EMNLP, pages 1532–1543. Proc. of the 16th European Chapter of the Associa- tion for Computational Linguistics (EACL). John Platt et al. 1999. Probabilistic outputs for sup- port vector machines and comparisons to regularized Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. likelihood methods. Advances in large margin clas- 2020. Learning the difference that makes a differ- ence with counterfactually-augmented data. In Inter- sifiers, 10(3):61–74. national Conference on Learning Representations. Willard Van Orman Quine. 1982. Methods of logic. Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- Harvard University Press. cardo Silva. 2017. Counterfactual fairness. In Ad- vances in Neural Information Processing Systems, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, pages 4066–4076. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Zhenzhong Lan, Mingda Chen, Sebastian Goodman, blog, 1(8):9. Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learn- Neal J Roese. 1997. Counterfactual thinking. Psycho- ing of language representations. In International logical bulletin, 121(1):133. Conference on Learning Representations. Rico Sennrich, Barry Haddow, and Alexandra Birch. Stefan Larson, Anthony Zheng, Anish Mahendran, 2016. Neural machine translation of rare words Rishi Tekriwal, Adrian Cheung, Eric Guldan, Kevin with subword units. In Proceedings of the 54th An- Leach, and Jonathan K Kummerfeld. 2020. Iterative nual Meeting of the Association for Computational feature mining for constraint-based data collection Linguistics (Volume 1: Long Papers), pages 1715– to increase data diversity and model robustness. In 1725, Berlin, Germany. Association for Computa- Proceedings of the 2020 Conference on Empirical tional Linguistics. Methods in Natural Language Processing (EMNLP), pages 8097–8106. Youngseo Son, Anneke Buffone, Joe Raso, Allegra Larche, Anthony Janocko, Kevin Zembroski, H An- David Lewis. 2013. Counterfactuals. John Wiley & drew Schwartz, and Lyle Ungar. 2017. Recognizing Sons. counterfactual thinking in social media texts. In Pro- ceedings of the 55th Annual Meeting of the Associa- Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, tion for Computational Linguistics (Volume 2: Short and Kai-Wei Chang. 2019. Efficient contextual rep- Papers), pages 654–658. resentation learning with continuous outputs. Trans- actions of the Association for Computational Lin- Svante Wold, Kim Esbensen, and Paul Geladi. 1987. guistics, 7:611–624. Principal component analysis. Chemometrics and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- intelligent laboratory systems, 2(1-3):37–52. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. RoBERTa: A robustly optimized bert pretraining ap- 2018. Challenges of using text classifiers for causal proach. arXiv preprint arXiv:1907.11692. inference. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing. Yaojie Lu, Annan Li, Hongyu Lin, Xianpei Han, and Conference on Empirical Methods in Natural Lan- Le Sun. 2020. Iscas at semeval-2020 task 5: Pre- guage Processing, volume 2018, page 4586. NIH trained transformers for counterfactual statement Public Access. modeling. In Proceedings of the Fourteenth Work- shop on Semantic Evaluation, pages 658–663. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Keith D Markman, Matthew J Lindberg, Laura J Kray, Maxim Krikun, Yuan Cao, Qin Gao, Klaus and Adam D Galinsky. 2007. Implications of coun- Macherey, et al. 2016. Google’s neural machine terfactual structure for creative generation and ana- translation system: Bridging the gap between hu- lytical problem solving. Personality and Social Psy- man and machine translation. arXiv preprint chology Bulletin, 33(3):312–324. arXiv:1609.08144.
Len Yabloko. 2020. Ethan at semeval-2020 task 5: Modelling causal reasoning in language using neuro- symbolic cloud computing. In Proceedings of the → − X = Xw + Xs + Xp (1) Fourteenth Workshop on Semantic Evaluation, pages 645–652. 12 → − M → − → − → − Z := softmax X Q(i) KT(i) X T X V(i) (2) i=1 Xiaoyu Yang, Stephen Obadinma, Huasha Zhao, Qiong → − → − → − Zhang, Stan Matwin, and Xiaodan Zhu. 2020a. Z = Feedforward(LayerNorm( Z + X )) (3) SemEval-2020 task 5: Counterfactual recognition. ← − ←− ← − In Proceedings of the Fourteenth Workshop on Se- Z = Feedforward(LayerNorm( Z + X )) (4) mantic Evaluation, pages 322–335, Barcelona (on- line). International Committee for Computational The last hidden representations of both direc- Linguistics. ←−L→ −0 tions are then concatenated Z0 := Z Z and projected using a final linear layer W ∈ Rd fol- Xiaoyu Yang, Stephen Obadinma, Huasha Zhao, Qiong Zhang, Stan Matwin, and Xiaodan Zhu. 2020b. lowed by a sigmoid function σ(·) to produce a SemEval-2020 Task 5: Counterfactual Recognition. probability estimate ŷ, as shown in (5). As in the original BERT paper, WordPiece embeddings (Wu Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and et al., 2016) are used with a vocabulary size of William W. Cohen. 2018. Breaking the softmax bot- 30,000. Words from (step-3) that are used for fil- tleneck: A high-rank RNN language model. In Inter- tering the sentences are masked using a [PAD] national Conference on Learning Representations. token to ensure the model does not simply learn to correctly classify some samples based on the asso- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- ciation of these tokens with counterfacts. A linear bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. layer is then fine-tuned on top of the hidden state, Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural in- hX,[CLS] emitted corresponding to the [CLS] to- formation processing systems, pages 5754–5764. ken. This fine-tunable linear layer is then used to predict whether the sentence is counterfactual or not, as shown in Equation 5, where B ⊂ D is a Supplementary Materials mini-batch and Lce is the cross-entropy loss. A Fine-tuned multilingual BERT for counterfactual classification 1 X Lce := y log σ(hX,[CLS] · W) (5) |B| (X,y)∈B Given that we select mBERT (Devlin et al., 2019) as the main classification method in the paper, we Configurations For the mBERT counterfactual describe how the original BERT architecture is model we use BERT-base, which uses 12 Trans- adapted for fine-tuned for CF classification. former blocks, 12 self-attention heads with a hid- Consider a dataset D = {(Xi , yi )}m i=1 for D ∈ den size of 768. The default size of 512 is used D and a sample s := (X, y) where the sentence for the sentence length and the sentence represen- X := (x1 , . . . xn ) with n being the number of tation is taken as the final hidden state of the first words x ∈ X. We can represent a word as an input [CLS] token. This model is already pre-trained embedding xw ∈ Rd , which has a corresponding and we fine-tune a linear layer W on top of BERT, target vector y. In the pre-trained transformer mod- which is fed to through a sigmoid function σ as els we use, Xi is represented by 3 types of embed- p(c|h) = σ(Wh) where c is the binary class label dings; word embeddings (Xw ∈ Rn×d ), segment and we maximize the log-probability of correctly embeddings (Xs ∈ Rn×d ) and position embed- predicting the ground truth label. dings (Xp ∈ Rn×d ), where d is the dimensionality of each embedding matrix. The self-attention block B Matthews Correlation Coefficient in a transformer mainly consists of three sets of pa- rameters: the query parameters Q ∈ Rd×l , the key Unlike metrics such as F1, MCC accounts for class parameters K ∈ Rd×l and the value parameters imbalance and incorporates all correlations within V ∈ Rd×o . For 12 attention heads (as in BERT- the confusion matrix (Chicco and Jurman, 2020). base), we express the forward pass as follows: For MCC, the range is [-1, 1] where 1 represents a
perfect prediction, 0 an average random prediction F.1 Transformer Model Hyperparameters and -1 an inverse prediction. We did not change the original hyperparame- ter settings that were used for the original pre- tp × tn − fp × fn training of each transformer model. The hy- MCC = p perparameter settings for these pretrained mod- (tp + fp)(tp + fn)(tn + fp)(tn + fn) (6) els can be found in the class arguments python documentation in each configuration python file C Extended version of Table 8 in the https://github.com/huggingface/transformers/ blob/master/src/transformers/ e.g configuration .py We report F1, MCC, and accuracy in Table 9. and are also summarized in Table 11. For fine-tuning transformer models, we man- D Examples of Incorrect Predictions ually tested different combinations of a subset of hyperparameters including the learning rates Table 10 shows examples of misclassifications {50−4 , 10−5 , 50−5 }, batch sizes {16, 32, 128}, given by transformer models. The second column warmup proportion {0, 0.1} and which is a hyper- indicates which of the remaining transformer mod- parameter in the adaptive momentum (adam) op- els misclassified each review where B=mBERT, timizer. Please refer to the huggingface documen- XR=XLM-RoBERTa, X=XLM without embed- tation at https://github.com/huggingface/transformers ding. for further details on each specific model e.g E Hardware Used at https://github.com/huggingface/transformers/blob/ master/src/transformers/modeling_bert.py, and also All transformer, RNN and CNN models were for the details of the architecture for BertForSe- trained using a GeForce NVIDIA GTX 1070 GPU quenceClassification pytorch class that is used for which has 8GB GDDR5 Memory. our sentence classification and likewise for the re- maining models. F Model Configuration and Fine-tuning all language models with a sentence Hyperparameter Settings classifier took less than two and half hours for all models. For example, for the largest transformer BERT-base uses 12 Transformer blocks, 12 self- model we used, BERT, the estimated average run- attention heads with a hidden size of 768. The time for a full epoch with batch size 16 (of 2, 682 default size of 512 is used for the sentence length training samples) is 184.13 seconds. In the worst and the sentence representation is taken as the fi- case, if the model does not already converge early nal hidden state of the first [CLS] token. A fine- and all 50 training epochs are carried out, training tuned linear layer W is used on top of BERT-base, lasts for 2 hour and 30 minutes. which is fed to through a sigmoid function σ as p(c|h) = σ(Wh) where c is used to calibrate the F.2 Baseline Hyperparameters class probability estimate and we maximize the log- SVM Classifier: A radial basis function was probability of correctly predicting the ground truth used as the nonlinear kernel, tested with an `2 reg- label. ularization term settings of C = {0.01, 0.1, 1}, Table 11 shows the pretrained model configura- while the kernel coefficient γ is autotuned by the tions that were already predefined before our ex- scikit-learn python package and class weights are periments. The number of (Num.) hidden groups used inversely proportional to the number of sam- here are the number of groups for the hidden lay- ples in each class. To calibrate probability esti- ers where parameters in the same group are shared. mates for AUC scores, we use Platt’s scaling (Platt The intermediate size is the dimensionality of the et al., 1999). feed-forward layers of the the Transformer encoder. The ‘Max Position Embeddings’ is the maximum Decision Tree and Random Forest Classifiers: sequence length that the model can deal with. We use 20 decision tree classifiers with no restric- We now detail the hyperparameter settings for tion on tree depth and the minimum number of transformer models and the baselines. We note that samples required to split an internal node is set all hyperparameter settings were performed using to 2. The criterion for splitting nodes is the Gini a manual search over development data. importance (Gini, 1912).
