Re-Evaluating GermEval17 Using German Pre-Trained Language Models Matthias Aßenmacher1 Alessandra Corvonato1 Christian Heumann1 1 Department of Statistics, Ludwig-Maximilians-Universität, Munich, Germany {matthias,chris}, Abstract Encoder Representations from Transfomers), achieving state-of-the-art (SOTA) performance on arXiv:2102.12330v1 [cs.CL] 24 Feb 2021 The lack of a commonly used benchmark several benchmark tasks - mainly for the English data set (collection) such as (Super-)GLUE (Wang et al., 2018, 2019) for the evaluation language - and becoming a milestone in the field of non-English pre-trained language models of NLP. is a severe shortcoming of current English- centric NLP-research. It concentrates a large part of the research on English, neglecting Up to now, only a few researchers have focused the uncertainty when transferring conclusions on sentiment related problems for German reviews. found for the English language to other lan- The first shared task on German Sentiment Ana- guages. We evaluate the performance of the German and multilingual BERT-based mod- lysis, which provides a large annotated data set els currently available via the huggingface for training and evaluation, is the GermEval17 transformers library on the four tasks of Shared Task (Wojatzki et al., 2017). The partici- the GermEval17 workshop. We compare them pating teams mostly analyzed the data using stan- to pre-BERT architectures (Wojatzki et al., dard machine learning techniques such as SVMs, 2017; Schmitt et al., 2018; Attia et al., 2018) CRFs, or LSTMs. In contrast to 2017, today, dif- as well as to an ELMo-based architecture ferent pre-trained BERT models are available for (Biesialska et al., 2020) and a BERT-based ap- a variety of different languages, including Ger- proach (Guhr et al., 2020). The observed im- provements are put in relation to those for sim- man. We re-analyzed the complete GermEval17 ilar tasks and similar models (pre-BERT vs. Task using seven pre-trained BERT models suit- BERT-based) for the English language in order able for German provided by the huggingface to draw tentative conclusions about whether transformers library (Wolf et al., 2020). We the observed improvements are transferable to will evaluate which one of the models is best German or potentially other related languages. suited for the different GermEval17 subtasks by comparing their performance values. Furthermore, 1 Introduction we compare our findings on whether (and how Sentiment Analysis is used to transform reviews much) BERT-based models are able to improve into helpful information on how a product or the pre-BERT SOTA in German Sentiment Anal- service of a company is perceived among the ysis with the SOTA developments for English Sen- customers. Until recently, Sentiment Analy- timent Analysis by the example of SemEval-2014 sis was mainly conducted using traditional ma- (Pontiki et al., 2014). chine learning and recurrent neural networks, like LSTMs (Hochreiter and Schmidhuber, 1997) or GRUs (Cho et al., 2014). Those mod- We will first give an overview on the els have been practically replaced with lan- GermEval17 tasks (cf. Sec. 2) and on related work guage models relying on (parts of) the Trans- (cf. Sec. 3). Second, we present the data and the former architecture, a novel framework pro- models (cf. Sec. 4), while Section 5 presents the posed by Vaswani et al. (2017). Devlin et al. results of our re-evaluation. Sections 6 and 7 con- (2019) developed a Transformer-encoder-based clude our work by stating our main findings and language model called BERT (Bidirectional drawing parallels to the English language.
2 The GermEval17 Task(s) Sentiment Analysis task using BERT-based lan- guage models (Hoang et al., 2019; Xu et al., 2019; The GermEval17 Shared Task (Wojatzki et al., Sun et al., 2019; Li et al., 2019; Karimi et al., 2017) is a task on analyzing aspect-based senti- 2020; Tao and Fang, 2020). ments in customer reviews about ”Deutsche Bahn” In comparison, little research deals with (DB) - the German public train company. The German Sentiment Analysis. For instance, main data was crawled from various social me- Barriere and Balahur (2020) trained a multi- dia platforms such as Twitter, Facebook and Q&A lingual BERT model for German Document- websites from May 2015 to June 2016. The doc- level Sentiment Analysis on the SB-10k data uments were manually annotated, and split into a set (Cieliebak et al., 2017). Regarding the training (train), a development (dev) and a syn- GermEval17 Subtask B, Guhr et al. (2020) con- chronic (testsyn ) test set. A diachronic test set sidered both FastText (Bojanowski et al., 2017) (testdia ) was collected the same way from Novem- and BERT, achieving notable improvements. ber 2016 to January 2017 in order to test for Biesialska et al. (2020) made use of ensem- robustness. The task comprises four subtasks ble models. One is an ensemble of ELMo representing a complete classification pipeline. (Peters et al., 2018), GloVe and a bi-attentive clas- Subtask A is a binary Relevance Classification sification network (BCN; McCann et al., 2017), task which aims at identifying whether the feed- achieving a score of 0.782, and the other one back refers to DB. Subtask B aims at classify- consists of ELMo and a Transformer-based Sen- ing the Document-level Polarity (”negative”, ”pos- timent Analysis model (TSA), reaching a score of itive” and ”neutral”). In Subtask C, the model 0.789 for the synchronic test data set. Moreover, has to identify all the aspect categories with as- Attia et al. (2018) trained a convolutional neural sociated sentiment polarities in a relevant docu- network (CNN), achieving a score of 0.7545 on ment. This multi-label classification task was the synchronic test data set. Schmitt et al. (2018) divided into Subtask C1 (Aspect-only) and Sub- advanced the SOTA for Subtask C. They exper- task C2 (Aspect+Sentiment). For this purpose, imented with biLSTMs and CNNs to carry out the organizers defined 20 different aspect cate- end-to-end Aspect-based Sentiment Analysis. The gories, e.g. Allgemein (General), Sonstige highest score was achieved using an end-to-end Unregelmäßigkeiten (Other irregularities). CNN architecture with FastText embeddings, scor- Finally, Subtask D refers to the Opinion Target Ex- ing 0.523 and 0.557 on the synchronic and di- traction (OTE), i.e. a sequence labeling task ex- achronic test data set for Subtask C1, respectively, tracting the linguistic phrase used to express an and 0.423 and 0.465 for Subtask C2. opinion. We differentiate between exact match (Subtask D1) and overlapping match, tolerating er- 4 Materials and Methods rors of +/− one token (Subtask D2). Data The GermEval17 data is freely available in 3 Related Work .xml- and .tsv-format1 . Each data split (train, Already before BERT, many researchers focused validation, test) in .tsv-format contains the fol- on (English) Sentiment Analysis (Behdenna et al., lowing variables: 2018)). The most common architectures were tra- • document id (URL) ditional machine learning classifiers and recurrent neural networks (RNNs). SemEval14 (Task 4; • document text Pontiki et al., 2014) was the first workshop to in- troduce Aspect-based Sentiment Analysis (ABSA) • relevance label (true, false) which was expanded within SemEval15 Task 12 (Pontiki et al., 2015) and SemEval16 Task 5 • document-level sentiment label (negative, (Pontiki et al., 2016). Here, restaurant and lap- neutral, positive) top reviews were examined on different granu- larities. The best model at SemEval16 was an • aspects with respective polarities (e.g. SVM/CRF architecture using GloVe embeddings Ticketkauf#Haupt:negative) (Pennington et al., 2014). However, many works 1 The data sets (in both formats) can be obtained from recently focused on re-evaluating the SemEval
For documents which are annotated as irrele- Sentiment train dev testsyn testdia vant, the sentiment label is set to neutral and no negative 5,045 589 780 497 aspects are available. Visibly, the .tsv-formatted neutral 13,208 1,632 1,681 1,237 data does not contain the target expressions or positive 1,179 148 105 108 their associated sequence positions. Consequently, Subtask D can only be conducted using the data in Table 3: Sentiment distribution for Subtask B. .xml-format which additionally holds the infor- mation on the starting and ending sequence posi- Category train dev testsyn testdia tions of the target phrases. Allgemein 11,454 1,391 1,398 1,024 The data set is comprised of ∼ 26k documents Zugfahrt 1,687 177 241 184 in total, including the diachronic test set with Sonstige Unregelmäßigkeiten 1,277 139 224 164 Atmosphäre 990 128 148 53 around 1.8k examples. Further, the main data was Ticketkauf 540 64 95 48 randomly split by the organizers into a train data Service und Kundenbetreuung 447 42 63 27 set for training, a development data set for evalu- Sicherheit 405 59 84 42 Informationen 306 28 58 35 ation and a synchronic test data set. Table 1 dis- Connectivity 250 22 36 73 plays the number of documents for each split. Auslastung und Platzangebot 231 25 35 20 DB App und Website 175 20 28 18 Komfort und Ausstattung 125 18 24 11 train dev testsyn testdia Barrierefreiheit 53 14 9 2 Image 42 6 0 3 19,432 2,369 2,566 1,842 Toiletten 41 5 7 4 Gastronomisches Angebot 38 2 3 3 Table 1: Number of documents per split of the data set. Reisen mit Kindern 35 3 7 2 Design 29 3 4 2 Gepäck 12 2 2 6 While roughly 74% of the documents form the QR-Code 0 1 1 0 train set, the development split and the synchronic total 18,137 2,149 2,467 1,721 test split contains around 9% and around 10%, re- # documents with aspects 16,200 1,930 2,095 1,547 ∅ different aspects/document 1.12 1.11 1.18 1.11 spectively. The remaining 7% of the data belong to the diachronic set (cf. Tab. 1). Table 2 shows the Table 4: Aspect category distribution for Subtask C. relevance distribution per data split. This unveils a Multiple mentions of the same aspect category in a doc- pretty skewed distribution of the labels since the ument are only considered once. relevant documents represent the clear majority with over 80% in each split. are ∼ 1.12 different aspects per document. Again, Relevance train dev testsyn testdia the label distribution is heavily skewed, with true 16,201 1,931 2,095 1,547 Allgemein (General) clearly representing the false 3,231 438 471 295 majority class, as it is present in 75.8% of the doc- uments with aspects. The second most frequent Table 2: Relevance distribution for Subtask A. category is Zugfahrt (Train ride) appearing in around 13.8% of the documents. This strong The distribution of the sentiments is depicted in imbalance in the aspect categories leads to an Table 3, which shows that between 65% and 69% almost Zipfian distribution (Wojatzki et al., 2017). (per split) belong to the neutral class, 25–31% to the negative and only 4–6% to the positive class. Table 4 holds the distribution of the 20 different Pre-trained architectures Since we were in- aspect categories assigned to the documents2 . It terested in examining the performance differ- shows the number of documents containing certain ences between pre-BERT architectures and current categories without differentiating between how of- SOTA models, we took pre-trained German BERT- ten a category appears within a given document. based architectures available via the huggingface The relative distribution of the aspect categories transformers module (Wolf et al., 2020). All is similar between the splits. On average, there the models we selected for examination are dis- 2 Multiple annotations per document are pos- played in Table 5. sible; for a detailed category description see We include three German (Distil)BERT models
model variant corpus properties bert-base-german-cased 12GB of German text ( L=12, H=768, A=12, 110M parameters bert-base-german-dbmdz-cased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters bert-base-german-dbmdz-uncased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters bert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=12, H=768, A=12, 179M parameters bert-base-multilingual-uncased Largest Wikipedias (top 102 languages) L=12, H=768, A=12, 168M parameters distilbert-base-german-cased 16GB of German text (dbmdz) L=6, H=768, A=12, 66M parameters distilbert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=6, H=768, A=12, 134M parameters Table 5: Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For all available models, see: by DBMDZ3 and one by Deepset.ai4 . The latter The models were fine-tuned on one Tesla V100 one is pre-trained using German Wikipedia PCIe 16GB GPU using Python 3.8.7. Moreover, (6GB raw text files), the Open Legal Data the transformers module (version 4.0.1) and dump (2.4GB; Ostendorff et al., 2020) and news torch (version 1.7.1) were used6 . The con- articles (3.6GB). DBMDZ combine Wikipedia, sidered values for the hyperparameters for fine- EU Bookshop (Skadiņš et al., 2014), Open tuning follow the recommendations of Devlin et al. Subtitles (Lison and Tiedemann, 2016), (2019)7 : CommonCrawl (Ortiz Suárez et al., 2019), ParaCrawl (Esplà-Gomis et al., 2019) and • Batch size ∈ {16, 32}, News Crawl (Haddow, 2018) to a corpus with a total size of 16GB with ∼ 2, 350M tokens. • Adam learning rate ∈ {5e,3e,2e} − 5, Besides this, we use the three multilingual models included in the transformers module. This • # epochs ∈ {2, 3, 4}. amounts to five BERT and two DistilBERT mod- els, two of which are ”uncased” (i.e. lower-cased) Finally, all the language models were fine-tuned while the other five models are ”cased” ones. with a learning rate of 5e-5 and four epochs. The maximum sequence length was set to 256 and a 5 Results batch size of 32 was chosen. For the re-evaluation, we used the latest data Other models Eight teams officially partici- provided in .xml-format. Duplicates were not pated in GermEval17, of which five teams ana- removed, in order to make our results as com- lyzed Subtask A, all of them evaluated Subtask B, parable as possible. We tokenized the docu- and two teams submitted results for both Sub- ments and fixed single spelling mistakes in the la- task C and D. Since GermEval17 was conducted bels5 . For Subtask D, the BIO-tags were added back in 2017, several other authors have also fo- based on the provided sequence positions, i.e. cused on analyzing parts of the subtasks. Table one entity corresponds to at least one token tag 6 shows which authors employed which kinds of starting with B- for ”Beginning” and continuing models to solve which task. We added the system with I- for ”Inner”. If a token does not be- by Ruppert et al. (2017) to the models from 2017 long to any entity, the tag O for ”Outer” is as- even if they did not officially participate because signed. For instance, the sequence ”fährt nicht” the system was constructed and released by the or- (engl. ”does not run”) consists of two tokens and ganizers. They also tackled all the subtasks. would receive the entity Zugfahrt:negative The performance values of the models are added and the token tags [B-Zugfahrt:negative, to our tables when suitable. I-Zugfahrt:negative] if it refers to a DB 6 train which is not running. Source code is available on GitHub: germeval2017. The 3 MDZ Digital Library team at the Bavarian State Li- results are fully reproducible for Subtasks A, B and C. For brary. Visit for details Subtask D, the results should also be reproducible but for and for their repository on some reason they are not, and we could not yet fix the pre-trained BERT models. issue. However, the micro F1 scores only fluctuate between 4 Visit for details. +/-0.01. 5 7 ”positve” in train set was replaced with ”positive”, ” neg- Due to memory limitations, not every hyperparameter ative” in testdia was replaced with ”negative”. combination was applicable.
Subtask A B C1 C2 D1 D2 Language model testsyn testdia Models 2017 (Wojatzki et al., 2017; Ruppert et al., 2017) X X X X X X Best models 2017 (testsyn : Ruppert et al., 2017) Our BERT models X X X X X X 0.767 0.750 (testdia : Sayyed et al., 2017) CNN (Attia et al., 2018) – X – – – – CNN+FastText (Schmitt et al., 2018) – – X X – – bert-base-german-cased 0.798 0.793 ELMo+GloVe+BCN (Biesialska et al., 2020) – X – – – – bert-base-german-dbmdz-cased 0.799 0.785 ELMo+TSA (Biesialska et al., 2020) – X – – – – bert-base-german-dbmdz-uncased 0.807 0.800 FastText (Guhr et al., 2020) – X – – – – bert-base-multilingual-cased 0.790 0.780 bert-base-german-cased (Guhr et al., 2020) – X – – – – bert-base-multilingual-uncased 0.784 0.766 distilbert-base-german-cased 0.798 0.776 Table 6: Different subtasks which the models were distilbert-base-multilingual-cased 0.777 0.770 (not) evaluated on. CNN (Attia et al., 2018) 0.755 – ELMo+GloVe+BCN (Biesialska et al., 2020) 0.782 – ELMo+TSA (Biesialska et al., 2020) 0.789 – Subtask A The Relevance Classification is a FastText (Guhr et al., 2020) 0.698† – bert-base-german-cased (Guhr et al., 2020) 0.789† – binary document classification task with classes true and false. Table 7 displays the micro Table 8: Micro-averaged F1 scores for Subtask B on F1 score obtained by each language model on each synchronic and diachronic test sets. † test set (best result per data set in bold). Guhr et al. (2020) created their own (balanced & un- balanced) data splits, which limits comparability. We Language model testsyn testdia compare to the performance on the unbalanced data Best model 2017 (Sayyed et al., 2017) 0.903 0.906 since it more likely resembles the original data splits. bert-base-german-cased 0.950 0.939 bert-base-german-dbmdz-cased 0.951 0.946 bert-base-german-dbmdz-uncased 0.957 0.948 its cased variant with 0.799. For the diachronic bert-base-multilingual-cased 0.942 0.933 bert-base-multilingual-uncased 0.944 0.939 test set, the uncased German BERT-BASE model distilbert-base-german-cased 0.944 0.939 exceeds the other models with a score of 0.800, fol- distilbert-base-multilingual-cased 0.941 0.932 lowed by the cased German BERT-BASE model reaching a score of 0.793. The three multilingual Table 7: F1 scores for Subtask A on synchronic and diachronic test sets. models perform generally worse than the German models on this task. Besides this, all the models All the models outperform the best result achieved perform slightly better on the synchronic data set in 2017 for both test data sets. For the synchronic than on the diachronic one. The FastText-based test set, the previous best result is surpassed by model (Guhr et al., 2020) comes not even close 3.8–5.4 percentage points. For the diachronic test to the baseline from 2017, while the ELMo-based set, the absolute difference to the best contender models (Biesialska et al., 2020) are pretty compet- of 2017 varies between 2.6 and 4.2 percentage itive. Interestingly, two of the multilingual models points. With a micro F1 score of 0.957 and 0.948, are even outperformed by these ELMo-based mod- respectively, the best scoring pre-trained language els. model is the uncased German BERT-BASE vari- Subtask C Subtask C is split into Aspect-only ant by dbmdz, followed by its cased version. All (Subtask C1) and Aspect+Sentiment Classification the pre-trained models perform slightly better on (Subtask C2), each being a multi-label classifica- the synchronic test data than on the diachronic tion task8 . As the organizers provide 20 aspect data. Attia et al. (2018), Schmitt et al. (2018), categories, Subtask C1 includes 20 labels, whereas Biesialska et al. (2020) and Guhr et al. (2020) did Subtask C2 deems 60 labels since each aspect cat- not evaluate their models on this task. egory can be combined with each of the three Subtask B Subtask B refers to the Document- sentiments. Consistent with Lee et al. (2017) and level Polarity, which is a multi-class classification Mishra et al. (2017), we do not account for multi- task with three classes. Table 8 demonstrates the ple mentions of the same label in one document. performances on the two test sets: The results for Subtask C1 are shown in Table 9: All models outperform the best model from 2017 All pre-trained German BERTs clearly surpass the by 1.0–4.0 percentage points for the synchronic, best performance from 2017 as well as the re- and by 1.6–5.0 percentage points for the di- sults reported by Schmitt et al. (2018), who are achronic test set. On the synchronic test set, the the only ones of the other authors to evaluate their uncased German BERT-BASE model by dbmdz 8 This leads to a change of activation functions in the final performs best with a score of 0.807, followed by layer from softmax to sigmoid + binary cross entropy loss.
Language model testsyn testdia Subtask D Subtask D refers to the Opinion Best model 2017 (Ruppert et al., 2017) 0.537 0.556 Target Extraction (OTE) and is thus a token- bert-base-german-cased 0.756 0.762 level classification task. As this is a rather bert-base-german-dbmdz-cased 0.756 0.781 bert-base-german-dbmdz-uncased 0.761 0.791 difficult task, Wojatzki et al. (2017) distinguish bert-base-multilingual-cased 0.706 0.734 between exact (Subtask D1) and overlapping bert-base-multilingual-uncased 0.723 0.752 distilbert-base-german-cased 0.738 0.768 match (Subtask D2), tolerating a deviation of distilbert-base-multilingual-cased 0.716 0.744 +/− one token. Here, ”entities” are identified CNN+FastText (Schmitt et al., 2018) 0.523 0.557 by their BIO-tags. It is noteworthy that there are less entities here than for Subtask C since Table 9: Micro-averaged F1 scores for Subtask C1 document-level aspects or sentiments could not (Aspect-only) on synchronic and diachronic test sets. A always be assigned to a certain sequence in the more detailed overview of the per-class performances can be found in Table 15 in Appendix A. document. As a result, there are less documents at disposal for this task, namely 9,193. The remaining data has 1.86 opinions per document on average. The majority class is now Sonstige models on this tasks. Regarding the synchronic Unregelmäßigkeiten:negative with test set, the absolute improvement ranges between around 15.4% of the true entities (16,650 in total), 16.9 and 22.4 percentage points, while for the di- leading to more balanced data than in Subtask C. achronic test data, the models outperform the pre- vious results by 17.8–23.5 percentage points. The Language model testsyn testdia best model is again the uncased German BERT- Best model 2017 (Ruppert et al., 2017) 0.229 0.301 BASE model by dbmdz, reaching scores of 0.761 bert-base-german-cased 0.460 0.455 without CRF and 0.791, respectively, followed by the two cased bert-base-german-dbmdz-cased 0.480 0.466 bert-base-german-dbmdz-uncased 0.492 0.501 German BERT-BASE models. One more time, bert-base-multilingual-cased 0.447 0.457 the multilingual models exhibit the poorest perfor- bert-base-multilingual-uncased 0.429 0.404 distilbert-base-german-cased 0.347 0.357 mances amongst the evaluated models. Next, Ta- distilbert-base-multilingual-cased 0.430 0.419 ble 10 shows the results for Subtask C2: bert-base-german-cased 0.446 0.443 bert-base-german-dbmdz-cased 0.466 0.444 with CRF bert-base-german-dbmdz-uncased 0.515 0.518 Language model testsyn testdia bert-base-multilingual-cased 0.472 0.466 bert-base-multilingual-uncased 0.477 0.452 Best model 2017 (Ruppert et al., 2017) 0.396 0.424 distilbert-base-german-cased 0.424 0.403 bert-base-german-cased 0.634 0.663 distilbert-base-multilingual-cased 0.436 0.418 bert-base-german-dbmdz-cased 0.628 0.663 bert-base-german-dbmdz-uncased 0.655 0.689 Table 11: Entity-level micro-averaged F1 scores for bert-base-multilingual-cased 0.571 0.634 Subtask D1 (exact match) on synchronic and di- bert-base-multilingual-uncased 0.553 0.631 distilbert-base-german-cased 0.629 0.663 achronic test sets. A more detailed overview of the distilbert-base-multilingual-cased 0.589 0.642 per-class performances can be found in Table 17 in Ap- CNN+FastText (Schmitt et al., 2018) 0.423 0.465 pendix B. Table 10: Micro-averaged F1 scores for Subtask C2 In Table 11, we compare the pre-trained models (Aspect+Sentiment) on synchronic and diachronic test using an ”ordinary” softmax layer to when using a sets. A more detailed overview of the per-class perfor- mances can be found in Table 16 in Appendix A. CRF layer for Subtask D1. The best performing model is the uncased Ger- man BERT-BASE model by dbmdz with CRF Here, the pre-trained models surpass the best layer on both test sets, with a score of 0.515 and model from 2017 by 15.7–25.9 percentage points 0.518, respectively. Overall, the results from 2017 and 20.7–26.5 percentage points, respectively, for are outperformed by 11.8–28.6 percentage points the synchronic and diachronic test sets. Again, the on the synchronic test set and 5.6–21.7 percentage best model is the uncased German BERT-BASE points on the diachronic test set. dbmdz model reaching a score of 0.655 and 0.689, For the overlapping match (cf. Tab. 12), the best respectively. The CNN models (Schmitt et al., system from 2017 are outperformed by 4.9–17.5 2018) are also outperformed. For both Subtask C1 percentage points on the synchronic and by 4.2– and C2, all the displayed models perform better on 16.8 percentage points on the diachronic test set. the diachronic than on the synchronic test data. Again, the uncased German BERT-BASE model
Language model testsyn testdia Language model Laptops Restaurants Best models 2017 (testsyn : Lee et al., 2017) Best model SemEval-2014 0.348 0.365 (testdia : Ruppert et al., 2017) 0.7048 0.8095 pre-BERT (Pontiki et al., 2014) bert-base-german-cased 0.471 0.474 MemNet (Tang et al., 2016) 0.7221 0.8095 without CRF bert-base-german-dbmdz-cased 0.491 0.488 bert-base-german-dbmdz-uncased 0.501 0.518 bert-base-multilingual-cased 0.457 0.473 HAPN (Li et al., 2018) 0.7727 0.8223 bert-base-multilingual-uncased 0.435 0.417 distilbert-base-german-cased 0.397 0.407 distilbert-base-multilingual-cased 0.433 0.429 BERT-SPC (Song et al., 2019) 0.7899 0.8446 BERT-based bert-base-german-cased 0.455 0.457 bert-base-german-dbmdz-cased 0.476 0.469 BERT-ADA (Rietzler et al., 2020) 0.8023 0.8789 with CRF bert-base-german-dbmdz-uncased 0.523 0.533 bert-base-multilingual-cased 0.476 0.474 LCF-ATEPC (Yang et al., 2019) 0.8229 0.9018 bert-base-multilingual-uncased 0.484 0.464 distilbert-base-german-cased 0.433 0.423 distilbert-base-multilingual-cased 0.442 0.427 Table 13: Development of the SOTA Accuracy for the aspect term polarity task (SemEval-2014; Table 12: Entity-level micro-averaged F1 scores for Pontiki et al., 2014). Selected models were picked Subtask D2 (overlapping match) on synchronic and di- from achronic test sets. A more detailed overview of the aspect-based-sentiment-analysis-on-semeval. per-class performances can be found in Table 18 in Ap- pendix B. model from 2014 as well as performance values for subsequent (pre-BERT and BERT-based) archi- by dbmdz with CRF layer performs best with an tectures for subtasks SB3 and SB4. micro F1 score of 0.523 on the synchronic and 0.533 on the diachronic set. To our knowledge, Restaurants there were no other models to compare our perfor- Language model SB3 SB4 pre-BERT mance values with, besides the results from 2017. Best model SemEval-2014 0.8857 0.8292 (Pontiki et al., 2014) 6 Discussion ATAE-LSTM (Wang et al., 2016) —- 0.840 The most widely adopted data sets for English BERT-pair (Sun et al., 2019) 0.9218 0.899 BERT-based Aspect-based Sentiment Analysis originate from the SemEval Shared Tasks (Pontiki et al., 2014, CG-BERT (Wu and Ong, 2020) 0.9162† 0.901† 2015, 2016). When looking at public leaderboards, QACG-BERT (Wu and Ong, 2020) 0.9264 0.904† e.g., Subtask SB2 (aspect term polarity) from SemEval-2014 is the Table 14: Development of the SOTA F1 score (SB3) task which attracts most of the researchers. A com- and Accuracy (SB4) for the aspect category extrac- parison of pre-BERT and BERT-based methods re- tion/polarity task (SemEval-2014; Pontiki et al., 2014). veals no big ”jump” in the performance values, but † Wu and Ong (2020) additionally used auxiliary sen- rather a steady increase over time (cf. Tab. 13). tences here. More related, but unfortunately also less used, to Subtasks C1 and C2 from GermEval17 are In contrast to what we observed for subtask SB2, the subtasks SB3 (aspect category extraction) and in this case, the performance increase on SB4 SB4 (aspect category polarity) from SemEval- caused by the introduction of BERT seems to 2014. Since the data sets (Restaurants and Lap- be kind of striking. While the ATAE-LSTM tops) have been further developed for SemEval- (Wang et al., 2016) only slightly increased the 2015 and SemEval-2016, subtasks SB3 and SB4 performance compared to 2014, the BERT-based are revisited under the names Slot 1 and Slot 3 models led to a jump of more than 6 percentage for the in-domain ABSA in SemEval-2015. Slot points. 2 from SemEval-2015 aims at OTE and thus cor- Nevertheless, the comparability to our exper- responds to Subtask D from GermEval17. For iments in the GermEval17 data is somewhat SemEval-2016 the same task names as in 2015 limited. Subtask SB4 only exhibits five as- were used, subdivided into Subtask 1 (sentence- pect categories (as opposed to 20 categories for level ABSA) and Subtask 2 (text-level ABSA). GermEval17) which leads to an easier classifica- Table 14 shows the performance of the best tion problem and is reflected in the already pretty
high scores of the 2014 baselines. Another is- ”sbahn”. We suppose that in this case, lower- sue is that while for improving the SOTA on the casing the texts improves the data quality by elim- SemEval-2014 tasks (partly) highly specialized inating some of the noise and acts as a sort of reg- (T)ABSA architectures were used, we ”only” ap- ularization. As a result, the uncased models poten- plied standard pre-trained German BERT models tially generalize better than the cased models. The without any task-specific modifications or exten- findings from Mayhew et al. (2019), who compare sions. This leaves room for further improvements cased and uncased pre-trained models on social on this task on German data which should be an media data for NER, corroborate this hypothesis. objective for future research. 7 Conclusion References Attia, M., Samih, Y., Elkahky, A., and Kallmeyer, L. As expected, all the pre-trained language mod- (2018). Multilingual multi-class sentiment classi- els clearly outperform all the models from 2017, fication using convolutional neural networks. In proving once more the power of transfer learn- Proceedings of the Eleventh International Confer- ence on Language Resources and Evaluation (LREC ing. Throughout the presented analyses, the mod- 2018), Miyazaki, Japan. European Language Re- els always achieve similar results between the syn- sources Association (ELRA). chronic and the diachronic test sets, indicating temporal robustness for the models. Nonetheless, Barriere, V. and Balahur, A. (2020). Improving sen- timent analysis over non-English tweets using mul- the diachronic data was collected only half a year tilingual transformers and automatic translation for after the main data. It would be interesting to data-augmentation. In Proceedings of the 28th Inter- see whether the trained models would return simi- national Conference on Computational Linguistics, lar predictions on data collected a couple of years pages 266–271, Barcelona, Spain (Online). Interna- tional Committee on Computational Linguistics. later. The uncased German BERT-BASE model by Behdenna, S., Barigou, F., and Belalem, G. (2018). dbmdz achieves the best results across all sub- Document level sentiment analysis: A survey. EAI Endorsed Transactions on Context-aware Systems tasks. Since Rönnqvist et al. (2019) showed that and Applications, 4:154339. monolingual BERT models often outperform the multilingual models for a variety of tasks, one Biesialska, K., Biesialska, M., and Rybinski, H. (2020). might have already suspected that a monolingual Sentiment analysis with contextual embeddings and self-attention. arXiv preprint arXiv:2003.05574. German BERT performs best across the performed tasks. It may not seem evident at first that an Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. uncased language model ends up as the best per- (2017). Enriching word vectors with subword infor- forming model since, e.g. in Sentiment Analy- mation. Transactions of the Association for Compu- tational Linguistics, 5:135–146. sis, capitalized letters might be an indicator for polarity. In addition, since nouns and beginnings Cho, K., Van Merriënboer, B., Gulcehre, C., Bah- of sentences always start with a capital letter in danau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn German, one might assume that lower-casing the encoder-decoder for statistical machine translation. whole text changes the meaning of some words arXiv preprint arXiv:1406.1078. and thus confuses the language model. Neverthe- less, the GermEval17 documents are very noisy Cieliebak, M., Deriu, J. M., Egger, D., and Uzdilli, F. (2017). A Twitter corpus and benchmark resources since they were retrieved from social media. That for German sentiment analysis. In Proceedings of means that the data contains many misspellings, the Fifth International Workshop on Natural Lan- grammar and expression mistakes, dialect, and col- guage Processing for Social Media, pages 45–51, loquial language. For this reason, already some Valencia, Spain. Association for Computational Lin- guistics. participating teams in 2017 pursued an elaborate pre-processing on the text data in order to elimi- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. nate some noise (Hövelmann and Friedrich, 2017; (2019). BERT: Pre-training of Deep Bidirectional Sayyed et al., 2017; Sidarenka, 2017). Among Transformers for Language Understanding. In Pro- ceedings of the 2019 Conference of the North Amer- other things, Hövelmann and Friedrich (2017) ican Chapter of the Association for Computational transformed the text to lower-case and replaced, Linguistics: Human Language Technologies, Vol- for example, ”S-Bahn” and ”S Bahn” with ume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Computa- Mayhew, S., Tsygankova, T., and Roth, D. (2019). ner tional Linguistics. and pos when nothing is capitalized. In Proceed- ings of the 2019 Conference on Empirical Methods Esplà-Gomis, M., Forcada, M., Ramı́rez-Sánchez, G., in Natural Language Processingand the 9th Interna- and Hoang, H. T. (2019). ParaCrawl: Web-scale tional Joint Conference on Natural Language Pro- parallel corpora for the languages of the EU. In MT- cessing, pages 6256–6261, Hong Kong, China. As- Summit. sociation for Computational Linguistics. Guhr, O., Schumann, A.-K., Bahrmann, F., and Böhme, McCann, B., Bradbury, J., Xiong, C., and Socher, R. H.-J. (2020). Training a Broad-Coverage Ger- (2017). Learned in translation: Contextualized word man Sentiment Classification Model for Dialog Sys- vectors. In Guyon, I., Luxburg, U. V., Bengio, S., tems. In Proceedings of the 12th Conference on Wallach, H., Fergus, R., Vishwanathan, S., and Gar- Language Resources and Evaluation (LREC 2020), nett, R., editors, Advances in Neural Information pages 1627–1632, Marseille, France. Processing Systems, volume 30, pages 6294–6305. Curran Associates, Inc. Haddow, B. (2018). News Crawl Corpus. Mishra, P., Mujadia, V., and Lanka, S. (2017). GermEval 2017: Sequence based Models for Cus- Hoang, M., Bihorac, O. A., and Rouces, J. (2019). tomer Feedback Analysis. In Proceedings of the Aspect-based sentiment analysis using BERT. In GermEval 2017 – Shared Task on Aspect-based Sen- Proceedings of the 22nd Nordic Conference on Com- timent in Social Media Customer Feedback, Berlin, putational Linguistics, pages 187–196, Turku, Fin- Germany. land. Linköping University Electronic Press. Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Hochreiter, S. and Schmidhuber, J. (1997). Long short- Asynchronous Pipeline for Processing Huge Cor- term memory. Neural computation, 9(8):1735– pora on Medium to Low Resource Infrastructures. 1780. In Bański, P., Barbaresi, A., Biber, H., Breiteneder, E., Clematide, S., Kupietz, M., Lüngen, H., and Il- Hövelmann, L. and Friedrich, C. M. (2017). Fast- iadi, C., editors, 7th Workshop on the Challenges text and Gradient Boosted Trees at GermEval-2017 in the Management of Large Corpora (CMLC- Tasks on Relevance Classification and Document- 7), Cardiff, United Kingdom. Leibniz-Institut für level Polarity. In Proceedings of the GermEval 2017 Deutsche Sprache. – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany. Ostendorff, M., Blume, T., and Ostendorff, S. (2020). Towards an Open Platform for Legal Information. In Karimi, A., Rossi, L., and Prati, A. (2020). Adversar- Proceedings of the ACM/IEEE Joint Conference on ial training for aspect-based sentiment analysis with Digital Libraries in 2020, JCDL ’20, pages 385—- bert. 388, New York, NY, USA. Association for Comput- ing Machinery. Lee, J.-U., Eger, S., Daxenberger, J., and Gurevych, I. (2017). UKP TU-DA at GermEval 2017: Deep Pennington, J., Socher, R., and Manning, C. (2014). Learning for Aspect Based Sentiment Detection. In Glove: Global vectors for word representation. In Proceedings of the GermEval 2017 – Shared Task on Proceedings of the 2014 conference on empirical Aspect-based Sentiment in Social Media Customer methods in natural language processing (EMNLP), Feedback, Berlin, Germany. pages 1532–1543. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Li, L., Liu, Y., and Zhou, A. (2018). Hierarchical atten- Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep tion based position-aware network for aspect-level contextualized word representations. arXiv preprint sentiment analysis. In Proceedings of the 22nd Con- arXiv:1802.05365. ference on Computational Natural Language Learn- ing, pages 181–189, Brussels, Belgium. Association Pontiki, M., Galanis, D., Papageorgiou, H., Androut- for Computational Linguistics. sopoulos, I., Manandhar, S., AL-Smadi, M., Al- Ayyoub, M., Zhao, Y., Qin, B., de clercq, O., Hoste, Li, X., Bing, L., Zhang, W., and Lam, W. (2019). V., Apidianaki, M., Tannier, X., Loukachevitch, N., Exploiting BERT for end-to-end aspect-based senti- Kotelnikov, E., Bel, N., Zafra, S. M., and Eryiğit, G. ment analysis. In Proceedings of the 5th Workshop (2016). Semeval-2016 task 5: Aspect based senti- on Noisy User-generated Text (W-NUT 2019), pages ment analysis. In Proceedings of the 10th Interna- 34–41, Hong Kong, China. Association for Compu- tional Workshop on Semantic Evaluation (SemEval- tational Linguistics. 2016), pages 19–30. Lison, P. and Tiedemann, J. (2016). OpenSubti- Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, tles2016: Extracting Large Parallel Corpora from S., and Androutsopoulos, I. (2015). SemEval-2015 Movie and TV Subtitles. In Proceedings of the 10th task 12: Aspect based sentiment analysis. In Pro- International Conference on Language Resources ceedings of the 9th International Workshop on Se- and Evaluation (LREC 2016). mantic Evaluation (SemEval 2015), pages 486–495,
Denver, Colorado. Association for Computational Sun, C., Huang, L., and Qiu, X. (2019). Utilizing Linguistics. BERT for aspect-based sentiment analysis via con- structing auxiliary sentence. In Proceedings of the Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, 2019 Conference of the North American Chapter of H., Androutsopoulos, I., and Manandhar, S. (2014). the Association for Computational Linguistics: Hu- SemEval-2014 task 4: Aspect based sentiment anal- man Language Technologies, Volume 1 (Long and ysis. In Proceedings of the 8th International Work- Short Papers), pages 380–385, Minneapolis, Min- shop on Semantic Evaluation (SemEval 2014), pages nesota. Association for Computational Linguistics. 27–35, Dublin, Ireland. Association for Computa- tional Linguistics. Tang, D., Qin, B., and Liu, T. (2016). Aspect level sentiment classification with deep memory network. Rietzler, A., Stabinger, S., Opitz, P., and Engl, S. arXiv preprint arXiv:1605.08900. (2020). Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect- Tao, J. and Fang, X. (2020). Toward multi-label sen- target sentiment classification. In Proceedings of timent analysis: a transfer learning based approach. the 12th Language Resources and Evaluation Con- Journal of Big Data, 7:1. ference, pages 4933–4941, Marseille, France. Euro- pean Language Resources Association. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, Rönnqvist, S., Kanerva, J., Salakoski, T., and Ginter, F. I. (2017). Attention Is All You Need. In 31st Con- (2019). Is Multilingual BERT Fluent in Language ference on Neural Information Processing Systems Generation? In Proceedings of the First NLPL (NIPS 2017), Long Beach, California, USA. Workshop on Deep Learning for Natural Language Processing, pages 29–36, Turku, Finland. Linköping Wang, A., Pruksachatkun, Y., Nangia, N., Singh, University Electronic Press. A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2019). Superglue: A stickier benchmark for Ruppert, E., Kumar, A., and Biemann, C. (2017). general-purpose language understanding systems. LT-ABSA: An Extensible Open-Source System for In Advances in neural information processing sys- Document-Level and Aspect-Based Sentiment Anal- tems, pages 3266–3280. ysis. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Customer Feedback, Berlin, Germany. Bowman, S. R. (2018). Glue: A multi-task bench- mark and analysis platform for natural language un- Sayyed, Z. A., Dakota, D., and Kübler, S. (2017). IDS- derstanding. arXiv preprint arXiv:1804.07461. IUCL: Investigating Feature Selection and Oversam- pling for GermEval 2017. In Proceedings of the Wang, Y., Huang, M., Zhu, X., and Zhao, L. (2016). GermEval 2017 – Shared Task on Aspect-based Sen- Attention-based lstm for aspect-level sentiment clas- timent in Social Media Customer Feedback, Berlin, sification. In Proceedings of the 2016 conference on Germany. empirical methods in natural language processing, pages 606–615. Schmitt, M., Steinheber, S., Schreiber, K., and Roth, B. (2018). Joint aspect and polarity classification Wojatzki, M., Ruppert, E., Holschneider, S., Zesch, for aspect-based sentiment analysis with end-to-end T., and Biemann, C. (2017). GermEval 2017: neural networks. In Proceedings of the 2018 Con- Shared Task on Aspect-based Sentiment in Social ference on Empirical Methods in Natural Language Media Customer Feedback. In Proceedings of the Processing, pages 1109–1114, Brussels, Belgium. GermEval 2017 – Shared Task on Aspect-based Sen- Association for Computational Linguistics. timent in Social Media Customer Feedback, pages 1–12, Berlin, Germany. Sidarenka, U. (2017). PotTS at GermEval-2017 Task B: Document-Level Polarity Detection Using Hand- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, Crafted SVM and Deep Bidirectional LSTM Net- C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow- work. In Proceedings of the GermEval 2017 – icz, M., Davison, J., Shleifer, S., von Platen, P., Ma, Shared Task on Aspect-based Sentiment in Social C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gug- Media Customer Feedback, Berlin, Germany. ger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2020). Transformers: State-of-the-Art Natural Lan- Skadiņš, R., Tiedemann, J., Rozis, R., and Deksne, guage Processing. In Proceedings of the 2020 Con- D. (2014). Billions of Parallel Words for Free: ference on Empirical Methods in Natural Language Building and Using the EU Bookshop Corpus. In Processing: System Demonstrations, pages 38–45, Proceedings of the 9th International Conference on Online. Association for Computational Linguistics. Language Resources and Evaluation (LREC 2014), pages 1850–1855, Reykjavik, Iceland. European Wu, Z. and Ong, D. C. (2020). Context-guided bert Language Resources Association (ELRA). for targeted aspect-based sentiment analysis. arXiv preprint arXiv:2010.07523. Song, Y., Wang, J., Jiang, T., Liu, Z., and Rao, Y. (2019). Attentional encoder network for tar- Xu, H., Liu, B., Shu, L., and Yu, P. S. (2019). Bert geted sentiment classification. arXiv preprint post-training for review reading comprehension and arXiv:1902.09314. aspect-based sentiment analysis.
testsyn testdia Yang, H., Zeng, B., Yang, J., Song, Y., and Xu, R. Aspect+Sentiment Category Score Support Score Support (2019). A multi-task learning model for chinese- Allgemein:neutral 0.804 1,108 0.832 913 oriented aspect polarity classification and aspect Sonstige Unregelmäßigkeiten:negative 0.782 221 0.793 159 Zugfahrt:negative 0.645 197 0.725 149 term extraction. arXiv preprint arXiv:1912.07976. Sicherheit:negative 0.640 78 0.585 39 Allgemein:negative 0.582 258 0.333 80 Appendix Atmosphäre:negative 0.569 126 0.447 39 Connectivity:negative 0.400 20 0.291 46 A Detailed results (per category) for Ticketkauf:negative 0.364 42 0.298 34 Auslastung und Platzangebot:negative 0.350 31 0.211 17 Subtask C Allgemein:positive 0.214 41 0.690 33 Zugfahrt:positive 0.154 34 0 34 It may be interesting to have a more detailed look Service und Kundenbetreuung:negative 0.146 36 0.174 21 Rest 0 343 0 180 at the model performance for this subtask because of the high number of classes and their skewed Table 16: Micro-averaged F1 scores and support by As- distribution by investigating the performance on pect+Sentiment category (Subtask C2). 48 categories category-level. Table 15 shows the performance are summarized in Rest and show each a score of 0. of the uncased German BERT-BASE model by dbmdz per test set for Subtask C1. The support All the aspect categories displayed in Ta- indicates the number of appearances, which are ble 16 are also visible in Table 15 and also displayed in Table 4 in this case. Seven cat- most of them have negative sentiment. egories are summarized in Rest because they have Allgemein:neutral and Sonstige an F1 score of 0 for both test sets, i.e. the model is Unregelmäßigkeiten:negative show not able to correctly identify any of these seven as- the highest scores. Again, we assume that here, pects appearing in the test data. The table is sorted 48 categories could not be identified due to data by the score on the synchronic test set. sparsity. However, having this in mind, the model testsyn testdia achieves a relatively high overall performance Aspect Category Score Support Score Support for both, Subtask C1 and C2 (cf. Tab. 9 and Allgemein 0.854 1,398 0.877 1,024 Sonstige Unregelmäßigkeiten 0.782 224 0.785 164 Tab. 10). This is mainly owed to the high Connectivity 0.750 36 0.838 73 score of the majority classes Allgemein and Zugfahrt 0.678 241 0.687 184 Auslastung und Platzangebot 0.645 35 0.667 20 Allgemein:neutral, respectively, because Sicherheit 0.602 84 0.639 42 the micro F1 score puts a lot of weight on majority Atmosphäre 0.600 148 0.532 53 classes. It might be interesting whether the clas- Barrierefreiheit 0.500 9 0 2 Ticketkauf 0.481 95 0.506 48 sification of the rare categories can be improved Service und Kundenbetreuung 0.476 63 0.417 27 by balancing the data. We experimented with re- DB App und Website 0.455 28 0.563 18 Informationen 0.329 58 0.464 35 moving general categories such as Allgemein, Komfort und Ausstattung 0.286 24 0 11 Allgemein:neutral or documents with Rest 0 24 0 20 sentiment neutral since these are usually less Table 15: Micro-averaged F1 scores and support by as- interesting for a company. We observe a large pect category (Subtask C1). Seven categories are sum- drop in the overall F1 score which is attributed to marized in Rest and show each a score of 0. the absence of the strong majority class and the resulting data loss. Indeed, the classification for The F1 scores for Allgemein (General), some single categories could be improved, but the Sonstige Unregelmäßigkeiten (Other rare categories could still not be identified by the irregularities) and Connectivity are the model. highest. 13 categories, mostly similar between the two test sets, show a positive F1 score on at least one of the two test sets. For the categories subsumed under Rest, the model was not able to learn how to correctly identify these categories. Subtask C2 exhibits a similar distribution of the true labels, with the Aspect+Sentiment category Allgemein:neutral as majority class. Over 50% of the true labels belong to this class. Ta- ble 16 shows that only 12 out of 60 labels can be detected by the model (see Table 16).
testsyn testdia B Detailed results (per category) for Category Score Support Score Support Subtask D Zugfahrt:negative 0.708 622 0.739 495 Sonstige Unregelmäßigkeiten:negative 0.697 693 0.617 484 Sicherheit:negative 0.607 337 0.475 122 Similar as for Subtask C, the results for the best Connectivity:negative 0.598 56 0.620 109 model are investigated in more detail. Table 17 Barrierefreiheit:negative 0.595 14 0 3 Auslastung und Platzangebot:negative 0.579 66 0.447 31 gives the detailed classification report for the un- Connectivity:positive 0.571 26 0.555 60 Allgemein:negative 0.561 807 0.363 139 cased German BERT-BASE model with CRF layer Atmosphäre:negative 0.505 403 0.358 164 on Subtask D1. Only entities that were correctly Ticketkauf:negative 0.383 96 0.583 74 Ticketkauf:positive 0.368 59 0 13 detected at least once are displayed. The table is Komfort und Ausstattung:negative 0.357 24 0 16 Atmosphäre:neutral 0.348 40 0.111 14 sorted by the score on the synchronic test set. The Service und Kundenbetreuung:negative 0.323 74 0.286 31 classification report for Subtask D2 is displayed Informationen:negative 0.301 68 0.505 46 Zugfahrt:positive 0.276 62 0.343 83 analogously in Table 18. DB App und Website:negative 0.261 39 0.406 33 DB App und Website:neutral 0.188 23 0 11 testsyn testdia Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2 Category Score Support Score Support Allgemein:positive 0.157 86 0.586 92 Zugfahrt:negative 0.702 622 0.729 495 Service und Kundenbetreuung:positive 0.115 23 0 5 Sonstige Unregelmäßigkeiten:negative 0.681 693 0.581 484 Atmosphäre:positive 0.105 26 0 15 Sicherheit:negative 0.604 337 0.457 122 Ticketkauf:neutral 0.040 144 0.222 25 Connectivity:negative 0.598 56 0.620 109 Connectivity:neutral 0 11 0.211 15 Barrierefreiheit:negative 0.595 14 0 3 Toiletten:negative 0 15 0.160 23 Auslastung und Platzangebot:negative 0.579 66 0.447 31 Rest 0 355 0 112 Connectivity:positive 0.571 26 0.555 60 Allgemein:negative 0.545 807 0.343 139 Table 18: Micro-averaged F1 scores and support by Atmosphäre:negative 0.500 403 0.337 164 Ticketkauf:negative 0.383 96 0.583 74 Aspect+Sentiment entity with overlapping match (Sub- Ticketkauf:positive 0.368 59 0 13 task D2). 35 categories are summarized in Rest and Komfort und Ausstattung:negative 0.357 24 0 16 Atmosphäre:neutral 0.348 40 0.111 14 show each a score of 0. Service und Kundenbetreuung:negative 0.323 74 0.286 31 Informationen:negative 0.301 68 0.505 46 Zugfahrt:positive 0.276 62 0.343 83 DB App und Website:negative 0.232 39 0.375 33 Unregelmäßigkeiten:negative. Be- DB App und Website:neutral 0.188 23 0 11 sides this, the top three categories per test set Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2 Allgemein:positive 0.157 86 0.586 92 remain the same. Service und Kundenbetreuung:positive 0.115 23 0 5 Atmosphäre:positive 0.105 26 0 15 Apart from the fact that this is a different kind of Ticketkauf:neutral 0.040 144 0.222 25 task than before, one can notice that even though Connectivity:neutral 0 11 0.211 15 Toiletten:negative 0 15 0.160 23 the overall micro F1 scores are lower for Sub- Rest 0 355 0 115 task D than for Subtask C, the model manages to Table 17: Micro-averaged F1 scores and support by As- successfully identify a larger variety of categories, pect+Sentiment entity with exact match (Subtask D1). i.e. it achieves a positive score for more categories. 35 categories are summarized in Rest, each of them ex- This is probably due to the more balanced data hibiting ascore of 0. for Subtask D than for Subtask C2, resulting in a lower overall score and mostly higher scores per For Subtask D1, the model returns a pos- category. itive score on 25 entity categories on at least one of the two test sets. The category Zugfahrt:negative can be classified best on both test sets, followed by Sonstige Unregelmäßigkeiten:negative and Sicherheit:negative for the synchronic test set and by Connectivity:negative and Allgemein:positive for the diachronic set. Visibly, the scores between the two test sets differ more here than in the classification report of the previous task. The report for the overlapping match (cf. Tab. 18) shows slightly better re- sults on some categories than for the ex- act match. The third-best score on the diachronic test data is now Sonstige
