SKOLTECHNLP AT SEMEVAL-2021 TASK 5: LEVERAGING SENTENCE-LEVEL PRE-TRAINING FOR TOXIC SPAN DETECTION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SkoltechNLP at SemEval-2021 Task 5: Leveraging Sentence-level Pre-training for Toxic Span Detection David Dale‡ , Igor Markov‡,† , Varvara Logacheva‡ , Olga Kozlova† , Nikita Semenov† , and Alexander Panchenko‡ ‡ Skolkovo Institute of Science and Technology, Moscow, Russia † Mobile TeleSystems (MTS), Moscow, Russia {d.dale,igor.markov,v.logacheva,a.panchenko}@skoltech.ru {oskozlo9,nikita.semenov}@mts.ru Abstract Task 51 (Pavlopoulos et al., 2021). It provides training, development, and test data for English. As This work describes the participation of the far as we know, it is the first attempt to explicitly Skoltech NLP group team (Sk) in the Toxic formulate toxicity detection as sequence labeling Spans Detection task at SemEval-2021. The instead of classification of sentences. goal of the task is to identify the most toxic fragments of a given sentence, which is a bi- Multiple NLP tasks recently benefited from nary sequence tagging problem. We show that transfer learning — transfer of probability distri- fine-tuning a RoBERTa model for this prob- butions learned on some task to another model lem is a strong baseline. This baseline can be solving a different task. The most common ex- further improved by pre-training the RoBERTa ample of transfer learning is the use of embeddings model on a large dataset labeled for toxicity at and language models pre-trained on unlabeled data the sentence level. While our solution scored (e.g. ELMo (Peters et al., 2018), BERT (Devlin among the top 20% participating models, it is et al., 2019) and its variations, T5 (Raffel et al., only 2 points below the best result. This sug- gests the viability of our approach. 2020), etc.) on other tasks (e.g. He et al. (2020); Wang et al. (2020) inter alia use pre-trained BERT 1 Introduction models to perform tasks from the GLUE bench- mark (Wang et al., 2018)). Toxicity and offensive content is a major concern Word-level toxicity classification can be for- for many platforms on the Internet. Therefore, the mulated as a sequence labeling task, which also task of toxicity detection has attracted much atten- actively uses the pre-trained models mentioned tion in the NLP community (Wulczyn et al., 2017; above. BERT comprises the versatile information Hosseini et al., 2017; Dixon et al., 2018). Until on words and their context, which allows to suc- recently, the majority of research on toxicity fo- cessfully use it for sequence labeling tasks of dif- cused on classifying entire user messages as toxic ferent levels: part-of-speech tagging and syntactic or safe. However, the surge of work on text de- parsing (Koto et al., 2020), named entity recog- toxification, i.e., editing of text to keep its content nition (Hakala and Pyysalo, 2019), semantic role and remove toxicity (Nogueira dos Santos et al., labeling (He et al., 2019), detection of Machine 2018; Tran et al., 2020), suggests that localizing Translation errors (Moura et al., 2020). toxicity within a sentence is also useful. If we This diversity of applications suggests that word- know which words of a sentence are toxic, it is level toxicity detection can also benefit from pre- easier to “fix” this sentence by removing or replac- trained models. Besides that, toxicity itself has ing them with non-toxic synonyms. Mathew et al. been successfully tackled with BERT-based mod- (2020) make human labelers annotate the spans els. Research on sentence-level toxicity extensively as rationales for classifying a comment as hateful, used BERT and other pre-trained models. Both offensive, or normal. They show that using such language-specific and multilingual BERT models spans when training a toxicity classifier improves were used to fine-tune toxicity classifiers (Leite its accuracy and explainability and reduces unin- et al., 2020; Ozler et al., 2020). This shows that tended bias towards toxicity targets. BERT has information on toxicity. This year the SemEval hosts the first competition 1 https://competitions.codalab.org/ on toxic spans detection, namely, SemEval-2021 competitions/25623 927 Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 927–934 Bangkok, Thailand (online), August 5–6, 2021. ©2021 Association for Computational Linguistics
Thus, we follow this line of work. Namely, we evaluate our final models on the hidden test set of fine-tune a RoBERTa model (Liu et al., 2019) to the task consisting of 2,000 texts. perform a sequence labeling task. Besides that, we train a model for sentence classification on the 3 Pre-training for toxic span detection Jigsaw dataset of toxic comments and use the in- Here we give the motivation behind our models and formation from this model to detect toxicity at the describe their architecture and training setup. subsentential level. This helps us overcome the insufficient data size. 3.1 Motivation This is in line with previous work, which has Our intuition is that the toxicity is often lexically- shown that sentence-level labels can be used in based, i.e., there are certain words that are consid- combination with token labels (Rei and Søgaard, ered offensive and make the whole sentence toxic. 2019) or completely substitute them (Rei and In this case, we expect that as we add extra data Søgaard, 2018; Schmaltz, 2019). to our toxic span dataset, after some point, the vo- In our experiments, we test the hypothesis that cabulary of toxic words in it will saturate and stop the sentence-level toxicity labeling can be used for increasing. However, Figure 1 shows that the size a sequence labeler that recognizes toxic spans in of the toxic vocabulary linearly depends on the text. We suggest three ways of incorporating this dataset size, which suggests that its size is insuf- data: as a corpus for pre-training, pseudo-labeling, ficient for the task. In this case, the model will and for joint training of sentence-level and token- often need to label unseen words. To mitigate the level toxicity detection models. Our experiments lack of data, we leverage the additional dataset show that the latter method yields the best result. with toxicity information, namely, the Jigsaw toxic Moreover, we show that using sentence-level la- comments dataset2 which features 140,000 user bels can dramatically improve toxic span prediction utterances labeled as toxic or safe. when the dataset with token-level labels is small. The contributions of this work are the following: • We successfully use the dataset labeled for 5000 toxicity at the sentence level for token-level size of toxic vocabulary toxicity labeling, 4000 • We propose a model for joint sentence- and 3000 token-level toxicity detection, 2000 • We analyze the performance of our models, showing their limitations and reveal the ambi- 1000 guities in the data. 0 2 The task 0 2000 4000 6000 size of training set The training data of the task comprises 7,940 En- glish comments with character-level annotations of Figure 1: Size of the toxic vocabulary as a function of toxic spans. The labeling was performed manually the corpus size. by crowd workers. The spans labeled as toxic often contain rude words: “Because he’s a moron and a bigot. It’s 3.2 Transfer learning for spans detection not any more complicated than that.” (toxic Our base model is RoBERTa. We fine-tune spans are underlined). Other toxic spans con- roberta-base model on the toxic spans train- sist of words that become toxic in context: “Sec- ing set for sequence labeling task (further denoted tion 160 should also be amended to include as RoBERTa tagger). This model already gives sexual acts with animals not involving penetra- promising results. We further improve it by provid- tion”. Borders of some toxic spans fall in the mid- ing it with the additional training signal from the dle of a word; we treat such cases as markup errors. 2 https://www.kaggle.com/c/jigsaw- As a development set, we use the trial dataset toxic-comment-classification-challenge/ of 690 texts provided by the task organizers. We data 928
Jigsaw toxic comments dataset. We propose three 4 Baselines ways of incorporating this data. In this section, we present a set of common base- The first way is pseudo-labeling (Lee et al., line approaches used for sequence tagging, such 2013). We apply the RoBERTa tagger to predict as CRF and LSTM with pretrained word embed- the toxic spans in the Jigsaw dataset. We use these dings. We implement them in order to analyze the predictions to further train the model. performance of our methods in the context of other Another option is to use the Jigsaw data to fine- techniques. tune RoBERTa with it. We suggest two scenarios. The first is to fine-tune the model on the Jigsaw Word-based LogReg This is a vocabulary-based dataset for the sentence classification task, and then method: we label words as toxic if they appear in on the toxic spans dataset — this model is referred our toxic vocabulary. The vocabulary is created as to as RoBERTa classifier + tagger. In this case, follows. We create a set of toxic and safe phrases, the model has different output layers for the two where toxic phrases are toxic spans from our data tasks, and other layers are shared. and safe phrases are sentences from our data with Finally, we propose a novel architecture for joint removed toxic spans. We then train a binary logistic token- and sentence-level classification, where the regression classifier of toxic and safe phrases using score y for a sentence x = {x1 , x2 , ..., xn } is com- words as features. The by-product of this classifier puted as the average of word-level scores: is the list of weights for all words from the data. We consider words with weights greater than a n ! 1X threshold as toxic. ŷ = σ α + β ŷ γ n i=1 i Attention-based LogReg Another approach to where α, β and γ are trainable parameters, and represent words is to take their attention weights σ is the logistic function. This model does not from a RoBERTa-based sentence-level toxicity need to be trained on the data with token-level la- classifier (we train it on the Jigsaw dataset). We as- beling but can get token-level toxicity information semble attention weights from all RoBERTa heads from sentence labels. We fine-tune this model both and layers in a single vector of dimension 144. on Jigsaw and toxic spans datasets. The model is These vectors are used as features in a logistic re- referred to as tagging classifier. gression classifier. This approach is motivated by the fact that a RoBERTa model trained to recognize 3.3 Working with spans toxicity puts more emphasis on certain words as- To reformulate toxic spans detection as a token sociated with sentence-level toxicity. Surprisingly, classification problem, we label a token as toxic this model underperforms the logistic regression if at least one of its characters is toxic. When classifier, which uses words as features. projecting the predicted token-level labels back to Conditional Random Fields We suggest that the character level, we try two strategies: the toxicity level of a word can be context- dependent, so we also experiment with sequence la- 1. Consider a token to be toxic if its toxicity beling models. We try Conditional Random Fields score is higher than the threshold, do not force (CRF) (Lafferty et al., 2001) model. It uses the the labels of tokens within a word to agree following features: the word itself, the word’s part with each other. of speech, whether the word is a digit and con- 2. Consider a word to be toxic if the aggregated sists of uppercase letters. Each word is represented toxicity score of all its tokens is higher than with these features of the current, previous, and the threshold. We try four different aggre- next words. The model performs closely to the gation functions: min, max, mean, and the attention-based classifier. simplified naive Bayes formula: Q Sequence labeling with LSTM Finally, we ex- i xi periment with the LSTM architecture (Hochreiter x̃ = Q . − xi ) Q i xi + i (1 and Schmidhuber, 1997). We implement a Bi- In both methods, we label a space character as toxic LSTM network and also train an LSTM tagger only if the characters both to the right and to the from the AllenNLP library.3 We do not use any pre- 3 left of it are toxic. https://github.com/allenai/allennlp 929
trained embeddings in the Bi-LSTM model and use Model F1 score two versions of the AllenNLP LSTM: without pre- Top-5 participants trained embeddings and with GloVe embeddings (Pennington et al., 2014). HITSZ-HLT 0.708 S-NLP 0.707 5 Evaluation hitmi&t 0.698 5.1 Experimental Setting L 0.698 YNU-HPCC 0.696 For each transfer learning model, we use two- stage fine-tuning. We first train only the output Our models layers of the models with the learning rate of tagging classifier 0.683 10−3 , and then the whole models with the learn- pseudo-labeling 0.682 ing rate of 10−5 . In both cases, we use linear RoBERTa tagger 0.678 learning rate warm-up for 3000 steps. We use the RoBERTa classifier + tagger 0.670 AdamW optimizer (Loshchilov and Hutter, 2019) and the batch size of 8, and early stopping to de- Our baselines termine the number of training steps. We use the Word-based LogReg 0.556 transformers4 library for training. LSTM basic embeddings 0.538 For all the models which use the additional data Bi-LSTM basic embeddings 0.530 from Jigsaw, we apply this scheme twice. The Attention-based Logreg 0.524 pseudo-label model is first fine-tuned on the origi- CRF 0.523 nal toxic spans dataset, and then on the self-labeled LSTM Glove embeddings 0.497 Jigsaw dataset, whereas the RoBERTa classifier + tagger and tagging classifier are first fine-tuned Table 1: Performance of our models (baselines and on the Jigsaw dataset (as classifiers), and then on RoBERTa-based models) and their comparison with the toxic spans dataset (as taggers). the 5 best-performing participants. Models within each section are sorted from best to worst. 5.2 Results The scores of our models and the competing sys- An important hyper-parameter of the models is tems are shown in Table 1. Our best submitted sys- the probability threshold. It is usually fine-tuned tem (tagging classifier) had the F1 -score of 0.681 on the development set. However, the development on the test set, while the best team over the whole set provided for the task is too small. The thresh- task got 0.708. This brings our team to the top old fine-tuned on it performs even worse than the 20% of the leaderboard. The pseudo-labeling ap- standard threshold of 0.5. Thus, during the evalua- proach was only marginally worse, scoring 0.674. tion period of the competition, we tried submitting Simply fine-tuning RoBERTa only on the tagging models with different threshold values. While this problem scored 0.668. On the other hand, none of is not a completely fair practice because we indi- our baselines could approach this result. Our best rectly used the test for tuning a model parameter, baseline is the word-based LogReg classifier. Ap- we suspect that many teams were overfitting to the parently, other models fail to learn even the toxic test set in a similar way. We suggest that in order to vocabulary because their word representations are make the evaluation fair, the results of the models not informative enough. on the final test set should not be available before While our best-performing model is only 18th the end of the competition, even in the indirect way best out of 92 participating systems, the results of (i.e., in the form of teams ranking without scores the top systems are fairly close to ours (the dif- as it was done in this competition). ference is less than 2.5%). The variation of deep The best results which we report here were learning models often falls in this margin (Reimers achieved with the threshold of 0.6 (see Table 1). and Gurevych, 2017). For our models, the sample We compare these results with those of the same standard deviation of the F1 -score is about 0.9%, so models with the default threshold of 0.5 in Table 3. the difference between their performance is likely It shows that these scores are lower by up to 1%. to be statistically insignificant. Another hyper-parameter of our models is 4 https://huggingface.co/transformers the method of converting token-level labels to 930
Aggregation Type F1 score word-level labeling reaches around 3,000. Thus, this pre-training strategy is efficient only in cases No aggregation 0.685 when the size of the data with word-level labeling Aggregation of word-level scores is very small. max token score 0.673 min token score 0.641 0.65 average token scores 0.670 F1 score on the test set naive Bayes 0.653 0.60 Table 2: Scores of the tagging classifier model with 0.55 different token aggregation methods (computed on the development set). 0.50 RoBERTa tagger threshold 0.45 tagging classifier Model 0.5 0.6 101 102 103 104 tagging classifier 0.681 0.683 training set size pseudo-labeling 0.674 0.682 RoBERTa tagger 0.668 0.678 Figure 2: Learning curves for two transfer learn- RoBERTa classifier + tagger 0.664 0.670 ing models, with (tagging classifier) and without (RoBERTa tagger) additional sentence-level data. Table 3: F1 -scores of models with different probability thresholds. 5.4 Error analysis character-level labels. We compare different meth- We analyze the errors of our best submitted system ods on the development set (see Table 2). Sur- (tagging classifier) by comparing its predictions prisingly, the prediction of labels for each token with the ground truth labels released after the end individually with no aggregation works better than of the competition. assigning labels to the whole words. This might The vocabulary of false negative spans (527 happen because the attempts to decode words con- unique tokens) is more diverse than that of false sistently lead to the propagation of wrongly pre- positives (275 unique tokens), while the number of dicted labels. Following this observation, we use false positives and false negatives in the test set is the no-aggregation strategy for all models. comparable (860 vs 813 tokens). It may indicate that the model is cautious and prefers to highlight 5.3 Efficiency of pre-training only the hypotheses which have high confidence, To understand the effect of the use of additional while human annotators are more creative in their sentence-labeled data, we compare the perfor- analysis. We give some examples of correct and mance of RoBERTa tagger (a model which uses incorrect labelings by our model in Table 4. only the toxic span dataset) and tagging classi- The most frequent false positive words charac- fier (a model which uses sentence-labeled Jigsaw terize incompetence or lack of mental capacities: data in addition to the toxic span dataset) models stupid, idiot, ignorant, moron, dumb, etc. Other trained on subsets of the data of different sizes. We frequent false positives are derogatory (pathetic, would like to see if the usefulness of additional ridiculous, ass, garbage, loser, etc.), denounce par- sentence-labeled data reduces as we get more data ticular misdeeds (liar, troll, racist, hypocrite, etc.), with token-level labeling. or express general negativity (damn, fuck, etc.). It Figure 2 plots the F1 -scores of the two models is not obvious why human annotators label them trained on datasets of sizes between 10 and 7,940 as toxic in some cases, and as non-toxic in other sentences. It shows that when the training set size is cases. We suspect that inter-annotator agreement between 10 and 1,000, pre-training with sentence- on such words is not very high. level annotations gives a considerable boost in per- The most frequent false negative words are func- formance. However, the effect of this pre-training tion words: and, the, are, a, you etc. It happens becomes insignificant after the size of the data with because annotators sometimes label the whole text 931
Correct labeling See a shrink you pathetic troll. They’re not patriots. They’re vandals, thieves, and bullies. Trudeau and Morneau are fiscally and economically inept and incompetent. Incorrect labeling That’s right. They are not normal. And I am starting from the premise that they are ABNORMAL. Proceed wth the typical racist, bigot, sexist rubbish. Thanks! ADN is endorsing, without officially endorsing. Bunch of cowards!!! Rabidly anti-Canadian troll. Table 4: Examples of ground truth (underlined) and predicted (in bold) toxic spans Top 20 false positive words Top 20 false negative words stupid ass and of idiot liar the have ignorant garbage are loser moron loser a crap dumb fools ignorant all idiots troll racist chemical fool crap you that pathetic damn in not stupidity fuck is bunch ridiculous clown to dumb Table 5: The most common false positive and false negative words or a large chunk of it as toxic. The more mean- low-resource scenarios. ingful false negatives belong to the same classes Our model performs closely to the winning sys- as the false positives (ignorant, racist, loser, etc.). tems. We suggest that the differences between the The most common false positive and false negative 20 top models might be attributed to the variation of words are listed in Table 5. deep learning models and overfitting the test set. In In general, the performance on this task might addition to that, the error analysis shows that some be limited to 0.7 F1 -score (the quality of the best- errors in our model might be due to inconsistencies performing model) by the ambiguity of the annota- in the test data. tions. In future work, it We release the code required to reproduce our experiments online.5 6 Conclusions Acknowledgements We present a number of models for the detection This work was conducted in the framework of the of toxic spans within toxic sentences. All models joint Skoltech-MTS laboratory. are RoBERTa language models fine-tuned on the data with character-level labeling of toxic spans. In addition to that, we perform fine-tuning on an References additional dataset with sentence-level toxicity la- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and beling. This yields an improvement. However, our Kristina Toutanova. 2019. BERT: Pre-training of analysis shows that the effect of such pre-training deep bidirectional transformers for language under- is marginal when the main dataset size exceeds standing. In Proceedings of the 2019 Conference 1,000 samples. Therefore, substantial improvement of the North American Chapter of the Association is observed for small dataset sizes. Nevertheless, 5 https://github.com/skoltech-nlp/ the models we propose can be useful in extremely toxic-span-detection 932
for Computational Linguistics: Human Language 10th International Joint Conference on Natural Lan- Technologies, Volume 1 (Long and Short Papers), guage Processing, pages 914–924, Suzhou, China. pages 4171–4186, Minneapolis, Minnesota. Associ- Association for Computational Linguistics. ation for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, and Lucy Vasserman. 2018. Measuring and mitigat- Luke Zettlemoyer, and Veselin Stoyanov. 2019. ing unintended bias in text classification. In Pro- Roberta: A robustly optimized BERT pretraining ap- ceedings of the 2018 AAAI/ACM Conference on AI, proach. CoRR, abs/1907.11692. Ethics, and Society, pages 67–73. Ilya Loshchilov and Frank Hutter. 2019. Decou- Kai Hakala and Sampo Pyysalo. 2019. Biomedical pled weight decay regularization. In 7th Inter- named entity recognition with multilingual BERT. national Conference on Learning Representations, In Proceedings of The 5th Workshop on BioNLP ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Shared Tasks, pages 56–61, Hong Kong, OpenReview.net. China. Association for Computational Linguistics. Binny Mathew, Punyajoy Saha, Seid Muhie Yi- Pengcheng He, Xiaodong Liu, Jianfeng Gao, and mam, Chris Biemann, Pawan Goyal, and Ani- Weizhu Chen. 2020. Deberta: Decoding- mesh Mukherjee. 2020. Hatexplain: A benchmark enhanced BERT with disentangled attention. CoRR, dataset for explainable hate speech detection. CoRR, abs/2006.03654. abs/2012.10289. Shexia He, Zuchao Li, and Hai Zhao. 2019. Syntax- João Moura, miguel vera, Daan van Stigt, Fabio Kepler, aware multilingual semantic role labeling. In Pro- and André F. T. Martins. 2020. Ist-unbabel partici- ceedings of the 2019 Conference on Empirical Meth- pation in the wmt20 quality estimation shared task. ods in Natural Language Processing and the 9th In- In Proceedings of the Fifth Conference on Machine ternational Joint Conference on Natural Language Translation, pages 1029–1036, Online. Association Processing (EMNLP-IJCNLP), pages 5350–5359, for Computational Linguistics. Hong Kong, China. Association for Computational Linguistics. Kadir Bulut Ozler, Kate Kenski, Steve Rains, Yotam Shmargad, Kevin Coe, and Steven Bethard. 2020. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Fine-tuning for multi-domain and multi-label un- Long short-term memory. Neural computation, civil language detection. In Proceedings of the 9(8):1735–1780. Fourth Workshop on Online Abuse and Harms, pages 28–33, Online. Association for Computational Lin- Hossein Hosseini, Sreeram Kannan, Baosen Zhang, guistics. and Radha Poovendran. 2017. Deceiving google’s perspective API built for detecting toxic comments. John Pavlopoulos, Léo Laugier, Jeffrey Sorensen, and CoRR, abs/1702.08138. Ion Androutsopoulos. 2021. Semeval-2021 task 5: Toxic spans detection (to appear). In Proceedings of Fajri Koto, Afshin Rahimi, Jey Han Lau, and Timothy the 15th International Workshop on Semantic Evalu- Baldwin. 2020. IndoLEM and IndoBERT: A bench- ation. mark dataset and pre-trained language model for In- donesian NLP. In Proceedings of the 28th Inter- Jeffrey Pennington, Richard Socher, and Christopher D national Conference on Computational Linguistics, Manning. 2014. Glove: Global vectors for word rep- pages 757–770, Barcelona, Spain (Online). Interna- resentation. In Proceedings of the 2014 conference tional Committee on Computational Linguistics. on empirical methods in natural language process- ing (EMNLP), pages 1532–1543. John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Prob- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt abilistic models for segmenting and labeling se- Gardner, Christopher Clark, Kenton Lee, and Luke quence data. Proceedings of the 18th International Zettlemoyer. 2018. Deep contextualized word rep- Conference on Machine Learning 2001. resentations. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Associ- Dong-Hyun Lee et al. 2013. Pseudo-label: The simple ation for Computational Linguistics: Human Lan- and efficient semi-supervised learning method for guage Technologies, Volume 1 (Long Papers), pages deep neural networks. In Workshop on challenges 2227–2237, New Orleans, Louisiana. Association in representation learning, ICML, volume 3. for Computational Linguistics. João Augusto Leite, Diego Silva, Kalina Bontcheva, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine and Carolina Scarton. 2020. Toxic language detec- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, tion in social media for Brazilian Portuguese: New Wei Li, and Peter J Liu. 2020. Exploring the limits dataset and multilingual analysis. In Proceedings of of transfer learning with a unified text-to-text trans- the 1st Conference of the Asia-Pacific Chapter of the former. Journal of Machine Learning Research, Association for Computational Linguistics and the 21(140):1–67. 933
Marek Rei and Anders Søgaard. 2018. Zero-shot se- quence labeling: Transferring knowledge from sen- tences to tokens. In Proceedings of the 2018 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 293–302, New Orleans, Louisiana. Association for Computational Linguistics. Marek Rei and Anders Søgaard. 2019. Jointly learn- ing to label sentences and tokens. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 33, pages 6916–6923. Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338–348, Copenhagen, Denmark. Association for Computational Linguistics. Cicero Nogueira dos Santos, Igor Melnyk, and Inkit Padhi. 2018. Fighting offensive language on social media with unsupervised text style transfer. In Pro- ceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 189–194, Melbourne, Australia. As- sociation for Computational Linguistics. Allen Schmaltz. 2019. Detecting local insights from global labels: Supervised & zero-shot sequence la- beling via a convolutional decomposition. arXiv preprint arXiv:1906.01154. Minh Tran, Yipeng Zhang, and Mohammad Soleymani. 2020. Towards a friendly online community: An unsupervised style transfer framework for profan- ity redaction. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 2107–2114, Barcelona, Spain (Online). Inter- national Committee on Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020. Struct- bert: Incorporating language structures into pre- training for deep language understanding. In 8th International Conference on Learning Representa- tions, ICLR 2020, Addis Ababa, Ethiopia, April 26- 30, 2020. OpenReview.net. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th international conference on world wide web, pages 1391–1399. 934
You can also read