Neural Supervised Domain Adaptation by Augmenting Pre-trained Models with Random Units
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Supervised Domain Adaptation by Augmenting Pre-trained Models with Random Units Sara Meftah∗ , Nasredine Semmar∗ , Youssef Tamaazousti∗ , Hassane Essafi∗ , Fatiha Sadat+ ∗ CEA-List, Université Paris-Saclay, F-91120, Palaiseau, France + UQÀM, Montréal, Canada {firstname.lastname}@cea.fr, sadat.fatiha@uqam.ca Abstract tools that are capable of understanding and generat- ing all forms of human languages. Furthermore, in Neural Transfer Learning (TL) is becoming ubiquitous in Natural Language Processing spite of the tremendous empirical results achieved by NLP models based on Neural Networks (NNs), arXiv:2106.04935v1 [cs.CL] 9 Jun 2021 (NLP), thanks to its high performance on many tasks, especially in low-resourced scenarios. these models are in most cases based on a super- Notably, TL is widely used for neural do- vised learning paradigm, i.e. trained from scratch main adaptation to transfer valuable knowl- on large amounts of labelled examples. Never- edge from high-resource to low-resource do- theless, such training scheme is not fully optimal. mains. In the standard fine-tuning scheme of Indeed, NLP neural models with high performance TL, a model is initially pre-trained on a source often require huge volumes of manually annotated domain and subsequently fine-tuned on a tar- get domain and, therefore, source and target data to produce powerful results and prevent over- domains are trained using the same architec- fitting. However, manual data annotation is time- ture. In this paper, we show through inter- consuming. Besides, language changes over years pretation methods that such scheme, despite (Eisenstein, 2019). Thus, most languages varieties its efficiency, is suffering from a main limi- are under-resourced (Baumann and Pierrehumbert, tation. Indeed, although capable of adapting 2014; Duong, 2017). to new domains, pre-trained neurons struggle Particularly, in spite of the valuable advantage of with learning certain patterns that are specific to the target domain. Moreover, we shed light social media’s content analysis for a variety of ap- on the hidden negative transfer occurring de- plications (e.g. advertisement, health, or security), spite the high relatedness between source and this large domain is still poor in terms of anno- target domains, which may mitigate the final tated data. Furthermore, it has been shown that gain brought by transfer learning. To address models intended for news fail to work efficiently these problems, we propose to augment the on Tweets (Owoputi et al., 2013). This is mainly pre-trained model with normalised, weighted due to the conversational nature of the text, the and randomly initialised units that foster a bet- ter adaptation while maintaining the valuable lack of conventional orthography, the noise, lin- source knowledge. We show that our approach guistic errors, spelling inconsistencies, informal exhibits significant improvements to the stan- abbreviations and the idiosyncratic style of these dard fine-tuning scheme for neural domain texts (Horsmann, 2018). adaptation from the news domain to the so- One of the best approaches to address this issue cial media domain on four NLP tasks: part-of- is Transfer Learning (TL); an approach that allows speech tagging, chunking, named entity recog- handling the problem of the lack of annotated data, nition and morphosyntactic tagging.1 whereby relevant knowledge previously learned in 1 Introduction a source problem is leveraged to help in solving a new target problem (Pan et al., 2010). In the context NLP aims to produce resources and tools to under- of artificial NNs, TL relies on a model learned on a stand texts coming from standard languages and source-task with sufficient data, further adapted to their linguistic varieties, such as dialects or user- the target-task of interest. TL has been shown to be generated-content in social media platforms. This powerful for NLP and outperforms the standard su- diversity is a challenge for developing high-level pervised learning from scratch paradigm, because 1 Under review it takes benefit from the pre-learned knowledge.
Particularly, the standard fine-tuning (SFT) scheme upper-case letter. Thus the pre-trained units fail to of sequential transfer learning has been shown to discard this pattern which is not always respected be efficient for supervised domain adaptation from in user-generated-content in social media. As a the source news domain to the target social media consequence of this phenomenon, specific patterns domain (Gui et al., 2017; Meftah et al., 2018b,a; to the target-dataset (e.g. “wanna” or “gonna”) are März et al., 2019; Zhao et al., 2017; Lin and Lu, difficult to learn by pre-trained units. This phe- 2018). nomenon is non-desirable, since such specific units In this work we first propose a series of anal- are essential, especially for target-specific classes ysis to spot the limits of the standard fine-tuning (Zhou et al., 2018b; Lakretz et al., 2019). adaptation scheme of sequential transfer learning. Stemming from our analysis, we propose a We start by taking a step towards identifying and new method to overcome the above-mentioned analysing the hidden negative transfer when trans- drawbacks of the standard fine-tuning scheme of ferring from the news domain to the social me- transfer learning. Precisely, we propose a hybrid dia domain. Negative transfer (Rosenstein et al., method that takes benefit from both worlds, random 2005; Wang et al., 2019) occurs when the knowl- initialisation and transfer learning, without their edge learnt in the source domain hampers the learn- drawbacks. It consists in augmenting the source- ing of new knowledge from the target domain. network (set of pre-trained units) with randomly Particularly, when the source and target domains initialised units (that are by design non-biased) and are dissimilar, transfer learning may fail and hurt jointly learn them. We call our method PretRand the performance, leading to a worse performance (Pretrained and Random units). PretRand consists compared to the standard supervised training from of three main ideas: scratch. In this work, we rather perceive the gain brought by the standard fine-tuning scheme com- 1. Augmenting the source-network (set of pre- pared to random initialisation2 as a combination trained units) with a random branch composed of a positive transfer and a hidden negative trans- of randomly initialised units, and jointly learn fer. We define positive transfer as the percentage of them. predictions that were wrongly predicted by random 2. Normalising the outputs of both branches to initialisation, but using transfer learning changed to balance their different behaviours and thus the correct ones. The negative transfer represents forcing the network to consider both. the percentage of predictions that were tagged cor- rectly by random initialisation, but using transfer 3. Applying learnable attention weights on both learning gives incorrect predictions. Hence, the branches predictors to let the network learn final gain brought by transfer learning would be the which of random or pre-trained one is better difference between positive and negative transfer. for every class. We show that despite the final positive gain brought by transfer learning from the high-resource news Our experiments on 4 NLP tasks: Part-of-Speech domain to the low-resource social media domain, tagging (POS), Chunking (CK), Named Entity the hidden negative transfer may mitigate the final Recognition (NER) and Morphosyntactic Tagging gain. (MST) show that PretRand enhances considerably Then we perform an interpretive analysis of indi- the performance compared to the standard fine- vidual pre-trained neurons behaviours in different tuning adaptation scheme.4 settings. We find that some of pretrained neurons The remainder of this paper is organised as fol- are biased by what they have learnt in the source- lows. Section 2 presents the background related dataset. For instance, we observe a unit3 firing to our work: transfer learning and interpretation on proper nouns (e.g.“George” and “Washington”) methods for NLP. Section 3 presents the base neu- before fine-tuning and on words with capitalised ral architecture used for sequence labelling in NLP. first-letter whether the word is a proper noun or Section 4 describes our proposed methods to anal- not (e.g. “Man” and “Father”) during fine-tuning. yse the standard fine-tuning scheme of sequential Indeed, in news, only proper nouns start with an transfer learning. Section 5 describes our proposed 2 approach PretRand. Section 6 reports the datasets Random initialisation means training from scratch on target data (in-domain data). 4 This paper is an extension of our previous work (Meftah 3 We use “unit” and “neuron” interchangeably. et al., 2019).
and the experimental setup. Section 7 reports the models designed for specific high-resourced experimental results of our proposed methods and source setting(s) (language, language variety, is divided into two sub-sections: Sub-section 7.1 domain, task, etc) to work in a target low-resourced reports the empirical analysis of the standard fine- setting(s). It includes two categories. First, tuning scheme, highlighting its drawbacks. Sub- unsupervised domain adaptation assumes that section 7.2 presents the experimental results of our labelled examples in the source domain are proposed approach PretRand, showing the effec- sufficiently available, but for the target domain, tiveness of PretRand on different tasks and datasets only unlabelled examples are available. Second, and the impact of incorporating contextualised rep- in supervised domain adaptation setting, a small resentations. Finally, section 8 wraps up by dis- number of labelled target examples are assumed to cussing our findings and future research directions. be available. 2 Background Pretraining In the pretraining stage of STL, a crucial key for Since our work involves two research topics: Se- the success of transfer is the ruling about the pre- quential Transfer Learning (STL) and Interpreta- trained task and domain. For universal represen- tion methods, we discuss in the following sub- tations, the pre-trained task is expected to encode sections the state-of-the-art of each topic with a po- useful features for a wide number of target tasks sitioning of our contributions regarding each one. and domains. In comparison, for domain adapta- 2.1 Sequential Transfer Learning tion, the pre-trained task is expected to be most suitable for the target task in mind. We classify In STL, training is performed in two stages, sequen- pretraining methods into four main categories: un- tially: pretraining on the source task, followed supervised, supervised, multi-task and adversarial by an adaptation on the downstream target tasks pretraining: (Ruder, 2019). The purpose behind using STL techniques for NLP can be divided into two main • Unsupervised pretraining uses raw unlabelled research areas, universal representations and do- data for pretraining. Particularly, it has been main adaptation. successfully used in a wide range of semi- Universal representations aim to build neural nal works to learn universal representations. features (e.g. words embeddings and sentence em- Language modelling task has been partic- beddings) that are transferable and beneficial to a ularly used thanks to its ability to capture wide range of downstream NLP tasks and domains. general-purpose features of language.5 For Indeed, the probabilistic language model proposed instance, TagLM (Peters et al., 2017) is a pre- by Bengio et al. (2003) was the genesis of what trained model based on a bidirectional lan- we call words embedding in NLP, while Word2Vec guage model (biLM), also used to generate (Mikolov et al., 2013) was its outbreak and a start- ELMo (Embeddings from Language Models) ing point for a surge of works on learning words em- representations (Peters et al., 2018). With beddings: e.g. FastText (Bojanowski et al., 2017) the recent emergence of the “Transformers” enriches Word2Vec with subword information. Re- architectures (Vaswani et al., 2017), many cently, universal representations re-emerged with works propose pretrained models based on contextualised representations, handling a major these architectures (Devlin et al., 2019; Yang drawback of traditional words embedding. Indeed, et al., 2019; Raffel et al., 2019). Unsuper- these last learn a single context-independent repre- vised pretraining has also been used to im- sentation for each word thus ignoring words poly- prove sequence to sequence learning. We can semy. Therefore, contextualised words representa- cite the work of Ramachandran et al. (2017) tions aim to learn context-dependent word embed- who proposed to improve the performance of dings, i.e. considering the entire sequence as input an encoder-decoder neural machine transla- to produce each word’s embedding. tion model by initialising both encoder and While universal representations seek to be decoder parameters with pretrained weights propitious for any downstream task, domain 5 Note that language modelling is also considered as a adaptation is designed for particular target tasks. self-supervised task since, in fact, labels are automatically Domain adaptation consists in adapting NLP generated from raw data.
from two language models. initialised layers are added on top of pretrained ones. Three main adaptation schemes are used in • Supervised pretraining has been particularly sequential transfer learning: Feature Extraction, used for cross-lingual transfer (e.g. machine Fine-Tuning and the recent Residual Adapters. translation (Zoph and Knight, 2016)), cross- In a Feature Extraction scheme, the pretrained task transfer from POS tagging to words seg- layers’ weights are frozen during adaptation, while mentation task (Yang et al., 2017) and cross- in Fine-Tuning scheme weights are tuned. Accord- domain transfer for biomedical texts for ques- ingly, the former is computationally inexpensive tion answering by Wiese et al. (2017) and while the last allows better adaptation to target do- for NER by Giorgi and Bader (2018). Cross- mains peculiarities. In general, fine-tuning pre- domain transfer has also been used to transfer trained models begets better results, except in cases from news to social media texts for POS tag- wherein the target domain’s annotations are sparse ging (Meftah et al., 2017; März et al., 2019) or noisy (Dhingra et al., 2017; Mou et al., 2016). and sentiment analysis (Zhao et al., 2017). Su- Peters et al. (2019) found that for contextualised pervised pretraining has been also used ef- representations, both adaptation schemes are com- fectively for universal representations learn- petitive, but the appropriate adaptation scheme to ing, e.g. neural machine translation (McCann pick depends on the similarity between the source et al., 2017), language inference (Conneau and target problems. Recently, Residual Adapters et al., 2017) and discourse relations (Nie et al., were proposed by Houlsby et al. (2019) to adapt 2017). pretrained models based on Transformers archi- • Multi-task pretraining has been successfully tecture, aiming to keep Fine-Tuning scheme’s ad- applied to learn general universal sentence vantages while reducing the number of parame- representations by a simultaneous pretrain- ters to update during the adaptation stage. This ing on a set of supervised and unsuper- is achieved by adding adapters (intermediate lay- vised tasks (Subramanian et al., 2018; Cer ers with a small number of parameters) on top of et al., 2018). Subramanian et al. (2018), each pretrained layer. Thus, pretrained layers are for instance, proposed to learn universal sen- frozen, and only adapters are updated during train- tences representations by a joint pretraining ing. Therefore, Residual Adapters performance is on skip-thoughts, machine translation, con- near to Fine-tuning while being computationally stituency parsing, and natural language infer- cheaper (Pfeiffer et al., 2020b,a,c). ence. For domain adaptation, we have per- Our work formed in (Meftah et al., 2020) a multi-task Our work falls under supervised domain adaptation pretraining for supervised domain adaptation research area. Specifically, cross-domain adapta- from the news domain to the social media do- tion from the news domain to the social media do- main. main. The fine-tuning adaptation scheme has been • Adversarial pretraining is particularly used successfully applied on domain adaptation from for domain adaptation when some annotated the news domain to the social media domain (e.g. examples from the target domain are avail- adversarial pretraining (Gui et al., 2017) and super- able. Adversarial training (Ganin et al., 2016) vised pretraining (Meftah et al., 2018a)). In this is used as a pretraining step followed by an research, we highlight the aforementioned draw- adaptation step on the target dataset. Adver- backs (biased pre-trained units and the hidden neg- sarial pretraining demonstrated its effective- ative transfer) of the standard fine-tuning adapta- ness in several NLP tasks, e.g. cross-lingual tion scheme. Then, we propose a new adaptation sentiment analysis (Chen et al., 2018). Also, scheme (PretRand) to handle these problems. Fur- it has been used to learn cross-lingual words thermore, while ELMo contextualised words repre- embeddings (Lample et al., 2018). sentations efficiency has been proven for different tasks and datasets (Peters et al., 2019; Fecht et al., Adaptation 2019; Schumacher and Dredze, 2019), here we in- During the adaptation stage of STL, one or more vestigate their impact when used, simultaneously, layers from the pretrained model are transferred to with a sequential transfer learning scheme for su- the downstream task, and one or more randomly pervised domain adaptation.
2.2 Interpretation methods for NLP computer vision (Coates and Ng, 2011; Girshick et al., 2014; Zhou et al., 2015), and more recently Recently, a rising interest is devoted to peek inside in NLP, wherein units activations are visualised black-box neural NLP models to interpret their in heatmaps. For instance, Karpathy et al. (2016) internal representations and their functioning. A visualised character-level Long Short-Term Mem- variety of methods were proposed in the literature, ory (LSTM) cells learned in language modelling here we only discuss those that are most related to and found multiple interpretable units that track our research. long-distance dependencies, such as line lengths and quotes; Radford et al. (2017) visualised a unit Probing tasks is a common approach for NLP which performs sentiment analysis in a language models analysis used to investigate which model based on Recurrent Neural Networks linguistic properties are encoded in the latent (RNNs); Bau et al. (2019) visualised neurons representations of the neural model (Shi et al., specialised on tense, gender, number, etc. in 2016). Concretely, given a neural model M NMT models; and Kádár et al. (2017) proposed trained on a particular NLP task, whether it is top-k-contexts approach to identify sentences, an unsupervised (e.g. language modelling (LM)) thus linguistic patterns, sparking the highest acti- or supervised (e.g. Neural Machine Translation vation values of each unit in an RNNs-based model. (NMT)), a shallow classifier is trained on top of the frozen M on a corpus annotated with the linguistic properties of interest. The aim is to Neural representations correlation analysis: examine whether M’s hidden representations Cross-network and cross-layers correlation is encode the property of interest. For instance, Shi a significant approach to gain insights on how et al. (2016) found that different levels of syntactic internal representations may vary across networks, information are learned by NMT encoder’s layers. network-depth and training time. Suitable Adi et al. (2016) investigated what information approaches are based on Correlation Canonical (between sentence length, words order and Analysis (CCA) (Hotelling, 1992; Uurtio et al., word-content) is captured by different sentence 2018), such as Singular Vector Canonical Correla- embedding learning methods. Conneau et al. tion Analysis (Raghu et al., 2017) and Projected (2018) proposed 10 probing tasks annotated with Weighted Canonical Correlation Analysis (Morcos fine-grained linguistic properties and compared et al., 2018), that were successfully used in NLP different approaches for sentence embeddings. Zhu neural models analysis. For instance, it was used et al. (2018) inspected which semantic properties by Bau et al. (2019) to calculate cross-networks (e.g. negation, synonymy, etc.) are encoded correlation for ranking important neurons in NMT by different sentence embeddings approaches. and LM. Saphra and Lopez (2019) applied it to Furthermore, the emergence of contextualised probe the evolution of syntactic, semantic, and words representations have triggered a surge of topic representations cross-time and cross-layers. works on probing what these representations are Raghu et al. (2019) compared the internal rep- learning (Liu et al., 2019a; Clark et al., 2019). resentations of models trained from scratch vs This approach, however, suffers from two main models initialised with pre-trained weights. CCA flaws. First, probing tasks examine properties based methods aim to calculate similarity between captured by the model at a coarse-grained level, neural representations at the coarse-grained level. i.e. layers representations, and thereby, will not In contrast, correlation analysis at the fine-grained identify features captured by individual neurons. level, i.e. between individual neurons, has also Second, probing tasks will not identify linguistic been explored in the literature. Initially, Li et al. properties that do not appear in the annotated (2015) used Pearson’s correlation to examine to probing datasets (Zhou et al., 2018a). which extent each individual unit is correlated to another unit, either within the same network or between different networks. The same correlation Individual units stimulus: Inspired by works on metric was used by Bau et al. (2019) to determine receptive fields of biological neurons (Hubel and important neurons in NMT and LM tasks. Wiesel, 1965), much work has been devoted for interpreting and visualising individual hidden units stimulus-features in neural networks. Initially, in Our Work:
In this work, we propose two approaches (§4.2) model on M annotated sentences, the model’s loss to highlight the bias effect in the standard fine- is defined as follows: tuning scheme of transfer learning in NLP, the first mi M X method is based on individual units stimulus and X L = L(i,t) . (2) the second on neural representations correlation i=1 t=1 analysis. To the best of our knowledge, we are the first to harness these interpretation methods 4 Analysis of the Standard Fine-Tuning to analyse individual units behaviour in a transfer Scheme learning scheme. Furthermore, the most analysed The standard fine-tuning scheme consists in trans- tasks in the literature are Natural Language Infer- ferring a part of the learned weights from a source ence, NMT and LM (Belinkov and Glass, 2019), model to initialise the target model, which is further here we target under-explored tasks in visualisation fine-tuned on the target task with a small number of works such as POS, MST, CK and NER. training examples from the target domain. Given a 3 Base Neural Sequence Labelling Model source neural network Ms with a set of parameters θs split into two sets: θs = (θs1 , θs2 ) and a target Given an input sentence S of n successive tokens network Mt with a set of parameters θt split into S = [w1 , . . . , wn ], the goal of sequence labelling two sets: θt = (θt1 , θt2 ), the standard fine-tuning is to predict the label ct ∈ C of every wt , with scheme of transfer learning includes three simple C being the tag-set. We use a commonly used yet effective steps: end-to-end neural sequence labelling model (Ma and Hovy, 2016; Plank et al., 2016; Yang et al., 1. We train the source model on annotated data 2018), which is composed of three components (il- from the source domain on a source dataset. lustrated in Figure 1). First, the Word Represen- 2. We transfer the first set of parameters from tation Extractor (WRE), denoted Υ, computes a the source network Ms to the target network vector representation xt for each token wt . Second, Mt : θt1 = θs1 , whereas the second set θt2 of this representation is fed into a Feature Extrac- parameters is randomly initialised. tor (FE) based on a bidirectional Long Short-Term Memory (biLSTM) network (Graves et al., 2013), 3. Then, the target model is further fine-tuned on denoted Φ. It produces a hidden representation, ht , the small target data-set. that is fed into a Classifier (Cl): a fully-connected Source and target datasets may have different layer (FCL), denoted Ψ. Formally, given wt , the tag-sets, even within the same NLP task. Hence, logits are obtained using the following equation: transferring the parameters of the classifier (Ψ) may ŷt = (Ψ ◦ Φ ◦ Υ)(wt ).6 not be feasible in all cases. Therefore, in our ex- In the standard supervised training scheme, the periments, WRE’s layers (Υ) and FE’s layers (Φ) three modules are jointly trained from scratch by are initialised with the source model’s weights and minimising the Softmax Cross-Entropy (SCE) loss Ψ is randomly initialised. Then, the three modules using the Stochastic Gradient Descent (SGD) algo- are further jointly trained on the target-dataset by rithm. minimising a SCE loss using the SGD algorithm. Let us consider a training set of M annotated sentences, where each sentence i is composed of 4.1 The Hidden Negative Transfer mi tokens. Given a training word (wi,t , yi,t ) from It has been shown in many works in the literature the training sentence i, where yi,t is the gold stan- (Rosenstein et al., 2005; Ge et al., 2014; Ruder, dard label for the word wi,t , the cross-entropy loss 2019; Gui et al., 2018; Cao et al., 2018; Chen et al., for this example is calculated as follows: 2019; Wang et al., 2019; O’Neill, 2019) that, when the source and target domains are less related (e.g. L(i,t) = − yi,t × log(ŷi,t ) . (1) languages from different families), sequential trans- Thus, during the training of the sequence labelling fer learning may lead to a negative effect on the performance, instead of improving it. This phe- 6 For simplicity, we define ŷt only as a function of wt . nomenon is referred to as negative transfer. Pre- In reality, the prediction ŷt for the word wt is also a func- tion of the remaining words in the sentence and the model’s cisely, negative transfer is considered when trans- parameters, in addition to wt . fer learning is harmful to the target task/dataset,
Figure 1: Illustrative scheme of the base neural model for sequence labelling tasks. i.e. the performance when using transfer learning age of tokens that were wrongly predicted by ran- algorithm is lower than that with a solely super- dom initialisation, but the SFT changed to the cor- vised training on in-target data (Torrey and Shavlik, rect ones. Negative transfer N T i represents the 2010). percentage of words that were tagged correctly by In NLP, negative transfer phenomenon has only random initialisation, but using SFT gives wrong seldom been studied. We can cite the recent work predictions. PT i and N T i are defined as follows: of Kocmi (2020) who evaluated the negative trans- fer in transfer learning in neural machine transla- Nicorrected PT i = , (4) tion when the transfer is performed between dif- Ni ferent language-pairs. They found that: 1) The distributions mismatch between source and target Nif alsif ied NT i = , (5) language-pairs does not beget a negative transfer. Ni 2) The transfer may have a negative impact when where Ni is the total number of tokens in the the source language-pair is less-resourced com- validation-set, Nicorrected is the number of tokens pared to the target one, in terms of annotated exam- from the validation-set that were wrongly tagged ples. by the model trained from scratch but are correctly Our experiments in (Meftah et al., 2018a,b) have predicted by the SFT scheme, and Nif alsif ied is the shown that transfer learning techniques from the number of tokens from the validation-set that were news domain to the social media domain using correctly tagged by the model trained from scratch the standard fine-tuning scheme boosts the tagging but are wrongly predicted by the SFT scheme. performance. Hence, following the above defini- tion, transfer learning from news to social media 4.2 Interpretation of Pretrained Neurons does not beget a negative transfer. Contrariwise, Here, we propose to perform a set of analysis tech- in this work, we instead consider the hidden nega- niques to gain some insights into how the inner tive transfer, i.e. the percentage of predictions that pretrained representations are updated during fine- were correctly tagged by random initialisation, but tuning on social media datasets when using the using transfer learning gives wrong predictions. standard fine-tuning scheme of transfer learning. Let us consider the gain Gi brought by the stan- For this, we propose to analyse the feature extrac- dard fine-tuning scheme (SFT) of transfer learning tor’s (Φ) activations. Precisely, we attempt to visu- compared to the random initialisation for a dataset alise biased neurons, i.e. pre-trained neurons that i. Gi is defined as the difference between positive do not change that much from their initial state. transfer PT i and negative transfer N T i : Let us consider a validation-set of N words, the feature extractor Φ generates a matrix Gi = PT i − N T i , (3) h ∈ MN,H (R) of activations over all the words where positive transfer PT i represents the percent- of the validation-set, where Mf ,g (R) is the space
of f × g matrices over R and H is the size of the interested by the matrix diagonal, where cjj rep- hidden representation (number of neurons). Each resents the charge of each unit j from Φ, i.e. the element hi,j from the matrix represents the activa- correlation between each unit’s activations after tion of the neuron j on the word wi . fine-tuning to its activations before fine-tuning. Given two models, the first before fine-tuning 4.2.2 Visualising the Evolution of Pretrained and the second after fine-tuning, we obtain Neurons Stimulus during Fine-tuning two matrices hbef ore ∈ MN,H (R) and h af ter ∈ MN,H (R), which give the activations of Here, we perform units visualisation at the Φ over all validation-set’s words before and after individual-level to gain insights on how the pat- fine-tuning, respectively. terns encoded by individual units progress during We aim to visualise and quantify the change of fine-tuning when using the SFT scheme. To do the representations generated by the model from this, we generate top-k activated words by each the initial state, hbef ore (before fine-tuning), to the unit; i.e. words in the validation-set that fire the final state, haf ter (after fine-tuning). For this pur- most the said unit, positively and negatively (since pose, we perform two experiments: LSTMs generate positive and negative activations). In (Kádár et al., 2017), top-k activated contexts 1. Quantifying the change of pretrained individ- from the model were plotted at the end of training ual neurons (§4.2.1); (the best model), which shows on what each unit 2. Visualising the evolution of pretrained neu- is specialised, but it does not give insights about rons stimulus during fine-tuning (§4.2.2). how the said unit is evolving and changing during training. Thus, taking into account only the final 4.2.1 Quantifying the change of individual state of training does not reveal the whole picture. pretrained neurons Here, we instead propose to generate and plot top-k In order to quantify the change of the knowledge words activating each unit throughout the adapta- encoded in pretrained neurons after fine-tuning, tion stage. We follow two main steps (as illustrated we propose to calculate the similarity (correlation) in Figure 3): between neurons activations before and after fine- tuning, when using the SFT adaptation scheme. 1. We represent each unit j from Φ with a ran- Precisely, we calculate the correlation coefficient dom matrix A(j) ∈ MN,D (R) of the said between each neuron’s activations on the target- unit’s activations on all the validation-set at domain validation-set before starting fine-tuning different training epochs, where D is the num- and at the end of fine-tuning. ber of epochs and N is the number of words (j) Following the above formulation and as illus- in the validation-set. Thus, each element ay,z trated in Figure 2, from hbef ore and haf ter matri- represents the activation of the unit j on the ces, we extract two vectors hbef.j ore ∈ RN and word wy at the epoch z. haf .j ter ∈ RN , representing respectively the acti- 2. We carry out a sorting of each column of the vations of a unit j over all validation-set’s words matrix (each column represents an epoch) and before and after fine-tuning. Next, we generate pick the higher k words (for top-k words firing an asymmetric correlation matrix C ∈ MH,H (R), the unit positively) and the lowest k words where each element cjt in the matrix represents the (for top-k words firing the unit negatively), Pearson’s correlation between the activation vector (j) leading to two matrices, Abest+ ∈ MD,k (R) of unit j after fine-tuning (haf .j ter ) and the activa- (j) and Abest− ∈ MD,k (R), the first for top-k tion vector of unit t before fine-tuning (hbef .t ore ), words activating positively the unit j at each computed as follows: training epoch, and the last for top-k words activating negatively the unit j at each training E[(haf .j ter − µaf j ter )(hbef .t ore − µbef t ore )] epoch. cjt = . σjaf ter σtbef ore (6) Here µbef j ore and σj bef ore represent, respectively, the mean and the standard deviation of unit j ac- tivations over the validation set. Clearly, we are
Figure 2: Illustrative scheme of the computation of the charge of unit j, i.e. the Pearson correlation between unit j activations vector after fine-tuning to its activations vector before fine-tuning. (j) Figure 3: Illustrative scheme of the calculus of top-k-words activating unit j, positively (Abest+ ) and negatively (j) z (Abest− ) during fine-tuning epochs. hepoch states for Φ’s outputs at epoch number z.
5 Joint Learning of Pretrained and pre-trained branch predicts class-probabilities fol- Random Units: PretRand lowing: We found from our analysis (in section 7.1) on pre- ŷip = (Ψp ◦ Φp )(xi ), (7) trained neurons behaviours, that the standard fine- tuning scheme suffers from a main limitation. In- with xi = Υ(wi ). Likewise, the additional random deed, some pre-trained neurons still biased by what branch predicts class-probabilities following: they have learned from the source domain despite the fine-tuning on target domain. We thus propose ŷir = (Ψr ◦ Φr )(xi ). (8) a new adaptation scheme, PretRand, to take bene- fit from both worlds, the pre-learned knowledge in To get the final predictions, we simply apply an the pretrained neurons and the target-specific fea- element-wise sum between the outputs of the pre- tures easily learnt by random neurons. PretRand, trained branch and the random branch: illustrated in Figure 4, consists of three steps: ŷi = ŷip ⊕ ŷir . (9) 1. Augmenting the pre-trained branch with a As in the classical scheme, the SCE loss is min- random one to facilitate the learning of new imised but here, both branches are trained jointly. target-specific patterns (§5.1); 5.2 Independent Normalisation 2. Normalising both branches to balance their behaviours during fine-tuning (§5.2); Our first implementation of adding the random branch was less effective than expected. The main 3. Applying learnable weights on both branches explanation is that the pre-trained units were dom- to let the network learn which of random inating the random units, which means that the or pre-trained one is better for every class. weights as well as the gradients and outputs of pre- (§5.3). trained units absorb those of the random units. As illustrated in the left plot of Figure 5, the absorption phenomenon stays true even at the end of the train- 5.1 Adding the Random Branch ing process; we observe that random units weights We expect that augmenting the pretrained model are closer to zero. This absorption propriety handi- with new randomly initialised neurons allows a caps the random units in firing on the words of the better adaptation during fine-tuning. Thus, in the target dataset.7 adaptation stage, we augment the pre-trained model To alleviate this absorption phenomenon and with a random branch consisting of additional ran- push the random units to be more competitive, we dom units (as illustrated in the scheme “a” of Fig- normalise the outputs of both branches (ŷip and ŷir ) ure 4). Several works have shown that deep (top) using the `2 -norm, as illustrated in the scheme “b” layers are more task-specific than shallow (low) of Figure 4. The normalisation of a vector “x” is ones (Peters et al., 2018; Mou et al., 2016). Thus, computed using the following formula: deep layers learn generic features easily transfer- xi i=|x| able between tasks. In addition, word embeddings N2 (x) = [ ] . (10) (shallow layers) contain the majority of parameters. ||x||2 i=1 Based on these factors, we choose to expand only Thanks to this normalisation, the absorption phe- the top layers as a trade-off between performance nomenon was solved, and the random branch starts and number of parameters (model complexity). In to be more effective (see the right distribution of terms of the expanded layers, we add an extra biL- Figure 5). STM layer of k units in the FE (Φr - r for random); Furthermore, we have observed that despite the and a new fully-connected layer of C units (called normalisation, the performance of the pre-trained Ψr ). With this choice, we increase the complexity classifiers is still much better than the randomly of the model only 1.02× compared to the base one initialised ones. Thus, to make them more com- (The standard fine-tuning scheme). petitive, we propose to start with optimising only Concretely, for every wi , two predictions vec- 7 The same problem was stated in some computer-vision tors are computed; ŷip from the pre-trained branch works (Liu et al., 2015; Wang et al., 2017; Tamaazousti et al., and ŷir from the random one. Specifically, the 2017).
Figure 4: Illustrative scheme of the three ideas composing our proposed adaptation method, PretRand. a) We augment the pre-trained branch (grey branch) with a randomly initialised one (green branch) and jointly adapt them with pre-trained ones (grey branch). An element-wise sum is further applied to merge the two branches. b) Before merging, we balance the different behaviours of pre-trained and random units, using an independent normalisation (N). c) Finally we let the network learn which of pre-trained or random neurons are more suited for every class, by performing an element-wise product of the FC layers with learnable weighting vectors (u and v initialised with 1-values).
6 Experimental Settings 6.1 Datasets We conduct experiments on supervised domain adaptation from the news domain (formal texts) to Figure 5: The distributions of the learnt weight-values the social media domain (noisy texts) for English for the randomly initialised (green) and pre-trained Part-Of-Speech tagging (POS), Chunking (CK) (grey) fully-connected layers after their joint training. and Named Entity Recognition (NER). In addi- Left: without normalisation, right: with normalisation. tion, we experiment on Morpho-syntactic Tagging (MST) of three South-Slavic languages: Slovene, Croatian and Serbian. For POS task, we use the the randomly initialised units while freezing the WSJ part of Penn-Tree-Bank (PTB) (Marcus et al., pre-trained ones, then, we launch the joint training. 1993) news dataset for the source news domain and We call this technique random++. TPoS (Ritter et al., 2011), ArK (Owoputi et al., 2013) and TweeBank (Liu et al., 2018) for the tar- 5.3 Attention Learnable Weighting Vectors get social media domain. For CK task, we use the CONLL2000 (Tjong Kim Sang and Buchholz, Heretofore, pre-trained and random branches par- 2000) dataset for the news source domain and ticipate equally for every class’ predictions, i.e. we TChunk (Ritter et al., 2011) for the target domain. do not weight the dimensions of ŷip and ŷir before For NER task, we use the CONLL2003 dataset merging them with an element-wise summation. (Tjong Kim Sang and De Meulder, 2003) for the Nevertheless, random classifiers may be more effi- source news domain and WNUT-17 dataset (Der- cient for specific classes compared to pre-trained czynski et al., 2017) for the social media target ones and vice-versa. In other terms, we do not domain. For MST, we use the MTT shared-task know which of the two branches (random or pre- (Zampieri et al., 2018) benchmark containing two trained) is better for making a suitable decision for types of datasets: social media and news, for three each class. For instance, if the random branch is south-Slavic languages: Slovene (sl), Croatian (hr) more efficient for predicting a particular class cj , it and Serbian (sr). Statistics of all the datasets are would be better to give more attention to its outputs summarised in Table 1. concerning the class cj compared to the pretrained branch. Therefore, instead of simply performing an 6.2 Evaluation Metrics element-wise sum between the random and pre- We evaluate our models using metrics that are com- trained predictions, we first weight ŷip with a learn- monly used by the community. Specifically, accu- able weighting vector u ∈ RC and ŷir with a racy (acc.) for POS, MST and CK and entity-level learnable weighting vector v ∈ RC , where C is F1 for NER. the tagset size (number of classes). Such as, the element uj from the vector u represents the ran- Comparison criteria: A common approach to dom branch’s attention weight for the class cj , and compare the performance between different ap- the element vj from the vector v represents the proaches across different datasets and tasks is to pretrained branch’s attention weight for the class take the average of each approach across all tasks cj . Then, we compute a Hadamard product with and datasets. However, as it has been discussed in their associated normalised predictions (see the many research papers (Subramanian et al., 2018; scheme “c” of Figure 4). Both vectors u and v Rebuffi et al., 2017; Tamaazousti, 2018), when are initialised with 1-values and are fine-tuned by tasks are not evaluated using the same metrics or back-propagation. Formally, the final predictions results across datasets are not of the same order are computed as follows: of magnitude, the simple average does not allow a “coherent aggregation”. For this, we use the aver- age Normalized Relative Gain (aNRG) proposed by Tamaazousti et al. (2019), where a score aNRGi ŷi = u Np (ŷip ) ⊕ v Np (ŷir ). (11) for each approach i is calculated compared to a
Task #Classes Sources Eval. Metrics # Tokens-splits (train - val - test) POS: POS Tagging 36 WSJ Top-1 Acc. 912,344 - 131,768 - 129,654 CK: Chunking 22 CONLL-2000 Top-1 Acc. 211,727 - n/a - 47,377 NER: Named Entity Recognition 4 CONLL-2003 Top-1 Exact-match F1. 203,621 - 51,362 - 46,435 1304 Slovene-news Top-1 Acc. 439k - 58k - 88k MST: Morpho-syntactic Tagging 772 Croatian-news Top-1 Acc. 379k - 50k - 75k 557 Serbian-news Top-1 Acc. 59k - 11k, 16k 40 TPoS Top-1 Acc. 10,500 - 2,300 - 2,900 POS: POS Tagging 25 ArK Top-1 Acc. 26,500 - / - 7,700 17 TweeBank Top-1 Acc. 24,753 - 11,742 - 19,112 CK: Chunking 18 TChunk Top-1 Top-1 Acc.. 10,652 - 2,242 - 2,291 NER: Named Entity Recognition 6 WNUT-17 Top-1 Exact-match F1. 62,729 - 15,734 - 23,394 1102 Slovene-sm Top-1 Acc. 37,756 - 7,056 - 19,296 MST: Morpho-syntactic Tagging 654 Croatian-sm Top-1 Acc. 45,609 - 8,886 - 21,412 589 Serbian-sm Top-1 Acc. 45,708- 9,581- 23,327 Table 1: Statistics of the used datasets. Top: datasets of the source domain. Bottom: datasets of the target domain. reference approach (baseline) as follows: ELMo pre-trained models are not available but for Croatian (Che et al., 2018).10 Note that, in all ex- periments contextual embeddings are frozen during L i ref 1 X (sj − sj ) training. aNRGi = , (12) L max − sref ) FE’s HP: we use a single biLSTM layer (token- j=1 (sj j level feature extractor) and set the number of units with sij being the score of the approachi on to 200. PretRand’s random branch HP: we experiment the datasetj , sref j being the score of the refer- our approach with k = 200 added random-units. ence approach on the datasetj and smaxj is the Global HP: In all experiments, training (pretrain- best achieved score across all approaches on the ing and fine-tuning) are performed using the SGD datasetj . with momentum with early stopping, mini-batches 6.3 Implementation Details of 16 sentences and learning rate of 1.5 × 10−2 . All our models are implemented with the PyTorch We use the following Hyper-Parameters (HP): library (Paszke et al., 2017). WRE’s HP: In the standard word-level embed- dings, tokens are lower-cased while the character- 7 Experimental Results level component still retains access to the capitali- This section reports all our experimental results sation information. We set the randomly initialised and analysis. First we analyse the standard fine- character embedding dimension at 50, the dimen- tuning scheme of transfer learning (§7.1). Then we sion of hidden states of the character-level biLSTM assess the performance of our proposed approach, at 100 and used 300-dimensional word-level em- PretRand (§7.2). beddings. The latter were pre-loaded from publicly available GloVe pre-trained vectors on 42 billions 7.1 Analysis of the Standard Fine-tuning words from a web crawling and containing 1.9M Scheme words (Pennington et al., 2014) for English ex- We report in Table 2 the results of the reference periments, and pre-loaded from publicly available supervised training scheme from scratch, followed FastText (Bojanowski et al., 2017) pre-trained vec- by the results of the standard fine-tuning scheme, tors on common crawl for South-Slavic languages.8 which outperforms the reference. Precisely, trans- These embeddings are also updated during training. fer learning exhibits an improvement of ∼+3% acc. For experiments with contextual words embeddings for TPoS, ∼+1.2% acc. for ArK, ∼+1.6% acc. for (§7.2.3), we used ELMo (Embeddings from Lan- TweeBank, ∼+3.4% acc. for TChunk and ∼+4.5% guage Models) embeddings (Peters et al., 2018). F1 for WNUT. For English, we use the small official pre-trained In the following we provide the results of our ELMo model on 1 billion word benchmark (13.6M analysis of the standard fine-tuning scheme: parameters).9 Regarding South-Slavic languages, 1. Analysis of the hidden negative transfer 8 https://github.com/facebookresearch/ (§7.1.1). fastText/blob/master/docs/crawl-vectors. md 10 https://github.com/HIT-SCIR/ 9 https://allennlp.org/elmo ELMoForManyLangs
POS (Acc.) CK (Acc.) NER (F1) Dataset TPoS ARK Tweebank TChunk WNUT Method dev test test dev test dev test test From scratch 88.52 86.82 90.89 91.61 91.66 87.76 85.83 36.75 Standard Fine-tuning 90.95 89.79 92.09 93.04 93.29 90.71 89.21 41.25 Table 2: The main results of our proposed approach, transferring pretrained models, on social media datasets (Acc (%) for POS and CK and F1 (%) for NER). The best score for each dataset is highlighted in bold. 2. Quantifying the change of individual pre- show the percentage of positive transfer and red trained neurons after fine-tuning (§7.1.2). bars give the percentage of negative transfer. We observe that even though the standard fine-tuning 3. Visualising the evolution of pretrained neu- approach is effective since the resulting positive rons stimulus during fine-tuning (§7.1.3). transfer is higher than the negative transfer in all cases, this last mitigates the final gain brought by 7.1.1 Analysis of the Hidden Negative the standard fine-tuning. For instance, for TChunk Transfer dataset, standard fine-tuning corrected ∼4.7% of To investigate the hidden negative transfer in the predictions but falsified ∼1.7%, which reduces the standard fine-tuning scheme of transfer learning, final gain to ∼3%.11 we propose the following experiments. First, we show that the final gain brought by the standard Qualitative Examples of Negative Transfer fine-tuning can be separated into two categories: We report in Table 3 concrete examples of words positive transfer and negative transfer. Second, whose predictions were falsified when using the we provide some qualitative examples of negative standard fine-tuning scheme compared to standard transfer. supervised training scheme. Among mistakes we have observed: Quantifying Positive Transfer & Negative Transfer • Tokens with an upper-cased first letter: In news (formal English), only proper nouns start with an upper-case letter inside sentences. Consequently, when using transfer learning, the pre-trained units fail to slough this pattern which is not always respected in social me- dia. Hence, we found that most of the tokens with an upper-cased first letter are mistakenly predicted as proper nouns (PROPN) in POS, e.g. Award, Charity, Night, etc. and as entities in NER, e.g. Father, Hey, etc., which is con- Figure 6: The percentage of negative transfer and pos- sistent with the findings of Seah et al. (2012): itive transfer brought by the standard fine-tuning adap- negative transfer is mainly due to conditional tation scheme compared to supervised training from distribution differences between source and scratch scheme. target domains. We recall that we define positive transfer as the • Contractions are frequently used in social percentage of tokens that were wrongly predicted media to shorten a set of words. For instance, by random initialisation (supervised training from in TPoS dataset, we found that “’s” is in most scratch), but the standard fine-tuning changed to cases predicted as a “possessive ending (pos)” the correct ones, while negative transfer represents instead of “Verb, 3rd person singular present the percentage of words that were tagged correctly (vbz)”. Indeed, in formal English, “’s” is used by random initialisation, but using standard fine- in most cases to express the possessive form, tuning gives wrong predictions. Figure 6 shows 11 the results on English social media datasets, first Here we calculate positive and negative transfer at the token-level. Thus, the gain shown in Figure 6 for WNUT tagged with the classic supervised training scheme dataset does not correspond to the one in Table 2, since the F1 and then using the standard fine-tuning. Blue bars metric is calculated only on named-entities.
DataSet TPoS Award ’s its? Mum wont? id? Exactly nn vbz prp nn MD prp uh nnp pos prp$ uh VBP nn rb ArK Charity I’M? 2pac× 2× Titans? wth× nvr× noun L pnoun P Z ! R pnoun E $ $ N P V TweeBank amazin• Night Angry stangs #Trump awsome• bout• adj noun adj propn propn adj adp noun propn propn noun X intj verb TChunk luv× **ROCKSTAR**THURSDAY ONLY Just wyd× id? b-vp b-np i-np b-advp b-np b-np i-intj O b-np b-np b-intj i-np Wnut Hey Father &× IMO× UN Glasgow Supreme O O O O O b-location b-person b-person b-person i-group b-group b-group b-group b-corporation nn=N=noun=common noun / nnp=pnoun=propn=proper noun / vbz=Verb, 3rd person singular present / pos=possessive ending / prp=personal pronoun / prp$=possessive pronoun / md=modal / VBP=Verb, non-3rd person singular present / uh=!=intj=interjection / rb=R=adverb / L=nominal + verbal or verbal + nominal / E=emoticon / $=numerical / P=pre- or postposition, or subordinating conjunction / Z=proper noun + possessive ending / V=verb / adj=adjective / adp=adposition Table 3: Examples of falsified predictions by standard fine-tuning scheme when transferring from news domain to social media domain. Line 1: Some words from the validation-set of each data-set. Line 2: Correct labels predicted by the classic supervised setting (Random-200). Line 3: Wrong labels predicted by SFT setting. Mistake type: for words with first capital letter, • for misspelling, ? for contractions, × for abbreviations. ArK dataset Tchunk dataset Wnut dataset Figure 7: Correlation results between Φ units’ activations before fine-tuning (columns) and after fine-tuning (rows). Brighter colours indicate higher correlation. e.g. “company’s decision”, but rarely in con- dataset; and luv (love) and wyd (what you tractions that are frequently used in social me- doing?) in TChunk dataset. dia, e.g. “How’s it going with you?”. Simi- larly, “wont” is a frequent contraction for “will • Misspellings: Likewise, we found that not”, e.g. “i wont get bday money lool”, pre- the standard fine-tuning scheme often gives dicted as “verb” instead of “modal (MD)”12 wrong predictions for misspelt words, e.g. aw- by the SFT scheme. The same for “id”, which some, bout, amazin. stands for “I would”. 7.1.2 Quantifying the change of individual • Abbreviations are frequently used in social pretrained neurons media to shorten the way a word is standardly To visualise the bias phenomenon occurring when written. We found that the standard fine- using the standard fine-tuning scheme, we quan- tuning scheme stumbles on abbreviations pre- tify the charge of individual neurons. Precisely, dictions, e.g. 2pac (Tupac), 2 (to), ur (your), we plot the asymmetric correlation matrix C (The wth (what the hell) and nvr (never) in ArK method described in §4.2.1) between the Φ layer’s 12 A modal is an auxiliary verb expressing: ability (can), units before and after fine-tuning for each social obligation (have), etc. media dataset (ArK for POS, TChunk for CK and
WNUT-17 for NER). From the resulting correla- Unit-196: ArK dataset tion matrices illustrated in Figure 7, we can ob- serve the diagonal representing the charge of each unit, with most of the units having a high charge (light colour), alluding the fact that every unit after fine-tuning is highly correlated with itself before fine-tuning. Hypothesising that high correlation in the diagonal entails high bias, the results of this experiment confirm our initial motivation that pre- Unit-64: ArK dataset trained units are highly biased to what they have learnt in the source-dataset, making them limited to learn some patterns that are specific to the target- dataset. Our remarks were confirmed recently in the recent work of Merchant et al. (2020) who also found that fine-tuning is a “conservative process”. 7.1.3 Visualising the Evolution of Pretrained Neurons Stimulus during Fine-tuning Figure 8: Individual units activations before and Here, we give concrete visualisations of the evo- during fine-tuning from ArK POS dataset. For lution of pretrained neurons stimulus during fine- each unit we show Top-10 words activating the said tuning when transferring from the news domain to unit. The first column: top-10 words from the source the social media domain. Following the method validation-set (WSJ) before fine-tuning, Column 0: top- 10 words from the target validation-set (ArK) before described in section 4.2.2, we plot the matrices of fine-tuning. Columns 5 to 20: top-10 words from the top-10 words activating each neuron j, positively target validation-set during fine-tuning epochs. (j) (j) (Abest+ ) or negatively (Abest− ). The results are plotted in Figure 8 for ArK (POS) dataset and Fig- ure 9 for TweeBank dataset (POS). Rows represent – Unit-64 is sensitive to plural proper the top-10 words from the target dataset activat- nouns on news-domain before fine- ing each unit, and columns represent fine-tuning tuning, e.g. Koreans and Europeans, epochs; before fine-tuning in column 0 (at this stage and also on ArK during fine-tuning, e.g. the model is only trained on the source-dataset), Titans and Patriots. However, in ArK and during fine-tuning (columns 5 to 20). Addi- dataset, “Z” is a special class for “proper tionally, to get an idea about each unit’s stimulus noun + possessive ending”, e.g. Jay’s on source dataset, we also show, in the first column mum, and in some cases the apostrophe is (Final-WSJ), top-10 words from the source dataset omitted, e.g. Fergusons house for Fergu- activating the same unit before fine-tuning. In the son’s house, which thus may bring ambi- following, we describe the information encoded by guity with plural proper nouns in formal each provided neuron.13 English. Consequently, unit-64, initially sensitive to plural proper nouns, is also • Ark - POS: (Figure 8) firing on words from the class “Z”, e.g. – Unit-196 is sensitive to contractions con- Timbers (Timber’s). taining an apostrophe regardless of the • Tweebank - POS: (Figure 9) contraction’s class. However, unlike news, in social media and particularly – Unit-37 is sensitive before and during ArK dataset, apostrophes are used in dif- fine-tuning on plural nouns, such as gaz- ferent cases. For instance i’m, i’ll and ers and feminists. However, it is also it’s belong to the class “L” that stands firing on the word slangs because of for “nominal + verbal or verbal + nom- the s ending, which is in fact a proper inal”, while the contractions can’t and noun. This might explain the wrong pre- don’t belong to the class “Verb”. diction for the word slangs (noun instead 13 Here we only select some interesting neurons. However of proper noun) given by the standard we also found many neurons that are not interpretable. fine-tuning scheme (Table 3).
You can also read