Neural Supervised Domain Adaptation by Augmenting Pre-trained Models with Random Units

Page created by Julio Blake

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Neural Supervised Domain Adaptation by Augmenting Pre-trained
                                                                   Models with Random Units

                                        Sara Meftah∗ , Nasredine Semmar∗ , Youssef Tamaazousti∗ , Hassane Essafi∗ , Fatiha Sadat+
                                                       ∗
                                                         CEA-List, Université Paris-Saclay, F-91120, Palaiseau, France
                                                                        +
                                                                          UQÀM, Montréal, Canada
                                                  {firstname.lastname}@cea.fr, sadat.fatiha@uqam.ca

                                                                  Abstract                          tools that are capable of understanding and generat-
                                                                                                    ing all forms of human languages. Furthermore, in
                                                Neural Transfer Learning (TL) is becoming
                                                ubiquitous in Natural Language Processing           spite of the tremendous empirical results achieved
                                                                                                    by NLP models based on Neural Networks (NNs),
arXiv:2106.04935v1 [cs.CL] 9 Jun 2021

                                                (NLP), thanks to its high performance on many
                                                tasks, especially in low-resourced scenarios.       these models are in most cases based on a super-
                                                Notably, TL is widely used for neural do-           vised learning paradigm, i.e. trained from scratch
                                                main adaptation to transfer valuable knowl-         on large amounts of labelled examples. Never-
                                                edge from high-resource to low-resource do-         theless, such training scheme is not fully optimal.
                                                mains. In the standard fine-tuning scheme of        Indeed, NLP neural models with high performance
                                                TL, a model is initially pre-trained on a source
                                                                                                    often require huge volumes of manually annotated
                                                domain and subsequently fine-tuned on a tar-
                                                get domain and, therefore, source and target        data to produce powerful results and prevent over-
                                                domains are trained using the same architec-        fitting. However, manual data annotation is time-
                                                ture. In this paper, we show through inter-         consuming. Besides, language changes over years
                                                pretation methods that such scheme, despite         (Eisenstein, 2019). Thus, most languages varieties
                                                its efficiency, is suffering from a main limi-      are under-resourced (Baumann and Pierrehumbert,
                                                tation. Indeed, although capable of adapting        2014; Duong, 2017).
                                                to new domains, pre-trained neurons struggle
                                                                                                        Particularly, in spite of the valuable advantage of
                                                with learning certain patterns that are specific
                                                to the target domain. Moreover, we shed light       social media’s content analysis for a variety of ap-
                                                on the hidden negative transfer occurring de-       plications (e.g. advertisement, health, or security),
                                                spite the high relatedness between source and       this large domain is still poor in terms of anno-
                                                target domains, which may mitigate the final        tated data. Furthermore, it has been shown that
                                                gain brought by transfer learning. To address       models intended for news fail to work efficiently
                                                these problems, we propose to augment the           on Tweets (Owoputi et al., 2013). This is mainly
                                                pre-trained model with normalised, weighted
                                                                                                    due to the conversational nature of the text, the
                                                and randomly initialised units that foster a bet-
                                                ter adaptation while maintaining the valuable
                                                                                                    lack of conventional orthography, the noise, lin-
                                                source knowledge. We show that our approach         guistic errors, spelling inconsistencies, informal
                                                exhibits significant improvements to the stan-      abbreviations and the idiosyncratic style of these
                                                dard fine-tuning scheme for neural domain           texts (Horsmann, 2018).
                                                adaptation from the news domain to the so-              One of the best approaches to address this issue
                                                cial media domain on four NLP tasks: part-of-       is Transfer Learning (TL); an approach that allows
                                                speech tagging, chunking, named entity recog-
                                                                                                    handling the problem of the lack of annotated data,
                                                nition and morphosyntactic tagging.1
                                                                                                    whereby relevant knowledge previously learned in
                                        1       Introduction                                        a source problem is leveraged to help in solving a
                                                                                                    new target problem (Pan et al., 2010). In the context
                                        NLP aims to produce resources and tools to under-           of artificial NNs, TL relies on a model learned on a
                                        stand texts coming from standard languages and              source-task with sufficient data, further adapted to
                                        their linguistic varieties, such as dialects or user-       the target-task of interest. TL has been shown to be
                                        generated-content in social media platforms. This           powerful for NLP and outperforms the standard su-
                                        diversity is a challenge for developing high-level          pervised learning from scratch paradigm, because
                                            1
                                                Under review                                        it takes benefit from the pre-learned knowledge.

Particularly, the standard fine-tuning (SFT) scheme upper-case letter. Thus the pre-trained units fail to
of sequential transfer learning has been shown to discard this pattern which is not always respected
be efficient for supervised domain adaptation from in user-generated-content in social media. As a
the source news domain to the target social media consequence of this phenomenon, specific patterns
domain (Gui et al., 2017; Meftah et al., 2018b,a; to the target-dataset (e.g. “wanna” or “gonna”) are
März et al., 2019; Zhao et al., 2017; Lin and Lu, difficult to learn by pre-trained units. This phe-
2018). nomenon is non-desirable, since such specific units
In this work we first propose a series of anal- are essential, especially for target-specific classes
ysis to spot the limits of the standard fine-tuning (Zhou et al., 2018b; Lakretz et al., 2019).
adaptation scheme of sequential transfer learning. Stemming from our analysis, we propose a
We start by taking a step towards identifying and new method to overcome the above-mentioned
analysing the hidden negative transfer when trans- drawbacks of the standard fine-tuning scheme of
ferring from the news domain to the social me- transfer learning. Precisely, we propose a hybrid
dia domain. Negative transfer (Rosenstein et al., method that takes benefit from both worlds, random
2005; Wang et al., 2019) occurs when the knowl- initialisation and transfer learning, without their
edge learnt in the source domain hampers the learn- drawbacks. It consists in augmenting the source-
ing of new knowledge from the target domain. network (set of pre-trained units) with randomly
Particularly, when the source and target domains initialised units (that are by design non-biased) and
are dissimilar, transfer learning may fail and hurt jointly learn them. We call our method PretRand
the performance, leading to a worse performance (Pretrained and Random units). PretRand consists
compared to the standard supervised training from of three main ideas:
scratch. In this work, we rather perceive the gain
brought by the standard fine-tuning scheme com- 1. Augmenting the source-network (set of pre-
pared to random initialisation2 as a combination trained units) with a random branch composed
of a positive transfer and a hidden negative trans- of randomly initialised units, and jointly learn
fer. We define positive transfer as the percentage of them.
predictions that were wrongly predicted by random 2. Normalising the outputs of both branches to
initialisation, but using transfer learning changed to balance their different behaviours and thus
the correct ones. The negative transfer represents forcing the network to consider both.
the percentage of predictions that were tagged cor-
rectly by random initialisation, but using transfer 3. Applying learnable attention weights on both
learning gives incorrect predictions. Hence, the branches predictors to let the network learn
final gain brought by transfer learning would be the which of random or pre-trained one is better
difference between positive and negative transfer. for every class.
We show that despite the final positive gain brought
by transfer learning from the high-resource news Our experiments on 4 NLP tasks: Part-of-Speech
domain to the low-resource social media domain, tagging (POS), Chunking (CK), Named Entity
the hidden negative transfer may mitigate the final Recognition (NER) and Morphosyntactic Tagging
gain. (MST) show that PretRand enhances considerably
Then we perform an interpretive analysis of indi- the performance compared to the standard fine-
vidual pre-trained neurons behaviours in different tuning adaptation scheme.4
settings. We find that some of pretrained neurons The remainder of this paper is organised as fol-
are biased by what they have learnt in the source- lows. Section 2 presents the background related
dataset. For instance, we observe a unit3 firing to our work: transfer learning and interpretation
on proper nouns (e.g.“George” and “Washington”) methods for NLP. Section 3 presents the base neu-
before fine-tuning and on words with capitalised ral architecture used for sequence labelling in NLP.
first-letter whether the word is a proper noun or Section 4 describes our proposed methods to anal-
not (e.g. “Man” and “Father”) during fine-tuning. yse the standard fine-tuning scheme of sequential
Indeed, in news, only proper nouns start with an transfer learning. Section 5 describes our proposed
2
approach PretRand. Section 6 reports the datasets
Random initialisation means training from scratch on
target data (in-domain data). 4
This paper is an extension of our previous work (Meftah
3
We use “unit” and “neuron” interchangeably. et al., 2019).

and the experimental setup. Section 7 reports the models designed for specific high-resourced
experimental results of our proposed methods and source setting(s) (language, language variety,
is divided into two sub-sections: Sub-section 7.1 domain, task, etc) to work in a target low-resourced
reports the empirical analysis of the standard fine- setting(s). It includes two categories. First,
tuning scheme, highlighting its drawbacks. Sub- unsupervised domain adaptation assumes that
section 7.2 presents the experimental results of our labelled examples in the source domain are
proposed approach PretRand, showing the effec- sufficiently available, but for the target domain,
tiveness of PretRand on different tasks and datasets only unlabelled examples are available. Second,
and the impact of incorporating contextualised rep- in supervised domain adaptation setting, a small
resentations. Finally, section 8 wraps up by dis- number of labelled target examples are assumed to
cussing our findings and future research directions. be available.

2 Background Pretraining
In the pretraining stage of STL, a crucial key for
Since our work involves two research topics: Se-
the success of transfer is the ruling about the pre-
quential Transfer Learning (STL) and Interpreta-
trained task and domain. For universal represen-
tion methods, we discuss in the following sub-
tations, the pre-trained task is expected to encode
sections the state-of-the-art of each topic with a po-
useful features for a wide number of target tasks
sitioning of our contributions regarding each one.
and domains. In comparison, for domain adapta-
2.1 Sequential Transfer Learning tion, the pre-trained task is expected to be most
suitable for the target task in mind. We classify
In STL, training is performed in two stages, sequen- pretraining methods into four main categories: un-
tially: pretraining on the source task, followed supervised, supervised, multi-task and adversarial
by an adaptation on the downstream target tasks pretraining:
(Ruder, 2019). The purpose behind using STL
techniques for NLP can be divided into two main • Unsupervised pretraining uses raw unlabelled
research areas, universal representations and do- data for pretraining. Particularly, it has been
main adaptation. successfully used in a wide range of semi-
Universal representations aim to build neural nal works to learn universal representations.
features (e.g. words embeddings and sentence em- Language modelling task has been partic-
beddings) that are transferable and beneficial to a ularly used thanks to its ability to capture
wide range of downstream NLP tasks and domains. general-purpose features of language.5 For
Indeed, the probabilistic language model proposed instance, TagLM (Peters et al., 2017) is a pre-
by Bengio et al. (2003) was the genesis of what trained model based on a bidirectional lan-
we call words embedding in NLP, while Word2Vec guage model (biLM), also used to generate
(Mikolov et al., 2013) was its outbreak and a start- ELMo (Embeddings from Language Models)
ing point for a surge of works on learning words em- representations (Peters et al., 2018). With
beddings: e.g. FastText (Bojanowski et al., 2017) the recent emergence of the “Transformers”
enriches Word2Vec with subword information. Re- architectures (Vaswani et al., 2017), many
cently, universal representations re-emerged with works propose pretrained models based on
contextualised representations, handling a major these architectures (Devlin et al., 2019; Yang
drawback of traditional words embedding. Indeed, et al., 2019; Raffel et al., 2019). Unsuper-
these last learn a single context-independent repre- vised pretraining has also been used to im-
sentation for each word thus ignoring words poly- prove sequence to sequence learning. We can
semy. Therefore, contextualised words representa- cite the work of Ramachandran et al. (2017)
tions aim to learn context-dependent word embed- who proposed to improve the performance of
dings, i.e. considering the entire sequence as input an encoder-decoder neural machine transla-
to produce each word’s embedding. tion model by initialising both encoder and
While universal representations seek to be decoder parameters with pretrained weights
propitious for any downstream task, domain 5
Note that language modelling is also considered as a
adaptation is designed for particular target tasks. self-supervised task since, in fact, labels are automatically
Domain adaptation consists in adapting NLP generated from raw data.

from two language models. initialised layers are added on top of pretrained
ones. Three main adaptation schemes are used in
• Supervised pretraining has been particularly sequential transfer learning: Feature Extraction,
used for cross-lingual transfer (e.g. machine Fine-Tuning and the recent Residual Adapters.
translation (Zoph and Knight, 2016)), cross-
In a Feature Extraction scheme, the pretrained
task transfer from POS tagging to words seg-
layers’ weights are frozen during adaptation, while
mentation task (Yang et al., 2017) and cross-
in Fine-Tuning scheme weights are tuned. Accord-
domain transfer for biomedical texts for ques-
ingly, the former is computationally inexpensive
tion answering by Wiese et al. (2017) and
while the last allows better adaptation to target do-
for NER by Giorgi and Bader (2018). Cross-
mains peculiarities. In general, fine-tuning pre-
domain transfer has also been used to transfer
trained models begets better results, except in cases
from news to social media texts for POS tag-
wherein the target domain’s annotations are sparse
ging (Meftah et al., 2017; März et al., 2019)
or noisy (Dhingra et al., 2017; Mou et al., 2016).
and sentiment analysis (Zhao et al., 2017). Su-
Peters et al. (2019) found that for contextualised
pervised pretraining has been also used ef-
representations, both adaptation schemes are com-
fectively for universal representations learn-
petitive, but the appropriate adaptation scheme to
ing, e.g. neural machine translation (McCann
pick depends on the similarity between the source
et al., 2017), language inference (Conneau
and target problems. Recently, Residual Adapters
et al., 2017) and discourse relations (Nie et al.,
were proposed by Houlsby et al. (2019) to adapt
2017).
pretrained models based on Transformers archi-
• Multi-task pretraining has been successfully tecture, aiming to keep Fine-Tuning scheme’s ad-
applied to learn general universal sentence vantages while reducing the number of parame-
representations by a simultaneous pretrain- ters to update during the adaptation stage. This
ing on a set of supervised and unsuper- is achieved by adding adapters (intermediate lay-
vised tasks (Subramanian et al., 2018; Cer ers with a small number of parameters) on top of
et al., 2018). Subramanian et al. (2018), each pretrained layer. Thus, pretrained layers are
for instance, proposed to learn universal sen- frozen, and only adapters are updated during train-
tences representations by a joint pretraining ing. Therefore, Residual Adapters performance is
on skip-thoughts, machine translation, con- near to Fine-tuning while being computationally
stituency parsing, and natural language infer- cheaper (Pfeiffer et al., 2020b,a,c).
ence. For domain adaptation, we have per- Our work
formed in (Meftah et al., 2020) a multi-task
Our work falls under supervised domain adaptation
pretraining for supervised domain adaptation
research area. Specifically, cross-domain adapta-
from the news domain to the social media do-
tion from the news domain to the social media do-
main.
main. The fine-tuning adaptation scheme has been
• Adversarial pretraining is particularly used successfully applied on domain adaptation from
for domain adaptation when some annotated the news domain to the social media domain (e.g.
examples from the target domain are avail- adversarial pretraining (Gui et al., 2017) and super-
able. Adversarial training (Ganin et al., 2016) vised pretraining (Meftah et al., 2018a)). In this
is used as a pretraining step followed by an research, we highlight the aforementioned draw-
adaptation step on the target dataset. Adver- backs (biased pre-trained units and the hidden neg-
sarial pretraining demonstrated its effective- ative transfer) of the standard fine-tuning adapta-
ness in several NLP tasks, e.g. cross-lingual tion scheme. Then, we propose a new adaptation
sentiment analysis (Chen et al., 2018). Also, scheme (PretRand) to handle these problems. Fur-
it has been used to learn cross-lingual words thermore, while ELMo contextualised words repre-
embeddings (Lample et al., 2018). sentations efficiency has been proven for different
tasks and datasets (Peters et al., 2019; Fecht et al.,
Adaptation 2019; Schumacher and Dredze, 2019), here we in-
During the adaptation stage of STL, one or more vestigate their impact when used, simultaneously,
layers from the pretrained model are transferred to with a sequential transfer learning scheme for su-
the downstream task, and one or more randomly pervised domain adaptation.

2.2 Interpretation methods for NLP computer vision (Coates and Ng, 2011; Girshick
et al., 2014; Zhou et al., 2015), and more recently
Recently, a rising interest is devoted to peek inside
in NLP, wherein units activations are visualised
black-box neural NLP models to interpret their
in heatmaps. For instance, Karpathy et al. (2016)
internal representations and their functioning. A
visualised character-level Long Short-Term Mem-
variety of methods were proposed in the literature,
ory (LSTM) cells learned in language modelling
here we only discuss those that are most related to
and found multiple interpretable units that track
our research.
long-distance dependencies, such as line lengths
and quotes; Radford et al. (2017) visualised a unit
Probing tasks is a common approach for NLP which performs sentiment analysis in a language
models analysis used to investigate which model based on Recurrent Neural Networks
linguistic properties are encoded in the latent (RNNs); Bau et al. (2019) visualised neurons
representations of the neural model (Shi et al., specialised on tense, gender, number, etc. in
2016). Concretely, given a neural model M NMT models; and Kádár et al. (2017) proposed
trained on a particular NLP task, whether it is top-k-contexts approach to identify sentences, an
unsupervised (e.g. language modelling (LM)) thus linguistic patterns, sparking the highest acti-
or supervised (e.g. Neural Machine Translation vation values of each unit in an RNNs-based model.
(NMT)), a shallow classifier is trained on top
of the frozen M on a corpus annotated with the
linguistic properties of interest. The aim is to Neural representations correlation analysis:
examine whether M’s hidden representations Cross-network and cross-layers correlation is
encode the property of interest. For instance, Shi a significant approach to gain insights on how
et al. (2016) found that different levels of syntactic internal representations may vary across networks,
information are learned by NMT encoder’s layers. network-depth and training time. Suitable
Adi et al. (2016) investigated what information approaches are based on Correlation Canonical
(between sentence length, words order and Analysis (CCA) (Hotelling, 1992; Uurtio et al.,
word-content) is captured by different sentence 2018), such as Singular Vector Canonical Correla-
embedding learning methods. Conneau et al. tion Analysis (Raghu et al., 2017) and Projected
(2018) proposed 10 probing tasks annotated with Weighted Canonical Correlation Analysis (Morcos
fine-grained linguistic properties and compared et al., 2018), that were successfully used in NLP
different approaches for sentence embeddings. Zhu neural models analysis. For instance, it was used
et al. (2018) inspected which semantic properties by Bau et al. (2019) to calculate cross-networks
(e.g. negation, synonymy, etc.) are encoded correlation for ranking important neurons in NMT
by different sentence embeddings approaches. and LM. Saphra and Lopez (2019) applied it to
Furthermore, the emergence of contextualised probe the evolution of syntactic, semantic, and
words representations have triggered a surge of topic representations cross-time and cross-layers.
works on probing what these representations are Raghu et al. (2019) compared the internal rep-
learning (Liu et al., 2019a; Clark et al., 2019). resentations of models trained from scratch vs
This approach, however, suffers from two main models initialised with pre-trained weights. CCA
flaws. First, probing tasks examine properties based methods aim to calculate similarity between
captured by the model at a coarse-grained level, neural representations at the coarse-grained level.
i.e. layers representations, and thereby, will not In contrast, correlation analysis at the fine-grained
identify features captured by individual neurons. level, i.e. between individual neurons, has also
Second, probing tasks will not identify linguistic been explored in the literature. Initially, Li et al.
properties that do not appear in the annotated (2015) used Pearson’s correlation to examine to
probing datasets (Zhou et al., 2018a). which extent each individual unit is correlated to
another unit, either within the same network or
between different networks. The same correlation
Individual units stimulus: Inspired by works on
metric was used by Bau et al. (2019) to determine
receptive fields of biological neurons (Hubel and
important neurons in NMT and LM tasks.
Wiesel, 1965), much work has been devoted for
interpreting and visualising individual hidden units
stimulus-features in neural networks. Initially, in Our Work:

In this work, we propose two approaches (§4.2)                   model on M annotated sentences, the model’s loss
to highlight the bias effect in the standard fine-               is defined as follows:
tuning scheme of transfer learning in NLP, the first
                                                                                         mi
                                                                                       M X
method is based on individual units stimulus and                                       X
                                                                                L =                 L(i,t) .       (2)
the second on neural representations correlation
                                                                                        i=1   t=1
analysis. To the best of our knowledge, we are
the first to harness these interpretation methods                4     Analysis of the Standard Fine-Tuning
to analyse individual units behaviour in a transfer                    Scheme
learning scheme. Furthermore, the most analysed
                                                                 The standard fine-tuning scheme consists in trans-
tasks in the literature are Natural Language Infer-
                                                                 ferring a part of the learned weights from a source
ence, NMT and LM (Belinkov and Glass, 2019),
                                                                 model to initialise the target model, which is further
here we target under-explored tasks in visualisation
                                                                 fine-tuned on the target task with a small number of
works such as POS, MST, CK and NER.
                                                                 training examples from the target domain. Given a
3    Base Neural Sequence Labelling Model                        source neural network Ms with a set of parameters
                                                                 θs split into two sets: θs = (θs1 , θs2 ) and a target
Given an input sentence S of n successive tokens                 network Mt with a set of parameters θt split into
S = [w1 , . . . , wn ], the goal of sequence labelling           two sets: θt = (θt1 , θt2 ), the standard fine-tuning
is to predict the label ct ∈ C of every wt , with                scheme of transfer learning includes three simple
C being the tag-set. We use a commonly used                      yet effective steps:
end-to-end neural sequence labelling model (Ma
and Hovy, 2016; Plank et al., 2016; Yang et al.,                     1. We train the source model on annotated data
2018), which is composed of three components (il-                       from the source domain on a source dataset.
lustrated in Figure 1). First, the Word Represen-                    2. We transfer the first set of parameters from
tation Extractor (WRE), denoted Υ, computes a                           the source network Ms to the target network
vector representation xt for each token wt . Second,                    Mt : θt1 = θs1 , whereas the second set θt2 of
this representation is fed into a Feature Extrac-                       parameters is randomly initialised.
tor (FE) based on a bidirectional Long Short-Term
Memory (biLSTM) network (Graves et al., 2013),                       3. Then, the target model is further fine-tuned on
denoted Φ. It produces a hidden representation, ht ,                    the small target data-set.
that is fed into a Classifier (Cl): a fully-connected
                                                                    Source and target datasets may have different
layer (FCL), denoted Ψ. Formally, given wt , the
                                                                 tag-sets, even within the same NLP task. Hence,
logits are obtained using the following equation:
                                                                 transferring the parameters of the classifier (Ψ) may
ŷt = (Ψ ◦ Φ ◦ Υ)(wt ).6
                                                                 not be feasible in all cases. Therefore, in our ex-
   In the standard supervised training scheme, the
                                                                 periments, WRE’s layers (Υ) and FE’s layers (Φ)
three modules are jointly trained from scratch by
                                                                 are initialised with the source model’s weights and
minimising the Softmax Cross-Entropy (SCE) loss
                                                                 Ψ is randomly initialised. Then, the three modules
using the Stochastic Gradient Descent (SGD) algo-
                                                                 are further jointly trained on the target-dataset by
rithm.
                                                                 minimising a SCE loss using the SGD algorithm.
   Let us consider a training set of M annotated
sentences, where each sentence i is composed of                  4.1    The Hidden Negative Transfer
mi tokens. Given a training word (wi,t , yi,t ) from             It has been shown in many works in the literature
the training sentence i, where yi,t is the gold stan-            (Rosenstein et al., 2005; Ge et al., 2014; Ruder,
dard label for the word wi,t , the cross-entropy loss            2019; Gui et al., 2018; Cao et al., 2018; Chen et al.,
for this example is calculated as follows:                       2019; Wang et al., 2019; O’Neill, 2019) that, when
                                                                 the source and target domains are less related (e.g.
           L(i,t) = − yi,t × log(ŷi,t ) .               (1)     languages from different families), sequential trans-
Thus, during the training of the sequence labelling              fer learning may lead to a negative effect on the
                                                                 performance, instead of improving it. This phe-
    6
      For simplicity, we define ŷt only as a function of wt .   nomenon is referred to as negative transfer. Pre-
In reality, the prediction ŷt for the word wt is also a func-
tion of the remaining words in the sentence and the model’s      cisely, negative transfer is considered when trans-
parameters, in addition to wt .                                  fer learning is harmful to the target task/dataset,

Figure 1: Illustrative scheme of the base neural model for sequence labelling tasks.

i.e. the performance when using transfer learning          age of tokens that were wrongly predicted by ran-
algorithm is lower than that with a solely super-          dom initialisation, but the SFT changed to the cor-
vised training on in-target data (Torrey and Shavlik,      rect ones. Negative transfer N T i represents the
2010).                                                     percentage of words that were tagged correctly by
   In NLP, negative transfer phenomenon has only           random initialisation, but using SFT gives wrong
seldom been studied. We can cite the recent work           predictions. PT i and N T i are defined as follows:
of Kocmi (2020) who evaluated the negative trans-
fer in transfer learning in neural machine transla-                                  Nicorrected
                                                                            PT i =               ,          (4)
tion when the transfer is performed between dif-                                         Ni
ferent language-pairs. They found that: 1) The
distributions mismatch between source and target                                     Nif alsif ied
                                                                           NT i =                  ,        (5)
language-pairs does not beget a negative transfer.                                       Ni
2) The transfer may have a negative impact when
                                                           where Ni is the total number of tokens in the
the source language-pair is less-resourced com-
                                                           validation-set, Nicorrected is the number of tokens
pared to the target one, in terms of annotated exam-
                                                           from the validation-set that were wrongly tagged
ples.
                                                           by the model trained from scratch but are correctly
   Our experiments in (Meftah et al., 2018a,b) have
                                                           predicted by the SFT scheme, and Nif alsif ied is the
shown that transfer learning techniques from the
                                                           number of tokens from the validation-set that were
news domain to the social media domain using
                                                           correctly tagged by the model trained from scratch
the standard fine-tuning scheme boosts the tagging
                                                           but are wrongly predicted by the SFT scheme.
performance. Hence, following the above defini-
tion, transfer learning from news to social media          4.2   Interpretation of Pretrained Neurons
does not beget a negative transfer. Contrariwise,
                                                           Here, we propose to perform a set of analysis tech-
in this work, we instead consider the hidden nega-
                                                           niques to gain some insights into how the inner
tive transfer, i.e. the percentage of predictions that
                                                           pretrained representations are updated during fine-
were correctly tagged by random initialisation, but
                                                           tuning on social media datasets when using the
using transfer learning gives wrong predictions.
                                                           standard fine-tuning scheme of transfer learning.
   Let us consider the gain Gi brought by the stan-
                                                           For this, we propose to analyse the feature extrac-
dard fine-tuning scheme (SFT) of transfer learning
                                                           tor’s (Φ) activations. Precisely, we attempt to visu-
compared to the random initialisation for a dataset
                                                           alise biased neurons, i.e. pre-trained neurons that
i. Gi is defined as the difference between positive
                                                           do not change that much from their initial state.
transfer PT i and negative transfer N T i :
                                                              Let us consider a validation-set of N words,
                                                           the feature extractor Φ generates a matrix
                Gi = PT i − N T i ,                (3)
                                                           h ∈ MN,H (R) of activations over all the words
where positive transfer PT i represents the percent-       of the validation-set, where Mf ,g (R) is the space

of f × g matrices over R and H is the size of the           interested by the matrix diagonal, where cjj rep-
hidden representation (number of neurons). Each             resents the charge of each unit j from Φ, i.e. the
element hi,j from the matrix represents the activa-         correlation between each unit’s activations after
tion of the neuron j on the word wi .                       fine-tuning to its activations before fine-tuning.
   Given two models, the first before fine-tuning
                                                            4.2.2    Visualising the Evolution of Pretrained
and the second after fine-tuning, we obtain
                                                                     Neurons Stimulus during Fine-tuning
two matrices hbef ore           ∈      MN,H (R) and
h af ter ∈ MN,H (R), which give the activations of          Here, we perform units visualisation at the
Φ over all validation-set’s words before and after          individual-level to gain insights on how the pat-
fine-tuning, respectively.                                  terns encoded by individual units progress during
   We aim to visualise and quantify the change of           fine-tuning when using the SFT scheme. To do
the representations generated by the model from             this, we generate top-k activated words by each
the initial state, hbef ore (before fine-tuning), to the    unit; i.e. words in the validation-set that fire the
final state, haf ter (after fine-tuning). For this pur-     most the said unit, positively and negatively (since
pose, we perform two experiments:                           LSTMs generate positive and negative activations).
                                                            In (Kádár et al., 2017), top-k activated contexts
  1. Quantifying the change of pretrained individ-          from the model were plotted at the end of training
     ual neurons (§4.2.1);                                  (the best model), which shows on what each unit
  2. Visualising the evolution of pretrained neu-           is specialised, but it does not give insights about
     rons stimulus during fine-tuning (§4.2.2).             how the said unit is evolving and changing during
                                                            training. Thus, taking into account only the final
4.2.1   Quantifying the change of individual                state of training does not reveal the whole picture.
        pretrained neurons                                  Here, we instead propose to generate and plot top-k
In order to quantify the change of the knowledge            words activating each unit throughout the adapta-
encoded in pretrained neurons after fine-tuning,            tion stage. We follow two main steps (as illustrated
we propose to calculate the similarity (correlation)        in Figure 3):
between neurons activations before and after fine-
tuning, when using the SFT adaptation scheme.                 1. We represent each unit j from Φ with a ran-
Precisely, we calculate the correlation coefficient              dom matrix A(j) ∈ MN,D (R) of the said
between each neuron’s activations on the target-                 unit’s activations on all the validation-set at
domain validation-set before starting fine-tuning                different training epochs, where D is the num-
and at the end of fine-tuning.                                   ber of epochs and N is the number of words
                                                                                                             (j)
   Following the above formulation and as illus-                 in the validation-set. Thus, each element ay,z
trated in Figure 2, from hbef ore and haf ter matri-             represents the activation of the unit j on the
ces, we extract two vectors hbef.j
                                   ore
                                         ∈ RN and                word wy at the epoch z.
haf
 .j
    ter
        ∈ RN , representing respectively the acti-            2. We carry out a sorting of each column of the
vations of a unit j over all validation-set’s words              matrix (each column represents an epoch) and
before and after fine-tuning. Next, we generate                  pick the higher k words (for top-k words firing
an asymmetric correlation matrix C ∈ MH,H (R),                   the unit positively) and the lowest k words
where each element cjt in the matrix represents the              (for top-k words firing the unit negatively),
Pearson’s correlation between the activation vector                                           (j)
                                                                 leading to two matrices, Abest+ ∈ MD,k (R)
of unit j after fine-tuning (haf
                              .j
                                 ter
                                     ) and the activa-                  (j)
                                                                 and Abest− ∈ MD,k (R), the first for top-k
tion vector of unit t before fine-tuning     (hbef
                                               .t
                                                   ore
                                                       ),
                                                                 words activating positively the unit j at each
computed as follows:
                                                                 training epoch, and the last for top-k words
                                                                 activating negatively the unit j at each training
        E[(haf
            .j
               ter
                   − µaf
                      j
                         ter
                             )(hbef
                                .t
                                    ore
                                        − µbef
                                           t
                                               ore
                                                   )]            epoch.
cjt =                                                   .
                       σjaf ter σtbef ore
                                                 (6)
Here µbef
        j
           ore
               and σj
                     bef ore
                             represent, respectively,
the mean and the standard deviation of unit j ac-
tivations over the validation set. Clearly, we are

Figure 2: Illustrative scheme of the computation of the charge of unit j, i.e. the Pearson correlation between unit
j activations vector after fine-tuning to its activations vector before fine-tuning.

                                                                                             (j)
Figure 3: Illustrative scheme of the calculus of top-k-words activating unit j, positively (Abest+ ) and negatively
  (j)                                      z
(Abest− ) during fine-tuning epochs. hepoch states for Φ’s outputs at epoch number z.

5 Joint Learning of Pretrained and pre-trained branch predicts class-probabilities fol-
Random Units: PretRand lowing:

We found from our analysis (in section 7.1) on pre-
ŷip = (Ψp ◦ Φp )(xi ), (7)
trained neurons behaviours, that the standard fine-
tuning scheme suffers from a main limitation. In- with xi = Υ(wi ). Likewise, the additional random
deed, some pre-trained neurons still biased by what branch predicts class-probabilities following:
they have learned from the source domain despite
the fine-tuning on target domain. We thus propose ŷir = (Ψr ◦ Φr )(xi ). (8)
a new adaptation scheme, PretRand, to take bene-
fit from both worlds, the pre-learned knowledge in To get the final predictions, we simply apply an
the pretrained neurons and the target-specific fea- element-wise sum between the outputs of the pre-
tures easily learnt by random neurons. PretRand, trained branch and the random branch:
illustrated in Figure 4, consists of three steps:
ŷi = ŷip ⊕ ŷir . (9)
1. Augmenting the pre-trained branch with a As in the classical scheme, the SCE loss is min-
random one to facilitate the learning of new imised but here, both branches are trained jointly.
target-specific patterns (§5.1);
5.2 Independent Normalisation
2. Normalising both branches to balance their
behaviours during fine-tuning (§5.2); Our first implementation of adding the random
branch was less effective than expected. The main
3. Applying learnable weights on both branches explanation is that the pre-trained units were dom-
to let the network learn which of random inating the random units, which means that the
or pre-trained one is better for every class. weights as well as the gradients and outputs of pre-
(§5.3). trained units absorb those of the random units. As
illustrated in the left plot of Figure 5, the absorption
phenomenon stays true even at the end of the train-
5.1 Adding the Random Branch ing process; we observe that random units weights
We expect that augmenting the pretrained model are closer to zero. This absorption propriety handi-
with new randomly initialised neurons allows a caps the random units in firing on the words of the
better adaptation during fine-tuning. Thus, in the target dataset.7
adaptation stage, we augment the pre-trained model To alleviate this absorption phenomenon and
with a random branch consisting of additional ran- push the random units to be more competitive, we
dom units (as illustrated in the scheme “a” of Fig- normalise the outputs of both branches (ŷip and ŷir )
ure 4). Several works have shown that deep (top) using the `2 -norm, as illustrated in the scheme “b”
layers are more task-specific than shallow (low) of Figure 4. The normalisation of a vector “x” is
ones (Peters et al., 2018; Mou et al., 2016). Thus, computed using the following formula:
deep layers learn generic features easily transfer-
xi i=|x|
able between tasks. In addition, word embeddings N2 (x) = [ ] . (10)
(shallow layers) contain the majority of parameters. ||x||2 i=1
Based on these factors, we choose to expand only Thanks to this normalisation, the absorption phe-
the top layers as a trade-off between performance nomenon was solved, and the random branch starts
and number of parameters (model complexity). In to be more effective (see the right distribution of
terms of the expanded layers, we add an extra biL- Figure 5).
STM layer of k units in the FE (Φr - r for random); Furthermore, we have observed that despite the
and a new fully-connected layer of C units (called normalisation, the performance of the pre-trained
Ψr ). With this choice, we increase the complexity classifiers is still much better than the randomly
of the model only 1.02× compared to the base one initialised ones. Thus, to make them more com-
(The standard fine-tuning scheme). petitive, we propose to start with optimising only
Concretely, for every wi , two predictions vec- 7
The same problem was stated in some computer-vision
tors are computed; ŷip from the pre-trained branch works (Liu et al., 2015; Wang et al., 2017; Tamaazousti et al.,
and ŷir from the random one. Specifically, the 2017).

Figure 4: Illustrative scheme of the three ideas composing our proposed adaptation method, PretRand. a) We
augment the pre-trained branch (grey branch) with a randomly initialised one (green branch) and jointly adapt them
with pre-trained ones (grey branch). An element-wise sum is further applied to merge the two branches. b) Before
merging, we balance the different behaviours of pre-trained and random units, using an independent normalisation
(N). c) Finally we let the network learn which of pre-trained or random neurons are more suited for every class,
by performing an element-wise product of the FC layers with learnable weighting vectors (u and v initialised with
1-values).

6     Experimental Settings

                                                            6.1    Datasets

                                                            We conduct experiments on supervised domain
                                                            adaptation from the news domain (formal texts) to
Figure 5: The distributions of the learnt weight-values     the social media domain (noisy texts) for English
for the randomly initialised (green) and pre-trained        Part-Of-Speech tagging (POS), Chunking (CK)
(grey) fully-connected layers after their joint training.   and Named Entity Recognition (NER). In addi-
Left: without normalisation, right: with normalisation.
                                                            tion, we experiment on Morpho-syntactic Tagging
                                                            (MST) of three South-Slavic languages: Slovene,
                                                            Croatian and Serbian. For POS task, we use the
the randomly initialised units while freezing the
                                                            WSJ part of Penn-Tree-Bank (PTB) (Marcus et al.,
pre-trained ones, then, we launch the joint training.
                                                            1993) news dataset for the source news domain and
We call this technique random++.
                                                            TPoS (Ritter et al., 2011), ArK (Owoputi et al.,
                                                            2013) and TweeBank (Liu et al., 2018) for the tar-
5.3    Attention Learnable Weighting Vectors                get social media domain. For CK task, we use
                                                            the CONLL2000 (Tjong Kim Sang and Buchholz,
Heretofore, pre-trained and random branches par-            2000) dataset for the news source domain and
ticipate equally for every class’ predictions, i.e. we      TChunk (Ritter et al., 2011) for the target domain.
do not weight the dimensions of ŷip and ŷir before        For NER task, we use the CONLL2003 dataset
merging them with an element-wise summation.                (Tjong Kim Sang and De Meulder, 2003) for the
Nevertheless, random classifiers may be more effi-          source news domain and WNUT-17 dataset (Der-
cient for specific classes compared to pre-trained          czynski et al., 2017) for the social media target
ones and vice-versa. In other terms, we do not              domain. For MST, we use the MTT shared-task
know which of the two branches (random or pre-              (Zampieri et al., 2018) benchmark containing two
trained) is better for making a suitable decision for       types of datasets: social media and news, for three
each class. For instance, if the random branch is           south-Slavic languages: Slovene (sl), Croatian (hr)
more efficient for predicting a particular class cj , it    and Serbian (sr). Statistics of all the datasets are
would be better to give more attention to its outputs       summarised in Table 1.
concerning the class cj compared to the pretrained
branch.
    Therefore, instead of simply performing an              6.2    Evaluation Metrics
element-wise sum between the random and pre-
                                                            We evaluate our models using metrics that are com-
trained predictions, we first weight ŷip with a learn-
                                                            monly used by the community. Specifically, accu-
able weighting vector u ∈ RC and ŷir with a
                                                            racy (acc.) for POS, MST and CK and entity-level
learnable weighting vector v ∈ RC , where C is
                                                            F1 for NER.
the tagset size (number of classes). Such as, the
element uj from the vector u represents the ran-               Comparison criteria: A common approach to
dom branch’s attention weight for the class cj , and        compare the performance between different ap-
the element vj from the vector v represents the             proaches across different datasets and tasks is to
pretrained branch’s attention weight for the class          take the average of each approach across all tasks
cj . Then, we compute a Hadamard product with               and datasets. However, as it has been discussed in
their associated normalised predictions (see the            many research papers (Subramanian et al., 2018;
scheme “c” of Figure 4). Both vectors u and v               Rebuffi et al., 2017; Tamaazousti, 2018), when
are initialised with 1-values and are fine-tuned by         tasks are not evaluated using the same metrics or
back-propagation. Formally, the final predictions           results across datasets are not of the same order
are computed as follows:                                    of magnitude, the simple average does not allow a
                                                            “coherent aggregation”. For this, we use the aver-
                                                            age Normalized Relative Gain (aNRG) proposed by
                                                            Tamaazousti et al. (2019), where a score aNRGi
      ŷi = u    Np (ŷip ) ⊕ v      Np (ŷir ).   (11)     for each approach i is calculated compared to a

Task                            #Classes   Sources         Eval. Metrics           # Tokens-splits (train - val - test)
           POS: POS Tagging                      36   WSJ             Top-1 Acc.              912,344 - 131,768 - 129,654
           CK: Chunking                          22   CONLL-2000      Top-1 Acc.              211,727 - n/a - 47,377
           NER: Named Entity Recognition          4   CONLL-2003      Top-1 Exact-match F1.   203,621 - 51,362 - 46,435
                                              1304    Slovene-news    Top-1 Acc.              439k - 58k - 88k
           MST: Morpho-syntactic Tagging       772    Croatian-news   Top-1 Acc.              379k - 50k - 75k
                                               557    Serbian-news    Top-1 Acc.              59k - 11k, 16k
                                                40    TPoS            Top-1 Acc.              10,500 - 2,300 - 2,900
           POS: POS Tagging                     25    ArK             Top-1 Acc.              26,500 - / - 7,700
                                                17    TweeBank        Top-1 Acc.              24,753 - 11,742 - 19,112
           CK: Chunking                         18    TChunk          Top-1 Top-1 Acc..       10,652 - 2,242 - 2,291
           NER: Named Entity Recognition         6    WNUT-17         Top-1 Exact-match F1.   62,729 - 15,734 - 23,394
                                              1102    Slovene-sm      Top-1 Acc.              37,756 - 7,056 - 19,296
           MST: Morpho-syntactic Tagging       654    Croatian-sm     Top-1 Acc.              45,609 - 8,886 - 21,412
                                               589    Serbian-sm      Top-1 Acc.              45,708- 9,581- 23,327

Table 1: Statistics of the used datasets. Top: datasets of the source domain. Bottom: datasets of the target domain.

reference approach (baseline) as follows:                             ELMo pre-trained models are not available but for
                                                                      Croatian (Che et al., 2018).10 Note that, in all ex-
                                                                      periments contextual embeddings are frozen during
                 L      i   ref
              1 X (sj − sj )                                          training.
      aNRGi =                      ,                      (12)
              L       max − sref )                                    FE’s HP: we use a single biLSTM layer (token-
                j=1 (sj       j
                                                                      level feature extractor) and set the number of units
with sij being the score of the approachi on                          to 200.
                                                                      PretRand’s random branch HP: we experiment
the datasetj , sref
                j   being the score of the refer-
                                                                      our approach with k = 200 added random-units.
ence approach on the datasetj and smaxj   is the
                                                                      Global HP: In all experiments, training (pretrain-
best achieved score across all approaches on the
                                                                      ing and fine-tuning) are performed using the SGD
datasetj .
                                                                      with momentum with early stopping, mini-batches
6.3    Implementation Details                                         of 16 sentences and learning rate of 1.5 × 10−2 .
                                                                      All our models are implemented with the PyTorch
We use the following Hyper-Parameters (HP):                           library (Paszke et al., 2017).
WRE’s HP: In the standard word-level embed-
dings, tokens are lower-cased while the character-                    7     Experimental Results
level component still retains access to the capitali-
                                                                      This section reports all our experimental results
sation information. We set the randomly initialised
                                                                      and analysis. First we analyse the standard fine-
character embedding dimension at 50, the dimen-
                                                                      tuning scheme of transfer learning (§7.1). Then we
sion of hidden states of the character-level biLSTM
                                                                      assess the performance of our proposed approach,
at 100 and used 300-dimensional word-level em-
                                                                      PretRand (§7.2).
beddings. The latter were pre-loaded from publicly
available GloVe pre-trained vectors on 42 billions                    7.1    Analysis of the Standard Fine-tuning
words from a web crawling and containing 1.9M                                Scheme
words (Pennington et al., 2014) for English ex-                       We report in Table 2 the results of the reference
periments, and pre-loaded from publicly available                     supervised training scheme from scratch, followed
FastText (Bojanowski et al., 2017) pre-trained vec-                   by the results of the standard fine-tuning scheme,
tors on common crawl for South-Slavic languages.8                     which outperforms the reference. Precisely, trans-
These embeddings are also updated during training.                    fer learning exhibits an improvement of ∼+3% acc.
For experiments with contextual words embeddings                      for TPoS, ∼+1.2% acc. for ArK, ∼+1.6% acc. for
(§7.2.3), we used ELMo (Embeddings from Lan-                          TweeBank, ∼+3.4% acc. for TChunk and ∼+4.5%
guage Models) embeddings (Peters et al., 2018).                       F1 for WNUT.
For English, we use the small official pre-trained                       In the following we provide the results of our
ELMo model on 1 billion word benchmark (13.6M                         analysis of the standard fine-tuning scheme:
parameters).9 Regarding South-Slavic languages,
                                                                          1. Analysis of the hidden negative transfer
   8
     https://github.com/facebookresearch/                                    (§7.1.1).
fastText/blob/master/docs/crawl-vectors.
md                                                                      10
                                                                           https://github.com/HIT-SCIR/
   9
     https://allennlp.org/elmo                                        ELMoForManyLangs

POS (Acc.)                    CK (Acc.)        NER (F1)
                         Dataset         TPoS         ARK       Tweebank           TChunk           WNUT
           Method                     dev    test      test    dev    test       dev    test         test
           From scratch              88.52 86.82      90.89 91.61 91.66         87.76 85.83         36.75
           Standard Fine-tuning      90.95 89.79      92.09 93.04 93.29         90.71 89.21         41.25

Table 2: The main results of our proposed approach, transferring pretrained models, on social media datasets (Acc
(%) for POS and CK and F1 (%) for NER). The best score for each dataset is highlighted in bold.

  2. Quantifying the change of individual pre-              show the percentage of positive transfer and red
     trained neurons after fine-tuning (§7.1.2).            bars give the percentage of negative transfer. We
                                                            observe that even though the standard fine-tuning
  3. Visualising the evolution of pretrained neu-           approach is effective since the resulting positive
     rons stimulus during fine-tuning (§7.1.3).             transfer is higher than the negative transfer in all
                                                            cases, this last mitigates the final gain brought by
7.1.1   Analysis of the Hidden Negative                     the standard fine-tuning. For instance, for TChunk
        Transfer                                            dataset, standard fine-tuning corrected ∼4.7% of
To investigate the hidden negative transfer in the          predictions but falsified ∼1.7%, which reduces the
standard fine-tuning scheme of transfer learning,           final gain to ∼3%.11
we propose the following experiments. First, we
show that the final gain brought by the standard            Qualitative Examples of Negative Transfer
fine-tuning can be separated into two categories:           We report in Table 3 concrete examples of words
positive transfer and negative transfer. Second,            whose predictions were falsified when using the
we provide some qualitative examples of negative            standard fine-tuning scheme compared to standard
transfer.                                                   supervised training scheme. Among mistakes we
                                                            have observed:
Quantifying Positive Transfer & Negative Transfer
                                                               • Tokens with an upper-cased first letter: In
                                                                 news (formal English), only proper nouns
                                                                 start with an upper-case letter inside sentences.
                                                                 Consequently, when using transfer learning,
                                                                 the pre-trained units fail to slough this pattern
                                                                 which is not always respected in social me-
                                                                 dia. Hence, we found that most of the tokens
                                                                 with an upper-cased first letter are mistakenly
                                                                 predicted as proper nouns (PROPN) in POS,
                                                                 e.g. Award, Charity, Night, etc. and as entities
                                                                 in NER, e.g. Father, Hey, etc., which is con-
Figure 6: The percentage of negative transfer and pos-           sistent with the findings of Seah et al. (2012):
itive transfer brought by the standard fine-tuning adap-         negative transfer is mainly due to conditional
tation scheme compared to supervised training from
                                                                 distribution differences between source and
scratch scheme.
                                                                 target domains.
   We recall that we define positive transfer as the           • Contractions are frequently used in social
percentage of tokens that were wrongly predicted                 media to shorten a set of words. For instance,
by random initialisation (supervised training from               in TPoS dataset, we found that “’s” is in most
scratch), but the standard fine-tuning changed to                cases predicted as a “possessive ending (pos)”
the correct ones, while negative transfer represents             instead of “Verb, 3rd person singular present
the percentage of words that were tagged correctly               (vbz)”. Indeed, in formal English, “’s” is used
by random initialisation, but using standard fine-               in most cases to express the possessive form,
tuning gives wrong predictions. Figure 6 shows
                                                               11
the results on English social media datasets, first               Here we calculate positive and negative transfer at the
                                                            token-level. Thus, the gain shown in Figure 6 for WNUT
tagged with the classic supervised training scheme          dataset does not correspond to the one in Table 2, since the F1
and then using the standard fine-tuning. Blue bars          metric is calculated only on named-entities.

DataSet
          TPoS             Award             ’s          its?                         Mum          wont?            id?             Exactly
                               nn            vbz          prp                           nn           MD              prp                uh
                              nnp            pos          prp$                          uh          VBP              nn                  rb
          ArK              Charity         I’M?         2pac×                          2×         Titans?         wth×               nvr×
                             noun             L          pnoun                           P            Z               !                  R
                            pnoun             E             $                            $            N               P                  V
          TweeBank         amazin•         Night      Angry                         stangs       #Trump        awsome•              bout•
                              adj           noun           adj                        propn         propn            adj               adp
                             noun          propn         propn                         noun           X             intj              verb
          TChunk             luv×         **ROCKSTAR**THURSDAY                       ONLY           Just          wyd×                 id?
                             b-vp                  b-np                                i-np        b-advp           b-np              b-np
                             i-intj                 O                                  b-np          b-np          b-intj              i-np
          Wnut               Hey         Father         &×                          IMO×           UN           Glasgow           Supreme
                               O              O             O                            O            O          b-location         b-person
                           b-person       b-person      i-group                      b-group       b-group        b-group         b-corporation
nn=N=noun=common noun / nnp=pnoun=propn=proper noun / vbz=Verb, 3rd person singular present / pos=possessive ending / prp=personal pronoun /
prp$=possessive pronoun / md=modal / VBP=Verb, non-3rd person singular present / uh=!=intj=interjection / rb=R=adverb / L=nominal + verbal or verbal +
nominal / E=emoticon / $=numerical / P=pre- or postposition, or subordinating conjunction / Z=proper noun + possessive ending / V=verb / adj=adjective /
adp=adposition

Table 3: Examples of falsified predictions by standard fine-tuning scheme when transferring from news
domain to social media domain. Line 1: Some words from the validation-set of each data-set. Line 2: Correct
labels predicted by the classic supervised setting (Random-200). Line 3: Wrong labels predicted by SFT setting.
Mistake type:  for words with first capital letter, • for misspelling, ? for contractions, × for abbreviations.

                       ArK dataset                                       Tchunk dataset                                    Wnut dataset

Figure 7: Correlation results between Φ units’ activations before fine-tuning (columns) and after fine-tuning (rows).
Brighter colours indicate higher correlation.

       e.g. “company’s decision”, but rarely in con-                                      dataset; and luv (love) and wyd (what you
       tractions that are frequently used in social me-                                   doing?) in TChunk dataset.
       dia, e.g. “How’s it going with you?”. Simi-
       larly, “wont” is a frequent contraction for “will                              • Misspellings: Likewise, we found that
       not”, e.g. “i wont get bday money lool”, pre-                                    the standard fine-tuning scheme often gives
       dicted as “verb” instead of “modal (MD)”12                                       wrong predictions for misspelt words, e.g. aw-
       by the SFT scheme. The same for “id”, which                                      some, bout, amazin.
       stands for “I would”.
                                                                                 7.1.2      Quantifying the change of individual
    • Abbreviations are frequently used in social                                           pretrained neurons
      media to shorten the way a word is standardly                              To visualise the bias phenomenon occurring when
      written. We found that the standard fine-                                  using the standard fine-tuning scheme, we quan-
      tuning scheme stumbles on abbreviations pre-                               tify the charge of individual neurons. Precisely,
      dictions, e.g. 2pac (Tupac), 2 (to), ur (your),                            we plot the asymmetric correlation matrix C (The
      wth (what the hell) and nvr (never) in ArK                                 method described in §4.2.1) between the Φ layer’s
  12
     A modal is an auxiliary verb expressing: ability (can),                     units before and after fine-tuning for each social
obligation (have), etc.                                                          media dataset (ArK for POS, TChunk for CK and

WNUT-17 for NER). From the resulting correla- Unit-196: ArK dataset

tion matrices illustrated in Figure 7, we can ob-
serve the diagonal representing the charge of each
unit, with most of the units having a high charge
(light colour), alluding the fact that every unit after
fine-tuning is highly correlated with itself before
fine-tuning. Hypothesising that high correlation in
the diagonal entails high bias, the results of this
experiment confirm our initial motivation that pre- Unit-64: ArK dataset

trained units are highly biased to what they have
learnt in the source-dataset, making them limited
to learn some patterns that are specific to the target-
dataset. Our remarks were confirmed recently in
the recent work of Merchant et al. (2020) who also
found that fine-tuning is a “conservative process”.
7.1.3 Visualising the Evolution of Pretrained
Neurons Stimulus during Fine-tuning Figure 8: Individual units activations before and
Here, we give concrete visualisations of the evo- during fine-tuning from ArK POS dataset. For
lution of pretrained neurons stimulus during fine- each unit we show Top-10 words activating the said
tuning when transferring from the news domain to unit. The first column: top-10 words from the source
the social media domain. Following the method validation-set (WSJ) before fine-tuning, Column 0: top-
10 words from the target validation-set (ArK) before
described in section 4.2.2, we plot the matrices of
fine-tuning. Columns 5 to 20: top-10 words from the
top-10 words activating each neuron j, positively target validation-set during fine-tuning epochs.
(j) (j)
(Abest+ ) or negatively (Abest− ). The results are
plotted in Figure 8 for ArK (POS) dataset and Fig-
ure 9 for TweeBank dataset (POS). Rows represent – Unit-64 is sensitive to plural proper
the top-10 words from the target dataset activat- nouns on news-domain before fine-
ing each unit, and columns represent fine-tuning tuning, e.g. Koreans and Europeans,
epochs; before fine-tuning in column 0 (at this stage and also on ArK during fine-tuning, e.g.
the model is only trained on the source-dataset), Titans and Patriots. However, in ArK
and during fine-tuning (columns 5 to 20). Addi- dataset, “Z” is a special class for “proper
tionally, to get an idea about each unit’s stimulus noun + possessive ending”, e.g. Jay’s
on source dataset, we also show, in the first column mum, and in some cases the apostrophe is
(Final-WSJ), top-10 words from the source dataset omitted, e.g. Fergusons house for Fergu-
activating the same unit before fine-tuning. In the son’s house, which thus may bring ambi-
following, we describe the information encoded by guity with plural proper nouns in formal
each provided neuron.13 English. Consequently, unit-64, initially
sensitive to plural proper nouns, is also
• Ark - POS: (Figure 8) firing on words from the class “Z”, e.g.
– Unit-196 is sensitive to contractions con- Timbers (Timber’s).
taining an apostrophe regardless of the
• Tweebank - POS: (Figure 9)
contraction’s class. However, unlike
news, in social media and particularly – Unit-37 is sensitive before and during
ArK dataset, apostrophes are used in dif- fine-tuning on plural nouns, such as gaz-
ferent cases. For instance i’m, i’ll and ers and feminists. However, it is also
it’s belong to the class “L” that stands firing on the word slangs because of
for “nominal + verbal or verbal + nom- the s ending, which is in fact a proper
inal”, while the contractions can’t and noun. This might explain the wrong pre-
don’t belong to the class “Verb”. diction for the word slangs (noun instead
13
Here we only select some interesting neurons. However of proper noun) given by the standard
we also found many neurons that are not interpretable. fine-tuning scheme (Table 3).

You can also read