Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

Page created by Raymond Chen
 
CONTINUE READING
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP:
              The Role of Sample Size and Dimensionality

           Adithya V Ganesan, Matthew Matero, Aravind Reddy Ravula,
                       Huy Vu, and H. Andrew Schwartz
              Department of Computer Science, Stony Brook University
   {avirinchipur, mmatero, aravula, hvu, has}@cs.stonybrook.edu

                      Abstract                                Human-level NLP tasks, rooted in computational
                                                           social science, focus on making predictions about
    In human-level NLP tasks, such as predicting           people from their language use patterns. Some of
    mental health, personality, or demographics,
                                                           the more common tasks include age and gender
    the number of observations is often smaller
    than the standard 768+ hidden state sizes of           prediction (Sap et al., 2014; Morgan-Lopez et al.,
    each layer within modern transformer-based             2017) , personality (Park et al., 2015; Lynn et al.,
    language models, limiting the ability to effec-        2020), and mental health prediction (Coppersmith
    tively leverage transformers. Here, we provide         et al., 2014; Guntuku et al., 2017; Lynn et al., 2018).
    a systematic study on the role of dimension re-        Such tasks present an interesting challenge for the
    duction methods (principal components anal-            NLP community to model the people behind the
    ysis, factorization techniques, or multi-layer
                                                           language rather than the language itself, and the so-
    auto-encoders) as well as the dimensionality
    of embedding vectors and sample sizes as a
                                                           cial scientific community has begun to see success
    function of predictive performance. We first           of such approaches as an alternative or supplement
    find that fine-tuning large models with a lim-         to standard psychological assessment techniques
    ited amount of data pose a significant difficulty      like questionnaires (Kern et al., 2016; Eichstaedt
    which can be overcome with a pre-trained di-           et al., 2018). Generally, such work is helping to
    mension reduction regime. RoBERTa con-                 embed NLP in a greater social and human con-
    sistently achieves top performance in human-           text (Hovy and Spruit, 2016; Lynn et al., 2019).
    level tasks, with PCA giving benefit over other
    reduction methods in better handling users that         Despite the simultaneous growth of both (1)
    write longer texts. Finally, we observe that a       the use of transformers and (2) human-level NLP,
    majority of the tasks achieve results compara-       the effective merging of transformers for human-
                                            1
    ble to the best performance with just 12   of the    level tasks has received little attention. In a recent
    embedding dimensions.
                                                         human-level shared task on mental health, most
                                                         participants did not utilize transformers (Zirikly
1 Introduction
                                                         et al., 2019). A central challenge for their uti-
Transformer based language models (LMs) have             lization in such scenarios is that the number of
quickly become the foundation for accurately ap- training examples (i.e. sample size) is often only
proaching many tasks in natural language process- hundreds while the parameters for such deep mod-
ing (Vaswani et al., 2017; Devlin et al., 2019). Ow- els are in the hundreds of millions. For exam-
ing to their success is their ability to capture both    ple, recent human-level NLP shared tasks focused
syntactic and semantic information (Tenney et al., on mental health have had N = 947 (Milne
2019), modeled over large, deep attention-based          et al., 2016), N = 9, 146 (Lynn et al., 2018) and
networks (transformers) with hidden state sizes on       N = 993 (Zirikly et al., 2019) training examples.
the order of 1000 over 10s of layers (Liu et al., Such sizes all but rules out the increasingly popular
2019; Gururangan et al., 2020). In total such mod- approach of fine-tuning transformers whereby all
els typically have from hundreds of millions (De- its millions of parameters are allowed to be updated
vlin et al., 2019) to a few billion parameters (Raffel   toward the specific task one is trying to achieve (De-
et al., 2020). However, the size of such models          vlin et al., 2019; Mayfield and Black, 2020). Re-
presents a challenge for tasks involving small num- cent research not only highlights the difficulty in
bers of observations, such as for the growing num- fine-tuning with few samples (Jiang et al., 2020)
ber of tasks focused on human-level NLP.                 but it also becomes unreliable even with thousands
                                                      4515
                        Proceedings of the 2021 Conference of the North American Chapter of the
              Association for Computational Linguistics: Human Language Technologies, pages 4515–4532
                           June 6–11, 2021. ©2021 Association for Computational Linguistics
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
of training examples (Mosbach et al., 2020).               torization to reduce dimensionality (Matero et al.,
   On the other hand, some of the common                   2019). While the approach seems reasonable for
transformer-based approaches of deriving contex-           addressing the dimensionality challenge in using
tual embeddings from the top layers of a pre-              transformers, many critical questions remain unan-
trained model (Devlin et al., 2019; Clark et al.,          swered: (a) Which type of transformer model is
2019) still leaves one with approximately an equal         best? (b) Would fine-tuning have worked instead?
number of embedding dimensions as training size.           and (c) Does such an approach generalize to other
In fact, in one of the few successful cases of us-         human-level tasks? Most of the time, one does not
ing transformers for a human-level task, further           have a luxury of a shared task for their problem at
dimensionality reduction was used to avoid over-           hand to determine a best approach. Here, we look
fit (Matero et al., 2019), but an empirical under-         across many human-level tasks, some of which
standing of the application of transformers for            with the luxury of having relatively large sample
human-level tasks — which models are best and              sizes (in the thousands) from which to establish
the relationship between embedding dimensions,             upper-bounds, and ultimately to draw generalizable
sample size, and accuracy — has yet to be estab-           information on how to approach a human-level task
lished.                                                    given its domain (demographic, personality, mental
   In this work, we empirically explore strategies         health) and sample size.
to effectively utilize transformer-based LMs for rel-         Our work also falls in line with a rising trend in
atively small sample-size human-level tasks. We            AI and NLP to quantify the number of dimensions
provide the first systematic comparison of the most        necessary. While this has not been considered for
widely used transformer models for demographic,            human-level tasks, it has been explored in other
personality, and mental health prediction tasks.           domains. The post processing algorithm (Mu and
Then, we consider the role of dimension reduction          Viswanath, 2018) of the static word embeddings
to address the challenge of applying such models           motivated by the power law distribution of max-
on small sample sizes, yielding a suggested min-           imum explained variance and the domination of
imum number of dimensions necessary given a                mean vector turned out to be very effective in mak-
sample size for each of demographic, personality,          ing these embeddings more discriminative. The
and mental health tasks1 . While it is suspected that      analysis of contextual embedding models (Etha-
transformer LMs contain more dimensions than               yarajh, 2019) suggest that the static embeddings
necessary for document- or word-level NLP (Li              contribute to less than 5% to the explained variance,
and Eisner, 2019; Bao and Qiao, 2019), this repre-         the contribution of the mean vector starts dominat-
sents the first study on transformer dimensionality        ing when contextual embedding models are used
for human-level tasks.                                     for human-level tasks. This is an effect of averaging
                                                           the message embeddings to form user representa-
2   Related Work
                                                           tions in human-level tasks. This further motivates
Recently, NLP has taken to human-level predictive          the need to process these contextual embeddings
tasks using increasingly sophisticated techniques.         into more discriminative features.
The most common approaches use n-grams and                        Lastly, our work weighs into the discussion on
LDA (Blei et al., 2003) to model a person’s lan-               just which type of model is best in order to produce
guage and behaviors (Resnik et al., 2013; Kern                 effective contextual embedding models. A major-
et al., 2016). Other approaches utilize word em-               ity of the models fall under two broad categories
beddings (Mikolov et al., 2013; Pennington et al.,             based on how they are pre-trained - auto-encoders
2014) and more recently, contextual word represen-             (AE) and auto-regressive (AR) models. We com-
tations (Ambalavanan et al., 2019).                            pare the performance of AE and AR style LMs
   Our work is inspired by one of the top per-                 by comparing the performance of two widely used
forming systems at a recent mental health pre-                 models from each category with comparable num-
diction shared task (Zirikly et al., 2019) that uti-           ber of parameters. From the experiments involving
lized transformer-based contextualized word em-                BERT, RoBERTa (Liu et al., 2019), XLNet (Yang
beddings fed through a non-negative matrix fac-                et al., 2019) and GPT-2 (Radford et al., 2019), we
    1
      dimension reduction techniques can also be pre-trained   find that AE based models perform better than AR
leveraging larger sets of unlabeled data                       style models (with comparable model sizes), and
                                                            4516
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
RoBERTa is the best choice amongst these four                       BSAG (when 11 years old) and distress scores at
widely used models.                                                 ages 23, 33, 42 and 50. This task used a disattenu-
                                                                    ated pearson correlation as the metric (rdis ).
3       Data & Tasks
                                                                    CLPsych-2019.           (sui) This 2019 shared
We evaluate approaches over 7 human-level tasks                     task (Zirikly et al., 2019) comprised of 3 sub-tasks
spanning Demographics, Mental Health, and per-                      for predicting the suicide risk level in reddit
sonality prediction. The 3 datasets used for these                  users. This included a history of user posts on
tasks are described below.                                          r/SuicideWatch (SW), a subreddit dedicated
                                                                    to those wanting to seek outside help for processing
FB-Demogs. (age, gen, ope, ext) One of our
                                                                    their current state of emotions. Their posts on
goals was to leverage one of the largest human-
                                                                    other subreddits (NonSuicideWatch) were also
level datasets in order to evaluate over subsam-
                                                                    collected. The users were annotated with one of
ples of sizes. For this, we used the Facebook
                                                                    the 4 risk levels: none, low, moderate and severe
demographic and personality dataset of Kosinski
                                                                    risk based on their history of posts. In total this
et al. (2013). The data was collected from approx-
                                                                    task spans 496 users in training and 125 in testing.
imately 71k consenting participants who shared
                                                                    We focused on Task A, predicting suicide risk of a
Facebook posts along with demographic and per-
                                                                    user by evaluating their (English) posts across SW,
sonality scores from Jan-2009 through Oct-2011.
                                                                    measured via macro-F1.
The users in this sample had written at least a 1000
words and had selected English as their primary                                                  CLPsych        CLPsych
                                                                                 FB-Demogs
                                                                                                   2018           2019
language. Age (age) was self-reported and lim-                                     Sap et al.    Lynn et al.   Zirikly et al.
ited to those 65 years or younger (data beyond this                     Npt         56,764         9,217           496
age becomes very sparse) as in (Sap et al., 2014).                      Nmax        10,000         9,217           496
                                                                        Nte         5,000          1,000           125
Gender (gen) was only provided as a limited single
binary, male-female classification.                                 Table 1: Summary of the datasets. Npt is the num-
   Personality was derived from the Big 5 person-                   ber of users available for pre-training the dimension
ality traits questionnaires, including both extraver-               reduction model; Nmax is the maximum number of
sion (ext - one’s tendency to be energized by social                users available for task training. For CLPsych 2018
interaction) and openess (ope, one’s tendency to                    and CLPsych 2019, this would be the same sample as
                                                                    pre-training data. For Facebook, a disjoint set of 10k
be open to new ideas) (Schwartz et al., 2013). Dis-
                                                                    users was available for task training; Nte is the num-
attenuated Pearson correlation2 (rdis ) was used to                 ber of test users. This is always a disjoint set of users
measure the performance of these two personality                    from the pre-training and task training samples.
prediction tasks.

CLPsych-2018. (bsag, gen2) The CLPsych
                                                                    4       Methods
2018 shared task (Lynn et al., 2018) consisted
of sub-tasks aimed at early prediction of men-                      Here we discuss how we utilized representations
tal health scores (depression, anxiety and BSAG3                    from transformers, our approaches to dimensional-
score) based on their language. The data for this                   ity reduction, and our technique for robust evalua-
shared task (Power and Elliott, 2005) comprised                     tion using bootstrapped sampling.
of English essays written by 11 year old students
along with their gender (gen2) and income classes.                  4.1      Transformer Representations
There were 9217 students’ essays for training and                   The second to last layer representation of all the
1000 for testing. The average word count in an                      messages was averaged to produce a 768 dimen-
essay was less than 200. Each essay was annotated                   sional feature for each user4 . These user repre-
with the student’s psychological health measure,                    sentations are reduced to lower dimensions as de-
    2                                                               scribed in the following paragraphs. The message
      Disattenuated Pearson correlation helps account for the er-
ror of the measurement instrument (Kosinski et al., 2013; Mur-      representation from a layer was attained by aver-
phy and Davidshofer, 1988). Following (Lynn et al., 2020),          aging the token embeddings of that layer. To con-
we use reliabilities: rxx = 0.70 and ryy = 0.77.
    3                                                                   4
      Bristol Social Adjustment Guide (Ghodsian, 1977) scores            The second to last layer was chosen owing to its consis-
contains twelve sub-scales that measures different aspects of       tent performance in capturing semantic and syntactic struc-
childhood behavior.                                                 tures (Jawahar et al., 2019).
                                                                4517
sider a variety of transformer LM architectures,         also considered the post processing algorithm
we explored two popular auto-encoder (BERT and           (PPA) of word embeddings5 (Mu and Viswanath,
RoBERTa) and two auto-regressive (XLNet and              2018) that has shown effectiveness with PCA on
GPT-2) transformer-based models.                         word level (Raunak et al., 2019). Importantly, be-
   For fine-tuning evaluations, we used the trans-       sides transformer LMs being pre-trained, so too
former based model that performs best across the         can dimension reduction. Therefore, we distin-
majority of our task suite. Transformers are typi-       guish: (1) learning the transformation from higher
cally trained on single messages or pairs of mes-        dimension to lower dimensions (preferably on a
sages, at a time. Since we are tuning towards a          large data sample from the same domain) and (2)
human-level task, we label each user’s message           applying the learned transformation (on the task’s
with their human-level attribute and treat it as a       train/test set). For the first step, we used a sepa-
standard document-level task (Morales et al., 2019).     rate set of 56k unlabeled user data in the case of
Since we are interested in relative differences in       FB-demog6 . For CLPsych-2018 and -2019 (where
performance, we limit each user to at most 20 mes-       separate data from the exact domains was not read-
sages - approximately the median number of mes-          ily available), we used the task training data to train
sages, randomly sampled, to save compute time for        the dimension reduction. Since variance explained
the fine tuning experiments.                             in factor analysis typically follows a power law,
                                                         these methods transformed the 768 original embed-
Algorithm 1 Dimension Reduction and Evaluation           ding dimensions down to k, in powers of 2: 16, 32,
Notation: hD : hidden size, f (·): function to train     64, 128, 256 or 512.
    dimension reduction, θ: Linear Model, g(·, ·):
                                                         4.3   Bootstrapped Sampling & Training
    Logistic loss function for classification and L2
    loss for regression, η: learning rate, T : Num-     We systematically evaluate the role of training sam-
    ber of iterations (100).                            ple (Nta ) versus embedding dimensions (k) for
Data: Dpt ∈ RNpt ×hD : Pre-training embeddings,         human-level prediction tasks. The approach is
    Dmax ∈ RNmax ×hD : Task training embed-             described in algorithm 1. Varying Nta , the task-
    dings, Dte ∈ RNte ×hD : Test embeddings,            specific train data (after dimension reduction) is
    Ymax : Outcome for train set, Yte : Outcome         sampled randomly (with replacement) to get ten
    for test set.                                       training samples with Nta users each. Small Nta
 1: W ← f (Dpt )                                        values simulate a low-data regime and were used
 2: D̄max ← Dmax W                                      to understand its relationship with the least number
 3: D̄te ← Dte W                                        of dimensions required to perform the best (Nta vs
 4: for i = 1, . . . , 10 do                            k). Bootstrapped sampling was done to arrive at
 5:
          (0)
        θi ← ~0                                         a conservative estimate of performance. Each of
 6:     Sample (D̄ta , Yta ) from (D̄max , Ymax )       the bootstrapped samples was used to train either
 7:     for j = 1, . . . , T do                         an L2 penalized (ridge) regression model or logis-
               (j)      (j−1)                           tic regression for the regression and classification
 8:           θi ← θi         − η∇g(D̄ta , Yta )
                                                        tasks respectively. The performance on the test
 9:     end for
                        (T )                            set using models from each bootstrapped training
10:     Ŷtei ← D̄te θi
                                                        sample was recorded in order to derive a mean and
11: end for
                                                        standard error for each Nta and k for each task.
12: Evaluate(Ŷte , Yte )
                                                           To summarize results over the many tasks and
                                                        possible k and Nta values in a useful fashion, we
                                                        propose a ‘first k to peak (fkp)’ metric. For each
4.2   Dimension Reduction
                                                        Nta , this is the first observed k value for which
We explore singular value decomposition-based           the mean score is within the 95% confidence inter-
methods such as Principal components analysis           val of the peak performance. This quantifies the
(PCA) (Halko et al., 2011), Non-negative matrix         minimum number of dimensions required for peak
factorization (NMF) (Févotte and Idier, 2011) and       performance.
Factor analysis (FA) as well as a deep learning        5
                                                         The ’D’ value was set to b number of100
                                                                                              dimensions
                                                                                                         c.
approach: multi-layer non linear auto encoders         6
                                                         these pre-trained dimension reduction models are made
(NLAE) (Hinton and Salakhutdinov, 2006). We         available.
                                                 4518
LM                  demographics             personality        mental health
                                             age    gen    gen2          ext     ope        bsag     sui
             Nta    type     name
                                             (r)   (F 1) (F 1)          (rdis ) (rdis )     (rdis ) (F 1)
                    AE       BERT           0.533 0.703 0.761           0.163 0.184         0.424 0.360
                    AE       RoBERTa        0.589 0.712 0.761           0.123 0.203         0.455 0.363
             100
                    AR       XLNet          0.582 0.582 0.744           0.130 0.203         0.441 0.315
                    AR       GPT-2          0.517 0.584 0.624           0.082 0.157         0.397 0.349
                    AE       BERT           0.686 0.810 0.837           0.278 0.354         0.484 0.466
                    AE       RoBERTa        0.700 0.802 0.852           0.283 0.361         0.490 0.432
             500
                    AR       XLNet          0.697 0.796 0.821           0.261 0.336         0.508 0.439
                    AR       GPT-2          0.646 0.756 0.762           0.211 0.280         0.481 0.397

Table 2: Comparison of most commonly used auto-encoders (AE) and auto-regressor (AR) language models after
reducing the 768 dimensions to 128 using NMF and trained on 100 and 500 samples (Nta ) for each task. (Nta )
pertains to the number of samples used for training each task. Classification tasks (gen, gen2 and sui) were scored
using macro-F1 (F1); the remaining regression tasks were scored using pearson-r (r)/ disattenuated pearson-r
(rdis ). AE models predominantly perform the best. RoBERTa and BERT show consistent performance, with the
former performing the best in most tasks. The LMs in the table were base models (approx. 110M parameters).

5       Results                                                         Nta     Method           Age     Gen
                                                                                Fine-tuned       0.54    0.54
                                                                        100
                                                                                Pre-trained      0.56    0.63
5.1      Best LM for Human-Level Tasks
                                                                                Fine-tuned       0.64    0.60
                                                                        500
                                                                                Pre-trained      0.66    0.74
We start by comparing transformer LMs, replicat-
ing the setup of one of the state-of-the-art systems          Table 3: Comparison of task specific fine tuning of
for the CLPsych-2019 task in which embeddings                 RoBERTa (top 2 layers) and pre-trained RoBERTa em-
were reduced from BERT-base to approximately                  beddings (second to last layer) for age and gender pre-
100 dimensions using NMF (Matero et al., 2019).               diction tasks. Results are averaged across 5 trials ran-
Specifically, we used 128 dimensions (to stick with           domly sampling users equal to Nta from the Facebook
powers of 2 that we use throughout this work) as              data and reducing messages to maximum of 20 per user.
we explore the other LMs over multiple tasks (we
will explore other dimensions next) and otherwise             5.2   Fine-Tuning Best LM
use the bootstrapped evaluation described in the
method.                                                       We next evaluate fine-tuning in these low data situ-
                                                              ations8 . Utilizing RoBERTa, the best performing
   Table 2 shows the comparison of the four trans-            transformer from the previous experiments, we per-
former LMs when varying the sample size (Nta )                form fine-tuning across the age and gender tasks.
between two low data regimes: 100 and 5007 .                  Following (Sun et al., 2019; Mosbach et al., 2020),
RoBERTa and BERT were the best performing                     we freeze layers 0-9 and fine-tune layers 10 and
models in almost all the tasks, suggesting auto-              11. Even these top 2 layers alone of RoBERTa still
encoders based LMs are better than auto-regressive            result in a model that is updating tens of millions
models for these human-level tasks. Further,                  of parameters while being tuned to a dataset of
RoBERTa performed better than BERT in the ma-                 hundreds of users and at most 10,000 messages.
jority of cases. Since the number of model param-
                                                                 In table 3, results for age and gender are shown
eters are comparable, this may be attributable to
                                                              for both sample sizes of 100 and 500. For Age, the
RoBERTa’s increased pre-training corpus, which
                                                              average prediction across all of a user’s messages
is inclusive of more human discourse and larger
                                                              was used as the user’s prediction and for gender the
vocabularies in comparison to BERT.
                                                              mode was used. Overall, we find that fine-tuning
                                                                 8
                                                                   As we are focused on readily available models, we con-
    7
     The performance of all transformer embeddings without    sider substantial changes to the architecture or training as
any dimension reduction along with smaller sized models can   outside the scope of this systematic evaluation of existing
be found in the appendix section D.3.                         techniques.
                                                          4519
demographics            personality      mental health
                                         age    gen    gen2         ext     ope      bsag     sui
          Npt     Nta     Reduction
                                         (r)   (F 1) (F 1)         (rdis ) (rdis )   (rdis ) (F 1)
                          PCA           0.650 0.747 0.777          0.189 0.248       0.466 0.392
                          PCA-PPA       0.517 0.715 0.729          0.173 0.176       0.183 0.358
                  100     FA            0.534 0.722 0.729          0.171 0.183       0.210 0.360
                          NMF           0.589 0.712 0.761          0.123 0.203       0.455 0.363
                          NLAE          0.654 0.744 0.782          0.188 0.263       0.447 0.367
          56k*
                          PCA           0.729 0.821 0.856          0.346 0.384       0.514 0.416
                          PCA-PPA       0.707 0.814 0.849          0.317 0.349       0.337 0.415
                  500     FA            0.713 0.819 0.849          0.322 0.361       0.400 0.415
                          NMF           0.700 0.802 0.852          0.283 0.361       0.490 0.432
                          NLAE          0.725 0.820 0.843          0.340 0.394       0.485 0.409
                          PCA           0.644 0.749 0.788          0.186 0.248       0.412 0.392
                  100
                          NLAE          0.634 0.743 0.744          0.159 0.230       0.433 0.367
          500
                          PCA           0.726 0.819 0.850          0.344 0.382       0.509 0.416
                  500
                          NLAE          0.715 0.798 0.811          0.312 0.360       0.490 0.409

Table 4: Comparison of different dimension reduction techniques of RoBERTa embeddings (penultimate layer)
reduced down to 128 dimensions and Nta = 100 and 500. Number of user samples for pre-trianing the dimension
reduction model, Npt was 56k except for gen2, bsag (which had 9k users) and sui (which had 496 users). PCA
performs the best overall and NLAE performs as good as PCA consistently. With uniform pre-training size (Npt =
500), PCA performs better than NLAE.

offers lower performance with increased overhead        the pre-training. This is evident from the results in
for both train time and modeling complexity (hy-        Table 4 where the Npt was set to a uniform value
perparameter tuning, layer selection, etc).             and tested for all the tasks with Nta set to 100 and
   We did robustness checks for hyper-parameters        500. Thus, PCA appears a more reliable, showing
to offer more confidence that this result was not       more generalization for low samples.
simply due to the fastidious nature of fine-tuning.
The process is described in Appendix B, includ-          5.4   Performance by Sample Size and
ing an extensive exploration of hyper-parameters,              Dimensions
which never resulted in improvements over the pre-  Now that we have found (1) RoBERTa generally
trained setup. We are left to conclude that fine-   performed best, (2) pre-trainining worked better
tuning over such small user samples, at least with  than fine-tuning, and (3) PCA was most consis-
current typical techniques, is not able to produce  tently best for dimension reduction (often doing
results on par with using transformers to produce   better than the full dimensions), we can systemat-
pre-trained embeddings.                             ically evaluate model performance as a function
                                                    of training sample size (Nta ) and number of di-
5.3 Best Reduction technique for                    mensions (k) over tasks spanning demographics,
      Human-Level Tasks                             personality, and mental health. We exponentially
We evaluated the reduction techniques in low data   increase k from 16 to 512, recognizing that vari-
regime by comparing their performance on the        ance explained decreases exponentially with dimen-
downstream tasks across 100 and 500 training sam- sion (Mu and Viswanath, 2018). The performance
ples (Nta ). As described in the methods, tech- is also compared with that of using the RoBERTa
niques including PCA, NMF and FA along with         embeddings without any reduction.
NLAE, were applied to reduce the 768 dimensional       Figure 1 compares the scores at reduced dimen-
RoBERTa embeddings to 128 features. The results     sions for age, ext, ope and bsag. These charts
in table 4 show that PCA and NLAE perform most      depict the experiments on typical low data regime
consistently, with PCA having the best scores in    (Nta ≤ 1000). Lower dimensional representations
the majority tasks. NLAE’s performance appears      performed comparable to the peak performance
dependent on the amount of data available during    with just 13 the features while covering the most
                                                 4520
Dimension reduced    All dimensions
                                          (mean ± std err)       (mean)

Figure 1: Comparison of performance for all regression tasks: age, ext, ope and bsag over varying Nta and k.
Results vary by task, but predominantly, performance at k=64 is better than the performance without any reduction.
It is conclusive that the reduced features almost always performs better or as good as the original embeddings.

                           1
number of tasks and just 12  features for the ma-
jority of tasks. Charts exploring other ranges of                                                      mental
Nta values and remaining tasks can be found in the           Nta         demographics   personality    health
appendix D.1.                                                              (3 tasks)     (2 tasks)    (2 tasks)
                                                               50             16            16           16
5.5   Least Number of Dimensions Required                     100            128            16           22
Lastly, we devise an experiment motivated by an-              200            512            32           45
swering the question of how many dimensions are               500            768            64           64
necessary to achieve top results, given a limited            1000            768            90           64
sample size. Specifically, we define ‘first k to peak’ Table 5: First k to peak (fkp) for each set of tasks: the
(fkp) as the least valued k that produces an accuracy   least value of k that performed statistically equivalent
equivalent to the peak performance. A 95% con- (p > .05) to the best performing setup (peak). Inte-
fidence interval was computed for the best score        ger shown is the exponential median of the set of tasks.
(peak) for each task and each Nta based on boot-        This  table summarizes comprehensive testing and we
strapped resamples, and fkp was the least number        suggest  its results, fkp, can be used as a recommen-
of dimensions where this threshold was passed.          dation  for  the number of dimensions to use given a
                                                        task domain and training set size.
   Our goal is that such results can provide a sys-
tematic guide for making such modeling decisions
                                                     4521
Ext Prediction
in future human-level NLP tasks, where such an                                             Age Prediction

experiment (which relies on resampling over larger                                         nmf
                                                                                                                                         nmf
                                                                                                                                         pca

                                                                                                                Absolute Error
                                                                                           pca

                                                                 Absolute Error
amounts of training data) is typically not feasible.                              20
                                                                                                                                 2

Table 5 shows the fkp over all of the training sample
sizes (Nta ). The exponential median (med) in the                                 10                                             1

table is calculated as follows: med = 2Median(log(x))
                                                                                  0
   The fkp results suggest that more training sam-                                                                               0
                                                                                       2         3   4      5                        2         3   4      5
ples available yield ability to leverage more dimen-                               log(Avg 1grams Per Msg)                       log(Avg 1grams Per Msg)

sions, but the degree to which depends on the task.
In fact, utilizing all the embedding dimensions was          Figure 2: Comparison of the absolute error of NMF and
                                                             PCA with the average number of 1 grams per message.
only effective for demographic prediction tasks.
                                                             While both the models appear to perform very similar
The other two tasks benefited from reduction, of-            when the texts are small or average sized, PCA is better
                 1
ten with only 12    to 16 of the original second to last     at handling longer texts. The errors diverge when the
transformer layer dimensions.                                length of the texts increases.

6     Error Analysis
                                                             data. This choice wouldn’t make much difference
Here, we seek to better understand why using pre-            when used for Tweet-like short texts, but the errors
trained models worked better than fine-tuning, and           diverge rapidly for longer samples. We also see
differences between using PCA and NMF compo-                 that PCA is better at capturing information from
nents in the low sample setting (Nta = 500).                 these texts that have higher predictive power in
    Association    LIWC variables                            downstream tasks. This is discussed in appendix
                   Informal, Netspeak, Negemo                E.2 along with other interesting findings involving
     Positive                                                the comparison of PCA and the pre-trained model
                   Swear, Anger
                   Affiliation, Social, We, They,            in E.3.
     Negative      Family, Function, Drives, Prep,           7                    Discussion
                   Focuspast, Quant
                                                        Ethical Consideration. We used existing
Table 6: Top LIWC variables having negative and pos-    datasets that were either collected with participant
itive correlations with the difference in the absolute er-
                                                        consent (FB and CLPsych 2018) or public
ror of the pre-trained model and the fine-tuned model
                                                        data with identifiers removed and collected in
for age prediction. Benjamini-Hochberg FDR p < .05.
This suggests that the fine-tuned models have lesser er-a non-intrusive manner (CLPsych 2019). All
ror than pre-trained model when the language is infor-  procedures were reviewed and approved by both
mal and consists of more affect words.                  our institutional review board as well as the IRB of
                                                        the creators of the data set.
                                                           Our work can be seen as part of the growing
Pre-trained vs Fine-tuned. We looked at cate-
                                                        body of interdisciplinary research intended to un-
gories of language from LIWC (Tausczik and Pen-
                                                        derstanding human attributes associated with lan-
nebaker, 2010), correlated with the difference in
                                                        guage, aiming towards applications that can im-
the absolute error of the pre-trained and fine-tuned
                                                        prove human life, such as producing better mental
model in age prediction. Table 6 suggests that
                                                        health assessments that could ultimately save lives.
pre-trained model is better at handling users with
                                                        However, at this stage, our models are not intended
language conforming to the formal rules, and fine-
                                                        to be used in practice for mental health care nor
tuning helps in learning better representation of the
                                                        labeling of individuals publicly with mental health,
affect words and captures informal language well.
                                                        personality, or demographic scores. Even when
Furthermore, these LIWC variables are also known
                                                        the point comes where such models are ready for
to be associated with age (Schwartz et al., 2013).
                                                        testing in clinical settings, this should only be done
Additional analysis comparing these two models is
                                                        with oversight from professionals in mental health
available in appendix E.1.
                                                        care to establish the failure modes and their rates
PCA vs NMF. Figure 2 suggests that PCA is bet- (e.g. false-positives leading to incorrect treatment
ter at handling longer text sequences than NMF          or false-negatives leading to missed care; increased
(> 55 one grams on avg) when trained with less          inaccuracies due to evolving language; disparities
                                                     4522
in failure modes by demographics). Malicious use           dataset was further limited primarily to those post-
possibilities for which this work is not intended          ing in a suicide-related forum (Zirikly et al., 2019).
include targeting advertising to individuals using         Further, tokenization techniques can also impact
language-based psychology scores, which could              language model performance (Bostrom and Dur-
present harmful content to those suffering from            rett, 2020). To avoid oversimplification of com-
mental health conditions.                                  plex human attributes, in line with psychological
   We intend that the results of our empirical study       research (Haslam et al., 2012), all outcomes were
are used to inform fellow researchers in computa-          kept in their most dimensional form – e.g. person-
tional linguistics and psychology on how to better         ality scores were kept as real values rather than
utilize contextual embeddings towards the goal of          divided into bins and the CLPsych-2019 risk levels
improving psychological and mental health assess-          were kept at 4 levels to yield gradation in assess-
ments. Mental health conditions, such as depres-           ments as justified by Zirikly et al., 2019.
sion, are widespread and many suffering from such
conditions are under-served with only 13 - 49%
receiving minimally adequate treatment (Kessler            8   Conclusion
et al., 2003; Wang et al., 2005). Marginalized pop-
ulations, such as those with low income or minori-
ties, are especially under-served (Saraceno et al.,        We provide the first empirical evaluation of the ef-
2007). Such populations are well represented in so-        fectiveness of contextual embeddings as a function
cial media (Center, 2021) and with this technology         of dimensionality and sample size for human-level
developed largely over social media and predom-            prediction tasks. Multiple human-level tasks along
inantly using self-reported labels from users (i.e.,       with many of the most popular language model
rather than annotator-perceived labels that some-          techniques, were systematically evaluated in con-
times introduce bias (Sap et al., 2019; Flekova et al.,    junction with dimension reduction techniques to
2016)), we do not expect that marginalized popula-         derive optimal setups for low sample regimes char-
tions are more likely to hit failure modes. Still, tests   acteristic of many human-level tasks.
for error disparities (Shah et al., 2020) should be           We first show the fine-tuning transformer LMs
carried out in conjunction with clinical researchers       in low-data scenarios yields worse performance
before this technology is deployed. We believe             than pre-trained models. We then show that re-
this technology offers the potential to broaden the        ducing dimensions of contextual embeddings can
coverage of mental health care to such populations         improve performance and while past work used
where resources are currently limited.                     non-negative matrix factorization (Matero et al.,
   Future assessments built on the learnings of            2019), we note that PCA gives the most reliable
this work, and in conjunction with clinical men-           improvement. Auto-encoder based transformer lan-
tal health researchers, could help the under-served        guage models gave better performance, on average,
by both better classifying one’s condition as well         than their auto-regressive contemporaries of com-
as identifying an ideal treatment. Any applications        parable sizes. We find optimized versions of BERT,
to human subjects should consider the ethical im-          specifically RoBERTa, to yield the best results.
plications, undergo human subjects review, and
the predictions made by the model should not be               Finally, we find that many human-level tasks can
                                                                                                 th    1 th
shared with the individuals without consulting the         be achieved with a fraction, often 16 or 12      , the
experts.                                                   total transformer hidden-state size without sacri-
                                                           ficing significant accuracy. Generally, using fewer
Limitations. Each dataset brings its own unique            dimensions also reduces variance in model perfor-
selection biases across groups of people, which is         mance, in line with traditional bias-variance trade-
one reason we tested across many datasets cover-           offs and, thus, increases the chance of generalizing
ing a variety of human demographics. Most no-              to new populations. Further it can aid in explain-
tably, the FB dataset is skewed young and is geo-          ability especially when considering that these di-
graphically focused on residents within the United         mension reduction models can be pre-trained and
States. The CLPsych 2018 dataset is a represen-            standardized, and thus compared across problem
tative sample of citizens of the United Kingdom,           sets and studies.
all born on the same week, and the CLPsych-2019
                                                  4523
References                                                    9th International Joint Conference on Natural Lan-
                                                              guage Processing (EMNLP-IJCNLP), pages 55–65,
Ashwin Karthik Ambalavanan, Pranjali Dileep Jagtap,           Hong Kong, China. Association for Computational
  Soumya Adhya, and Murthy Devarakonda. 2019.                 Linguistics.
  Using contextual representations for suicide risk as-
  sessment from Internet forums. In Proceedings of         Cédric Févotte and Jérôme Idier. 2011. Algorithms
  the Sixth Workshop on Computational Linguistics            for nonnegative matrix factorization with the β-
  and Clinical Psychology, pages 172–176, Minneapo-          divergence. Neural computation, 23(9):2421–2456.
  lis, Minnesota. Association for Computational Lin-
  guistics.                                                Lucie Flekova, Jordan Carpenter, Salvatore Giorgi,
                                                             Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. An-
Xingce Bao and Qianqian Qiao. 2019. Transfer learn-          alyzing biases in human perception of user age and
  ing from pre-trained BERT for pronoun resolution.          gender from text. In Proceedings of the 54th An-
  In Proceedings of the First Workshop on Gender             nual Meeting of the Association for Computational
  Bias in Natural Language Processing, pages 82–88,          Linguistics (Volume 1: Long Papers), pages 843–
  Florence, Italy. Association for Computational Lin-        854, Berlin, Germany. Association for Computa-
  guistics.                                                  tional Linguistics.
David M Blei, Andrew Y Ng, and Michael I Jordan.           M Ghodsian. 1977. Children’s behaviour and the
  2003. Latent dirichlet allocation. Journal of ma-         bsag: some theoretical and statistical considerations.
  chine Learning research, 3(Jan):993–1022.                 British Journal of Social and Clinical Psychology,
Kaj Bostrom and Greg Durrett. 2020. Byte pair encod-        16(1):23–28.
  ing is suboptimal for language model pretraining. In
                                                           Sharath Chandra Guntuku, David B Yaden, Margaret L
  Findings of the Association for Computational Lin-
                                                             Kern, Lyle H Ungar, and Johannes C Eichstaedt.
  guistics: EMNLP 2020, pages 4617–4624, Online.
                                                             2017. Detecting depression and mental illness on
  Association for Computational Linguistics.
                                                             social media: an integrative review. Current Opin-
Pew Research Center. 2021. Social media fact sheet.          ion in Behavioral Sciences, 18:43–49.

Christopher Clark, Kenton Lee, Ming-Wei Chang,             Suchin Gururangan, Ana Marasović, Swabha
  Tom Kwiatkowski, Michael Collins, and Kristina             Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
  Toutanova. 2019. Boolq: Exploring the surprising           and Noah A. Smith. 2020. Don’t stop pretraining:
  difficulty of natural yes/no questions. arXiv preprint     Adapt language models to domains and tasks. In
  arXiv:1905.10044.                                          Proceedings of the 58th Annual Meeting of the
                                                             Association for Computational Linguistics, pages
Glen Coppersmith, Mark Dredze, and Craig Harman.             8342–8360, Online. Association for Computational
  2014. Quantifying mental health signals in Twit-           Linguistics.
  ter. In Proceedings of the Workshop on Computa-
  tional Linguistics and Clinical Psychology: From         Nathan Halko, Per-Gunnar Martinsson, and Joel A
  Linguistic Signal to Clinical Reality, pages 51–60,        Tropp. 2011.      Finding structure with random-
  Baltimore, Maryland, USA. Association for Compu-           ness: Probabilistic algorithms for constructing ap-
  tational Linguistics.                                      proximate matrix decompositions. SIAM review,
                                                             53(2):217–288.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of         Nick Haslam, Elise Holland, and Peter Kuppens. 2012.
   deep bidirectional transformers for language under-       Categories versus dimensions in personality and psy-
   standing. In Proceedings of the 2019 Conference           chopathology: a quantitative review of taxometric
   of the North American Chapter of the Association          research. Psychological medicine, 42(5):903.
   for Computational Linguistics: Human Language
  Technologies, Volume 1 (Long and Short Papers),          Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006.
   pages 4171–4186, Minneapolis, Minnesota. Associ-          Reducing the dimensionality of data with neural net-
   ation for Computational Linguistics.                      works. science, 313(5786):504–507.

J. Eichstaedt, R. Smith, R. Merchant, Lyle H. Ungar,       Dirk Hovy and Shannon L Spruit. 2016. The social
   Patrick Crutchley, Daniel Preotiuc-Pietro, D. Asch,       impact of natural language processing. In Proceed-
   and H. A. Schwartz. 2018. Facebook language pre-          ings of the 54th Annual Meeting of the Association
   dicts depression in medical records. Proceedings          for Computational Linguistics (Volume 2: Short Pa-
   of the National Academy of Sciences of the United         pers), pages 591–598.
   States of America, 115:11203 – 11208.
                                                           Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
Kawin Ethayarajh. 2019. How contextual are contex-           2019. What does BERT learn about the structure
  tualized word representations? comparing the geom-         of language? In Proceedings of the 57th Annual
  etry of BERT, ELMo, and GPT-2 embeddings. In               Meeting of the Association for Computational Lin-
  Proceedings of the 2019 Conference on Empirical            guistics, pages 3651–3657, Florence, Italy. Associa-
  Methods in Natural Language Processing and the             tion for Computational Linguistics.
                                                       4524
Haoming Jiang, Pengcheng He, Weizhu Chen, Xi-                  sification without the tweet: An empirical exami-
  aodong Liu, Jianfeng Gao, and Tuo Zhao. 2020.                nation of user versus document attributes. In Pro-
  SMART: Robust and efficient fine-tuning for pre-             ceedings of the Third Workshop on Natural Lan-
  trained natural language models through principled           guage Processing and Computational Social Sci-
  regularized optimization. In Proceedings of the 58th         ence, pages 18–28.
  Annual Meeting of the Association for Computa-
  tional Linguistics, pages 2177–2190, Online. Asso-        Veronica Lynn, Alissa Goodman, Kate Niederhof-
  ciation for Computational Linguistics.                      fer, Kate Loveys, Philip Resnik, and H. Andrew
                                                              Schwartz. 2018. CLPsych 2018 shared task: Pre-
Margaret L Kern, Johannes C Eichstaedt, H An-                 dicting current and future psychological health from
 drew Schwartz, Lukasz Dziurzynski, Lyle H Ungar,             childhood essays. In Proceedings of the Fifth Work-
 David J Stillwell, Michal Kosinski, Stephanie M Ra-          shop on Computational Linguistics and Clinical Psy-
 mones, and Martin EP Seligman. 2014. The online              chology: From Keyboard to Clinic, pages 37–46,
 social self: An open vocabulary approach to person-          New Orleans, LA. Association for Computational
 ality. Assessment, 21(2):158–169.                            Linguistics.
Margaret L Kern, Gregory Park, Johannes C Eichstaedt,       Matthew Matero, Akash Idnani, Youngseo Son, Sal-
 H Andrew Schwartz, Maarten Sap, Laura K Smith,              vatore Giorgi, Huy Vu, Mohammad Zamani, Parth
 and Lyle H Ungar. 2016. Gaining insights from so-           Limbachiya, Sharath Chandra Guntuku, and H. An-
 cial media language: Methodologies and challenges.          drew Schwartz. 2019. Suicide risk assessment with
 Psychological methods, 21(4):507.                           multi-level dual-context language and BERT. In
                                                             Proceedings of the Sixth Workshop on Computa-
Ronald C Kessler, Patricia Berglund, Olga Demler,            tional Linguistics and Clinical Psychology, pages
  Robert Jin, Doreen Koretz, Kathleen R Merikangas,          39–44, Minneapolis, Minnesota. Association for
  A John Rush, Ellen E Walters, and Philip S Wang.           Computational Linguistics.
  2003. The epidemiology of major depressive dis-
  order: results from the national comorbidity survey       Elijah Mayfield and Alan W Black. 2020. Should you
  replication (ncs-r). Jama, 289(23):3095–3105.                fine-tune BERT for automated essay scoring? In
                                                               Proceedings of the Fifteenth Workshop on Innova-
M. Kosinski, D. Stillwell, and T. Graepel. 2013. Pri-          tive Use of NLP for Building Educational Applica-
  vate traits and attributes are predictable from digital      tions, pages 151–162, Seattle, WA, USA → Online.
  records of human behavior. Proceedings of the Na-            Association for Computational Linguistics.
  tional Academy of Sciences, 110:5802 – 5805.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,              Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  Kevin Gimpel, Piyush Sharma, and Radu Soricut.              rado, and Jeff Dean. 2013. Distributed representa-
  2019. Albert: A lite bert for self-supervised learning      tions of words and phrases and their compositional-
  of language representations. In International Con-          ity. In Advances in neural information processing
  ference on Learning Representations.                        systems, pages 3111–3119.

Xiang Lisa Li and Jason Eisner. 2019. Specializing          David N. Milne, Glen Pink, Ben Hachey, and Rafael A.
  word embeddings (for parsing) by information bot-           Calvo. 2016. CLPsych 2016 shared task: Triaging
  tleneck. In Proceedings of the 2019 Conference on           content in online peer-support forums. In Proceed-
  Empirical Methods in Natural Language Processing            ings of the Third Workshop on Computational Lin-
  and the 9th International Joint Conference on Natu-         guistics and Clinical Psychology, pages 118–127,
  ral Language Processing (EMNLP-IJCNLP), pages               San Diego, CA, USA. Association for Computa-
  2744–2754, Hong Kong, China. Association for                tional Linguistics.
  Computational Linguistics.
                                                            Michelle Morales, Prajjalita Dey, Thomas Theisen,
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-           Danny Belitz, and Natalia Chernova. 2019. An in-
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,               vestigation of deep learning systems for suicide risk
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.               assessment. In Proceedings of the Sixth Workshop
  Roberta: A robustly optimized bert pretraining ap-          on Computational Linguistics and Clinical Psychol-
  proach. arXiv preprint arXiv:1907.11692.                    ogy, pages 177–181, Minneapolis, Minnesota. Asso-
                                                              ciation for Computational Linguistics.
Veronica Lynn, Niranjan Balasubramanian, and H. An-
  drew Schwartz. 2020. Hierarchical modeling for            Antonio A Morgan-Lopez, Annice E Kim, Robert F
  user personality prediction: The role of message-           Chew, and Paul Ruddle. 2017. Predicting age
  level attention. In Proceedings of the 58th Annual          groups of twitter users based on language and meta-
  Meeting of the Association for Computational Lin-           data features. PloS one, 12(8):e0183537.
  guistics, pages 5306–5316, Online. Association for
  Computational Linguistics.                                Marius Mosbach, Maksym Andriushchenko, and Diet-
                                                             rich Klakow. 2020. On the stability of fine-tuning
Veronica Lynn, Salvatore Giorgi, Niranjan Balasubra-         bert: Misconceptions, explanations, and strong base-
  manian, and H Andrew Schwartz. 2019. Tweet clas-           lines. arXiv preprint arXiv:2006.04884.
                                                        4525
Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-           Victor Sanh, Lysandre Debut, Julien Chaumond, and
   top: Simple and effective postprocessing for word          Thomas Wolf. 2019. Distilbert, a distilled version
   representations. In International Conference on            of bert: smaller, faster, cheaper and lighter. arXiv
   Learning Representations.                                  preprint arXiv:1910.01108.
K. Murphy and C. Davidshofer. 1988. Psychological           Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,
  testing: Principles and applications.                      and Noah A. Smith. 2019. The risk of racial bias
                                                             in hate speech detection. In Proceedings of the
Gregory Park, H. A. Schwartz, J. Eichstaedt, M. Kern,
                                                             57th Annual Meeting of the Association for Com-
  M. Kosinski, D. Stillwell, Lyle H. Ungar, and
                                                             putational Linguistics, pages 1668–1678, Florence,
  M. Seligman. 2015. Automatic personality assess-
                                                             Italy. Association for Computational Linguistics.
  ment through social media language. Journal of per-
  sonality and social psychology, 108 6:934–52.             Maarten Sap, Greg Park, Johannes C Eichstaedt, Mar-
Adam Paszke, Sam Gross, Francisco Massa, Adam                garet L Kern, David J Stillwell, Michal Kosinski,
  Lerer, James Bradbury, Gregory Chanan, Trevor              Lyle H Ungar, and H Andrew Schwartz. 2014. De-
  Killeen, Zeming Lin, Natalia Gimelshein, Luca              veloping age and gender predictive lexica over social
  Antiga, Alban Desmaison, Andreas Kopf, Edward              media. In Proceedings of the 2014 Conference on
  Yang, Zachary DeVito, Martin Raison, Alykhan Te-           Empirical Methods in Natural Language Processing
  jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,        (EMNLP).
  Junjie Bai, and Soumith Chintala. 2019.         Py-
  torch: An imperative style, high-performance deep         Benedetto Saraceno, Mark van Ommeren, Rajaie Bat-
  learning library. In H. Wallach, H. Larochelle,             niji, Alex Cohen, Oye Gureje, John Mahoney,
  A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-         Devi Sridhar, and Chris Underhill. 2007. Barriers
  nett, editors, Advances in Neural Information Pro-          to improvement of mental health services in low-
  cessing Systems 32, pages 8024–8035. Curran Asso-           income and middle-income countries. The Lancet,
  ciates, Inc.                                                370(9593):1164–1174.

Jeffrey Pennington, Richard Socher, and Christopher         H Andrew Schwartz, Johannes C Eichstaedt, Mar-
   Manning. 2014. GloVe: Global vectors for word              garet L Kern, Lukasz Dziurzynski, Stephanie M Ra-
   representation. In Proceedings of the 2014 Confer-         mones, Megha Agrawal, Achal Shah, Michal Kosin-
   ence on Empirical Methods in Natural Language              ski, David Stillwell, Martin EP Seligman, et al. 2013.
   Processing (EMNLP), pages 1532–1543, Doha,                 Personality, gender, and age in the language of social
   Qatar. Association for Computational Linguistics.          media: The open-vocabulary approach. PloS one,
                                                              8(9):e73791.
Chris Power and Jane Elliott. 2005. Cohort profile:
  1958 British birth cohort (National Child Develop-        H. Andrew Schwartz, Salvatore Giorgi, Maarten Sap,
  ment Study). International Journal of Epidemiol-            Patrick Crutchley, Lyle Ungar, and Johannes Eich-
  ogy, 35(1):34–41.                                           staedt. 2017. DLATK: Differential language anal-
                                                              ysis ToolKit. In Proceedings of the 2017 Confer-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,            ence on Empirical Methods in Natural Language
  Dario Amodei, and Ilya Sutskever. 2019. Language            Processing: System Demonstrations, pages 55–60,
  models are unsupervised multitask learners. OpenAI          Copenhagen, Denmark. Association for Computa-
  Blog, 1(8):9.                                               tional Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
  ine Lee, Sharan Narang, Michael Matena, Yanqi             Deven Santosh Shah, H. Andrew Schwartz, and Dirk
  Zhou, Wei Li, and Peter J. Liu. 2020. Exploring             Hovy. 2020. Predictive biases in natural language
  the limits of transfer learning with a unified text-to-     processing models: A conceptual framework and
  text transformer. Journal of Machine Learning Re-           overview. In Proceedings of the 58th Annual Meet-
  search, 21(140):1–67.                                       ing of the Association for Computational Linguistics,
                                                              pages 5248–5264, Online. Association for Computa-
Vikas Raunak, Vivek Gupta, and Florian Metze. 2019.           tional Linguistics.
  Effective dimensionality reduction for word em-
  beddings. In Proceedings of the 4th Workshop              Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
  on Representation Learning for NLP (RepL4NLP-               2019. How to fine-tune bert for text classification?
  2019), pages 235–243, Florence, Italy. Association          In China National Conference on Chinese Computa-
  for Computational Linguistics.                              tional Linguistics, pages 194–206. Springer.

Philip Resnik, Anderson Garron, and Rebecca Resnik.         Y. Tausczik and J. Pennebaker. 2010. The psychologi-
  2013. Using topic modeling to improve prediction             cal meaning of words: Liwc and computerized text
  of neuroticism and depression in college students.           analysis methods. Journal of Language and Social
  In Proceedings of the 2013 Conference on Empiri-             Psychology, 29:24 – 54.
  cal Methods in Natural Language Processing, pages
  1348–1353, Seattle, Washington, USA. Association          Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
  for Computational Linguistics.                               BERT rediscovers the classical NLP pipeline. In
                                                        4526

    Proceedings of the 57th Annual Meeting of the Asso-                
                                                                       trainableparams ;                                for fine tuning
    ciation for Computational Linguistics, pages 4593–
                                                                       
                                                                       
                                                                       
    4601, Florence, Italy. Association for Computational       δ=
    Linguistics.                                                       
                                                                       (layers ∗ hidden_size ∗ max_tokens);
                                                                                        for embedding retrieval
                                                                       
                                                                       
                                                                       
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz               where GPU memory and model sizes (space oc-
  Kaiser, and Illia Polosukhin. 2017. Attention is all     cupied by the model) are in bytes, trainableparams
  you need. In Advances in neural information pro-         corresponds to number of trainable parameters
  cessing systems, pages 5998–6008.
                                                           during fine tuning and layers corresponds to the
Philip S Wang, Michael Lane, Mark Olfson, Harold A         number of layers of embeddings required, the
  Pincus, Kenneth B Wells, and Ronald C Kessler.           hidden_size is the number of dimensions in the
  2005. Twelve-month use of mental health services         hidden state and max_tokens is the maximum
  in the united states: results from the national comor-   number of tokens (after tokenization) in any batch.
  bidity survey replication. Archives of general psy-
  chiatry, 62(6):629–640.                                  We carried out the experiments with 1 NVIDIA
                                                           Titan Xp GPU which has around 12 GB of memory.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien           All the other methods were implemented on CPU.
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, Joe Davison, Sam Shleifer, Patrick von Platen,                                         Task Training
                                                                                                                              Model
                                                                                                                               Model
                                                                                                                                Model
                                                                                                                              models
  Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  Teven Le Scao, Sylvain Gugger, Mariama Drame,                Nta ∊ [50, 100, 200, ...]      Bootstrapped
  Quentin Lhoest, and Alexander M. Rush. 2019.                      x 10 samples                 Sampling                       Task
                                                                                           (Nmax x k) → (Nta x k)            Evaluation
  Huggingface’s transformers: State-of-the-art natural
  language processing. ArXiv, abs/1910.03771.                  Dimension Reduction
                                                                                              reduced train                 reduced test
                                                                   Pre-training
                                                                                                features                      features
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-                                                (Nmax, k)                     (Nte x k)
                                                                    Dimension
  bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.             reduction model
  Xlnet: Generalized autoregressive pretraining for                                                   Apply Dimension Reduction
  language understanding. In Advances in neural in-              Train Dimension                               model
                                                                 Reduction model                        (n x hidden dim) → (n x k)
  formation processing systems, pages 5753–5763.

Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy          domain transformer           train transformer             test transformer
                                                                   embeddings                  embeddings                   embeddings
  Hollingshead. 2019. CLPsych 2019 shared task:                 (Npt x hidden dim)         (Nmax x hidden dim)           (Nte x hidden dim)
  Predicting the degree of suicide risk in Reddit posts.
  In Proceedings of the Sixth Workshop on Computa-
  tional Linguistics and Clinical Psychology, pages        Figure A1: Depiction of Dimension Reduction method
  24–33, Minneapolis, Minnesota. Association for           - Transformer embeddings of domain data (Npt users’
  Computational Linguistics.                               embeddings9 ) is used to pre-train a dimension reduc-
                                                           tion model that transforms the embeddings down to
Appendices                                                 k dimensions. This step is followed by applying this
                                                           learned reduction model on task’s train and test data
A     Experimental Setup                                   embeddings. These reduced train features (Nmax
                                                           users) are then bootstrap sampled to produce 10 sets of
Implementation. All the experiments were im-               Nta users each for training task specific models. All
plemented using Python, DLATK (Schwartz et al.,            these 10 task specific models are evaluated on the re-
2017), HuggingFace Transformers (Wolf et al.,              duced test features consisting of Nte users during task
2019), and PyTorch (Paszke et al., 2019). The envi-        evaluation. The mean and standard deviation of the task
                                                           specific metric are collected.
ronments were instantiated with a seed value of 42,
except for fine-tuning which used 1337. Code to
reproduce all results is available in our github page:
                                                           B      Model Details
github.com/adithya8/ContextualEmbeddingDR/
                                                   NLAE architecture. The model architecture for
Infrastructure. The deep learning models such
                                                   the Non-linear auto-encoders in Table 4 was a twin
as stacked-transformers and NLAE were run on
                                                   network taking inputs of 768 dimensions and re-
single GPU with batch size given by:
                                                   ducing it to 128 dimensions through 2 layers and
                                             
                GP U memory − model size              9
                                                        Generation of user embeddings explained in detail under
 batchsize =                                       methods.
                (f loating precision/8) ∗ (δ)
                                                4527
reconstructs the original 768 dimensional represen-      C     Data
tation with 2 layers. This architecture was chosen
                                                         Due to human subjects privacy constraints, most
balancing the constraints of enabling the non-linear
                                                         data are not able to be publicly distributed but they
associations while keeping total parameters low
                                                         are available from the original data owners via re-
given the low sample size context. The formal
                                                         quests for research purposes (e.g. CLPsych-2018
definition of the model is:
                                                         and CLPsych-2019 shared tasks).

                                                         D     Additional Results
      xcomp = f (W2T f (W1T x + b1 ) + b2 )              D.1    Results on higher Nta
                 T       T
    xdecomp = W1 f (W2 xcomp + b2 ) + b1                 We can see that reduction still helps in majority of
                                                         tasks in higher Nta from Figure A2. As expected,
        x, xdecomp ∈ R768 , xcomp ∈ R128                 the performance starts to plateau at higher Nta val-
                                                         ues and it is visibly consistent across most tasks.
         W1 ∈ R768∗448 , W2 ∈ R448∗128                   With the exception of age and gender prediction
       b1 , b2 ∈ R448 , b2 ∈ R128 , b1 ∈ R768            using facebook data, all the other tasks benefit from
         W1 ∈ R448∗768 , W2 ∈ R128∗448                   reduction.
            f (a) = max(a, 0); ∀a ∈ R                    D.2    Results on classification tasks
                                                         Figure A3 compares the performance of reduced
NLAE Training. The data for domain pre-                  dimensions at low samples size scenario (Nta ≤
training of dimension reduction was split into 2         1000) in classification tasks. Except for a few Nta
sets for NLAE alone: training and validation sets.       values in gender prediction using the facebook data,
90% of the domain data was randomly sampled for          all the other tasks benefits from reduction in achiev-
training the NLAE and the remaining 10% of pre-          ing the best performance.
training data was used to validate hyper-parameters
                                                         D.3    LM comparison for no reduction &
after every epoch. This model was trained with
                                                                Smaller models.
an objective to minimise the reconstruction mean
squared loss over multiple epochs. It was trained        Table A1 compares the performance of the lan-
until the validation loss increased over 3 consecu-      guage models without applying any dimension re-
tive epochs. AdamW was the optimizer used with           duction of the embeddings and the performance
the learning rate set to 0.001. This took around         of the best transformer models is also compared
30-40 epochs depending upon the dataset.                 with smaller models (and distil version) after re-
                                                         ducing second to last lasyer representation to 128
                                                         dimensions in table A2.
Fine-tuning. In our fine-tuning configuration we
freeze all but the top 2 layers of the best LM, to        D.4 Least dimensions required: Higher Nta
prevent over fitting and vanishing gradients at the
                                                          The ’fkp’ plateaus as the the number of training
lower layers (Sun et al., 2019; Mosbach et al.,
                                                          samples grow as seen in table A3.
2020). We also apply early stopping (varied the
patience between 3 and 6 depending upon the task). E Additional Analysis
Other hyperparameters for this experiment include
L2-regularization (in the form of weight-decay on         E.1 Pre-trained vs Fine-Tuned models
AdamW optimizer, set to 1), dropout set to 0.3, We also find that fine-tuned model doesn’t perform
batch size set to 10, learning rate initialized to 5e-5, better than the pre-trained model for users with
and the number of epochs was set to max of 15, typical message lengths, but is better at handling
which was limited by early stopping between 5-10          longer sequences upon training it on the tasks’ data.
depending on the task and early stopping patience. This is evident from the graphs in figure A4.
   We arrived at these hyperparameter values after
an extensive search. The weight decay param was           E.2 PCA vs NMF.
searched in [100, 0.01], dropout within [0.1, 0.5], From figure A5, we can see that LIWC variables
and learning rate between [5e-4, 5e-5].                   like ARTICLE, INSIGHT, PERCEPT (perceptual
                                                       4528
You can also read