Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality Adithya V Ganesan, Matthew Matero, Aravind Reddy Ravula, Huy Vu, and H. Andrew Schwartz Department of Computer Science, Stony Brook University {avirinchipur, mmatero, aravula, hvu, has}@cs.stonybrook.edu Abstract Human-level NLP tasks, rooted in computational social science, focus on making predictions about In human-level NLP tasks, such as predicting people from their language use patterns. Some of mental health, personality, or demographics, the more common tasks include age and gender the number of observations is often smaller than the standard 768+ hidden state sizes of prediction (Sap et al., 2014; Morgan-Lopez et al., each layer within modern transformer-based 2017) , personality (Park et al., 2015; Lynn et al., language models, limiting the ability to effec- 2020), and mental health prediction (Coppersmith tively leverage transformers. Here, we provide et al., 2014; Guntuku et al., 2017; Lynn et al., 2018). a systematic study on the role of dimension re- Such tasks present an interesting challenge for the duction methods (principal components anal- NLP community to model the people behind the ysis, factorization techniques, or multi-layer language rather than the language itself, and the so- auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a cial scientific community has begun to see success function of predictive performance. We first of such approaches as an alternative or supplement find that fine-tuning large models with a lim- to standard psychological assessment techniques ited amount of data pose a significant difficulty like questionnaires (Kern et al., 2016; Eichstaedt which can be overcome with a pre-trained di- et al., 2018). Generally, such work is helping to mension reduction regime. RoBERTa con- embed NLP in a greater social and human con- sistently achieves top performance in human- text (Hovy and Spruit, 2016; Lynn et al., 2019). level tasks, with PCA giving benefit over other reduction methods in better handling users that Despite the simultaneous growth of both (1) write longer texts. Finally, we observe that a the use of transformers and (2) human-level NLP, majority of the tasks achieve results compara- the effective merging of transformers for human- 1 ble to the best performance with just 12 of the level tasks has received little attention. In a recent embedding dimensions. human-level shared task on mental health, most participants did not utilize transformers (Zirikly 1 Introduction et al., 2019). A central challenge for their uti- Transformer based language models (LMs) have lization in such scenarios is that the number of quickly become the foundation for accurately ap- training examples (i.e. sample size) is often only proaching many tasks in natural language process- hundreds while the parameters for such deep mod- ing (Vaswani et al., 2017; Devlin et al., 2019). Ow- els are in the hundreds of millions. For exam- ing to their success is their ability to capture both ple, recent human-level NLP shared tasks focused syntactic and semantic information (Tenney et al., on mental health have had N = 947 (Milne 2019), modeled over large, deep attention-based et al., 2016), N = 9, 146 (Lynn et al., 2018) and networks (transformers) with hidden state sizes on N = 993 (Zirikly et al., 2019) training examples. the order of 1000 over 10s of layers (Liu et al., Such sizes all but rules out the increasingly popular 2019; Gururangan et al., 2020). In total such mod- approach of fine-tuning transformers whereby all els typically have from hundreds of millions (De- its millions of parameters are allowed to be updated vlin et al., 2019) to a few billion parameters (Raffel toward the specific task one is trying to achieve (De- et al., 2020). However, the size of such models vlin et al., 2019; Mayfield and Black, 2020). Re- presents a challenge for tasks involving small num- cent research not only highlights the difficulty in bers of observations, such as for the growing num- fine-tuning with few samples (Jiang et al., 2020) ber of tasks focused on human-level NLP. but it also becomes unreliable even with thousands 4515 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4515–4532 June 6–11, 2021. ©2021 Association for Computational Linguistics
of training examples (Mosbach et al., 2020). torization to reduce dimensionality (Matero et al., On the other hand, some of the common 2019). While the approach seems reasonable for transformer-based approaches of deriving contex- addressing the dimensionality challenge in using tual embeddings from the top layers of a pre- transformers, many critical questions remain unan- trained model (Devlin et al., 2019; Clark et al., swered: (a) Which type of transformer model is 2019) still leaves one with approximately an equal best? (b) Would fine-tuning have worked instead? number of embedding dimensions as training size. and (c) Does such an approach generalize to other In fact, in one of the few successful cases of us- human-level tasks? Most of the time, one does not ing transformers for a human-level task, further have a luxury of a shared task for their problem at dimensionality reduction was used to avoid over- hand to determine a best approach. Here, we look fit (Matero et al., 2019), but an empirical under- across many human-level tasks, some of which standing of the application of transformers for with the luxury of having relatively large sample human-level tasks — which models are best and sizes (in the thousands) from which to establish the relationship between embedding dimensions, upper-bounds, and ultimately to draw generalizable sample size, and accuracy — has yet to be estab- information on how to approach a human-level task lished. given its domain (demographic, personality, mental In this work, we empirically explore strategies health) and sample size. to effectively utilize transformer-based LMs for rel- Our work also falls in line with a rising trend in atively small sample-size human-level tasks. We AI and NLP to quantify the number of dimensions provide the first systematic comparison of the most necessary. While this has not been considered for widely used transformer models for demographic, human-level tasks, it has been explored in other personality, and mental health prediction tasks. domains. The post processing algorithm (Mu and Then, we consider the role of dimension reduction Viswanath, 2018) of the static word embeddings to address the challenge of applying such models motivated by the power law distribution of max- on small sample sizes, yielding a suggested min- imum explained variance and the domination of imum number of dimensions necessary given a mean vector turned out to be very effective in mak- sample size for each of demographic, personality, ing these embeddings more discriminative. The and mental health tasks1 . While it is suspected that analysis of contextual embedding models (Etha- transformer LMs contain more dimensions than yarajh, 2019) suggest that the static embeddings necessary for document- or word-level NLP (Li contribute to less than 5% to the explained variance, and Eisner, 2019; Bao and Qiao, 2019), this repre- the contribution of the mean vector starts dominat- sents the first study on transformer dimensionality ing when contextual embedding models are used for human-level tasks. for human-level tasks. This is an effect of averaging the message embeddings to form user representa- 2 Related Work tions in human-level tasks. This further motivates Recently, NLP has taken to human-level predictive the need to process these contextual embeddings tasks using increasingly sophisticated techniques. into more discriminative features. The most common approaches use n-grams and Lastly, our work weighs into the discussion on LDA (Blei et al., 2003) to model a person’s lan- just which type of model is best in order to produce guage and behaviors (Resnik et al., 2013; Kern effective contextual embedding models. A major- et al., 2016). Other approaches utilize word em- ity of the models fall under two broad categories beddings (Mikolov et al., 2013; Pennington et al., based on how they are pre-trained - auto-encoders 2014) and more recently, contextual word represen- (AE) and auto-regressive (AR) models. We com- tations (Ambalavanan et al., 2019). pare the performance of AE and AR style LMs Our work is inspired by one of the top per- by comparing the performance of two widely used forming systems at a recent mental health pre- models from each category with comparable num- diction shared task (Zirikly et al., 2019) that uti- ber of parameters. From the experiments involving lized transformer-based contextualized word em- BERT, RoBERTa (Liu et al., 2019), XLNet (Yang beddings fed through a non-negative matrix fac- et al., 2019) and GPT-2 (Radford et al., 2019), we 1 dimension reduction techniques can also be pre-trained find that AE based models perform better than AR leveraging larger sets of unlabeled data style models (with comparable model sizes), and 4516
RoBERTa is the best choice amongst these four BSAG (when 11 years old) and distress scores at widely used models. ages 23, 33, 42 and 50. This task used a disattenu- ated pearson correlation as the metric (rdis ). 3 Data & Tasks CLPsych-2019. (sui) This 2019 shared We evaluate approaches over 7 human-level tasks task (Zirikly et al., 2019) comprised of 3 sub-tasks spanning Demographics, Mental Health, and per- for predicting the suicide risk level in reddit sonality prediction. The 3 datasets used for these users. This included a history of user posts on tasks are described below. r/SuicideWatch (SW), a subreddit dedicated to those wanting to seek outside help for processing FB-Demogs. (age, gen, ope, ext) One of our their current state of emotions. Their posts on goals was to leverage one of the largest human- other subreddits (NonSuicideWatch) were also level datasets in order to evaluate over subsam- collected. The users were annotated with one of ples of sizes. For this, we used the Facebook the 4 risk levels: none, low, moderate and severe demographic and personality dataset of Kosinski risk based on their history of posts. In total this et al. (2013). The data was collected from approx- task spans 496 users in training and 125 in testing. imately 71k consenting participants who shared We focused on Task A, predicting suicide risk of a Facebook posts along with demographic and per- user by evaluating their (English) posts across SW, sonality scores from Jan-2009 through Oct-2011. measured via macro-F1. The users in this sample had written at least a 1000 words and had selected English as their primary CLPsych CLPsych FB-Demogs 2018 2019 language. Age (age) was self-reported and lim- Sap et al. Lynn et al. Zirikly et al. ited to those 65 years or younger (data beyond this Npt 56,764 9,217 496 age becomes very sparse) as in (Sap et al., 2014). Nmax 10,000 9,217 496 Nte 5,000 1,000 125 Gender (gen) was only provided as a limited single binary, male-female classification. Table 1: Summary of the datasets. Npt is the num- Personality was derived from the Big 5 person- ber of users available for pre-training the dimension ality traits questionnaires, including both extraver- reduction model; Nmax is the maximum number of sion (ext - one’s tendency to be energized by social users available for task training. For CLPsych 2018 interaction) and openess (ope, one’s tendency to and CLPsych 2019, this would be the same sample as pre-training data. For Facebook, a disjoint set of 10k be open to new ideas) (Schwartz et al., 2013). Dis- users was available for task training; Nte is the num- attenuated Pearson correlation2 (rdis ) was used to ber of test users. This is always a disjoint set of users measure the performance of these two personality from the pre-training and task training samples. prediction tasks. CLPsych-2018. (bsag, gen2) The CLPsych 4 Methods 2018 shared task (Lynn et al., 2018) consisted of sub-tasks aimed at early prediction of men- Here we discuss how we utilized representations tal health scores (depression, anxiety and BSAG3 from transformers, our approaches to dimensional- score) based on their language. The data for this ity reduction, and our technique for robust evalua- shared task (Power and Elliott, 2005) comprised tion using bootstrapped sampling. of English essays written by 11 year old students along with their gender (gen2) and income classes. 4.1 Transformer Representations There were 9217 students’ essays for training and The second to last layer representation of all the 1000 for testing. The average word count in an messages was averaged to produce a 768 dimen- essay was less than 200. Each essay was annotated sional feature for each user4 . These user repre- with the student’s psychological health measure, sentations are reduced to lower dimensions as de- 2 scribed in the following paragraphs. The message Disattenuated Pearson correlation helps account for the er- ror of the measurement instrument (Kosinski et al., 2013; Mur- representation from a layer was attained by aver- phy and Davidshofer, 1988). Following (Lynn et al., 2020), aging the token embeddings of that layer. To con- we use reliabilities: rxx = 0.70 and ryy = 0.77. 3 4 Bristol Social Adjustment Guide (Ghodsian, 1977) scores The second to last layer was chosen owing to its consis- contains twelve sub-scales that measures different aspects of tent performance in capturing semantic and syntactic struc- childhood behavior. tures (Jawahar et al., 2019). 4517
sider a variety of transformer LM architectures, also considered the post processing algorithm we explored two popular auto-encoder (BERT and (PPA) of word embeddings5 (Mu and Viswanath, RoBERTa) and two auto-regressive (XLNet and 2018) that has shown effectiveness with PCA on GPT-2) transformer-based models. word level (Raunak et al., 2019). Importantly, be- For fine-tuning evaluations, we used the trans- sides transformer LMs being pre-trained, so too former based model that performs best across the can dimension reduction. Therefore, we distin- majority of our task suite. Transformers are typi- guish: (1) learning the transformation from higher cally trained on single messages or pairs of mes- dimension to lower dimensions (preferably on a sages, at a time. Since we are tuning towards a large data sample from the same domain) and (2) human-level task, we label each user’s message applying the learned transformation (on the task’s with their human-level attribute and treat it as a train/test set). For the first step, we used a sepa- standard document-level task (Morales et al., 2019). rate set of 56k unlabeled user data in the case of Since we are interested in relative differences in FB-demog6 . For CLPsych-2018 and -2019 (where performance, we limit each user to at most 20 mes- separate data from the exact domains was not read- sages - approximately the median number of mes- ily available), we used the task training data to train sages, randomly sampled, to save compute time for the dimension reduction. Since variance explained the fine tuning experiments. in factor analysis typically follows a power law, these methods transformed the 768 original embed- Algorithm 1 Dimension Reduction and Evaluation ding dimensions down to k, in powers of 2: 16, 32, Notation: hD : hidden size, f (·): function to train 64, 128, 256 or 512. dimension reduction, θ: Linear Model, g(·, ·): 4.3 Bootstrapped Sampling & Training Logistic loss function for classification and L2 loss for regression, η: learning rate, T : Num- We systematically evaluate the role of training sam- ber of iterations (100). ple (Nta ) versus embedding dimensions (k) for Data: Dpt ∈ RNpt ×hD : Pre-training embeddings, human-level prediction tasks. The approach is Dmax ∈ RNmax ×hD : Task training embed- described in algorithm 1. Varying Nta , the task- dings, Dte ∈ RNte ×hD : Test embeddings, specific train data (after dimension reduction) is Ymax : Outcome for train set, Yte : Outcome sampled randomly (with replacement) to get ten for test set. training samples with Nta users each. Small Nta 1: W ← f (Dpt ) values simulate a low-data regime and were used 2: D̄max ← Dmax W to understand its relationship with the least number 3: D̄te ← Dte W of dimensions required to perform the best (Nta vs 4: for i = 1, . . . , 10 do k). Bootstrapped sampling was done to arrive at 5: (0) θi ← ~0 a conservative estimate of performance. Each of 6: Sample (D̄ta , Yta ) from (D̄max , Ymax ) the bootstrapped samples was used to train either 7: for j = 1, . . . , T do an L2 penalized (ridge) regression model or logis- (j) (j−1) tic regression for the regression and classification 8: θi ← θi − η∇g(D̄ta , Yta ) tasks respectively. The performance on the test 9: end for (T ) set using models from each bootstrapped training 10: Ŷtei ← D̄te θi sample was recorded in order to derive a mean and 11: end for standard error for each Nta and k for each task. 12: Evaluate(Ŷte , Yte ) To summarize results over the many tasks and possible k and Nta values in a useful fashion, we propose a ‘first k to peak (fkp)’ metric. For each 4.2 Dimension Reduction Nta , this is the first observed k value for which We explore singular value decomposition-based the mean score is within the 95% confidence inter- methods such as Principal components analysis val of the peak performance. This quantifies the (PCA) (Halko et al., 2011), Non-negative matrix minimum number of dimensions required for peak factorization (NMF) (Févotte and Idier, 2011) and performance. Factor analysis (FA) as well as a deep learning 5 The ’D’ value was set to b number of100 dimensions c. approach: multi-layer non linear auto encoders 6 these pre-trained dimension reduction models are made (NLAE) (Hinton and Salakhutdinov, 2006). We available. 4518
LM demographics personality mental health age gen gen2 ext ope bsag sui Nta type name (r) (F 1) (F 1) (rdis ) (rdis ) (rdis ) (F 1) AE BERT 0.533 0.703 0.761 0.163 0.184 0.424 0.360 AE RoBERTa 0.589 0.712 0.761 0.123 0.203 0.455 0.363 100 AR XLNet 0.582 0.582 0.744 0.130 0.203 0.441 0.315 AR GPT-2 0.517 0.584 0.624 0.082 0.157 0.397 0.349 AE BERT 0.686 0.810 0.837 0.278 0.354 0.484 0.466 AE RoBERTa 0.700 0.802 0.852 0.283 0.361 0.490 0.432 500 AR XLNet 0.697 0.796 0.821 0.261 0.336 0.508 0.439 AR GPT-2 0.646 0.756 0.762 0.211 0.280 0.481 0.397 Table 2: Comparison of most commonly used auto-encoders (AE) and auto-regressor (AR) language models after reducing the 768 dimensions to 128 using NMF and trained on 100 and 500 samples (Nta ) for each task. (Nta ) pertains to the number of samples used for training each task. Classification tasks (gen, gen2 and sui) were scored using macro-F1 (F1); the remaining regression tasks were scored using pearson-r (r)/ disattenuated pearson-r (rdis ). AE models predominantly perform the best. RoBERTa and BERT show consistent performance, with the former performing the best in most tasks. The LMs in the table were base models (approx. 110M parameters). 5 Results Nta Method Age Gen Fine-tuned 0.54 0.54 100 Pre-trained 0.56 0.63 5.1 Best LM for Human-Level Tasks Fine-tuned 0.64 0.60 500 Pre-trained 0.66 0.74 We start by comparing transformer LMs, replicat- ing the setup of one of the state-of-the-art systems Table 3: Comparison of task specific fine tuning of for the CLPsych-2019 task in which embeddings RoBERTa (top 2 layers) and pre-trained RoBERTa em- were reduced from BERT-base to approximately beddings (second to last layer) for age and gender pre- 100 dimensions using NMF (Matero et al., 2019). diction tasks. Results are averaged across 5 trials ran- Specifically, we used 128 dimensions (to stick with domly sampling users equal to Nta from the Facebook powers of 2 that we use throughout this work) as data and reducing messages to maximum of 20 per user. we explore the other LMs over multiple tasks (we will explore other dimensions next) and otherwise 5.2 Fine-Tuning Best LM use the bootstrapped evaluation described in the method. We next evaluate fine-tuning in these low data situ- ations8 . Utilizing RoBERTa, the best performing Table 2 shows the comparison of the four trans- transformer from the previous experiments, we per- former LMs when varying the sample size (Nta ) form fine-tuning across the age and gender tasks. between two low data regimes: 100 and 5007 . Following (Sun et al., 2019; Mosbach et al., 2020), RoBERTa and BERT were the best performing we freeze layers 0-9 and fine-tune layers 10 and models in almost all the tasks, suggesting auto- 11. Even these top 2 layers alone of RoBERTa still encoders based LMs are better than auto-regressive result in a model that is updating tens of millions models for these human-level tasks. Further, of parameters while being tuned to a dataset of RoBERTa performed better than BERT in the ma- hundreds of users and at most 10,000 messages. jority of cases. Since the number of model param- In table 3, results for age and gender are shown eters are comparable, this may be attributable to for both sample sizes of 100 and 500. For Age, the RoBERTa’s increased pre-training corpus, which average prediction across all of a user’s messages is inclusive of more human discourse and larger was used as the user’s prediction and for gender the vocabularies in comparison to BERT. mode was used. Overall, we find that fine-tuning 8 As we are focused on readily available models, we con- 7 The performance of all transformer embeddings without sider substantial changes to the architecture or training as any dimension reduction along with smaller sized models can outside the scope of this systematic evaluation of existing be found in the appendix section D.3. techniques. 4519
demographics personality mental health age gen gen2 ext ope bsag sui Npt Nta Reduction (r) (F 1) (F 1) (rdis ) (rdis ) (rdis ) (F 1) PCA 0.650 0.747 0.777 0.189 0.248 0.466 0.392 PCA-PPA 0.517 0.715 0.729 0.173 0.176 0.183 0.358 100 FA 0.534 0.722 0.729 0.171 0.183 0.210 0.360 NMF 0.589 0.712 0.761 0.123 0.203 0.455 0.363 NLAE 0.654 0.744 0.782 0.188 0.263 0.447 0.367 56k* PCA 0.729 0.821 0.856 0.346 0.384 0.514 0.416 PCA-PPA 0.707 0.814 0.849 0.317 0.349 0.337 0.415 500 FA 0.713 0.819 0.849 0.322 0.361 0.400 0.415 NMF 0.700 0.802 0.852 0.283 0.361 0.490 0.432 NLAE 0.725 0.820 0.843 0.340 0.394 0.485 0.409 PCA 0.644 0.749 0.788 0.186 0.248 0.412 0.392 100 NLAE 0.634 0.743 0.744 0.159 0.230 0.433 0.367 500 PCA 0.726 0.819 0.850 0.344 0.382 0.509 0.416 500 NLAE 0.715 0.798 0.811 0.312 0.360 0.490 0.409 Table 4: Comparison of different dimension reduction techniques of RoBERTa embeddings (penultimate layer) reduced down to 128 dimensions and Nta = 100 and 500. Number of user samples for pre-trianing the dimension reduction model, Npt was 56k except for gen2, bsag (which had 9k users) and sui (which had 496 users). PCA performs the best overall and NLAE performs as good as PCA consistently. With uniform pre-training size (Npt = 500), PCA performs better than NLAE. offers lower performance with increased overhead the pre-training. This is evident from the results in for both train time and modeling complexity (hy- Table 4 where the Npt was set to a uniform value perparameter tuning, layer selection, etc). and tested for all the tasks with Nta set to 100 and We did robustness checks for hyper-parameters 500. Thus, PCA appears a more reliable, showing to offer more confidence that this result was not more generalization for low samples. simply due to the fastidious nature of fine-tuning. The process is described in Appendix B, includ- 5.4 Performance by Sample Size and ing an extensive exploration of hyper-parameters, Dimensions which never resulted in improvements over the pre- Now that we have found (1) RoBERTa generally trained setup. We are left to conclude that fine- performed best, (2) pre-trainining worked better tuning over such small user samples, at least with than fine-tuning, and (3) PCA was most consis- current typical techniques, is not able to produce tently best for dimension reduction (often doing results on par with using transformers to produce better than the full dimensions), we can systemat- pre-trained embeddings. ically evaluate model performance as a function of training sample size (Nta ) and number of di- 5.3 Best Reduction technique for mensions (k) over tasks spanning demographics, Human-Level Tasks personality, and mental health. We exponentially We evaluated the reduction techniques in low data increase k from 16 to 512, recognizing that vari- regime by comparing their performance on the ance explained decreases exponentially with dimen- downstream tasks across 100 and 500 training sam- sion (Mu and Viswanath, 2018). The performance ples (Nta ). As described in the methods, tech- is also compared with that of using the RoBERTa niques including PCA, NMF and FA along with embeddings without any reduction. NLAE, were applied to reduce the 768 dimensional Figure 1 compares the scores at reduced dimen- RoBERTa embeddings to 128 features. The results sions for age, ext, ope and bsag. These charts in table 4 show that PCA and NLAE perform most depict the experiments on typical low data regime consistently, with PCA having the best scores in (Nta ≤ 1000). Lower dimensional representations the majority tasks. NLAE’s performance appears performed comparable to the peak performance dependent on the amount of data available during with just 13 the features while covering the most 4520
Dimension reduced All dimensions (mean ± std err) (mean) Figure 1: Comparison of performance for all regression tasks: age, ext, ope and bsag over varying Nta and k. Results vary by task, but predominantly, performance at k=64 is better than the performance without any reduction. It is conclusive that the reduced features almost always performs better or as good as the original embeddings. 1 number of tasks and just 12 features for the ma- jority of tasks. Charts exploring other ranges of mental Nta values and remaining tasks can be found in the Nta demographics personality health appendix D.1. (3 tasks) (2 tasks) (2 tasks) 50 16 16 16 5.5 Least Number of Dimensions Required 100 128 16 22 Lastly, we devise an experiment motivated by an- 200 512 32 45 swering the question of how many dimensions are 500 768 64 64 necessary to achieve top results, given a limited 1000 768 90 64 sample size. Specifically, we define ‘first k to peak’ Table 5: First k to peak (fkp) for each set of tasks: the (fkp) as the least valued k that produces an accuracy least value of k that performed statistically equivalent equivalent to the peak performance. A 95% con- (p > .05) to the best performing setup (peak). Inte- fidence interval was computed for the best score ger shown is the exponential median of the set of tasks. (peak) for each task and each Nta based on boot- This table summarizes comprehensive testing and we strapped resamples, and fkp was the least number suggest its results, fkp, can be used as a recommen- of dimensions where this threshold was passed. dation for the number of dimensions to use given a task domain and training set size. Our goal is that such results can provide a sys- tematic guide for making such modeling decisions 4521
Ext Prediction in future human-level NLP tasks, where such an Age Prediction experiment (which relies on resampling over larger nmf nmf pca Absolute Error pca Absolute Error amounts of training data) is typically not feasible. 20 2 Table 5 shows the fkp over all of the training sample sizes (Nta ). The exponential median (med) in the 10 1 table is calculated as follows: med = 2Median(log(x)) 0 The fkp results suggest that more training sam- 0 2 3 4 5 2 3 4 5 ples available yield ability to leverage more dimen- log(Avg 1grams Per Msg) log(Avg 1grams Per Msg) sions, but the degree to which depends on the task. In fact, utilizing all the embedding dimensions was Figure 2: Comparison of the absolute error of NMF and PCA with the average number of 1 grams per message. only effective for demographic prediction tasks. While both the models appear to perform very similar The other two tasks benefited from reduction, of- when the texts are small or average sized, PCA is better 1 ten with only 12 to 16 of the original second to last at handling longer texts. The errors diverge when the transformer layer dimensions. length of the texts increases. 6 Error Analysis data. This choice wouldn’t make much difference Here, we seek to better understand why using pre- when used for Tweet-like short texts, but the errors trained models worked better than fine-tuning, and diverge rapidly for longer samples. We also see differences between using PCA and NMF compo- that PCA is better at capturing information from nents in the low sample setting (Nta = 500). these texts that have higher predictive power in Association LIWC variables downstream tasks. This is discussed in appendix Informal, Netspeak, Negemo E.2 along with other interesting findings involving Positive the comparison of PCA and the pre-trained model Swear, Anger Affiliation, Social, We, They, in E.3. Negative Family, Function, Drives, Prep, 7 Discussion Focuspast, Quant Ethical Consideration. We used existing Table 6: Top LIWC variables having negative and pos- datasets that were either collected with participant itive correlations with the difference in the absolute er- consent (FB and CLPsych 2018) or public ror of the pre-trained model and the fine-tuned model data with identifiers removed and collected in for age prediction. Benjamini-Hochberg FDR p < .05. This suggests that the fine-tuned models have lesser er-a non-intrusive manner (CLPsych 2019). All ror than pre-trained model when the language is infor- procedures were reviewed and approved by both mal and consists of more affect words. our institutional review board as well as the IRB of the creators of the data set. Our work can be seen as part of the growing Pre-trained vs Fine-tuned. We looked at cate- body of interdisciplinary research intended to un- gories of language from LIWC (Tausczik and Pen- derstanding human attributes associated with lan- nebaker, 2010), correlated with the difference in guage, aiming towards applications that can im- the absolute error of the pre-trained and fine-tuned prove human life, such as producing better mental model in age prediction. Table 6 suggests that health assessments that could ultimately save lives. pre-trained model is better at handling users with However, at this stage, our models are not intended language conforming to the formal rules, and fine- to be used in practice for mental health care nor tuning helps in learning better representation of the labeling of individuals publicly with mental health, affect words and captures informal language well. personality, or demographic scores. Even when Furthermore, these LIWC variables are also known the point comes where such models are ready for to be associated with age (Schwartz et al., 2013). testing in clinical settings, this should only be done Additional analysis comparing these two models is with oversight from professionals in mental health available in appendix E.1. care to establish the failure modes and their rates PCA vs NMF. Figure 2 suggests that PCA is bet- (e.g. false-positives leading to incorrect treatment ter at handling longer text sequences than NMF or false-negatives leading to missed care; increased (> 55 one grams on avg) when trained with less inaccuracies due to evolving language; disparities 4522
in failure modes by demographics). Malicious use dataset was further limited primarily to those post- possibilities for which this work is not intended ing in a suicide-related forum (Zirikly et al., 2019). include targeting advertising to individuals using Further, tokenization techniques can also impact language-based psychology scores, which could language model performance (Bostrom and Dur- present harmful content to those suffering from rett, 2020). To avoid oversimplification of com- mental health conditions. plex human attributes, in line with psychological We intend that the results of our empirical study research (Haslam et al., 2012), all outcomes were are used to inform fellow researchers in computa- kept in their most dimensional form – e.g. person- tional linguistics and psychology on how to better ality scores were kept as real values rather than utilize contextual embeddings towards the goal of divided into bins and the CLPsych-2019 risk levels improving psychological and mental health assess- were kept at 4 levels to yield gradation in assess- ments. Mental health conditions, such as depres- ments as justified by Zirikly et al., 2019. sion, are widespread and many suffering from such conditions are under-served with only 13 - 49% receiving minimally adequate treatment (Kessler 8 Conclusion et al., 2003; Wang et al., 2005). Marginalized pop- ulations, such as those with low income or minori- ties, are especially under-served (Saraceno et al., We provide the first empirical evaluation of the ef- 2007). Such populations are well represented in so- fectiveness of contextual embeddings as a function cial media (Center, 2021) and with this technology of dimensionality and sample size for human-level developed largely over social media and predom- prediction tasks. Multiple human-level tasks along inantly using self-reported labels from users (i.e., with many of the most popular language model rather than annotator-perceived labels that some- techniques, were systematically evaluated in con- times introduce bias (Sap et al., 2019; Flekova et al., junction with dimension reduction techniques to 2016)), we do not expect that marginalized popula- derive optimal setups for low sample regimes char- tions are more likely to hit failure modes. Still, tests acteristic of many human-level tasks. for error disparities (Shah et al., 2020) should be We first show the fine-tuning transformer LMs carried out in conjunction with clinical researchers in low-data scenarios yields worse performance before this technology is deployed. We believe than pre-trained models. We then show that re- this technology offers the potential to broaden the ducing dimensions of contextual embeddings can coverage of mental health care to such populations improve performance and while past work used where resources are currently limited. non-negative matrix factorization (Matero et al., Future assessments built on the learnings of 2019), we note that PCA gives the most reliable this work, and in conjunction with clinical men- improvement. Auto-encoder based transformer lan- tal health researchers, could help the under-served guage models gave better performance, on average, by both better classifying one’s condition as well than their auto-regressive contemporaries of com- as identifying an ideal treatment. Any applications parable sizes. We find optimized versions of BERT, to human subjects should consider the ethical im- specifically RoBERTa, to yield the best results. plications, undergo human subjects review, and the predictions made by the model should not be Finally, we find that many human-level tasks can th 1 th shared with the individuals without consulting the be achieved with a fraction, often 16 or 12 , the experts. total transformer hidden-state size without sacri- ficing significant accuracy. Generally, using fewer Limitations. Each dataset brings its own unique dimensions also reduces variance in model perfor- selection biases across groups of people, which is mance, in line with traditional bias-variance trade- one reason we tested across many datasets cover- offs and, thus, increases the chance of generalizing ing a variety of human demographics. Most no- to new populations. Further it can aid in explain- tably, the FB dataset is skewed young and is geo- ability especially when considering that these di- graphically focused on residents within the United mension reduction models can be pre-trained and States. The CLPsych 2018 dataset is a represen- standardized, and thus compared across problem tative sample of citizens of the United Kingdom, sets and studies. all born on the same week, and the CLPsych-2019 4523
References 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 55–65, Ashwin Karthik Ambalavanan, Pranjali Dileep Jagtap, Hong Kong, China. Association for Computational Soumya Adhya, and Murthy Devarakonda. 2019. Linguistics. Using contextual representations for suicide risk as- sessment from Internet forums. In Proceedings of Cédric Févotte and Jérôme Idier. 2011. Algorithms the Sixth Workshop on Computational Linguistics for nonnegative matrix factorization with the β- and Clinical Psychology, pages 172–176, Minneapo- divergence. Neural computation, 23(9):2421–2456. lis, Minnesota. Association for Computational Lin- guistics. Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. An- Xingce Bao and Qianqian Qiao. 2019. Transfer learn- alyzing biases in human perception of user age and ing from pre-trained BERT for pronoun resolution. gender from text. In Proceedings of the 54th An- In Proceedings of the First Workshop on Gender nual Meeting of the Association for Computational Bias in Natural Language Processing, pages 82–88, Linguistics (Volume 1: Long Papers), pages 843– Florence, Italy. Association for Computational Lin- 854, Berlin, Germany. Association for Computa- guistics. tional Linguistics. David M Blei, Andrew Y Ng, and Michael I Jordan. M Ghodsian. 1977. Children’s behaviour and the 2003. Latent dirichlet allocation. Journal of ma- bsag: some theoretical and statistical considerations. chine Learning research, 3(Jan):993–1022. British Journal of Social and Clinical Psychology, Kaj Bostrom and Greg Durrett. 2020. Byte pair encod- 16(1):23–28. ing is suboptimal for language model pretraining. In Sharath Chandra Guntuku, David B Yaden, Margaret L Findings of the Association for Computational Lin- Kern, Lyle H Ungar, and Johannes C Eichstaedt. guistics: EMNLP 2020, pages 4617–4624, Online. 2017. Detecting depression and mental illness on Association for Computational Linguistics. social media: an integrative review. Current Opin- Pew Research Center. 2021. Social media fact sheet. ion in Behavioral Sciences, 18:43–49. Christopher Clark, Kenton Lee, Ming-Wei Chang, Suchin Gururangan, Ana Marasović, Swabha Tom Kwiatkowski, Michael Collins, and Kristina Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Toutanova. 2019. Boolq: Exploring the surprising and Noah A. Smith. 2020. Don’t stop pretraining: difficulty of natural yes/no questions. arXiv preprint Adapt language models to domains and tasks. In arXiv:1905.10044. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages Glen Coppersmith, Mark Dredze, and Craig Harman. 8342–8360, Online. Association for Computational 2014. Quantifying mental health signals in Twit- Linguistics. ter. In Proceedings of the Workshop on Computa- tional Linguistics and Clinical Psychology: From Nathan Halko, Per-Gunnar Martinsson, and Joel A Linguistic Signal to Clinical Reality, pages 51–60, Tropp. 2011. Finding structure with random- Baltimore, Maryland, USA. Association for Compu- ness: Probabilistic algorithms for constructing ap- tational Linguistics. proximate matrix decompositions. SIAM review, 53(2):217–288. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Nick Haslam, Elise Holland, and Peter Kuppens. 2012. deep bidirectional transformers for language under- Categories versus dimensions in personality and psy- standing. In Proceedings of the 2019 Conference chopathology: a quantitative review of taxometric of the North American Chapter of the Association research. Psychological medicine, 42(5):903. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. pages 4171–4186, Minneapolis, Minnesota. Associ- Reducing the dimensionality of data with neural net- ation for Computational Linguistics. works. science, 313(5786):504–507. J. Eichstaedt, R. Smith, R. Merchant, Lyle H. Ungar, Dirk Hovy and Shannon L Spruit. 2016. The social Patrick Crutchley, Daniel Preotiuc-Pietro, D. Asch, impact of natural language processing. In Proceed- and H. A. Schwartz. 2018. Facebook language pre- ings of the 54th Annual Meeting of the Association dicts depression in medical records. Proceedings for Computational Linguistics (Volume 2: Short Pa- of the National Academy of Sciences of the United pers), pages 591–598. States of America, 115:11203 – 11208. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. Kawin Ethayarajh. 2019. How contextual are contex- 2019. What does BERT learn about the structure tualized word representations? comparing the geom- of language? In Proceedings of the 57th Annual etry of BERT, ELMo, and GPT-2 embeddings. In Meeting of the Association for Computational Lin- Proceedings of the 2019 Conference on Empirical guistics, pages 3651–3657, Florence, Italy. Associa- Methods in Natural Language Processing and the tion for Computational Linguistics. 4524
Haoming Jiang, Pengcheng He, Weizhu Chen, Xi- sification without the tweet: An empirical exami- aodong Liu, Jianfeng Gao, and Tuo Zhao. 2020. nation of user versus document attributes. In Pro- SMART: Robust and efficient fine-tuning for pre- ceedings of the Third Workshop on Natural Lan- trained natural language models through principled guage Processing and Computational Social Sci- regularized optimization. In Proceedings of the 58th ence, pages 18–28. Annual Meeting of the Association for Computa- tional Linguistics, pages 2177–2190, Online. Asso- Veronica Lynn, Alissa Goodman, Kate Niederhof- ciation for Computational Linguistics. fer, Kate Loveys, Philip Resnik, and H. Andrew Schwartz. 2018. CLPsych 2018 shared task: Pre- Margaret L Kern, Johannes C Eichstaedt, H An- dicting current and future psychological health from drew Schwartz, Lukasz Dziurzynski, Lyle H Ungar, childhood essays. In Proceedings of the Fifth Work- David J Stillwell, Michal Kosinski, Stephanie M Ra- shop on Computational Linguistics and Clinical Psy- mones, and Martin EP Seligman. 2014. The online chology: From Keyboard to Clinic, pages 37–46, social self: An open vocabulary approach to person- New Orleans, LA. Association for Computational ality. Assessment, 21(2):158–169. Linguistics. Margaret L Kern, Gregory Park, Johannes C Eichstaedt, Matthew Matero, Akash Idnani, Youngseo Son, Sal- H Andrew Schwartz, Maarten Sap, Laura K Smith, vatore Giorgi, Huy Vu, Mohammad Zamani, Parth and Lyle H Ungar. 2016. Gaining insights from so- Limbachiya, Sharath Chandra Guntuku, and H. An- cial media language: Methodologies and challenges. drew Schwartz. 2019. Suicide risk assessment with Psychological methods, 21(4):507. multi-level dual-context language and BERT. In Proceedings of the Sixth Workshop on Computa- Ronald C Kessler, Patricia Berglund, Olga Demler, tional Linguistics and Clinical Psychology, pages Robert Jin, Doreen Koretz, Kathleen R Merikangas, 39–44, Minneapolis, Minnesota. Association for A John Rush, Ellen E Walters, and Philip S Wang. Computational Linguistics. 2003. The epidemiology of major depressive dis- order: results from the national comorbidity survey Elijah Mayfield and Alan W Black. 2020. Should you replication (ncs-r). Jama, 289(23):3095–3105. fine-tune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innova- M. Kosinski, D. Stillwell, and T. Graepel. 2013. Pri- tive Use of NLP for Building Educational Applica- vate traits and attributes are predictable from digital tions, pages 151–162, Seattle, WA, USA → Online. records of human behavior. Proceedings of the Na- Association for Computational Linguistics. tional Academy of Sciences, 110:5802 – 5805. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Kevin Gimpel, Piyush Sharma, and Radu Soricut. rado, and Jeff Dean. 2013. Distributed representa- 2019. Albert: A lite bert for self-supervised learning tions of words and phrases and their compositional- of language representations. In International Con- ity. In Advances in neural information processing ference on Learning Representations. systems, pages 3111–3119. Xiang Lisa Li and Jason Eisner. 2019. Specializing David N. Milne, Glen Pink, Ben Hachey, and Rafael A. word embeddings (for parsing) by information bot- Calvo. 2016. CLPsych 2016 shared task: Triaging tleneck. In Proceedings of the 2019 Conference on content in online peer-support forums. In Proceed- Empirical Methods in Natural Language Processing ings of the Third Workshop on Computational Lin- and the 9th International Joint Conference on Natu- guistics and Clinical Psychology, pages 118–127, ral Language Processing (EMNLP-IJCNLP), pages San Diego, CA, USA. Association for Computa- 2744–2754, Hong Kong, China. Association for tional Linguistics. Computational Linguistics. Michelle Morales, Prajjalita Dey, Thomas Theisen, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Danny Belitz, and Natalia Chernova. 2019. An in- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, vestigation of deep learning systems for suicide risk Luke Zettlemoyer, and Veselin Stoyanov. 2019. assessment. In Proceedings of the Sixth Workshop Roberta: A robustly optimized bert pretraining ap- on Computational Linguistics and Clinical Psychol- proach. arXiv preprint arXiv:1907.11692. ogy, pages 177–181, Minneapolis, Minnesota. Asso- ciation for Computational Linguistics. Veronica Lynn, Niranjan Balasubramanian, and H. An- drew Schwartz. 2020. Hierarchical modeling for Antonio A Morgan-Lopez, Annice E Kim, Robert F user personality prediction: The role of message- Chew, and Paul Ruddle. 2017. Predicting age level attention. In Proceedings of the 58th Annual groups of twitter users based on language and meta- Meeting of the Association for Computational Lin- data features. PloS one, 12(8):e0183537. guistics, pages 5306–5316, Online. Association for Computational Linguistics. Marius Mosbach, Maksym Andriushchenko, and Diet- rich Klakow. 2020. On the stability of fine-tuning Veronica Lynn, Salvatore Giorgi, Niranjan Balasubra- bert: Misconceptions, explanations, and strong base- manian, and H Andrew Schwartz. 2019. Tweet clas- lines. arXiv preprint arXiv:2006.04884. 4525
Jiaqi Mu and Pramod Viswanath. 2018. All-but-the- Victor Sanh, Lysandre Debut, Julien Chaumond, and top: Simple and effective postprocessing for word Thomas Wolf. 2019. Distilbert, a distilled version representations. In International Conference on of bert: smaller, faster, cheaper and lighter. arXiv Learning Representations. preprint arXiv:1910.01108. K. Murphy and C. Davidshofer. 1988. Psychological Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, testing: Principles and applications. and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the Gregory Park, H. A. Schwartz, J. Eichstaedt, M. Kern, 57th Annual Meeting of the Association for Com- M. Kosinski, D. Stillwell, Lyle H. Ungar, and putational Linguistics, pages 1668–1678, Florence, M. Seligman. 2015. Automatic personality assess- Italy. Association for Computational Linguistics. ment through social media language. Journal of per- sonality and social psychology, 108 6:934–52. Maarten Sap, Greg Park, Johannes C Eichstaedt, Mar- Adam Paszke, Sam Gross, Francisco Massa, Adam garet L Kern, David J Stillwell, Michal Kosinski, Lerer, James Bradbury, Gregory Chanan, Trevor Lyle H Ungar, and H Andrew Schwartz. 2014. De- Killeen, Zeming Lin, Natalia Gimelshein, Luca veloping age and gender predictive lexica over social Antiga, Alban Desmaison, Andreas Kopf, Edward media. In Proceedings of the 2014 Conference on Yang, Zachary DeVito, Martin Raison, Alykhan Te- Empirical Methods in Natural Language Processing jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, (EMNLP). Junjie Bai, and Soumith Chintala. 2019. Py- torch: An imperative style, high-performance deep Benedetto Saraceno, Mark van Ommeren, Rajaie Bat- learning library. In H. Wallach, H. Larochelle, niji, Alex Cohen, Oye Gureje, John Mahoney, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- Devi Sridhar, and Chris Underhill. 2007. Barriers nett, editors, Advances in Neural Information Pro- to improvement of mental health services in low- cessing Systems 32, pages 8024–8035. Curran Asso- income and middle-income countries. The Lancet, ciates, Inc. 370(9593):1164–1174. Jeffrey Pennington, Richard Socher, and Christopher H Andrew Schwartz, Johannes C Eichstaedt, Mar- Manning. 2014. GloVe: Global vectors for word garet L Kern, Lukasz Dziurzynski, Stephanie M Ra- representation. In Proceedings of the 2014 Confer- mones, Megha Agrawal, Achal Shah, Michal Kosin- ence on Empirical Methods in Natural Language ski, David Stillwell, Martin EP Seligman, et al. 2013. Processing (EMNLP), pages 1532–1543, Doha, Personality, gender, and age in the language of social Qatar. Association for Computational Linguistics. media: The open-vocabulary approach. PloS one, 8(9):e73791. Chris Power and Jane Elliott. 2005. Cohort profile: 1958 British birth cohort (National Child Develop- H. Andrew Schwartz, Salvatore Giorgi, Maarten Sap, ment Study). International Journal of Epidemiol- Patrick Crutchley, Lyle Ungar, and Johannes Eich- ogy, 35(1):34–41. staedt. 2017. DLATK: Differential language anal- ysis ToolKit. In Proceedings of the 2017 Confer- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, ence on Empirical Methods in Natural Language Dario Amodei, and Ilya Sutskever. 2019. Language Processing: System Demonstrations, pages 55–60, models are unsupervised multitask learners. OpenAI Copenhagen, Denmark. Association for Computa- Blog, 1(8):9. tional Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Deven Santosh Shah, H. Andrew Schwartz, and Dirk Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Hovy. 2020. Predictive biases in natural language the limits of transfer learning with a unified text-to- processing models: A conceptual framework and text transformer. Journal of Machine Learning Re- overview. In Proceedings of the 58th Annual Meet- search, 21(140):1–67. ing of the Association for Computational Linguistics, pages 5248–5264, Online. Association for Computa- Vikas Raunak, Vivek Gupta, and Florian Metze. 2019. tional Linguistics. Effective dimensionality reduction for word em- beddings. In Proceedings of the 4th Workshop Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. on Representation Learning for NLP (RepL4NLP- 2019. How to fine-tune bert for text classification? 2019), pages 235–243, Florence, Italy. Association In China National Conference on Chinese Computa- for Computational Linguistics. tional Linguistics, pages 194–206. Springer. Philip Resnik, Anderson Garron, and Rebecca Resnik. Y. Tausczik and J. Pennebaker. 2010. The psychologi- 2013. Using topic modeling to improve prediction cal meaning of words: Liwc and computerized text of neuroticism and depression in college students. analysis methods. Journal of Language and Social In Proceedings of the 2013 Conference on Empiri- Psychology, 29:24 – 54. cal Methods in Natural Language Processing, pages 1348–1353, Seattle, Washington, USA. Association Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. for Computational Linguistics. BERT rediscovers the classical NLP pipeline. In 4526
Proceedings of the 57th Annual Meeting of the Asso- trainableparams ; for fine tuning ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational δ= Linguistics. (layers ∗ hidden_size ∗ max_tokens); for embedding retrieval Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz where GPU memory and model sizes (space oc- Kaiser, and Illia Polosukhin. 2017. Attention is all cupied by the model) are in bytes, trainableparams you need. In Advances in neural information pro- corresponds to number of trainable parameters cessing systems, pages 5998–6008. during fine tuning and layers corresponds to the Philip S Wang, Michael Lane, Mark Olfson, Harold A number of layers of embeddings required, the Pincus, Kenneth B Wells, and Ronald C Kessler. hidden_size is the number of dimensions in the 2005. Twelve-month use of mental health services hidden state and max_tokens is the maximum in the united states: results from the national comor- number of tokens (after tokenization) in any batch. bidity survey replication. Archives of general psy- chiatry, 62(6):629–640. We carried out the experiments with 1 NVIDIA Titan Xp GPU which has around 12 GB of memory. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien All the other methods were implemented on CPU. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Task Training Model Model Model models Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Nta ∊ [50, 100, 200, ...] Bootstrapped Quentin Lhoest, and Alexander M. Rush. 2019. x 10 samples Sampling Task (Nmax x k) → (Nta x k) Evaluation Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771. Dimension Reduction reduced train reduced test Pre-training features features Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- (Nmax, k) (Nte x k) Dimension bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. reduction model Xlnet: Generalized autoregressive pretraining for Apply Dimension Reduction language understanding. In Advances in neural in- Train Dimension model Reduction model (n x hidden dim) → (n x k) formation processing systems, pages 5753–5763. Ayah Zirikly, Philip Resnik, Özlem Uzuner, and Kristy domain transformer train transformer test transformer embeddings embeddings embeddings Hollingshead. 2019. CLPsych 2019 shared task: (Npt x hidden dim) (Nmax x hidden dim) (Nte x hidden dim) Predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computa- tional Linguistics and Clinical Psychology, pages Figure A1: Depiction of Dimension Reduction method 24–33, Minneapolis, Minnesota. Association for - Transformer embeddings of domain data (Npt users’ Computational Linguistics. embeddings9 ) is used to pre-train a dimension reduc- tion model that transforms the embeddings down to Appendices k dimensions. This step is followed by applying this learned reduction model on task’s train and test data A Experimental Setup embeddings. These reduced train features (Nmax users) are then bootstrap sampled to produce 10 sets of Implementation. All the experiments were im- Nta users each for training task specific models. All plemented using Python, DLATK (Schwartz et al., these 10 task specific models are evaluated on the re- 2017), HuggingFace Transformers (Wolf et al., duced test features consisting of Nte users during task 2019), and PyTorch (Paszke et al., 2019). The envi- evaluation. The mean and standard deviation of the task specific metric are collected. ronments were instantiated with a seed value of 42, except for fine-tuning which used 1337. Code to reproduce all results is available in our github page: B Model Details github.com/adithya8/ContextualEmbeddingDR/ NLAE architecture. The model architecture for Infrastructure. The deep learning models such the Non-linear auto-encoders in Table 4 was a twin as stacked-transformers and NLAE were run on network taking inputs of 768 dimensions and re- single GPU with batch size given by: ducing it to 128 dimensions through 2 layers and GP U memory − model size 9 Generation of user embeddings explained in detail under batchsize = methods. (f loating precision/8) ∗ (δ) 4527
reconstructs the original 768 dimensional represen- C Data tation with 2 layers. This architecture was chosen Due to human subjects privacy constraints, most balancing the constraints of enabling the non-linear data are not able to be publicly distributed but they associations while keeping total parameters low are available from the original data owners via re- given the low sample size context. The formal quests for research purposes (e.g. CLPsych-2018 definition of the model is: and CLPsych-2019 shared tasks). D Additional Results xcomp = f (W2T f (W1T x + b1 ) + b2 ) D.1 Results on higher Nta T T xdecomp = W1 f (W2 xcomp + b2 ) + b1 We can see that reduction still helps in majority of tasks in higher Nta from Figure A2. As expected, x, xdecomp ∈ R768 , xcomp ∈ R128 the performance starts to plateau at higher Nta val- ues and it is visibly consistent across most tasks. W1 ∈ R768∗448 , W2 ∈ R448∗128 With the exception of age and gender prediction b1 , b2 ∈ R448 , b2 ∈ R128 , b1 ∈ R768 using facebook data, all the other tasks benefit from W1 ∈ R448∗768 , W2 ∈ R128∗448 reduction. f (a) = max(a, 0); ∀a ∈ R D.2 Results on classification tasks Figure A3 compares the performance of reduced NLAE Training. The data for domain pre- dimensions at low samples size scenario (Nta ≤ training of dimension reduction was split into 2 1000) in classification tasks. Except for a few Nta sets for NLAE alone: training and validation sets. values in gender prediction using the facebook data, 90% of the domain data was randomly sampled for all the other tasks benefits from reduction in achiev- training the NLAE and the remaining 10% of pre- ing the best performance. training data was used to validate hyper-parameters D.3 LM comparison for no reduction & after every epoch. This model was trained with Smaller models. an objective to minimise the reconstruction mean squared loss over multiple epochs. It was trained Table A1 compares the performance of the lan- until the validation loss increased over 3 consecu- guage models without applying any dimension re- tive epochs. AdamW was the optimizer used with duction of the embeddings and the performance the learning rate set to 0.001. This took around of the best transformer models is also compared 30-40 epochs depending upon the dataset. with smaller models (and distil version) after re- ducing second to last lasyer representation to 128 dimensions in table A2. Fine-tuning. In our fine-tuning configuration we freeze all but the top 2 layers of the best LM, to D.4 Least dimensions required: Higher Nta prevent over fitting and vanishing gradients at the The ’fkp’ plateaus as the the number of training lower layers (Sun et al., 2019; Mosbach et al., samples grow as seen in table A3. 2020). We also apply early stopping (varied the patience between 3 and 6 depending upon the task). E Additional Analysis Other hyperparameters for this experiment include L2-regularization (in the form of weight-decay on E.1 Pre-trained vs Fine-Tuned models AdamW optimizer, set to 1), dropout set to 0.3, We also find that fine-tuned model doesn’t perform batch size set to 10, learning rate initialized to 5e-5, better than the pre-trained model for users with and the number of epochs was set to max of 15, typical message lengths, but is better at handling which was limited by early stopping between 5-10 longer sequences upon training it on the tasks’ data. depending on the task and early stopping patience. This is evident from the graphs in figure A4. We arrived at these hyperparameter values after an extensive search. The weight decay param was E.2 PCA vs NMF. searched in [100, 0.01], dropout within [0.1, 0.5], From figure A5, we can see that LIWC variables and learning rate between [5e-4, 5e-5]. like ARTICLE, INSIGHT, PERCEPT (perceptual 4528
You can also read