Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles Jian Zhu David Jurgens Department of Linguistics School of Information University of Michigan University of Michigan lingjzhu@umich.edu jurgens@umich.edu Abstract to quantify and characterize individual textual fea- tures to separate authors (Grant, 2012; Coulthard An individual’s variation in writing style is of- et al., 2016; Neal et al., 2017), the theory of idiolect ten a function of both social and personal at- tributes. While structured social variation has remains comparatively underdeveloped (Grant and arXiv:2109.03158v3 [cs.CL] 10 Sep 2021 been extensively studied, e.g., gender based MacLeod, 2018). An in-depth understanding of the variation, far less is known about how to char- nature and the variation of idiolect not only sheds acterize individual styles due to their idiosyn- light on the theoretical discussion of language varia- cratic nature. We introduce a new approach tion but also aids practical and forensic applications to studying idiolects through a massive cross- of linguistic science. author comparison to identify and encode Here, we characterize the idiolectal variation of stylistic features. The neural model achieves linguistic styles through a computational analysis strong performance at authorship identifica- tion on short texts and through an analogy- of large-scale short online texts. Specifically, we based probing task, showing that the learned ask the following questions: 1) to what extent can representations exhibit surprising regularities we extract distinct styles from short texts, even that encode qualitative and quantitative shifts for unseen authors; 2) what are the core stylistic of idiolectal styles. Through text perturbation, dimensions along which individuals vary; and 3) to we quantify the relative contributions of dif- what extent are idiolects consistent and distinctive? ferent linguistic elements to idiolectal varia- By using deep metric learning, we show that id- tion. Furthermore, we provide a description of idiolects through measuring inter- and intra- iolect is, in fact, systematic and can be quantified author variation, showing that variation in idi- separately from sociolect. And we introduce a new olects is often distinctive yet consistent. set of probing tasks for testing the relative contribu- tions of different linguistic variations to idiolectal 1 Introduction styles. Secondly, we found that the learned repre- Linguistic identities manifest through ubiquitous sentations for idiolect also encode some stylistic di- language variation. The notion that language func- mensions with surprising regularity (see Figure 1), tions as stylistic resources for the construction and analogous to linguistic regularity found in the word performance of social identity rests upon two the- embeddings. Thirdly, with our proposed metrics oretical constructs: sociolect and idiolect (Grant for style distinctiveness and consistency, we show and MacLeod, 2018). The term ‘sociolect’ refers to that individuals vary considerably in their internal the socially structured variation at the group level, consistency and distinctiveness in idiolect, which whereas ‘idiolect’ denotes language variation asso- has implications for the limits of authorship recog- ciated with individuals (Wardhaugh, 2011; Grant nition and the practice of forensic linguistics. For and MacLeod, 2018). Variationist sociolinguistics replication, we make our code available at https: emphasizes the systematic variation of sociolect //github.com/lingjzhu/idiolect. such as gender, ethnicity, and socioeconomic strat- 2 Idiolectal variation ification (Labov, 1972). While a central concept in sociolinguistics, idiolect has received far more Theoretical questions “Idiolect" remains a fun- research attention in forensic linguistics (Wright, damental yet elusive construct in sociolinguistics 2018; Grant and MacLeod, 2018). (Wright, 2018). The term has been abstractly Although idiolects have played a central role defined as the totality of the possible utterances in stylometry and forensic linguistics, which seek one could say (Bloch, 1948; Turell, 2010; Wright,
Orignal Lower cased Original Null subject (fewer) Null subject (more) 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0.0 0.0 0.1 0.2 0.2 0.4 0.3 0.6 0.4 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.4 0.2 0.0 0.2 0.4 Figure 1: Compositionality in stylistic embeddings encoded by SRoBERTa, projected into the first two principal components. [Left] Lower-casing all letters shifts all original texts in the same direction (blue dots → orange dots; e.g., “I love Cantonese BBQ!”→“i love cantonese bbq!”). [Right] The magnitude of movement in one direction is proportionate to the number of null-subject sentences in texts (blue dots → orange dots → green crosses; e.g., “I went out. I bought durians.”→“Went out. I bought durians.”→“Went out. Bought durians”.). 2018). Idiolect as a combination of one’s cognitive and consistent against a background population capacity and sociolinguistic experiences (Grant and (Grant, 2012; Grant and MacLeod, 2018). Studies MacLeod, 2018) raises many interesting linguis- on forensic linguistics (Johnson and Wright, 2014; tic questions. First, how are idiolects composed? Wright, 2013, 2017) provide confirmatory answers Forensic studies often focus on a few linguistic fea- to both questions—yet these studies only focus on tures as capturing a person’s idiolect (Coulthard, a specific set of features for a small group of au- 2004; Barlow, 2013; Wright, 2013) but few have thors. At the other extreme, the high performance explicitly offered an explanation as to why partic- of applying machine learning on authorship ver- ular word sequences were useful or not (Wright, ification and attribution (Kestemont et al., 2018, 2017). Frameworks for analyzing individual varia- 2019, 2020) stems from placing more emphasis tions have been proposed (Grant, 2012; Grant and on separating authors (distinctiveness) than consis- MacLeod, 2020) but contributions to idiolectal vari- tency. As an empirical investigation of these two ation at different linguistic levels are seldom explic- concepts in a relatively large population remains to itly measured. be conducted. Here we aim to quantify these two Sociolinguists have pursued the relationship be- linguistic constructs in large-scale textual datasets. tween idiolect and sociolect. However, the per- ceived idiosyncrasies of idiolect often render it Stylometry and stylistic similarities Tradi- second to sociolect as an object of study (Labov, tional stylometry often relies on painstaking man- 1989). Multiple scholars have suggested that idi- ual analysis of texts for a closed set of authors olects are the building blocks of various sociolects (Holmes, 1998). Surface linguistic features, espe- (Eckert, 2012; Barlow, 2013; Wright, 2018). While cially function words or character n-grams, have some studies have probed the relations between the been found to be effective in authorship analysis language of individuals and the group (Johnstone (Kestemont, 2014; Neal et al., 2017). Despite the and Bean, 1997; Schilling-Estes, 1998; Meyerhoff overwhelming success of deep learning, traditional and Walker, 2007), it remains less clear as to what features are still effective in authorship analysis extent idiolect is composed of sociolects or has (Kestemont et al., 2018, 2019, 2020). Yet the wide unique elements of their own (see Barlow, 2013, application of machine learning and deep learn- for a review). We test this hypothesize relationship ing in recent years has greatly advanced the state- later in Appendix F and do find preliminary quanti- of-the-art performance in authorship verification tative evidence that idiolect representations some (Boenninghoff et al., 2019a,b; Weerasinghe and information of sociolects. Greenstadt, 2020). Recent PAN Authorship Ver- The other theoretical question relevant to cog- ification shared tasks suggest that characterizing nitive science as well as forensic linguistics is to individual styles in long texts can be solved with what extent an individual’s idiolect is distinctive almost perfect accuracy (Kestemont et al., 2020);
as a result, stylometric studies have increasingly where W1 and W2 are learnable parameters and focusing on short texts in social media or online σ(·) is the ReLU activation function. Stylometric communications for authorship profiling or verifi- similarity between the text pair is then measured cation (Brocardo et al., 2013; Vosoughi et al., 2015; by a distance function d(zi , zj ). Here we mainly Boenninghoff et al., 2019b). Our study points to a consider the cosine similarity. promising application of sociolinguistics for author- The underlying models are RoBERTa (Liu ship detection in social media, as idiolect variation et al., 2019) and BERT (Devlin et al., 2019). is evident and consistent in texts as short as 100 Specifically, we used the roberta-base or tokens. bert-base-uncased as the encoder.1 3 Learning representations of idiolects Loss function The classic max-margin loss for deep metric learning was shown to be effective in In forensic linguistics, textual similarity was tra- previous work on stylometry (Boenninghoff et al., ditionally quantified as the proportion of shared 2019a,b). Inspired by Kim et al. (2020), we used a vocabulary and the number and length of shared continuous approximation of the max-margin loss phrases or characters (n-grams) (Coulthard, 2004). to learn the stylometric distance between users. More sophisticated statistical methods to perform The additional hyperparameter in this loss allows textual comparison have been developed over the the fine-grained control of the penalty magnitude years (Neal et al., 2017; Kestemont et al., 2020). for hard samples. To learn representations of idiolectal style, we The loss function is an adaptation from the propose using a proxy task of authorship verifica- proxy-anchor loss proposed by Kim et al. (2020). tion, where, given two input texts, a model must Given a text pair {xi , xj }, stylometric similarity determine if they were written by the same author between the text pair is then measured by a dis- or not (Neal et al., 2017). The identification is tance function d(zi , zj ). Here we mainly consider performed by scoring the two texts under compar- the cosine similarity. To minimize the distance for ison with a linguistic similarity measure and, if same-author pairs and maximize the distance be- their linguistic similarity measure exceeds a certain tween different-author pairs, the model was trained threshold, the two texts are judged to be written by with a contrastive loss with pre-defined margins the same author. {τs , τd } for the set of positive samples P + and Task definition Given a collection of negative samples P − is given below. text pairs by multiple authors X = pt−1 1 {(xp11 , xp21 ), . . . , (xn−1 , xpnt )} from do- X L(s) = Softplus LogSumExp main P = {p1 , p2 , . . . , pt } and labels |P + | + i,j∈P (2) Y = {yi , y2 , . . . , yn }, we aim to identify a function fθ that can determine whether two text α · [d(zi , zj ) − τs ] samples xi and xj are written by the same author (y = 1) or different authors (y = 0). 1 X Stylometric similarity learning Our model for L(s) = Softplus LogSumExp |P − | extracting stylistic embeddings from input texts i,j∈P − (3) is the same as the Sentence RoBERTa or α · [τd − d(zi , zj )] BERT network (SBERT/SRoBERTa) (Reimers and Gurevych, 2019). For a text pair (xi , xj ), the Siamese model fθ maps both text samples into L = L(s) + L(d) (4) embedding vectors (zi , zj ) in the latent space such where Softplus(z) = log(1 + ez ) is a continuous that zi = fθ (xi ) and zj = fθ (xj ). Rather than approximation of the max function. α is a scal- using the [cls] token as the representation, we ing factor that scales the penalty of the out-of-the- use attention pooling to merge the last hidden states margin samples. The out-of-margin samples are [h0 , . . . , hk ] into a single embedding vector to rep- exponentially weighted through the Log-Sum-Exp resent textual style. 1 The performance of bert-base-cased was almost ho =AttentionPool([h0 , . . . , hk ]) identical to bert-base-uncased, so it was not included (1) z =W 1 · σ(W 2 · ho + b2 ) + b1 in the main results.
operation such that hard examples are assigned ex- as positive samples of same-authorship (SA). For ponentially growing weights, prompting the model negative samples, we randomly sampled six texts to learn hard samples harder. from the rest of the data and paired them with During inference, we compare the textual dis- the original text. In order to improve generaliz- tance d(x1 , x2 ) with the threshold τt , the aver- ability across domains, we enforced a sampling age of the two margins, τt = τs +τ d 2 . We set scheme that half of the positive/negative samples {τs = 0.6, τd = 0.4} and α = 30. Details about were matched in domain while the other half were hyperparameters searching can be found in Ap- cross-domain. pendix A. Reddit posts To test the generalizability, we ad- Baseline methods We also compare our mod- ditionally constructed a second dataset from the els against several baseline methods in authorship online community Reddit and ran a subset of exper- verification: 1) GLAD. Groningen Lightweight iments on it. The top 200 subreddits were extracted Authorship Detection (GLAD) (Hürlimann et al., via the Convokit package (Chang et al., 2020). 2015) is a binary linear classifier using multiple Only users who had posted texts longer than 50 handcrafted linguistic features. 2) FVD. the Fea- words in more than 10 subreddits were selected in ture Vector Difference (FVD) method (Weeras- the dataset, resulting in 55,368 unique users. We inghe and Greenstadt, 2020) is a deep learning partitioned the 60%, 10%, and 30% of users as method for authorship verification using the abso- training, development, and test sets respectively. lute difference between two traditional stylometric Each user’s idiolect is represented by 10 posts from feature vectors. 3) AdHominem. AdHominem is 10 different subreddits. The binary labels were then an LSTM-based Siamese network for authorship generated by randomly sampling from SA and DA verification in social media domains (Boenninghoff pairs, using the same negative sampling procedure et al., 2019a). 4) BERTConcat /RoBERTaConcat . used in the creation of the Amazon dataset. The model is fine-tuned BERT/RoBERTa by con- catenating two texts under comparison. Authorship 5 Results on proxy task was determined by performing the binary classi- To test whether our model does recover stylis- fication task on the token (Ordoñez et al., tic features, we first test its performance on the 2020). Implementation details are provided in the proxy task of author verification, contextualizing Appendix B. the performance with other models specifically de- signed for that task. The performance of author- 4 Data ship verification is evaluated by accuracy, F1 and Amazon reviews Our dataset was extracted from the area-under-the-curve score (AUC) (Kestemont the release of the full Amazon review dataset up et al., 2020). Table 1 suggests that all models to 2018 (Ni et al., 2019). We filtered out reviews are able to at least recover some distinctive as- that were shorter than 50 words to ensure sufficient pects of individual styles even in these short text. text to reveal stylistic variation. We only retained Deep learning-based methods generally achieve users that have reviews in at least two product do- better verification accuracy than GLAD (Hürli- mains (e.g., Electronics and Books) and at least five mann et al., 2015) and FVDNN (Weerasinghe and reviews in each domain. After text cleaning, the Greenstadt, 2020), models based on traditional dataset contained 128,945 users. We partitioned linguistic features. Siamese architecture demon- 40%, 10%, and 50% of users into training, devel- strates its usefulness in the authorship verification opment, and test sets. There were 51398, 12849, task, as AdHominem (Boenninghoff et al., 2019a) and 64248 unique users in each set respectively. As and SRoBERTa perform better than the pre-trained one of our goals is to analyzed stylistic variation, transformer RoBERTaConcat . These results confirm we reserved the majority of the data (50% of users) that models recognize authors’ stylistic variation. for model evaluation and for subsequent linguistic Our error analysis also shows that identifying the analysis. The maximum length of all samples was same author across domains or different authors limited to 100 tokens. in the same domain poses a greater challenge to these models in general, though different model Negative sampling For each user, we randomly choices may exhibit various inductive biases (see sampled six pairs of texts written by the same user Appendix C). As SRoBERTa is shown to be the
most effective architecture in this task, we will use set of hyperparameters to compare the model per- these models to examine idiolectal variation for the formance on these perturbing inputs. rest of the study. Content and function words. The use of func- Model Accuracy F1 AUC tion words has long been recognized as an impor- Amazon reviews tant stylometric feature. A small set of function Random 50% 0.5 0.5 words are disproportionately frequent, relatively GLAD 67.1% 0.667 0.738 stable across content, and seem less under authors’ FVDNN 65.1% 0.671 0.714 conscious control (Kestemont, 2014). Yet few stud- AdHominem 73.3% 0.781 0.811 ies have empirically compared the relative contri- BERTConcat 71.3% 0.664 0.802 butions between function words and content words. SBERT 76.7% 0.768 0.850 To test this, we masked out all content words in RoBERTaConcat 73.9% 0.686 0.838 the original texts with a masked token , SRoBERTa 82.9% 0.831 0.909 which was recognized by the transformer models. Reddit posts For comparison, we also created masked texts with BERTConcat 65.0% 0.600 0.721 only content words. Punctuation and relative posi- SBERT 66.3% 0.669 0.727 tions between words were retained as this allows RoBERTaConcat 71.0% 0.695 0.794 the model to maximally exploit the spatial layout SRoBERTa 73.0% 0.737 0.812 of content/function words. Table 1: Results of authorship verification on the Ama- Results While the importance of lexical informa- zon (top) and Reddit (bottom) test samples tion in authorship analysis has been emphasized, it is suggested that only using lexical information is insufficient in forensic linguistics (Grant and 6 Linguistic Analysis MacLeod, 2020). Our results in Table 2 suggest In this section, we seek to quantify the idiolectal that, even with only lexical information, the model variation at different linguistic levels. performance is only about 4% lower than models with access to all information. Syntactic and dis- 6.1 Stylometric features course ordering do contribute to author identities, Idiolects vary at lexical, syntactic, and discourse yet the contributions are relatively minor. In foren- levels, yet it remains unclear which type of varia- sic linguistics, it is commonly the case that only tion contributes most to idiolectal variation. fragmentary texts are available (Grant, 2012), and our findings suggest that even without broader dis- Ordering Hierarchical models of linguistic iden- course information, it is still possible to estimate tities hold that authorial identities are reflected at author identity with good confidence. The weak all linguistic levels (Herring, 2004; Grant, 2012; contribution of discourse coherence to authorship Grant and MacLeod, 2020), yet the relative im- analysis highlights that the high level organization portance of these elements is seldom empirically of texts is only somewhat consistent within authors, explored. In order to understand the contributions which has been mentioned but rarely tested in foren- of lexical distribution, syntactic ordering, or dis- sic linguistics (Grant and MacLeod, 2020). course coherence, we test the contributions of dif- From Table 3, it is apparent that, even with half ferent linguistic features to authorship verification of the words masked out, the transformed texts by perturbing the input texts. To force the model to still contain an abundance of reliable stylometric only use lexical information, we randomly permute cues to individual writers, such that the overall all tokens in our data, removing information about accuracy is not significantly lower than models syntactic ordering (lexical model). The or- with full texts. While the importance of function ganization of discourse might also provide cues words in authorship analysis has been emphasized to idiolectal style. To test this, we preserve the (Kestemont, 2014), content words seem to convey word order within a sentence but permute sentences slightly more idiolectal cues despite the topical vari- within the text to disrupt discourse information ation. Both SBERT and SRoBERTa achieve simi- (lecico-syntactic model). Then we ran lar performance on Amazon and Reddit data, yet our experiments on these datasets using the same SRoBERTa better exploits the individual variation
Amazon Reddit Test (Permuted) Test (Original) Test (Permuted) Test (original) SBERT 0.716 / 0.798 0.736 / 0.794 0.639 / 0.668 0.657 / 0.667 Lexical SRoBERTa 0.781 / 0.856 0.751 / 0.850 0.654 / 0.734 0.686 / 0.714 SBERT 0.767 / 0.851 0.770 / 0.852 0.681 / 0.722 0.685 / 0.724 Lexico-syntactic SRoBERTa 0.820 / 0.895 0.822 / 0.896 0.730 / 0.798 0.732 / 0.800 Table 2: Results (F1/AUC) on permuted examples show that models are largely insensitive to syntactic and ordering variation, and, instead, idiolect is mostly captured through lexical variation. Amazon Reddit Tokenization Accuracy F1 AUC SBERT 0.775 / 0.856 0.680 / 0.744 Word-30k 67.8% 0.675 0.745 Function SRoBERTa 0.786 / 0.858 0.683 / 0.755 Word-50k 67.8% 0.679 0.744 SBERT 0.735 / 0.812 0.641 / 0.689 BERT-uncased 67.1% 0.680 0.737 Content SRoBERTa 0.795 / 0.870 0.708 / 0.768 BERT-cased 67.2% 0.677 0.737 RoBERTa 73.1% 0.734 0.804 Table 3: Results (F1/AUC) show that function or content words alone are reliable authorial cues. For Table 4: The effect of tokenization methods on the SRoBERTa, content words seem to convey slightly model performance with respect to Amazon reviews. more idiolectal cues despite topical variation. The pre-trained RoBERTa BPE tokenizer encodes more textual variations than the rest. in content words. These results strongly suggest that there are unique individual styles that are stable 50k does not bring any improvements, indicating across topics, and our additional probing also re- that many tokens were unused during evaluation. veals that topic information is significantly reduced The results strongly suggest that choosing the ex- in the learned embeddings (see Appendix F). pressive tokenizer, such as the RoBERTa BPE to- kenizers directly trained on raw texts, can effec- 6.2 Analysis of tokenization methods tively encode more stylistic features. For exam- We hypothesized that the large performance gap be- ple, for the word cat, the RoBERTa tokenizer tween BERT and RoBERTa (~5%) could be caused gives different encodings for its common variants by the discrepancy in the tokenization methods. such as Cat, CAT, [space]CAT or caaats, but The BERT tokenizer is learned after preprocessing these are all treated as the OOV tokens in word- the texts with heuristic rules (Devlin et al., 2019), based tokenizations. While the BERT tokenizer whereas the BPE tokenizer for RoBERTa is learned handles most variants, it fails to encode format- without any additional preprocessing or tokeniza- ting variation such as CAT, [space]CAT and tion of the input (Liu et al., 2019). [space][space]CAT. Such nuances in format- ting are an essential dimension of idiolectal varia- Method To verify, we trained several lightweight tion in forensic analysis of electronic communica- Siamese LSTM models from scratch that only tions (Grant, 2012; Grant and MacLeod, 2020). differed in tokenization methods: 1) word- based tokenizer with the vocabulary size set 7 Characterizing idiolectal styles to either 30k or 50k to match the sizes of BPE encodings; 2) pre-trained wordpiece In this section, we turn our attention to distinc- tokenizer for bert-base-uncased and tiveness and consistency in writing styles, both of bert-base-cased; 3) pre-trained tokenizer which are key theoretical assumptions in forensic for roberta-base. Implementation details are linguistics (Grant, 2012). attached in Appendix D. Distinctiveness We examine inter-author varia- Results As shown in Table 4, the RoBERTa to- tion through inter-author distinctiveness by con- kenizer outperforms other tokenizers by a signif- structing a graph that connects users with similar icant margin, even though it has similar numbers style embeddings, described next. For each user in of parameters to Word-50k. Interestingly, pre- the test set, we randomly sampled one text sample trained BERT tokenizer is not superior to the word- and extracted its embedding through the Siamese based tokenizer, despite better handling of out- models. Then we created the pairwise similarity of-vocabulary (OOV) tokens. For word-based to- matrix M by computing the pairwise similarity kenizers, increasing the vocabulary from 30k to between each text pair. Then M is pruned by re-
Most distinctive Least distinctive its ok seems like a reprint i mean its not horrible but i was Nice, thinner style plates that are well suited for building expecting a lil better qaulity but if i wore to do it again yes Lego projects. They hold Lego pieces securely and match i would still buy this poster its not blurry or anything but if up perfectly. Also, as a big PLUS for this company you get you have a good eye it seems a lil like a reprint amazing customer service. Table 5: Sample Amazon review excerpts with the most and the least distinctive style as predicted by SRoBERTa. moved entries below a threshold τcutof f , the same Analysis The distributions of style distinctive- threshold τt that is used to determined SA or DA ness and consistency both conform to a nor- pairs. The pruned matrix M̂ is treated as the graph mal distribution (Figure 2), yet no meaningful adjacency matrix from which a network G is con- correlation exists between these two measures structed. (Amazon: Spearman’s ρ=0.078; Reddit: Spear- PN man’s ρ=0.11). In general, users are highly con- j I[j ∈ Vi ] sistent in their writing styles even in such a large Si = 1 − (5) N population, with an average of 0.8, much higher than that for random samples (~0.4). Users are also where Vi is the set of neighbors of node i in G, N quite distinctive from one another, as on average the total PNnode count, and I[ ] the indicator func- a user’s style is different from 80% of users in the tion. j I[j ∈ Vi ] is the degree centrality of node population pool. Yet individuals do differ in their i. We found that features from the unweighted degrees of distinctiveness and consistency, which graph are perfectly correlated with the ones from may be taken into consideration in forensic linguis- the weighted graph. The unweighted graph is kept tics. Because inconsistency or indistinctiveness for computational efficiency. The scores were aver- may weaken the linguistic evidence to be analyzed. aged over 5 runs. The intuition is that, since authors are connected to similar authors, the more neigh- In Table 5, the least distinctive text is charac- bors an author has, the less distinctive their style is. terized by plain language, proper formatting, and A distinctiveness of 0.6 implies that this author is typical content, which reflects the unmarked style different from 60% of authors in the dataset. of stereotypical Amazon reviews. Yet this review itself is still quite distinctive as it differs from 60% Consistency We also measured the intra-author of the total reviews. The most distinctive review consistency in styles through concentration in the exhibits multiple deviations from the norm of this latent space. The concentration can be quantified genre. The style is unconventional with uncapital- by the conicity, a measure of the averaged vec- ized letters, run-on sentences, typos, the lack of pe- tor alignment to mean (ATM), as in the following riods, and the use of colloquial alternative spellings equation (Chandrahas et al., 2018). such as “haft", “lil" and “wore", all of which make this review highly marked. For style consistency, 1 X the most consistent writers incline towards using Conicity(V) = AT M (v, V) (6) |V| similar formatting, emojis, and narrative perspec- v∈V tives across reviews, whereas the least consistent users tend to shift across registers and perspectives 1 X in writings (see Appendix E for additional sam- AT M (v, V) = cosine v, x (7) ples). |V| x∈V We tested how various authors affect the verifi- where v ∈ V is a latent vector in the set of vec- cation performance. To avoid circular validation tors V. The ATM measures the cosine distance resulting from repeatedly using the same training between v to the centroid of V whereas the conic- data, we retrained the model with the repartitioned ity indicates the overall clusteredness of vectors in test data and tested them using the development V around the centroid. If all texts written by the set. The original test set was repartitioned into same user are highly aligned around their centroid three disjoint chunks of equal size, each chunk con- with a conicity close to 1, this suggests that this taining authors solely from either the top, middle user is highly consistent in writing style. or bottom 33% in terms of distinctiveness or con-
Style Shift Amazon Reddit Random 0.032 0.002 and → & 0.698 0.484 . → space 0.389 0.352 Intra-author Consistency 1.0 . → !!!! lower-cased 0.569 0.679 0.396 0.660 upper-cased 0.463 0.470 0.9 I→∅ going to → gonna 0.420 0.398 0.306 0.390 want to → wanna 0.478 0.505 0.8 -ing → -in’ w/o elongation → w/ 0.522 0.410 0.484 0.343 0.7 Table 7: Sij for qualitative style shifts, averaged across all comparisons. The offset vectors for style shifts are highly aligned. 0.6 0.8 1.0 Amazon Reddit Inter-author Distinctiveness Style shift Sk ∆N orm Sk ∆N orm Random 0.686 0.059 0.692 0.059 and → & 0.957 0.176 0.922 0.124 . → !!!! 0.964 0.085 0.953 0.072 Figure 2: The joint distribution of distinctiveness and I→∅ 0.860 0.257 0.831 0.273 consistency on the Amazon reviews as computed with -ing → -in’ 0.933 0.103 0.928 0.097 SRoBERTa embeddings. Both follow normal distribu- tions yet there is no meaningful correlation between Table 8: Sk and ∆N orm for quantitative style shifts. them, suggesting that these two dimensions may vary The direction and the magnitude of change encode the independently of each other. type and the degree of style shift. ∆N orm is positive, suggesting that more manipulations result in a longer Range Amazon Reddit offset vector along that direction. Random 0.806 / 0.875 0.677 / 0.761 High 0.801 / 0.879 0.680 / 0.767 Consistent Moderate 0.805 / 0.874 0.672 / 0.757 Low 0.797 / 0.867 0.661 / 0.716 2016), we designed a series of linguistic stimuli Random 0.808 / 0.876 0.695 / 0.754 that vary systematically in styles to probe the struc- High 0.803 / 0.874 0.709 / 0.750 ture of the stylistic embeddings. For each stylis- Distinctive Moderate 0.808 / 0.880 0.663 / 0.731 Low 0.793 / 0.861 0.611 / 0.68 tic dimension, we created n text embedding pairs P = [(p1r , p1m ), . . . , (pnr , pnm )] where pir is the em- Table 6: Performance (F1/AUC) on different partitions bedding of a randomly sampled text and pim is the of data. Models trained on inconsistent or indistinctive embedding of the modified version of pir so that authors significantly underperformed. it differs from pir in only one stylistic aspect. For sample i and j from P, we quantified sistency. Results in Table 6 suggest that, while Sij = cosine(pir − pim , pjr − pjm ) (8) most models performed similarly, models trained on inconsistent or indistinctive authors significantly Like the word analogy task, if this stylistic dimen- underperformed. This result may have implications sion is encoded only in one direction, we should for comparative authorship analysis in that it is expect Sij close to 1. desirable to control the number of inconsistent or For a target qualitative style shift, we randomly indistinctive authors in the dataset. sampled 1000 texts and modified the text to approx- imate the target stylistic dimension. For example, 8 Compositionality of styles if null subject is the target feature, we remove the Finally, we sought to understand how stylistic vari- subjective “I” from “I recommend crispy pork!" ations are encoded. At least for certain stylistic to “Recommend crispy pork!". Then we compute features, there is additive stylistic compositionality Sij for each pair, totaling 499500 possible com- in the latent space onto which the texts are pro- parisons. Here we selected 10 stylistic markers jected (Figure 1). of textual formality for evaluation (MacLeod and Grant, 2012; Biber and Conrad, 2019) and these Method In light of the word analogy task for textual modifications cover insertion, deletion, and word embeddings (Mikolov et al., 2013; Linzen, replacement operations.
For quantitative style shifts, we measure Sk be- stylistic dimensions in the latent space through the tween samples as well as the difference in length proxy task. with the following equation. We compare two em- While we only examine several stylistic markers, bedding pairs (pkr , pks ) and (pkr , pkl ), where both we were aware that the learned style representations pks and pkl differ from pir in only one stylistic di- also exhibit regularities for other lexical manipula- mension. But pkl is further along that dimension tions, as long as the manipulation is systematic and than pks . For instance, compared to the original regular across samples. An explanation is that the review pkr , pks contains five more “!!!" whereas pkl model is systematically tracking the fine-grained contains ten more such tokens. Here we surveyed variations in lexical statistics. Yet the proposed four stylistic markers of formality. model must also encode more abstract linguistic Sk =cosine(pkl − pkr , pks − pkr ) features because it outperformed GLAD (Hürli- (9) mann et al., 2015) and FVDNN (Weerasinghe and ∆normk = ||pkl − pkr ||2 − ||pks − pkr ||2 Greenstadt, 2020) that also track bag-of-words or For each stylistic shift, we collected 2000 sam- bag-of-ngram features. Previous research in word ples, each contained at least 8 markers of that style. embeddings attributes the performance of the anal- Then we modified the original review pkr to the ogy task to the occurrence patterns (Pennington target style by incrementally transforming the key- et al., 2014; Levy et al., 2015; Ethayarajh et al., words into the target keywords by 50% (pks ) and 2019). The fact that these variations are encoded 100% (pkl ). If these styles are highly organized, systematically beyond random variation and at such we should expect Sk to be close to 1, suggesting a fine-grain manner indicates that they are stylistic that changes in the same dimension point to the dimensions along which individual choices vary same direction. Yet we also expect that quantitative frequently and regularly. changes should also be reflected in the significant length difference (magnitude of ∆normk ) and the 9 Conclusions direction of the difference (∆normk being positive or negative) of the style vectors. The relatively unconstrained nature of the online Results Results in Table 7 suggest that both mod- genres tolerates a much wider range of stylistic els outperform the random baseline, Sij generated variation than conventional genres (Hovy et al., with the same samples by randomly replacing some 2015). Online genres are often marked by uncon- words. Like word embeddings, stylistic embed- ventional spellings, heavy use of colloquial lan- dings also exhibit a linear additive relationship be- guage, extensive deviations in formatting, and the tween various stylistic attributes. In Figure 1, con- relaxation of grammatical rules, providing rich lin- verting all letters to lower case causes textual rep- guistic resources to construct and perform one’s resentations to move collectively in approximately identity. Our analysis of idiolects in online regis- the same direction. Despite such regularities, the ters has highlighted that idiolectal variations per- offset vectors for style shifts were not perfectly meate all linguistic levels, present in both surface aligned in all instances, which may be attributed to lexico-syntactic features and high-level discourse the variations across texts. organization. Traditional sociolinguistic research For quantitative changes, SRoBERTa on both often regards idiolects as idiosyncratic and unstable Amazon and Reddit data encode the same type of and not as regular as sociolects (Labov, 1989; Bar- change in the same direction, as the vector offsets low, 2018); here, we show that idiolectal variation are highly aligned to each other (Table 8). Yet, is not only highly distinctive but also consistent, greater degrees of style shift relative to the origi- even in a relatively large population. Our findings nal text translate to a larger magnitude of change suggest that individuals may differ considerably by along that direction ( ∆normk in Table 8). In Fig- degrees of consistency and distinctiveness across ure 1, after removing the first three instances or multiple text samples, which sheds light on the all the occurrences of “I" from the original text, theoretical discussions and practical applications the resulted representations both shift in the same in forensic linguistics. Our findings also have im- direction but differ in magnitude. Such changes plications for sociolinguistics, as we have shown cannot be explained by random variation, suggest- an effective method to discover, understand and ing that both models learn to encode fine-grained exploit sociolinguistic variation.
10 Ethical considerations References While this study is theoretically driven, we are Michael Barlow. 2013. Individual differences and usage-based grammar. International Journal of Cor- aware that there might be some ethical concerns pus Linguistics, 18(4):443–478. for models surveyed in this study. While com- putational stylometry can be applied to forensic Michael Barlow. 2018. The individual and the group from a corpus perspective. The Corpus Linguistic investigations (Grant and MacLeod, 2020), author- Discourse: In Honour of Wolfgang Teubert. Amster- ship verification in online social networks, if put dam, pages 163–184. to malicious use, may weaken the anonymity of Angelo Basile, Albert Gatt, and Malvina Nissim. 2019. some users, leading to potential privacy issues. Our You write like you eat: Stylistic variation as a pre- results also show that caution should be taken when dictor of social stratification. In Proceedings of the deploying these models in forensic scenarios, as 57th Annual Meeting of the Association for Com- different models or tokenizers might show different putational Linguistics, pages 2583–2593, Florence, Italy. Association for Computational Linguistics. inductive biases that may bias towards certain types of users. Another potential bias is that we only se- Billal Belainine, Fatiha Sadat, Mounir Boukadoum, lected a small group of the most productive writers and Hakim Lounis. 2020. Towards a multi-dataset for complex emotions learning based on deep neural from the pool (less than 20% of all data), but this networks. In Proceedings of the Second Workshop sample might not necessary represent all popula- on Linguistic and Neurocognitive Resources, pages tions. We urge that caution should be exercised 50–58, Marseille, France. European Language Re- when using these models in real-life settings. sources Association. We still consider that the benefits of our study Douglas Biber and Susan Conrad. 2019. Register, outweigh potential dangers. Deep learning-based Genre, and Style. Cambridge University Press. stylometry is an active research area in recent years. Bernard Bloch. 1948. A set of postulates for phonemic While many studies focus on improving perfor- analysis. Language, 24(1):3–46. mance, we provide insights into how some of these models make decisions based on certain linguistic Benedikt Boenninghoff, Steffen Hessler, Dorothea Kolossa, and Robert M Nickel. 2019a. Explainable features and expose some of the models’ induc- authorship verification in social media via attention- tive biases. This interpretability analysis could be based similarity learning. In 2019 IEEE Interna- used to guide the proper use of these methods in tional Conference on Big Data, pages 36–45. IEEE. decision-making processes. The analysis could Benedikt Boenninghoff, Robert M Nickel, Steffen also be useful in developing adversarial techniques Zeiler, and Dorothea Kolossa. 2019b. Similarity that guard against the malicious use of such tech- learning for authorship verification in social media. nologies. In ICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Processing All of our experiments were performed on public (ICASSP), pages 2457–2461. IEEE. data, in accordance with the terms of service. All authors were anonymized. In addition, the term Marcelo Luiz Brocardo, Issa Traore, Sherif Saad, and Isaac Woungang. 2013. Authorship verification “Siamese” may be considered offensive when it for short messages using stylometry. In 2013 In- refers to some groups of people. The use of this ternational Conference on Computer, Information term follows the research naming of a mathematical and Telecommunication Systems (CITS), pages 1–6. model in machine learning literature. We use the IEEE. word here to refer to a neural network model and Chandrahas, Aditya Sharma, and Partha Talukdar. make no reference to particular population groups. 2018. Towards understanding the geometry of knowledge graph embeddings. In Proceedings of Acknowledgements the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), We thank Professor Patrice Beddor, Jiaxin Pei at pages 122–131, Melbourne, Australia. Association the University of Michigan and Zuoyu Tian at the for Computational Linguistics. University of Indiana Bloomington for their help- Jonathan P. Chang, Caleb Chiam, Liye Fu, An- ful discussions. We are also grateful for comments drew Wang, Justine Zhang, and Cristian Danescu- from anonymous reviewers, which helped improve Niculescu-Mizil. 2020. ConvoKit: A toolkit for the analysis of conversations. In Proceedings of the the paper greatly. This material is based upon work 21th Annual Meeting of the Special Interest Group supported by the National Science Foundation un- on Discourse and Dialogue, pages 57–60, 1st virtual der Grant No 1850221. meeting. Association for Computational Linguistics.
Malcolm Coulthard. 2004. Author identification, idi- Manuela Hürlimann, Benno Weck, Esther van den olect, and linguistic uniqueness. Applied Linguis- Berg, Simon Suster, and Malvina Nissim. 2015. tics, 25(4):431–447. Glad: Groningen lightweight authorship detection. In CLEF (Working Notes). Malcolm Coulthard, Alison Johnson, and David Wright. 2016. An introduction to forensic linguis- Alison Johnson and David Wright. 2014. Identifying tics: Language in evidence. Routledge. idiolect in forensic authorship attribution: an n-gram textbite approach. Language and Law, 1(1). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Barbara Johnstone and Judith Mattson Bean. 1997. deep bidirectional transformers for language under- Self-expression and linguistic variation. Language standing. In Proceedings of the 2019 Conference in Society, 26(2):221–246. of the North American Chapter of the Association for Computational Linguistics: Human Language Mike Kestemont. 2014. Function words in authorship Technologies, Volume 1 (Long and Short Papers), attribution. from black magic to theory? In Proceed- pages 4171–4186, Minneapolis, Minnesota. Associ- ings of the 3rd Workshop on Computational Linguis- ation for Computational Linguistics. tics for Literature (CLFL), pages 59–66, Gothen- burg, Sweden. Association for Computational Lin- Penelope Eckert. 2012. Three waves of variation study: guistics. The emergence of meaning in the study of sociolin- guistic variation. Annual review of Anthropology, Mike Kestemont, Enrique Manjavacas, Ilia Markov, 41:87–100. Janek Bevendorff, Matti Wiegmann, Efstathios Sta- matatos, Martin Potthast, and Benno Stein. 2020. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Overview of the cross-domain authorship verifica- 2019. Understanding undesirable word embedding tion task at pan 2020. In CLEF. associations. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- Mike Kestemont, Efstathios Stamatatos, Enrique Man- guistics, pages 1696–1705, Florence, Italy. Associa- javacas, Walter Daelemans, Martin Potthast, and tion for Computational Linguistics. Benno Stein. 2019. Overview of the Cross-domain Authorship Attribution Task at PAN 2019. In Lucie Flekova, Daniel Preoţiuc-Pietro, and Lyle Un- CLEF 2019 Labs and Workshops, Notebook Papers. gar. 2016. Exploring stylistic variation with age CEUR-WS.org. and income on Twitter. In Proceedings of the 54th Annual Meeting of the Association for Computa- Mike Kestemont, Michael Tschuggnall, Efstathios Sta- tional Linguistics (Volume 2: Short Papers), pages matatos, Walter Daelemans, Günther Specht, Benno 313–319, Berlin, Germany. Association for Compu- Stein, and Martin Potthast. 2018. Overview of tational Linguistics. the Author Identification Task at PAN-2018: Cross- domain Authorship Attribution and Style Change Tim Grant. 2012. Txt 4n6: method, consistency, and Detection. In Working Notes Papers of the CLEF distinctiveness in the analysis of sms text messages. 2018 Evaluation Labs, volume 2125 of CEUR Work- Journal of Law and Policy, 21:467. shop Proceedings. CEUR-WS.org. Tim Grant and Nicci MacLeod. 2020. Language and Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Online Identities: The Undercover Policing of Inter- Kwak. 2020. Proxy anchor loss for deep metric net Sexual Crime. Cambridge University Press. learning. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, Tim Grant and Nicola MacLeod. 2018. Resources and pages 3238–3247. constraints in linguistic identity performance–a the- ory of authorship. Language and Law/Linguagem e William Labov. 1972. Sociolinguistic patterns. 4. Uni- Direito, 5(1):80–96. versity of Pennsylvania Press. Susan C. Herring. 2004. Computer-Mediated Dis- William Labov. 1989. Exact description of the speech course Analysis: An Approach to Researching On- community: Short a in philadelphia. Language line Behavior. Learning in doing, pages 338–376. Change and Variation, pages 1–57. Cambridge University Press, New York, NY, US. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- David I Holmes. 1998. The evolution of stylometry proving distributional similarity with lessons learned in humanities scholarship. Literary and linguistic from word embeddings. Transactions of the Associ- computing, 13(3):111–117. ation for Computational Linguistics, 3:211–225. Dirk Hovy, Anders Johannsen, and Anders Søgaard. Tal Linzen. 2016. Issues in evaluating semantic spaces 2015. User review sites as a resource for large-scale using word analogies. In Proceedings of the 1st sociolinguistic studies. In Proceedings of the 24th Workshop on Evaluating Vector-Space Representa- international conference on World Wide Web, pages tions for NLP, pages 13–18, Berlin, Germany. As- 452–461. sociation for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- M Teresa Turell. 2010. The use of textual, grammatical dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, and sociolinguistic evidence in forensic text compar- Luke Zettlemoyer, and Veselin Stoyanov. 2019. ison. International Journal of Speech, Language & Roberta: A robustly optimized bert pretraining ap- the Law, 17(2). proach. arXiv preprint arXiv:1907.11692. Soroush Vosoughi, Helen Zhou, and Deb Roy. 2015. Nicci MacLeod and Tim Grant. 2012. Whose tweet? Digital stylometry: Linking profiles across social authorship analysis of micro-blogs and other short- networks. In International Conference on Social In- form messages. In Proceedings of The International formatics, pages 164–177. Springer. Association of Forensic Linguists’ Tenth Biennial Conference. Ronald Wardhaugh. 2011. An introduction to sociolin- guistics, volume 28. John Wiley & Sons. Miriam Meyerhoff and James A Walker. 2007. The per- sistence of variation in individual grammars: Copula Janith Weerasinghe and Rachel Greenstadt. 2020. absence in ‘urban sojourners’ and their stay-at-home Feature Vector Difference based Neural Network peers, bequia (st vincent and the grenadines) 1. Jour- and Logistic Regression Models for Authorship nal of Sociolinguistics, 11(3):346–366. Verification—Notebook for PAN at CLEF 2020. In CLEF 2020 Labs and Workshops, Notebook Papers. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. CEUR-WS.org. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 David Wright. 2013. Stylistic variation within genre Conference of the North American Chapter of the conventions in the enron email corpus: developing Association for Computational Linguistics: Human a textsensitive methodology for authorship research. Language Technologies, pages 746–751, Atlanta, International Journal of Speech, Language & the Georgia. Association for Computational Linguistics. Law, 20(1). Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, David Wright. 2017. Using word n-grams to identify Yiming Yan, Yingfei Xiang, and Damon Woodard. authors and idiolects: A corpus approach to a foren- 2017. Surveying stylometry techniques and applica- sic linguistic problem. International Journal of Cor- tions. ACM Computing Surveys (CSUR), 50(6):1– pus Linguistics, 22(2):212–241. 36. David Wright. 2018. Idiolect. Oxford Bibliographies in Linguistics. Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China. Association for Computational Lin- guistics. Juanita Ordoñez, Rafael Rivera Soto, and Barry Y Chen. 2020. Will longformers pan out for author- ship verification. Working Notes of CLEF. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence- BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics. Natalie Schilling-Estes. 1998. Investigating “self- conscious” speech: The performance register in oc- racoke english. Language in society, 27(1):53–83.
A Hyperparameter tuning combo4 options in the code, which covers 23 lin- guistic features. While support vector machine In this section, we report the results from our hyper- (SVM) was used as the classifier in their paper parameter tuning process. The following Table 9 (Hürlimann et al., 2015), we found SVM did not reports some additional results obtained during hy- scale to the size of our data. Instead, we ran a logis- perparameter tuning. Changing the masking proba- tic regression model on the features as this allowed bility or the margins for the contrastive loss has an us to interpret the feature importance. The perfor- impact on the final accuracy. This search is manual mance of logistic regression was very close to the and not exhaustive. We used the best parameters in Random Forest classifier in their code. the main text. For actual implementation, we used an effective FVD The model was trained with the code re- batch size of 256. The default optimizer was the leased by the original authors3 (Weerasinghe and Adam optimizer with a learning rate of 1e − 5. Greenstadt, 2020). We kept the original feature ex- All models were trained on a single Nvidia V100 traction methods and the model architecture. The GPU with 16GB memory. The models were set to input to the neural network was a 2314-dimensional train for 5 epochs but we applied early stopping feature vector, which was computed by taking the when the validation accuracy stopped to increase. absolute difference between the linguistic feature Each epoch took about 2 hours to complete. For vectors of the two authors under comparison. The each model, we limited the maximum length of text two-layered fully connected neural network was samples to 100 tokens but the actual definition of trained for 100 epochs and the model with the best tokens depended on the tokenizer used. validation accuracy was kept. Model Accuracy F1 AUC AdHomenin We used the original implementa- τs = 0.7 , τd = 0.3, α = 30 tion4 provided by the author. While this implemen- SRoBERTa - Amazon 0.744 0.676 0.901 tation was slightly different from that described in SRoBERTa - Reddit 0.723 0.713 0.804 Boenninghoff et al. (2019a), no modification was τs = 0.8 , τd = 0.2, α = 30 made to the code other than adapting the code to SRoBERTa - Amazon 0.809 0.814 0.889 work on our own data. The same pre-processing SRoBERTa - Reddit 0.708 0.722 0.784 method, model architecture, and parameters were τs = 0.8 , τd = 0.2, α = 30 kept. Yet the evaluation code was not used as SRoBERTa - Amazon 0.809 0.814 0.889 it ignores uncertain samples, which is a standard SRoBERTa - Reddit 0.708 0.722 0.784 practice in PAN 20 (Kestemont et al., 2020). The τs = 0.6 , τd = 0.4, α = 10 SRoBERTa - Amazon 0.822 0.826 0.903 model was trained for 5 epochs and we only kept SRoBERTa - Reddit 0.73 0.734 0.81 the model with the best validation results. τs = 0.6 , τd = 0.4, α = 5 BERTConcat /RoBERTaConcat We mostly fol- SRoBERTa - Amazon 0.819 0.825 0.901 SRoBERTa - Reddit 0.72 0.723 0.798 lowed the description by Ordoñez et al. (2020). While Ordoñez et al. (2020) used a variant of Table 9: Results of authorship verification on the de- RoBERTa known as Longformer (Belainine et al., velop / test sets 2020), we re-implemented the model using the orig- inal pre-trained RoBERTa, so that the model can be directly compared to the Siamese version. Since B Baseline methods the Longformer is highly similar to RoBERTa and BERT, we do not expect a significant performance We ran some baseline models for comparison. gap between them. When setting up these models, we tried to make minimal changes to the original implementation. Evaluation To ensure consistency, all evalua- Details of our changes are provided below. tion metrics were computed by the functions in Sklearn: accuracy_score for accuracy, GLAD We used the original code for GLAD2 . The linguistic features were extracted using the 3 https://github.com/pan-webis-de/ weerasinghe20 2 4 https://github.com/pan-webis-de/ https://github.com/boenninghoff/ huerlimann15 AdHominem
F1_score for F1 and roc_auc_score for sentation of the whole input text, which was then AUC. passed to a two-layer fully connected network with 300 hidden states in each layer. The similarity be- C Error analysis tween the two paired texts was computed with the We also analyzed the error distributions across dif- cosine distance function. The hyperparameters for ferent conditions, shown in Table 10. Given a pair the loss function were τd = 0.4 and τs = 0.6. No of texts, we categorized them into four different cat- pre-trained word embedding weights were used egories, same-author (SA)/different author (DA), and all weights were trained from scratch. or same domain (SD)/different domain (DD). Un- surprisingly, most methods still struggling with Training details The training, development, and SA-DD and DA-SD pairs, suggesting that domain- test data were the same as those in the main exper- specific/topic information partially interferes with iments. The model was optimized by the Adam the extraction of writing styles. The cases with optimizer with a learning rate of 0.001. Gradient RoBERTaConcat and BERTConcat are particularly in- clipping was applied to stabilize training with the teresting, as both models consistently performed maximum gradient norm set to 1. The model was worse at SA pairs but outperformed the rest of the trained on an 11GB RTX 2080Ti with an effec- models in DA pairs. Cosine distance-based models tive batch size of 256 for 10 epochs. The average seem to better balance the trade-off across condi- training time for each model was about 3 hours. tions. This shows that model architectures also exhibit inductive biases of their own, which may E Distinctiveness and consistency bias them to be more or less effective in certain Here we also show the joint distribution of dis- conditions. tinctiveness and consistency given by SBERT in Model Accuracy Figure 3. The shape of the distribution is a bivariate SA-SD SA-DD DA-SD DA-DD normal distribution and these two metrics are not GLAD 73.7% 58.4% 62.9% 73.6% FVDNN 75% 67.5% 55.7% 62.3% correlated. AdHominem 81.2% 73.1% 63.6% 75.4% The overall distribution of distinctiveness and BERTConcat 61.8% 51.6% 85.3% 86.5% consistency computed using different models are SBERT 84.1% 77.2% 67.7% 78% given in Figure 4. The distribution of style distinc- RoBERTaConcat 64.4% 49.6% 89.2% 92.5% SRoBERTa 88.4% 82.5% 74.8% 83.2% tiveness conforms to a normal distribution regard- less of models (Figure 4), though the distribution Table 10: Accuracy by domain and authorship. All ex- is more peaked for better models. For style con- periments were run on the Amazon reviews. sistency, its distribution also conforms to a normal distribution but the distributions predicted by dif- ferent models are highly similar. D Comparing tokenization methods Correlations across models We used the Spear- Tokenization methods For the word-based tok- man correlation coefficients to assess to what extent enization, we made use of the word_tokenizer different models assign similar rankings of distinc- function in NLTK. Either 30k or 50k most frequent tiveness and consistency. Results are presented in lexical tokens were kept as the vocabulary for train- Table 11. For consistency scores given by all mod- ing the LSTM model, plus a padding token and els are moderately correlated yet the correlations an OOV token. As for the BPE tokenizer, we di- for distinctiveness are generally weak. rectly used the pre-trained tokenizers for BERT and RoBERTa accessed through HuggingFace’s Transformers. F Additional analysis: characterizing sociolects Model specification The underlying model is an LSTM-based Siamese network. The model con- Language varies at both individual and collective sists of two bidirectional LSTM layers with 300 levels (Eckert, 2012). In this section, diagnostic hidden states for each direction. The last hidden classification is employed to probe to what extent states of the last layer in both forward and back- the collective language variations are retained in ward directions were concatenated as the repre- the stylistic embeddings.
You can also read