Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles

Page created by Kristen Day
 
CONTINUE READING
Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers
                                                      Reveals Distinctive yet Consistent Individual Styles

                                                                 Jian Zhu                                    David Jurgens
                                                          Department of Linguistics                       School of Information
                                                           University of Michigan                         University of Michigan
                                                          lingjzhu@umich.edu                             jurgens@umich.edu

                                                                Abstract                          to quantify and characterize individual textual fea-
                                                                                                  tures to separate authors (Grant, 2012; Coulthard
                                              An individual’s variation in writing style is of-
                                                                                                  et al., 2016; Neal et al., 2017), the theory of idiolect
                                              ten a function of both social and personal at-
                                              tributes. While structured social variation has     remains comparatively underdeveloped (Grant and
arXiv:2109.03158v3 [cs.CL] 10 Sep 2021

                                              been extensively studied, e.g., gender based        MacLeod, 2018). An in-depth understanding of the
                                              variation, far less is known about how to char-     nature and the variation of idiolect not only sheds
                                              acterize individual styles due to their idiosyn-    light on the theoretical discussion of language varia-
                                              cratic nature. We introduce a new approach          tion but also aids practical and forensic applications
                                              to studying idiolects through a massive cross-      of linguistic science.
                                              author comparison to identify and encode               Here, we characterize the idiolectal variation of
                                              stylistic features. The neural model achieves
                                                                                                  linguistic styles through a computational analysis
                                              strong performance at authorship identifica-
                                              tion on short texts and through an analogy-         of large-scale short online texts. Specifically, we
                                              based probing task, showing that the learned        ask the following questions: 1) to what extent can
                                              representations exhibit surprising regularities     we extract distinct styles from short texts, even
                                              that encode qualitative and quantitative shifts     for unseen authors; 2) what are the core stylistic
                                              of idiolectal styles. Through text perturbation,    dimensions along which individuals vary; and 3) to
                                              we quantify the relative contributions of dif-      what extent are idiolects consistent and distinctive?
                                              ferent linguistic elements to idiolectal varia-
                                                                                                     By using deep metric learning, we show that id-
                                              tion. Furthermore, we provide a description
                                              of idiolects through measuring inter- and intra-    iolect is, in fact, systematic and can be quantified
                                              author variation, showing that variation in idi-    separately from sociolect. And we introduce a new
                                              olects is often distinctive yet consistent.         set of probing tasks for testing the relative contribu-
                                                                                                  tions of different linguistic variations to idiolectal
                                         1    Introduction                                        styles. Secondly, we found that the learned repre-
                                         Linguistic identities manifest through ubiquitous        sentations for idiolect also encode some stylistic di-
                                         language variation. The notion that language func-       mensions with surprising regularity (see Figure 1),
                                         tions as stylistic resources for the construction and    analogous to linguistic regularity found in the word
                                         performance of social identity rests upon two the-       embeddings. Thirdly, with our proposed metrics
                                         oretical constructs: sociolect and idiolect (Grant       for style distinctiveness and consistency, we show
                                         and MacLeod, 2018). The term ‘sociolect’ refers to       that individuals vary considerably in their internal
                                         the socially structured variation at the group level,    consistency and distinctiveness in idiolect, which
                                         whereas ‘idiolect’ denotes language variation asso-      has implications for the limits of authorship recog-
                                         ciated with individuals (Wardhaugh, 2011; Grant          nition and the practice of forensic linguistics. For
                                         and MacLeod, 2018). Variationist sociolinguistics        replication, we make our code available at https:
                                         emphasizes the systematic variation of sociolect         //github.com/lingjzhu/idiolect.
                                         such as gender, ethnicity, and socioeconomic strat-
                                                                                                  2   Idiolectal variation
                                         ification (Labov, 1972). While a central concept
                                         in sociolinguistics, idiolect has received far more      Theoretical questions “Idiolect" remains a fun-
                                         research attention in forensic linguistics (Wright,      damental yet elusive construct in sociolinguistics
                                         2018; Grant and MacLeod, 2018).                          (Wright, 2018). The term has been abstractly
                                            Although idiolects have played a central role         defined as the totality of the possible utterances
                                         in stylometry and forensic linguistics, which seek       one could say (Bloch, 1948; Turell, 2010; Wright,
Orignal         Lower cased                            Original         Null subject (fewer)         Null subject (more)
   0.8                                                          0.3
   0.6                                                          0.2
   0.4                                                          0.1
   0.2                                                          0.0
   0.0                                                          0.1

   0.2                                                          0.2

   0.4                                                          0.3

   0.6                                                          0.4
         0.6   0.4   0.2       0.0      0.2        0.4   0.6          0.4              0.2          0.0             0.2         0.4

Figure 1: Compositionality in stylistic embeddings encoded by SRoBERTa, projected into the first two principal
components. [Left] Lower-casing all letters shifts all original texts in the same direction (blue dots → orange dots;
e.g., “I love Cantonese BBQ!”→“i love cantonese bbq!”). [Right] The magnitude of movement in one direction is
proportionate to the number of null-subject sentences in texts (blue dots → orange dots → green crosses; e.g., “I
went out. I bought durians.”→“Went out. I bought durians.”→“Went out. Bought durians”.).

2018). Idiolect as a combination of one’s cognitive            and consistent against a background population
capacity and sociolinguistic experiences (Grant and            (Grant, 2012; Grant and MacLeod, 2018). Studies
MacLeod, 2018) raises many interesting linguis-                on forensic linguistics (Johnson and Wright, 2014;
tic questions. First, how are idiolects composed?              Wright, 2013, 2017) provide confirmatory answers
Forensic studies often focus on a few linguistic fea-          to both questions—yet these studies only focus on
tures as capturing a person’s idiolect (Coulthard,             a specific set of features for a small group of au-
2004; Barlow, 2013; Wright, 2013) but few have                 thors. At the other extreme, the high performance
explicitly offered an explanation as to why partic-            of applying machine learning on authorship ver-
ular word sequences were useful or not (Wright,                ification and attribution (Kestemont et al., 2018,
2017). Frameworks for analyzing individual varia-              2019, 2020) stems from placing more emphasis
tions have been proposed (Grant, 2012; Grant and               on separating authors (distinctiveness) than consis-
MacLeod, 2020) but contributions to idiolectal vari-           tency. As an empirical investigation of these two
ation at different linguistic levels are seldom explic-        concepts in a relatively large population remains to
itly measured.                                                 be conducted. Here we aim to quantify these two
   Sociolinguists have pursued the relationship be-            linguistic constructs in large-scale textual datasets.
tween idiolect and sociolect. However, the per-
ceived idiosyncrasies of idiolect often render it              Stylometry and stylistic similarities Tradi-
second to sociolect as an object of study (Labov,              tional stylometry often relies on painstaking man-
1989). Multiple scholars have suggested that idi-              ual analysis of texts for a closed set of authors
olects are the building blocks of various sociolects           (Holmes, 1998). Surface linguistic features, espe-
(Eckert, 2012; Barlow, 2013; Wright, 2018). While              cially function words or character n-grams, have
some studies have probed the relations between the             been found to be effective in authorship analysis
language of individuals and the group (Johnstone               (Kestemont, 2014; Neal et al., 2017). Despite the
and Bean, 1997; Schilling-Estes, 1998; Meyerhoff               overwhelming success of deep learning, traditional
and Walker, 2007), it remains less clear as to what            features are still effective in authorship analysis
extent idiolect is composed of sociolects or has               (Kestemont et al., 2018, 2019, 2020). Yet the wide
unique elements of their own (see Barlow, 2013,                application of machine learning and deep learn-
for a review). We test this hypothesize relationship           ing in recent years has greatly advanced the state-
later in Appendix F and do find preliminary quanti-            of-the-art performance in authorship verification
tative evidence that idiolect representations some             (Boenninghoff et al., 2019a,b; Weerasinghe and
information of sociolects.                                     Greenstadt, 2020). Recent PAN Authorship Ver-
   The other theoretical question relevant to cog-             ification shared tasks suggest that characterizing
nitive science as well as forensic linguistics is to           individual styles in long texts can be solved with
what extent an individual’s idiolect is distinctive            almost perfect accuracy (Kestemont et al., 2020);
as a result, stylometric studies have increasingly              where W1 and W2 are learnable parameters and
focusing on short texts in social media or online               σ(·) is the ReLU activation function. Stylometric
communications for authorship profiling or verifi-              similarity between the text pair is then measured
cation (Brocardo et al., 2013; Vosoughi et al., 2015;           by a distance function d(zi , zj ). Here we mainly
Boenninghoff et al., 2019b). Our study points to a              consider the cosine similarity.
promising application of sociolinguistics for author-              The underlying models are RoBERTa (Liu
ship detection in social media, as idiolect variation           et al., 2019) and BERT (Devlin et al., 2019).
is evident and consistent in texts as short as 100              Specifically, we used the roberta-base or
tokens.                                                         bert-base-uncased as the encoder.1

3    Learning representations of idiolects                      Loss function The classic max-margin loss for
                                                                deep metric learning was shown to be effective in
In forensic linguistics, textual similarity was tra-
                                                                previous work on stylometry (Boenninghoff et al.,
ditionally quantified as the proportion of shared
                                                                2019a,b). Inspired by Kim et al. (2020), we used a
vocabulary and the number and length of shared
                                                                continuous approximation of the max-margin loss
phrases or characters (n-grams) (Coulthard, 2004).
                                                                to learn the stylometric distance between users.
More sophisticated statistical methods to perform
                                                                The additional hyperparameter in this loss allows
textual comparison have been developed over the
                                                                the fine-grained control of the penalty magnitude
years (Neal et al., 2017; Kestemont et al., 2020).
                                                                for hard samples.
   To learn representations of idiolectal style, we
                                                                   The loss function is an adaptation from the
propose using a proxy task of authorship verifica-
                                                                proxy-anchor loss proposed by Kim et al. (2020).
tion, where, given two input texts, a model must
                                                                Given a text pair {xi , xj }, stylometric similarity
determine if they were written by the same author
                                                                between the text pair is then measured by a dis-
or not (Neal et al., 2017). The identification is
                                                                tance function d(zi , zj ). Here we mainly consider
performed by scoring the two texts under compar-
                                                                the cosine similarity. To minimize the distance for
ison with a linguistic similarity measure and, if
                                                                same-author pairs and maximize the distance be-
their linguistic similarity measure exceeds a certain
                                                                tween different-author pairs, the model was trained
threshold, the two texts are judged to be written by
                                                                with a contrastive loss with pre-defined margins
the same author.
                                                                {τs , τd } for the set of positive samples P + and
Task definition Given a collection of                           negative samples P − is given below.
text pairs by multiple authors X                           =
                           pt−1                                              1
{(xp11 , xp21 ), . . . , (xn−1  , xpnt )}
                                                                                                   
                                                  from    do-
                                                                                    X
                                                                  L(s) =                  Softplus   LogSumExp
main P             =       {p1 , p2 , . . . , pt } and labels              |P + |       +
                                                                                  i,j∈P                              (2)
Y = {yi , y2 , . . . , yn }, we aim to identify a                                                                 
function fθ that can determine whether two text                                            α · [d(zi , zj ) − τs ]
samples xi and xj are written by the same author
(y = 1) or different authors (y = 0).
                                                                             1      X              
Stylometric similarity learning Our model for                     L(s) =                  Softplus   LogSumExp
                                                                           |P − |
extracting stylistic embeddings from input texts                                  i,j∈P −                           (3)
is the same as the Sentence RoBERTa or                                                                           
                                                                                           α · [τd − d(zi , zj )]
BERT network (SBERT/SRoBERTa) (Reimers and
Gurevych, 2019). For a text pair (xi , xj ), the
Siamese model fθ maps both text samples into                                      L = L(s) + L(d)                   (4)
embedding vectors (zi , zj ) in the latent space such
                                                                where Softplus(z) = log(1 + ez ) is a continuous
that zi = fθ (xi ) and zj = fθ (xj ). Rather than
                                                                approximation of the max function. α is a scal-
using the [cls] token as the representation, we
                                                                ing factor that scales the penalty of the out-of-the-
use attention pooling to merge the last hidden states
                                                                margin samples. The out-of-margin samples are
[h0 , . . . , hk ] into a single embedding vector to rep-
                                                                exponentially weighted through the Log-Sum-Exp
resent textual style.
                                                                   1
                                                                     The performance of bert-base-cased was almost
         ho =AttentionPool([h0 , . . . , hk ])                  identical to bert-base-uncased, so it was not included
                                                        (1)
          z =W 1 · σ(W 2 · ho + b2 ) + b1                       in the main results.
operation such that hard examples are assigned ex-      as positive samples of same-authorship (SA). For
ponentially growing weights, prompting the model        negative samples, we randomly sampled six texts
to learn hard samples harder.                           from the rest of the data and paired them with
   During inference, we compare the textual dis-        the original text. In order to improve generaliz-
tance d(x1 , x2 ) with the threshold τt , the aver-     ability across domains, we enforced a sampling
age of the two margins, τt = τs +τ    d
                                    2 . We set
                                                        scheme that half of the positive/negative samples
{τs = 0.6, τd = 0.4} and α = 30. Details about          were matched in domain while the other half were
hyperparameters searching can be found in Ap-           cross-domain.
pendix A.
                                                        Reddit posts To test the generalizability, we ad-
Baseline methods We also compare our mod-               ditionally constructed a second dataset from the
els against several baseline methods in authorship      online community Reddit and ran a subset of exper-
verification: 1) GLAD. Groningen Lightweight            iments on it. The top 200 subreddits were extracted
Authorship Detection (GLAD) (Hürlimann et al.,          via the Convokit package (Chang et al., 2020).
2015) is a binary linear classifier using multiple      Only users who had posted texts longer than 50
handcrafted linguistic features. 2) FVD. the Fea-       words in more than 10 subreddits were selected in
ture Vector Difference (FVD) method (Weeras-            the dataset, resulting in 55,368 unique users. We
inghe and Greenstadt, 2020) is a deep learning          partitioned the 60%, 10%, and 30% of users as
method for authorship verification using the abso-      training, development, and test sets respectively.
lute difference between two traditional stylometric     Each user’s idiolect is represented by 10 posts from
feature vectors. 3) AdHominem. AdHominem is             10 different subreddits. The binary labels were then
an LSTM-based Siamese network for authorship            generated by randomly sampling from SA and DA
verification in social media domains (Boenninghoff      pairs, using the same negative sampling procedure
et al., 2019a). 4) BERTConcat /RoBERTaConcat .          used in the creation of the Amazon dataset.
The model is fine-tuned BERT/RoBERTa by con-
catenating two texts under comparison. Authorship       5   Results on proxy task
was determined by performing the binary classi-         To test whether our model does recover stylis-
fication task on the  token (Ordoñez et al.,       tic features, we first test its performance on the
2020). Implementation details are provided in the       proxy task of author verification, contextualizing
Appendix B.                                             the performance with other models specifically de-
                                                        signed for that task. The performance of author-
4   Data                                                ship verification is evaluated by accuracy, F1 and
Amazon reviews Our dataset was extracted from           the area-under-the-curve score (AUC) (Kestemont
the release of the full Amazon review dataset up        et al., 2020). Table 1 suggests that all models
to 2018 (Ni et al., 2019). We filtered out reviews      are able to at least recover some distinctive as-
that were shorter than 50 words to ensure sufficient    pects of individual styles even in these short text.
text to reveal stylistic variation. We only retained    Deep learning-based methods generally achieve
users that have reviews in at least two product do-     better verification accuracy than GLAD (Hürli-
mains (e.g., Electronics and Books) and at least five   mann et al., 2015) and FVDNN (Weerasinghe and
reviews in each domain. After text cleaning, the        Greenstadt, 2020), models based on traditional
dataset contained 128,945 users. We partitioned         linguistic features. Siamese architecture demon-
40%, 10%, and 50% of users into training, devel-        strates its usefulness in the authorship verification
opment, and test sets. There were 51398, 12849,         task, as AdHominem (Boenninghoff et al., 2019a)
and 64248 unique users in each set respectively. As     and SRoBERTa perform better than the pre-trained
one of our goals is to analyzed stylistic variation,    transformer RoBERTaConcat . These results confirm
we reserved the majority of the data (50% of users)     that models recognize authors’ stylistic variation.
for model evaluation and for subsequent linguistic      Our error analysis also shows that identifying the
analysis. The maximum length of all samples was         same author across domains or different authors
limited to 100 tokens.                                  in the same domain poses a greater challenge to
                                                        these models in general, though different model
Negative sampling For each user, we randomly            choices may exhibit various inductive biases (see
sampled six pairs of texts written by the same user     Appendix C). As SRoBERTa is shown to be the
most effective architecture in this task, we will use     set of hyperparameters to compare the model per-
these models to examine idiolectal variation for the      formance on these perturbing inputs.
rest of the study.
                                                          Content and function words. The use of func-
    Model              Accuracy        F1     AUC         tion words has long been recognized as an impor-
    Amazon reviews                                        tant stylometric feature. A small set of function
    Random                50%         0.5      0.5        words are disproportionately frequent, relatively
    GLAD                 67.1%       0.667    0.738       stable across content, and seem less under authors’
    FVDNN                65.1%       0.671    0.714       conscious control (Kestemont, 2014). Yet few stud-
    AdHominem            73.3%       0.781    0.811       ies have empirically compared the relative contri-
    BERTConcat           71.3%       0.664    0.802       butions between function words and content words.
    SBERT                76.7%       0.768    0.850       To test this, we masked out all content words in
    RoBERTaConcat        73.9%       0.686    0.838       the original texts with a masked token ,
    SRoBERTa             82.9%       0.831    0.909       which was recognized by the transformer models.
    Reddit posts                                          For comparison, we also created masked texts with
    BERTConcat           65.0%       0.600    0.721       only content words. Punctuation and relative posi-
    SBERT                66.3%       0.669    0.727       tions between words were retained as this allows
    RoBERTaConcat        71.0%       0.695    0.794       the model to maximally exploit the spatial layout
    SRoBERTa             73.0%       0.737    0.812       of content/function words.
Table 1: Results of authorship verification on the Ama-   Results While the importance of lexical informa-
zon (top) and Reddit (bottom) test samples
                                                          tion in authorship analysis has been emphasized,
                                                          it is suggested that only using lexical information
                                                          is insufficient in forensic linguistics (Grant and
6     Linguistic Analysis
                                                          MacLeod, 2020). Our results in Table 2 suggest
In this section, we seek to quantify the idiolectal       that, even with only lexical information, the model
variation at different linguistic levels.                 performance is only about 4% lower than models
                                                          with access to all information. Syntactic and dis-
6.1    Stylometric features                               course ordering do contribute to author identities,
Idiolects vary at lexical, syntactic, and discourse       yet the contributions are relatively minor. In foren-
levels, yet it remains unclear which type of varia-       sic linguistics, it is commonly the case that only
tion contributes most to idiolectal variation.            fragmentary texts are available (Grant, 2012), and
                                                          our findings suggest that even without broader dis-
Ordering Hierarchical models of linguistic iden-          course information, it is still possible to estimate
tities hold that authorial identities are reflected at    author identity with good confidence. The weak
all linguistic levels (Herring, 2004; Grant, 2012;        contribution of discourse coherence to authorship
Grant and MacLeod, 2020), yet the relative im-            analysis highlights that the high level organization
portance of these elements is seldom empirically          of texts is only somewhat consistent within authors,
explored. In order to understand the contributions        which has been mentioned but rarely tested in foren-
of lexical distribution, syntactic ordering, or dis-      sic linguistics (Grant and MacLeod, 2020).
course coherence, we test the contributions of dif-           From Table 3, it is apparent that, even with half
ferent linguistic features to authorship verification     of the words masked out, the transformed texts
by perturbing the input texts. To force the model to      still contain an abundance of reliable stylometric
only use lexical information, we randomly permute         cues to individual writers, such that the overall
all tokens in our data, removing information about        accuracy is not significantly lower than models
syntactic ordering (lexical model). The or-               with full texts. While the importance of function
ganization of discourse might also provide cues           words in authorship analysis has been emphasized
to idiolectal style. To test this, we preserve the        (Kestemont, 2014), content words seem to convey
word order within a sentence but permute sentences        slightly more idiolectal cues despite the topical vari-
within the text to disrupt discourse information          ation. Both SBERT and SRoBERTa achieve simi-
(lecico-syntactic model). Then we ran                     lar performance on Amazon and Reddit data, yet
our experiments on these datasets using the same          SRoBERTa better exploits the individual variation
Amazon                              Reddit
                                          Test (Permuted) Test (Original)    Test (Permuted) Test (original)
                          SBERT            0.716 / 0.798   0.736 / 0.794      0.639 / 0.668    0.657 / 0.667
       Lexical
                          SRoBERTa         0.781 / 0.856   0.751 / 0.850      0.654 / 0.734    0.686 / 0.714
                          SBERT            0.767 / 0.851   0.770 / 0.852      0.681 / 0.722    0.685 / 0.724
       Lexico-syntactic
                          SRoBERTa         0.820 / 0.895   0.822 / 0.896      0.730 / 0.798    0.732 / 0.800

Table 2: Results (F1/AUC) on permuted examples show that models are largely insensitive to syntactic and ordering
variation, and, instead, idiolect is mostly captured through lexical variation.

                            Amazon            Reddit               Tokenization     Accuracy      F1     AUC
            SBERT         0.775 / 0.856    0.680 / 0.744           Word-30k          67.8%       0.675   0.745
 Function
            SRoBERTa      0.786 / 0.858    0.683 / 0.755           Word-50k          67.8%       0.679   0.744
            SBERT         0.735 / 0.812    0.641 / 0.689           BERT-uncased      67.1%       0.680   0.737
 Content
            SRoBERTa      0.795 / 0.870    0.708 / 0.768           BERT-cased        67.2%       0.677   0.737
                                                                   RoBERTa           73.1%       0.734   0.804
Table 3: Results (F1/AUC) show that function or
content words alone are reliable authorial cues. For          Table 4: The effect of tokenization methods on the
SRoBERTa, content words seem to convey slightly               model performance with respect to Amazon reviews.
more idiolectal cues despite topical variation.               The pre-trained RoBERTa BPE tokenizer encodes more
                                                              textual variations than the rest.

in content words. These results strongly suggest
that there are unique individual styles that are stable       50k does not bring any improvements, indicating
across topics, and our additional probing also re-            that many tokens were unused during evaluation.
veals that topic information is significantly reduced         The results strongly suggest that choosing the ex-
in the learned embeddings (see Appendix F).                   pressive tokenizer, such as the RoBERTa BPE to-
                                                              kenizers directly trained on raw texts, can effec-
6.2   Analysis of tokenization methods                        tively encode more stylistic features. For exam-
We hypothesized that the large performance gap be-            ple, for the word cat, the RoBERTa tokenizer
tween BERT and RoBERTa (~5%) could be caused                  gives different encodings for its common variants
by the discrepancy in the tokenization methods.               such as Cat, CAT, [space]CAT or caaats, but
The BERT tokenizer is learned after preprocessing             these are all treated as the OOV tokens in word-
the texts with heuristic rules (Devlin et al., 2019),         based tokenizations. While the BERT tokenizer
whereas the BPE tokenizer for RoBERTa is learned              handles most variants, it fails to encode format-
without any additional preprocessing or tokeniza-             ting variation such as CAT, [space]CAT and
tion of the input (Liu et al., 2019).                         [space][space]CAT. Such nuances in format-
                                                              ting are an essential dimension of idiolectal varia-
Method To verify, we trained several lightweight
                                                              tion in forensic analysis of electronic communica-
Siamese LSTM models from scratch that only
                                                              tions (Grant, 2012; Grant and MacLeod, 2020).
differed in tokenization methods: 1) word-
based tokenizer with the vocabulary size set                  7   Characterizing idiolectal styles
to either 30k or 50k to match the sizes of
BPE encodings; 2) pre-trained wordpiece                       In this section, we turn our attention to distinc-
tokenizer for bert-base-uncased and                           tiveness and consistency in writing styles, both of
bert-base-cased; 3) pre-trained tokenizer                     which are key theoretical assumptions in forensic
for roberta-base. Implementation details are                  linguistics (Grant, 2012).
attached in Appendix D.
                                                              Distinctiveness We examine inter-author varia-
Results As shown in Table 4, the RoBERTa to-                  tion through inter-author distinctiveness by con-
kenizer outperforms other tokenizers by a signif-             structing a graph that connects users with similar
icant margin, even though it has similar numbers              style embeddings, described next. For each user in
of parameters to Word-50k. Interestingly, pre-                the test set, we randomly sampled one text sample
trained BERT tokenizer is not superior to the word-           and extracted its embedding through the Siamese
based tokenizer, despite better handling of out-              models. Then we created the pairwise similarity
of-vocabulary (OOV) tokens. For word-based to-                matrix M by computing the pairwise similarity
kenizers, increasing the vocabulary from 30k to               between each text pair. Then M is pruned by re-
Most distinctive                                                  Least distinctive
 its ok seems like a reprint i mean its not horrible but i was     Nice, thinner style plates that are well suited for building
 expecting a lil better qaulity but if i wore to do it again yes   Lego projects. They hold Lego pieces securely and match
 i would still buy this poster its not blurry or anything but if   up perfectly. Also, as a big PLUS for this company you get
 you have a good eye it seems a lil like a reprint                 amazing customer service.

Table 5: Sample Amazon review excerpts with the most and the least distinctive style as predicted by SRoBERTa.

moved entries below a threshold τcutof f , the same                 Analysis The distributions of style distinctive-
threshold τt that is used to determined SA or DA                    ness and consistency both conform to a nor-
pairs. The pruned matrix M̂ is treated as the graph                 mal distribution (Figure 2), yet no meaningful
adjacency matrix from which a network G is con-                     correlation exists between these two measures
structed.                                                           (Amazon: Spearman’s ρ=0.078; Reddit: Spear-
                             PN                                     man’s ρ=0.11). In general, users are highly con-
                                j   I[j ∈ Vi ]                      sistent in their writing styles even in such a large
               Si = 1 −                                     (5)
                                    N                               population, with an average of 0.8, much higher
                                                                    than that for random samples (~0.4). Users are also
where Vi is the set of neighbors of node i in G, N                  quite distinctive from one another, as on average
the total
      PNnode count, and I[ ] the indicator func-                    a user’s style is different from 80% of users in the
tion. j I[j ∈ Vi ] is the degree centrality of node                 population pool. Yet individuals do differ in their
i. We found that features from the unweighted                       degrees of distinctiveness and consistency, which
graph are perfectly correlated with the ones from                   may be taken into consideration in forensic linguis-
the weighted graph. The unweighted graph is kept                    tics. Because inconsistency or indistinctiveness
for computational efficiency. The scores were aver-                 may weaken the linguistic evidence to be analyzed.
aged over 5 runs. The intuition is that, since authors
are connected to similar authors, the more neigh-                      In Table 5, the least distinctive text is charac-
bors an author has, the less distinctive their style is.            terized by plain language, proper formatting, and
A distinctiveness of 0.6 implies that this author is                typical content, which reflects the unmarked style
different from 60% of authors in the dataset.                       of stereotypical Amazon reviews. Yet this review
                                                                    itself is still quite distinctive as it differs from 60%
Consistency We also measured the intra-author                       of the total reviews. The most distinctive review
consistency in styles through concentration in the                  exhibits multiple deviations from the norm of this
latent space. The concentration can be quantified                   genre. The style is unconventional with uncapital-
by the conicity, a measure of the averaged vec-                     ized letters, run-on sentences, typos, the lack of pe-
tor alignment to mean (ATM), as in the following                    riods, and the use of colloquial alternative spellings
equation (Chandrahas et al., 2018).                                 such as “haft", “lil" and “wore", all of which make
                                                                    this review highly marked. For style consistency,
                           1 X                                      the most consistent writers incline towards using
    Conicity(V) =              AT M (v, V)                  (6)
                          |V|                                       similar formatting, emojis, and narrative perspec-
                                v∈V
                                                                    tives across reviews, whereas the least consistent
                                                                    users tend to shift across registers and perspectives
                            1 X                                   in writings (see Appendix E for additional sam-
    AT M (v, V) = cosine v,     x                           (7)     ples).
                            |V|
                                             x∈V
                                                                       We tested how various authors affect the verifi-
where v ∈ V is a latent vector in the set of vec-                   cation performance. To avoid circular validation
tors V. The ATM measures the cosine distance                        resulting from repeatedly using the same training
between v to the centroid of V whereas the conic-                   data, we retrained the model with the repartitioned
ity indicates the overall clusteredness of vectors in               test data and tested them using the development
V around the centroid. If all texts written by the                  set. The original test set was repartitioned into
same user are highly aligned around their centroid                  three disjoint chunks of equal size, each chunk con-
with a conicity close to 1, this suggests that this                 taining authors solely from either the top, middle
user is highly consistent in writing style.                         or bottom 33% in terms of distinctiveness or con-
Style Shift             Amazon         Reddit
                                                                                   Random                    0.032         0.002
                                                                                   and → &                   0.698         0.484
                                                                                   . → space                 0.389         0.352
 Intra-author Consistency
                        1.0                                                        . → !!!!
                                                                                   lower-cased
                                                                                                             0.569
                                                                                                             0.679
                                                                                                                           0.396
                                                                                                                           0.660
                                                                                   upper-cased               0.463         0.470
                        0.9                                                        I→∅
                                                                                   going to → gonna
                                                                                                             0.420
                                                                                                             0.398
                                                                                                                           0.306
                                                                                                                           0.390
                                                                                   want to → wanna           0.478         0.505
                        0.8                                                        -ing → -in’
                                                                                   w/o elongation → w/
                                                                                                             0.522
                                                                                                             0.410
                                                                                                                           0.484
                                                                                                                           0.343

                        0.7                                                Table 7: Sij for qualitative style shifts, averaged across
                                                                           all comparisons. The offset vectors for style shifts are
                                                                           highly aligned.
                        0.6
                                       0.8                  1.0                                  Amazon                Reddit
                            Inter-author Distinctiveness
                                                                              Style shift
                                                                                               Sk    ∆N orm          Sk     ∆N orm
                                                                              Random        0.686      0.059      0.692       0.059
                                                                              and → &       0.957      0.176      0.922       0.124
                                                                              . → !!!!      0.964      0.085      0.953       0.072
Figure 2: The joint distribution of distinctiveness and
                                                                              I→∅           0.860      0.257      0.831       0.273
consistency on the Amazon reviews as computed with                            -ing → -in’   0.933      0.103      0.928       0.097
SRoBERTa embeddings. Both follow normal distribu-
tions yet there is no meaningful correlation between                       Table 8: Sk and ∆N orm for quantitative style shifts.
them, suggesting that these two dimensions may vary                        The direction and the magnitude of change encode the
independently of each other.                                               type and the degree of style shift. ∆N orm is positive,
                                                                           suggesting that more manipulations result in a longer
                                Range      Amazon          Reddit          offset vector along that direction.
                                Random     0.806 / 0.875   0.677 / 0.761
                                High       0.801 / 0.879   0.680 / 0.767
    Consistent
                                Moderate   0.805 / 0.874   0.672 / 0.757
                                Low        0.797 / 0.867   0.661 / 0.716
                                                                           2016), we designed a series of linguistic stimuli
                                Random     0.808 / 0.876   0.695 / 0.754   that vary systematically in styles to probe the struc-
                                High       0.803 / 0.874   0.709 / 0.750   ture of the stylistic embeddings. For each stylis-
    Distinctive
                                Moderate   0.808 / 0.880   0.663 / 0.731
                                Low        0.793 / 0.861   0.611 / 0.68
                                                                           tic dimension, we created n text embedding pairs
                                                                           P = [(p1r , p1m ), . . . , (pnr , pnm )] where pir is the em-
Table 6: Performance (F1/AUC) on different partitions                      bedding of a randomly sampled text and pim is the
of data. Models trained on inconsistent or indistinctive                   embedding of the modified version of pir so that
authors significantly underperformed.                                      it differs from pir in only one stylistic aspect. For
                                                                           sample i and j from P, we quantified
sistency. Results in Table 6 suggest that, while
                                                                                    Sij = cosine(pir − pim , pjr − pjm )           (8)
most models performed similarly, models trained
on inconsistent or indistinctive authors significantly                     Like the word analogy task, if this stylistic dimen-
underperformed. This result may have implications                          sion is encoded only in one direction, we should
for comparative authorship analysis in that it is                          expect Sij close to 1.
desirable to control the number of inconsistent or                            For a target qualitative style shift, we randomly
indistinctive authors in the dataset.                                      sampled 1000 texts and modified the text to approx-
                                                                           imate the target stylistic dimension. For example,
8              Compositionality of styles                                  if null subject is the target feature, we remove the
Finally, we sought to understand how stylistic vari-                       subjective “I” from “I recommend crispy pork!"
ations are encoded. At least for certain stylistic                         to “Recommend crispy pork!". Then we compute
features, there is additive stylistic compositionality                     Sij for each pair, totaling 499500 possible com-
in the latent space onto which the texts are pro-                          parisons. Here we selected 10 stylistic markers
jected (Figure 1).                                                         of textual formality for evaluation (MacLeod and
                                                                           Grant, 2012; Biber and Conrad, 2019) and these
Method In light of the word analogy task for                               textual modifications cover insertion, deletion, and
word embeddings (Mikolov et al., 2013; Linzen,                             replacement operations.
For quantitative style shifts, we measure Sk be-       stylistic dimensions in the latent space through the
tween samples as well as the difference in length         proxy task.
with the following equation. We compare two em-              While we only examine several stylistic markers,
bedding pairs (pkr , pks ) and (pkr , pkl ), where both   we were aware that the learned style representations
pks and pkl differ from pir in only one stylistic di-     also exhibit regularities for other lexical manipula-
mension. But pkl is further along that dimension          tions, as long as the manipulation is systematic and
than pks . For instance, compared to the original         regular across samples. An explanation is that the
review pkr , pks contains five more “!!!" whereas pkl     model is systematically tracking the fine-grained
contains ten more such tokens. Here we surveyed           variations in lexical statistics. Yet the proposed
four stylistic markers of formality.                      model must also encode more abstract linguistic
       Sk =cosine(pkl − pkr , pks − pkr )                 features because it outperformed GLAD (Hürli-
                                                   (9)    mann et al., 2015) and FVDNN (Weerasinghe and
   ∆normk = ||pkl − pkr ||2 − ||pks − pkr ||2             Greenstadt, 2020) that also track bag-of-words or
For each stylistic shift, we collected 2000 sam-          bag-of-ngram features. Previous research in word
ples, each contained at least 8 markers of that style.    embeddings attributes the performance of the anal-
Then we modified the original review pkr to the           ogy task to the occurrence patterns (Pennington
target style by incrementally transforming the key-       et al., 2014; Levy et al., 2015; Ethayarajh et al.,
words into the target keywords by 50% (pks ) and          2019). The fact that these variations are encoded
100% (pkl ). If these styles are highly organized,        systematically beyond random variation and at such
we should expect Sk to be close to 1, suggesting          a fine-grain manner indicates that they are stylistic
that changes in the same dimension point to the           dimensions along which individual choices vary
same direction. Yet we also expect that quantitative      frequently and regularly.
changes should also be reflected in the significant
length difference (magnitude of ∆normk ) and the          9   Conclusions
direction of the difference (∆normk being positive
or negative) of the style vectors.
                                                          The relatively unconstrained nature of the online
Results Results in Table 7 suggest that both mod-         genres tolerates a much wider range of stylistic
els outperform the random baseline, Sij generated         variation than conventional genres (Hovy et al.,
with the same samples by randomly replacing some          2015). Online genres are often marked by uncon-
words. Like word embeddings, stylistic embed-             ventional spellings, heavy use of colloquial lan-
dings also exhibit a linear additive relationship be-     guage, extensive deviations in formatting, and the
tween various stylistic attributes. In Figure 1, con-     relaxation of grammatical rules, providing rich lin-
verting all letters to lower case causes textual rep-     guistic resources to construct and perform one’s
resentations to move collectively in approximately        identity. Our analysis of idiolects in online regis-
the same direction. Despite such regularities, the        ters has highlighted that idiolectal variations per-
offset vectors for style shifts were not perfectly        meate all linguistic levels, present in both surface
aligned in all instances, which may be attributed to      lexico-syntactic features and high-level discourse
the variations across texts.                              organization. Traditional sociolinguistic research
   For quantitative changes, SRoBERTa on both             often regards idiolects as idiosyncratic and unstable
Amazon and Reddit data encode the same type of            and not as regular as sociolects (Labov, 1989; Bar-
change in the same direction, as the vector offsets       low, 2018); here, we show that idiolectal variation
are highly aligned to each other (Table 8). Yet,          is not only highly distinctive but also consistent,
greater degrees of style shift relative to the origi-     even in a relatively large population. Our findings
nal text translate to a larger magnitude of change        suggest that individuals may differ considerably by
along that direction ( ∆normk in Table 8). In Fig-        degrees of consistency and distinctiveness across
ure 1, after removing the first three instances or        multiple text samples, which sheds light on the
all the occurrences of “I" from the original text,        theoretical discussions and practical applications
the resulted representations both shift in the same       in forensic linguistics. Our findings also have im-
direction but differ in magnitude. Such changes           plications for sociolinguistics, as we have shown
cannot be explained by random variation, suggest-         an effective method to discover, understand and
ing that both models learn to encode fine-grained         exploit sociolinguistic variation.
10    Ethical considerations                             References
While this study is theoretically driven, we are         Michael Barlow. 2013. Individual differences and
                                                           usage-based grammar. International Journal of Cor-
aware that there might be some ethical concerns            pus Linguistics, 18(4):443–478.
for models surveyed in this study. While com-
putational stylometry can be applied to forensic         Michael Barlow. 2018. The individual and the group
                                                           from a corpus perspective. The Corpus Linguistic
investigations (Grant and MacLeod, 2020), author-          Discourse: In Honour of Wolfgang Teubert. Amster-
ship verification in online social networks, if put        dam, pages 163–184.
to malicious use, may weaken the anonymity of
                                                         Angelo Basile, Albert Gatt, and Malvina Nissim. 2019.
some users, leading to potential privacy issues. Our       You write like you eat: Stylistic variation as a pre-
results also show that caution should be taken when        dictor of social stratification. In Proceedings of the
deploying these models in forensic scenarios, as           57th Annual Meeting of the Association for Com-
different models or tokenizers might show different        putational Linguistics, pages 2583–2593, Florence,
                                                           Italy. Association for Computational Linguistics.
inductive biases that may bias towards certain types
of users. Another potential bias is that we only se-     Billal Belainine, Fatiha Sadat, Mounir Boukadoum,
lected a small group of the most productive writers         and Hakim Lounis. 2020. Towards a multi-dataset
                                                            for complex emotions learning based on deep neural
from the pool (less than 20% of all data), but this
                                                            networks. In Proceedings of the Second Workshop
sample might not necessary represent all popula-            on Linguistic and Neurocognitive Resources, pages
tions. We urge that caution should be exercised             50–58, Marseille, France. European Language Re-
when using these models in real-life settings.              sources Association.
   We still consider that the benefits of our study      Douglas Biber and Susan Conrad. 2019. Register,
outweigh potential dangers. Deep learning-based            Genre, and Style. Cambridge University Press.
stylometry is an active research area in recent years.
                                                         Bernard Bloch. 1948. A set of postulates for phonemic
While many studies focus on improving perfor-              analysis. Language, 24(1):3–46.
mance, we provide insights into how some of these
models make decisions based on certain linguistic        Benedikt Boenninghoff, Steffen Hessler, Dorothea
                                                           Kolossa, and Robert M Nickel. 2019a. Explainable
features and expose some of the models’ induc-             authorship verification in social media via attention-
tive biases. This interpretability analysis could be       based similarity learning. In 2019 IEEE Interna-
used to guide the proper use of these methods in           tional Conference on Big Data, pages 36–45. IEEE.
decision-making processes. The analysis could            Benedikt Boenninghoff, Robert M Nickel, Steffen
also be useful in developing adversarial techniques        Zeiler, and Dorothea Kolossa. 2019b. Similarity
that guard against the malicious use of such tech-         learning for authorship verification in social media.
nologies.                                                  In ICASSP 2019-2019 IEEE International Confer-
                                                           ence on Acoustics, Speech and Signal Processing
   All of our experiments were performed on public         (ICASSP), pages 2457–2461. IEEE.
data, in accordance with the terms of service. All
authors were anonymized. In addition, the term           Marcelo Luiz Brocardo, Issa Traore, Sherif Saad, and
                                                          Isaac Woungang. 2013. Authorship verification
“Siamese” may be considered offensive when it             for short messages using stylometry. In 2013 In-
refers to some groups of people. The use of this          ternational Conference on Computer, Information
term follows the research naming of a mathematical        and Telecommunication Systems (CITS), pages 1–6.
model in machine learning literature. We use the          IEEE.
word here to refer to a neural network model and         Chandrahas, Aditya Sharma, and Partha Talukdar.
make no reference to particular population groups.         2018. Towards understanding the geometry of
                                                           knowledge graph embeddings. In Proceedings of
Acknowledgements                                           the 56th Annual Meeting of the Association for Com-
                                                           putational Linguistics (Volume 1: Long Papers),
We thank Professor Patrice Beddor, Jiaxin Pei at           pages 122–131, Melbourne, Australia. Association
the University of Michigan and Zuoyu Tian at the           for Computational Linguistics.
University of Indiana Bloomington for their help-        Jonathan P. Chang, Caleb Chiam, Liye Fu, An-
ful discussions. We are also grateful for comments         drew Wang, Justine Zhang, and Cristian Danescu-
from anonymous reviewers, which helped improve             Niculescu-Mizil. 2020. ConvoKit: A toolkit for the
                                                           analysis of conversations. In Proceedings of the
the paper greatly. This material is based upon work        21th Annual Meeting of the Special Interest Group
supported by the National Science Foundation un-           on Discourse and Dialogue, pages 57–60, 1st virtual
der Grant No 1850221.                                      meeting. Association for Computational Linguistics.
Malcolm Coulthard. 2004. Author identification, idi-      Manuela Hürlimann, Benno Weck, Esther van den
 olect, and linguistic uniqueness. Applied Linguis-        Berg, Simon Suster, and Malvina Nissim. 2015.
 tics, 25(4):431–447.                                      Glad: Groningen lightweight authorship detection.
                                                           In CLEF (Working Notes).
Malcolm Coulthard, Alison Johnson, and David
 Wright. 2016. An introduction to forensic linguis-       Alison Johnson and David Wright. 2014. Identifying
 tics: Language in evidence. Routledge.                     idiolect in forensic authorship attribution: an n-gram
                                                            textbite approach. Language and Law, 1(1).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of        Barbara Johnstone and Judith Mattson Bean. 1997.
   deep bidirectional transformers for language under-      Self-expression and linguistic variation. Language
   standing. In Proceedings of the 2019 Conference          in Society, 26(2):221–246.
   of the North American Chapter of the Association
   for Computational Linguistics: Human Language          Mike Kestemont. 2014. Function words in authorship
  Technologies, Volume 1 (Long and Short Papers),           attribution. from black magic to theory? In Proceed-
   pages 4171–4186, Minneapolis, Minnesota. Associ-         ings of the 3rd Workshop on Computational Linguis-
   ation for Computational Linguistics.                     tics for Literature (CLFL), pages 59–66, Gothen-
                                                            burg, Sweden. Association for Computational Lin-
Penelope Eckert. 2012. Three waves of variation study:      guistics.
  The emergence of meaning in the study of sociolin-
  guistic variation. Annual review of Anthropology,       Mike Kestemont, Enrique Manjavacas, Ilia Markov,
  41:87–100.                                                Janek Bevendorff, Matti Wiegmann, Efstathios Sta-
                                                            matatos, Martin Potthast, and Benno Stein. 2020.
Kawin Ethayarajh, David Duvenaud, and Graeme Hirst.         Overview of the cross-domain authorship verifica-
  2019. Understanding undesirable word embedding            tion task at pan 2020. In CLEF.
  associations. In Proceedings of the 57th Annual
  Meeting of the Association for Computational Lin-       Mike Kestemont, Efstathios Stamatatos, Enrique Man-
  guistics, pages 1696–1705, Florence, Italy. Associa-      javacas, Walter Daelemans, Martin Potthast, and
  tion for Computational Linguistics.                       Benno Stein. 2019. Overview of the Cross-domain
                                                            Authorship Attribution Task at PAN 2019. In
Lucie Flekova, Daniel Preoţiuc-Pietro, and Lyle Un-        CLEF 2019 Labs and Workshops, Notebook Papers.
  gar. 2016. Exploring stylistic variation with age         CEUR-WS.org.
  and income on Twitter. In Proceedings of the 54th
  Annual Meeting of the Association for Computa-          Mike Kestemont, Michael Tschuggnall, Efstathios Sta-
  tional Linguistics (Volume 2: Short Papers), pages        matatos, Walter Daelemans, Günther Specht, Benno
  313–319, Berlin, Germany. Association for Compu-          Stein, and Martin Potthast. 2018. Overview of
  tational Linguistics.                                     the Author Identification Task at PAN-2018: Cross-
                                                            domain Authorship Attribution and Style Change
Tim Grant. 2012. Txt 4n6: method, consistency, and          Detection. In Working Notes Papers of the CLEF
  distinctiveness in the analysis of sms text messages.     2018 Evaluation Labs, volume 2125 of CEUR Work-
  Journal of Law and Policy, 21:467.                        shop Proceedings. CEUR-WS.org.

Tim Grant and Nicci MacLeod. 2020. Language and           Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha
  Online Identities: The Undercover Policing of Inter-      Kwak. 2020. Proxy anchor loss for deep metric
  net Sexual Crime. Cambridge University Press.             learning. In Proceedings of the IEEE/CVF Confer-
                                                            ence on Computer Vision and Pattern Recognition,
Tim Grant and Nicola MacLeod. 2018. Resources and           pages 3238–3247.
  constraints in linguistic identity performance–a the-
  ory of authorship. Language and Law/Linguagem e         William Labov. 1972. Sociolinguistic patterns. 4. Uni-
  Direito, 5(1):80–96.                                      versity of Pennsylvania Press.

Susan C. Herring. 2004. Computer-Mediated Dis-            William Labov. 1989. Exact description of the speech
  course Analysis: An Approach to Researching On-           community: Short a in philadelphia. Language
  line Behavior. Learning in doing, pages 338–376.          Change and Variation, pages 1–57.
  Cambridge University Press, New York, NY, US.
                                                          Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
David I Holmes. 1998. The evolution of stylometry          proving distributional similarity with lessons learned
  in humanities scholarship. Literary and linguistic       from word embeddings. Transactions of the Associ-
  computing, 13(3):111–117.                                ation for Computational Linguistics, 3:211–225.

Dirk Hovy, Anders Johannsen, and Anders Søgaard.          Tal Linzen. 2016. Issues in evaluating semantic spaces
  2015. User review sites as a resource for large-scale     using word analogies. In Proceedings of the 1st
  sociolinguistic studies. In Proceedings of the 24th       Workshop on Evaluating Vector-Space Representa-
  international conference on World Wide Web, pages         tions for NLP, pages 13–18, Berlin, Germany. As-
  452–461.                                                  sociation for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        M Teresa Turell. 2010. The use of textual, grammatical
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,              and sociolinguistic evidence in forensic text compar-
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.              ison. International Journal of Speech, Language &
  Roberta: A robustly optimized bert pretraining ap-         the Law, 17(2).
  proach. arXiv preprint arXiv:1907.11692.
                                                           Soroush Vosoughi, Helen Zhou, and Deb Roy. 2015.
Nicci MacLeod and Tim Grant. 2012. Whose tweet?              Digital stylometry: Linking profiles across social
  authorship analysis of micro-blogs and other short-        networks. In International Conference on Social In-
  form messages. In Proceedings of The International         formatics, pages 164–177. Springer.
  Association of Forensic Linguists’ Tenth Biennial
  Conference.                                              Ronald Wardhaugh. 2011. An introduction to sociolin-
                                                             guistics, volume 28. John Wiley & Sons.
Miriam Meyerhoff and James A Walker. 2007. The per-
  sistence of variation in individual grammars: Copula     Janith Weerasinghe and Rachel Greenstadt. 2020.
  absence in ‘urban sojourners’ and their stay-at-home        Feature Vector Difference based Neural Network
  peers, bequia (st vincent and the grenadines) 1. Jour-      and Logistic Regression Models for Authorship
  nal of Sociolinguistics, 11(3):346–366.                    Verification—Notebook for PAN at CLEF 2020. In
                                                             CLEF 2020 Labs and Workshops, Notebook Papers.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.               CEUR-WS.org.
  2013. Linguistic regularities in continuous space
  word representations. In Proceedings of the 2013         David Wright. 2013. Stylistic variation within genre
  Conference of the North American Chapter of the            conventions in the enron email corpus: developing
  Association for Computational Linguistics: Human           a textsensitive methodology for authorship research.
  Language Technologies, pages 746–751, Atlanta,             International Journal of Speech, Language & the
  Georgia. Association for Computational Linguistics.        Law, 20(1).

Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima,       David Wright. 2017. Using word n-grams to identify
  Yiming Yan, Yingfei Xiang, and Damon Woodard.              authors and idiolects: A corpus approach to a foren-
  2017. Surveying stylometry techniques and applica-         sic linguistic problem. International Journal of Cor-
  tions. ACM Computing Surveys (CSUR), 50(6):1–              pus Linguistics, 22(2):212–241.
  36.                                                      David Wright. 2018. Idiolect. Oxford Bibliographies
                                                             in Linguistics.
Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.
   Justifying recommendations using distantly-labeled
   reviews and fine-grained aspects. In Proceedings
   of the 2019 Conference on Empirical Methods in
   Natural Language Processing and the 9th Interna-
   tional Joint Conference on Natural Language Pro-
   cessing (EMNLP-IJCNLP), pages 188–197, Hong
   Kong, China. Association for Computational Lin-
   guistics.

Juanita Ordoñez, Rafael Rivera Soto, and Barry Y
  Chen. 2020. Will longformers pan out for author-
  ship verification. Working Notes of CLEF.

Jeffrey Pennington, Richard Socher, and Christopher
   Manning. 2014. GloVe: Global vectors for word
   representation. In Proceedings of the 2014 Confer-
   ence on Empirical Methods in Natural Language
   Processing (EMNLP), pages 1532–1543, Doha,
   Qatar. Association for Computational Linguistics.

Nils Reimers and Iryna Gurevych. 2019. Sentence-
  BERT: Sentence embeddings using Siamese BERT-
  networks. In Proceedings of the 2019 Conference on
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-
  ral Language Processing (EMNLP-IJCNLP), pages
  3982–3992, Hong Kong, China. Association for
  Computational Linguistics.

Natalie Schilling-Estes. 1998. Investigating “self-
  conscious” speech: The performance register in oc-
  racoke english. Language in society, 27(1):53–83.
A    Hyperparameter tuning                               combo4 options in the code, which covers 23 lin-
                                                         guistic features. While support vector machine
In this section, we report the results from our hyper-   (SVM) was used as the classifier in their paper
parameter tuning process. The following Table 9          (Hürlimann et al., 2015), we found SVM did not
reports some additional results obtained during hy-      scale to the size of our data. Instead, we ran a logis-
perparameter tuning. Changing the masking proba-         tic regression model on the features as this allowed
bility or the margins for the contrastive loss has an    us to interpret the feature importance. The perfor-
impact on the final accuracy. This search is manual      mance of logistic regression was very close to the
and not exhaustive. We used the best parameters in       Random Forest classifier in their code.
the main text.
   For actual implementation, we used an effective       FVD The model was trained with the code re-
batch size of 256. The default optimizer was the         leased by the original authors3 (Weerasinghe and
Adam optimizer with a learning rate of 1e − 5.           Greenstadt, 2020). We kept the original feature ex-
All models were trained on a single Nvidia V100          traction methods and the model architecture. The
GPU with 16GB memory. The models were set to             input to the neural network was a 2314-dimensional
train for 5 epochs but we applied early stopping         feature vector, which was computed by taking the
when the validation accuracy stopped to increase.        absolute difference between the linguistic feature
Each epoch took about 2 hours to complete. For           vectors of the two authors under comparison. The
each model, we limited the maximum length of text        two-layered fully connected neural network was
samples to 100 tokens but the actual definition of       trained for 100 epochs and the model with the best
tokens depended on the tokenizer used.                   validation accuracy was kept.

 Model                   Accuracy      F1     AUC        AdHomenin We used the original implementa-
 τs = 0.7 , τd = 0.3, α = 30                             tion4 provided by the author. While this implemen-
 SRoBERTa - Amazon          0.744     0.676   0.901      tation was slightly different from that described in
 SRoBERTa - Reddit          0.723     0.713   0.804      Boenninghoff et al. (2019a), no modification was
 τs = 0.8 , τd = 0.2, α = 30                             made to the code other than adapting the code to
 SRoBERTa - Amazon          0.809     0.814   0.889      work on our own data. The same pre-processing
 SRoBERTa - Reddit          0.708     0.722   0.784      method, model architecture, and parameters were
 τs = 0.8 , τd = 0.2, α = 30                             kept. Yet the evaluation code was not used as
 SRoBERTa - Amazon          0.809     0.814   0.889
                                                         it ignores uncertain samples, which is a standard
 SRoBERTa - Reddit          0.708     0.722   0.784
                                                         practice in PAN 20 (Kestemont et al., 2020). The
 τs = 0.6 , τd = 0.4, α = 10
 SRoBERTa - Amazon          0.822     0.826   0.903
                                                         model was trained for 5 epochs and we only kept
 SRoBERTa - Reddit           0.73     0.734    0.81      the model with the best validation results.
 τs = 0.6 , τd = 0.4, α = 5
                                                         BERTConcat /RoBERTaConcat We mostly fol-
 SRoBERTa - Amazon          0.819     0.825   0.901
 SRoBERTa - Reddit           0.72     0.723   0.798
                                                         lowed the description by Ordoñez et al. (2020).
                                                         While Ordoñez et al. (2020) used a variant of
Table 9: Results of authorship verification on the de-   RoBERTa known as Longformer (Belainine et al.,
velop / test sets                                        2020), we re-implemented the model using the orig-
                                                         inal pre-trained RoBERTa, so that the model can
                                                         be directly compared to the Siamese version. Since
B   Baseline methods                                     the Longformer is highly similar to RoBERTa and
                                                         BERT, we do not expect a significant performance
We ran some baseline models for comparison.
                                                         gap between them.
When setting up these models, we tried to make
minimal changes to the original implementation.          Evaluation To ensure consistency, all evalua-
Details of our changes are provided below.               tion metrics were computed by the functions
                                                         in Sklearn: accuracy_score for accuracy,
GLAD We used the original code for GLAD2 .
The linguistic features were extracted using the           3
                                                             https://github.com/pan-webis-de/
                                                         weerasinghe20
  2                                                        4
    https://github.com/pan-webis-de/                         https://github.com/boenninghoff/
huerlimann15                                             AdHominem
F1_score for F1 and roc_auc_score for                    sentation of the whole input text, which was then
AUC.                                                     passed to a two-layer fully connected network with
                                                         300 hidden states in each layer. The similarity be-
C    Error analysis                                      tween the two paired texts was computed with the
We also analyzed the error distributions across dif-     cosine distance function. The hyperparameters for
ferent conditions, shown in Table 10. Given a pair       the loss function were τd = 0.4 and τs = 0.6. No
of texts, we categorized them into four different cat-   pre-trained word embedding weights were used
egories, same-author (SA)/different author (DA),         and all weights were trained from scratch.
or same domain (SD)/different domain (DD). Un-
surprisingly, most methods still struggling with         Training details The training, development, and
SA-DD and DA-SD pairs, suggesting that domain-           test data were the same as those in the main exper-
specific/topic information partially interferes with     iments. The model was optimized by the Adam
the extraction of writing styles. The cases with         optimizer with a learning rate of 0.001. Gradient
RoBERTaConcat and BERTConcat are particularly in-        clipping was applied to stabilize training with the
teresting, as both models consistently performed         maximum gradient norm set to 1. The model was
worse at SA pairs but outperformed the rest of the       trained on an 11GB RTX 2080Ti with an effec-
models in DA pairs. Cosine distance-based models         tive batch size of 256 for 10 epochs. The average
seem to better balance the trade-off across condi-       training time for each model was about 3 hours.
tions. This shows that model architectures also
exhibit inductive biases of their own, which may         E   Distinctiveness and consistency
bias them to be more or less effective in certain
                                                         Here we also show the joint distribution of dis-
conditions.
                                                         tinctiveness and consistency given by SBERT in
     Model
                              Accuracy                   Figure 3. The shape of the distribution is a bivariate
                 SA-SD    SA-DD DA-SD       DA-DD        normal distribution and these two metrics are not
     GLAD        73.7%    58.4%    62.9%     73.6%
    FVDNN         75%     67.5%    55.7%     62.3%
                                                         correlated.
  AdHominem      81.2%    73.1%    63.6%     75.4%          The overall distribution of distinctiveness and
  BERTConcat     61.8%    51.6%    85.3%     86.5%       consistency computed using different models are
    SBERT        84.1%    77.2%    67.7%      78%        given in Figure 4. The distribution of style distinc-
 RoBERTaConcat   64.4%    49.6%    89.2%     92.5%
   SRoBERTa      88.4%    82.5%    74.8%     83.2%
                                                         tiveness conforms to a normal distribution regard-
                                                         less of models (Figure 4), though the distribution
Table 10: Accuracy by domain and authorship. All ex-     is more peaked for better models. For style con-
periments were run on the Amazon reviews.                sistency, its distribution also conforms to a normal
                                                         distribution but the distributions predicted by dif-
                                                         ferent models are highly similar.
D    Comparing tokenization methods
                                                         Correlations across models We used the Spear-
Tokenization methods For the word-based tok-
                                                         man correlation coefficients to assess to what extent
enization, we made use of the word_tokenizer
                                                         different models assign similar rankings of distinc-
function in NLTK. Either 30k or 50k most frequent
                                                         tiveness and consistency. Results are presented in
lexical tokens were kept as the vocabulary for train-
                                                         Table 11. For consistency scores given by all mod-
ing the LSTM model, plus a padding token and
                                                         els are moderately correlated yet the correlations
an OOV token. As for the BPE tokenizer, we di-
                                                         for distinctiveness are generally weak.
rectly used the pre-trained tokenizers for BERT
and RoBERTa accessed through HuggingFace’s
Transformers.                                            F   Additional analysis: characterizing
                                                             sociolects
Model specification The underlying model is an
LSTM-based Siamese network. The model con-               Language varies at both individual and collective
sists of two bidirectional LSTM layers with 300          levels (Eckert, 2012). In this section, diagnostic
hidden states for each direction. The last hidden        classification is employed to probe to what extent
states of the last layer in both forward and back-       the collective language variations are retained in
ward directions were concatenated as the repre-          the stylistic embeddings.
You can also read