DIRECTPROBE: Studying Representations without Classifiers
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
D IRECT P ROBE: Studying Representations without Classifiers Yichu Zhou Vivek Srikumar School of Computing School of Computing University of Utah University of Utah flyaway@cs.utah.edu svivek@cs.utah.edu Abstract task. However, their ability to reveal the informa- tion in a representation is occluded by numerous Understanding how linguistic structure is en- factors, such as the choice of the optimizer and the coded in contextualized embedding could help initialization used to train the classifiers. For exam- explain their impressive performance across ple, in our experiments using the task of preposition NLP. Existing approaches for probing them usually call for training classifiers and use supersense prediction (Schneider et al., 2018), we the accuracy, mutual information, or complex- found that the accuracies across different training ity as a proxy for the representation’s good- runs of the same classifier can vary by as much ness. In this work, we argue that doing so can as ∼ 8%! (Detailed results can be found in Ap- be unreliable because different representations pendix F.) may need different classifiers. We develop a Indeed, the very choice of a classifier influences heuristic, D IRECT P ROBE, that directly stud- our estimate of the quality of a representation. ies the geometry of a representation by build- ing upon the notion of a version space for a For example, one representation may achieve the task. Experiments with several linguistic tasks best classification accuracy with a linear model, and contextualized embeddings show that, even whereas another may demand a multi-layer per- without training classifiers, D IRECT P ROBE can ceptron for its non-linear decision boundaries. Of shine light into how an embedding space repre- course, enumerating every possible classifier for sents labels, and also anticipate classifier per- a task is untenable. A common compromise in- formance for the representation. volves using linear classifiers to probe representa- tions (Alain and Bengio, 2017; Kulmizev et al., 1 Introduction 2020), but doing so may mischaracterize repre- Distributed representations of words (e.g., Peters sentations that need non-linear separators. Some et al., 2018; Devlin et al., 2019) have propelled work recognizes this problem (Hewitt and Liang, the state-of-the-art across NLP to new heights. 2019) and proposes to report probing results for at Recently, there is much interest in probing these least logistic regression and a multi-layer percep- opaque representations to understand the informa- tron (Eger et al., 2019), or to compare the learning tion they bear (e.g., Kovaleva et al., 2019; Conneau curves between multiple controls (Talmor et al., et al., 2018; Jawahar et al., 2019). The most com- 2020). However, the success of these methods still monly used strategy calls for training classifiers on depends on the choices of classifiers. them to predict linguistic properties such as syntax, In this paper, we pose the question: Can we eval- or cognitive skills like numeracy (e.g. Kassner and uate the quality of a representation for an NLP task Schütze, 2020; Perone et al., 2018; Yaghoobzadeh directly without relying on classifiers as a proxy? et al., 2019; Krasnowska-Kieraś and Wróblewska, Our approach is driven by a characterization of 2019; Wallace et al., 2019; Pruksachatkun et al., not one, but all decision boundaries in a represen- 2020). Using these classifiers, criteria such as accu- tation that are consistent with a training set for a racy or model complexity are used to evaluate the task. This set of consistent (or approximately con- representation quality for the task (e.g. Goodwin sistent) classifiers constitutes the version space for et al., 2020; Pimentel et al., 2020a; Michael et al., the task (Mitchell, 1982), and includes both simple 2020). (e.g., linear) and complex (e.g., non-linear) classi- Such classifier-based probes are undoubtedly fiers for the task. However, perfectly characterizing useful to estimate a representation’s quality for a the version space for a problem presents compu- 5070 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5070–5083 June 6–11, 2021. ©2021 Association for Computational Linguistics
tational challenges. To develop an approximation, Error < ϵ we note that any decision boundary partitions the underlying feature space into contiguous regions E1(x) h1(E1(x)) associated with labels. We present a heuristic ap- E2(x) h2(E2(x)) proach called D IRECT P ROBE, which builds upon hierarchical clustering to identify such regions for E3(x) h3(E3(x)) a given task and embedding. The resulting partitions allow us to directly probe E4(x) the embeddings via their geometric properties. For Input: x Output: y example, distances between these regions correlate with the difficulty of learning with the representa- Figure 1: An illustration of the total work in a learning tion: larger distances between regions of different problem, visualized as the distance traveled from the labels indicates that there are more consistent sep- input bar to the output bar. Different representations Ei and different classifiers hi provide different effort. arators between them, and imply easier learning, and better generalization of classifiers. Further, by assigning test points to their closest partitions, 2.1 Two Sub-tasks of a Learning Problem we have a parameter-free classifier as a side effect, The problem of training a predictor for a task can which can help benchmark representations without be divided into two sub-problems: (a) representing committing to a specific family of classifiers (e.g., data, and (b) learning a predictor (Bengio et al., linear) as probes. 2013). The former involves transforming input Our experiments study five different NLP tasks objects x—words, pairs of words, sentences, etc.— that involve syntactic and semantic phenomena. into a representation E(x) ∈ Rn that provides We show that our approach allows us to ascertain, features for the latter. Model learning builds a clas- without training a classifier, (a) if a representation sifier h over E(x) to make a prediction, denoted admits a linear separator for a dataset, (b) how by the function composition h(E(x)). different layers of BERT differ in their represen- Figure 1 illustrates the two sub-tasks and their tations for a task, (c) which labels for a task are roles in figuratively “transporting” an input x to- more confusable, (d) the expected performance of wards a prediction with a probability of error be- the best classifier for the task, and (e) the impact of low a small . For the representation E1 , the best fine-tuning. classifier h1 falls short of this error requirement. In summary, the contributions of this work are: The representation E4 does not need an additional 1. We point out that training classifiers as probes classifier because it is identical to the label. The is not reliable, and instead, we should directly representations E2 and E3 both admit classifiers analyze the structure of a representation space. h2 and h3 that meet the error threshold. Further, note, that E3 leaves less work for the classifier than 2. We formalize the problem of evaluating repre- E2 , suggesting that it is a better representation as sentations via the notion of version spaces and far as this task is concerned. introduce D IRECT P ROBE, a heuristic method This illustration gives us the guiding principle to approximate it directly which does not in- for this work2 : volve training classifiers. The quality of a representation E for a 3. Via experiments, we show that our approach task is a function of both the performance can help identify how good a given represen- and the complexity of the best classifier tation will be for a prediction task.1 h over that representation. 2 Representations and Learning Two observations follow from the above discussion. In this section, we will first briefly review the rela- First, we cannot enumerate every possible classi- tionship between representations and model learn- fier to find the best one. Other recent work, such ing. Then, we will introduce the notion of -version 2 In addition to performance and complexity of the best spaces to characterize representation quality. classifier, other aspects such as sample complexity, the sta- bility of learning, etc are also important. We do not consider 1 D IRECT P ROBE can be downloaded from https:// them in this work: these aspects are related to optimization and github.com/utahnlp/DirectProbe. learnability, and more closely tied to classifier-based probes. 5071
as that of Xia et al. (2020) make a similar point. h1 h2 h3 Instead, we need to resort to an approximation to ◦ ◦ evaluate representations. Second, trivially, the best ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ representation for a task is identical to an accu- ◦ rate classifier; in the illustration in Figure 1, this is represented by E4 . However, such a represen- tation is over-specialized to one task. In contrast, ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ learned representations like BERT promise task- ◦ ◦ ◦ ◦ ◦ ◦ independent representations that support accurate classifiers. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 2.2 -Version Spaces Figure 2: Using the piecewise linear functions to mimic Given a classification task, we seek to disentangle decision boundaries on two different cases. The solid lines on the left are the real decision boundaries for a the evaluation of a representation E from the clas- binary classification problem. The dashed lines in the sifiers h that are trained over it. To do so, the first middle are the piecewise linear functions used to mimic step is to characterize all classifiers supported by a these decision boundaries. The gray area is the region representation. that a separator must cross. The connected points in Classifiers are trained to find a hypothesis (i.e., the right represent the convex regions that the piecewise a classifier) that is consistent with a training set. linear separators lead to. A representation E admits a set of such hypothe- ses, and a learner chooses one of them. Consider Instead, we seek to directly evaluate the -version the top-left example of Figure 2. There are many space of a representation for a task, without com- classifiers that separate the two classes; the figure mitting to a restricted set of probe classifiers. shows two linear (h1 and h2 ) and one non-linear (h3 ) example. Given a set H of classifiers of in- 3 Approximating an -Version Space terest, the subset of classifiers that are consistent with a given dataset represents the version space Although the notion of an -version space is well with respect to H (Mitchell, 1982). To account defined, finding it for a given representation and for errors or noise in data, we define an -version task is impossible because it involves enumerating space: the set of hypothesis that can achieve less all possible classifiers. In this section, we will than error on a given dataset. describe a heuristic to approximate the -version Let us formalize this definition. Suppose H rep- space. We call this approach as D IRECT P ROBE. resents the whole hypothesis space consisting of 3.1 Piecewise Linear Approximation all possible classifiers h of interest. The -version space V (H, E, D) expressed by a representation Each classifier in V (H, E, D) is a decision bound- E for a labeled dataset D is defined as: ary in the representation space that separates exam- ples with different labels (see Figure 2, left). The V (H, E, D) , {h ∈ H | err(h, E, D) ≤ } decisions of a classifier can be mimicked by a set (1) of piecewise linear functions. Figure 2 (middle) where err represents training error. shows two examples. At the top is the simple case Note that the -version space V (H, E, D) is with linearly separable points. At the bottom is only a set of functions and does not involve any a more complicated case with a circular separa- learning. However, understanding a representation tor. The set of the piecewise linear function that requires examining its -version space—a larger matches its decisions needs at least three lines. one would allow for easier learning. The ideal piecewise linear separator partitions In previous work, the quality of a representation training points into groups, each of which contains E for a task represented by a dataset D is measured points with exactly one label. These groups can via properties of a specific h ∈ V (H, E, D), typi- be seen as defining convex regions in the embed- cally a linear probe. Commonly measured proper- ding space (see Figure 2, left). Any classifier in ties include generalization error (Kim et al., 2019), V (H, E, D) must cross the regions between the minimum description length of labels (Voita and groups with different labels; these are the regions Titov, 2020) and complexity (Whitney et al., 2020). that separate labels from each other, as shown in 5072
the gray areas in Figure 2 (middle). Inspired by this, choose the next closest pair of clusters. We repeat we posit that these regions between groups with dif- these steps till no more clusters can be merged. ferent labels, and indeed the partitions themselves, offer insight into V (H, E, D). Algorithm 1 D IRECT P ROBE: Bottom-up Cluster- ing 3.2 Partitioning Training Data Input: A dataset D with labeled examples (xi , yi ). Output: A set of clusters of points C. Although finding the set of all decision boundaries 1: Initialize Ci as the cluster for each (xi , yi ) ∈ D remain hard, finding the regions between convex 2: C = {C0 , C1 , · · · }, B = ∅ 3: repeat groups that these piecewise linear functions splits 4: Select the closest pair (Ci , Cj ) ∈ C which have the the data into is less so. Grouping data points in same label and is not flagged in B. this fashion is related to a well-studied problem, 5: S ← Ci ∪ Cj 6: if convex hull of S overlaps with other elements of C namely clustering, and several recent works have then looked at clustering of contextualized representa- 7: B ← B ∪ {(Ci , Cj )} 8: else tions (Reimers et al., 2019; Aharoni and Goldberg, 9: Update C by removing Ci , Cj and adding S 2020; Gupta et al., 2020). In this work, we have 10: end if a new clustering problem with the following crite- 11: until no more pairs can be merged 12: return C ria: (i) All points in a group have the same label. We need to ensure we are mimicking the decision boundaries. (ii) There are no overlaps between We define the distance between clusters Ci and the convex hulls of each group. If convex hulls Cj as the Euclidean distance between their cen- of two groups do not overlap, there must exist a troids. Although Algorithm 1 does not guarantee line that can separates them, as guaranteed by the the minimization criterion in §3.2 since it is greedy hyperplane separation theorem. (iii) Minimize the heuristic, we will see in our experiments that, in number of total groups. Otherwise, a simple so- practice, it works well. lution is that each data point becomes a group by Overlaps Between Convex Hulls A key point itself. of Algorithm 1 is checking if the convex hulls Note that the criteria do not require all points of two sets of points overlap (line 6). Suppose of one label to be grouped together. For example, we have two sets C = {xC C 0 0 0 1 , . . . , xn } and C = in Figure 2 (bottom right), points of the circle (i.e., {xC C 1 , . . . , xm }. We can restate this as the problem ◦) class are in three subgroups. of checking if there is some vector w ∈ E(xC i ) + b ≥ 1, 0 0 0 (2) Next, let us see a heuristic for partitioning the train- ∀xC j ∈C , w> E(xC j ) + b ≤ −1. ing set into clusters based on these criteria. where E is the representation under investigation. 3.3 A Heuristic for Clustering We can state this problem as a linear program that checks for feasibility of the system of inequali- To find clusters as described in §3.2, we define a ties. If the LP problem is feasible, there must exist simple bottom-up heuristic clustering strategy that a separator between the two sets of points, and they forms the basis of D IRECT P ROBE (Algorithm 1). do not overlap. In our implementation, we use the In the beginning, each example xi with label yi in Gurobi optimizer (Gurobi Optimization, 2020) to a training set D is a cluster Ci by itself. In each solve the linear programs.4 iteration, we select the closest pair of clusters with the same label and merge them (lines 4, 5). If the 3.4 Noise: Controlling convex hull of the new cluster does not overlap Clusters with only a small number of points could with all other clusters3 , we keep this new cluster be treated as noise. Geometrically, a point in the (line 9). Otherwise, we flag this pair (line 7) and neighborhood of other points with different labels 3 4 Note that “all other clusters” here means the other clusters Algorithm 1 can be made faster by avoiding unnecessary with different labels. There is no need to prohibit overlap calls to the solver. Appendix A gives a detailed description between clusters with the same label since they might be of these techniques, which are also incorporated in the code merged in the next iterations. release. 5073
could be thought of as noise. Other clusters can not We use the dataset of semeval2010 task 8 (Hen- merge with it because of the no-overlap constraint. drickx et al., 2010). To represent the pair of nomi- As a result, such clusters will have only a few points nals, we concatenate their embeddings. Some nom- (one or two in practice). If we want zero error inals could be spans instead of individual tokens, rate on the training data, we can keep these noise and we represent them via the average embedding points; if we allow a small error rate , then we can of the tokens in the span. remove these noise clusters. In our experiments, Dependency relation (DEP) is the task of pre- for simplicity, we keep all clusters. dicting the syntactic dependency relation between a token whead and its modifier wmod . We use the 4 Representations, Tasks and Classifiers universal dependency annotation of the English Before looking at the analysis offered by the parti- web treebank (Bies et al., 2012). As with seman- tions obtained via D IRECT P ROBE in §5, let us first tic relations, to represent the pair of tokens, we enumerate the English NLP tasks and representa- concatenate their embeddings. tions we will encounter. 4.3 Classifier Accuracy 4.1 Representations The key starting point of this work is that restrict- Our main experiments focus on BERTbase,cased , and ing ourselves to linear probes may be insufficient. we also show additional analysis on other contex- To validate the results of our analysis, we evaluate tual representations: ELMo (Peters et al., 2018)5 , a large collection of classifiers—from simple lin- BERTlarge,cased (Devlin et al., 2019), RoBERTabase ear classifiers to two-layers neural networks—for and RoBERTalarge (Liu et al., 2019b). We refer the each task. For each one, we choose the best hyper- reader to the Appendix C for further details about parameters using cross-validation. From these clas- these embeddings. sifiers, we find the best test accuracy of each task We use the average of subword embeddings as and representation. All classifiers are trained with the token vector for the representations that use the scikit-learn library (Pedregosa et al., 2011). To subwords. We use the original implementation of reduce the impact of randomness, we trained each ELMo, and the HuggingFace library (Wolf et al., classifier 10 times with different initializations, and 2020) for the others. report their average accuracy. The Appendix E summarizes the best classifiers we found and their 4.2 Tasks performance. We conduct our experiments on five NLP tasks 5 Experiments and Analysis that cover the varied usages of word representa- tions (token-based, span-based, and token pairs) D IRECT P ROBE helps partition an embedding space and include both syntactic and semantic prediction for a task, and thus characterize its -version space. problems. The Appendix D has more details about Here, we will see that these clusters do indeed the tasks to help with replication. characterize various linguistic properties of the rep- Preposition supersense disambiguation repre- resentations we consider. sents a pair of tasks, involving classifying a preposi- tion’s semantic role (SS-role) and semantic func- 5.1 Number of Clusters tion (SS-func). Following the previous work (Liu The number of clusters is an indicator of the et al., 2019a), we only use the single-token prepo- linear separability of representations for a task. sitions in the Streusle v4.2 corpus (Schneider et al., The best scenario is when the number of clusters 2018). equals the number of labels. In this case, examples Part-of-speech tagging (POS) is a token level with the same label are placed close enough by the prediction task. We use the English portion of the representation to form a cluster that is separable parallel universal dependencies treebank (ud-pud from other clusters. A simple linear multi-class Nivre et al., 2016). classifier can fit well in this scenario. In contrast, if Semantic relation (SR) is the task of predicting the number of clusters is more than the number of the semantic relation between pairs of nominals. labels, then some labels are distributed across mul- 5 For simplicity, we use the equally weighted three layers tiple clusters (as in Figure 2, bottom). There must of ELMo in our experiments. be a non-linear decision boundary. Consequently, 5074
Linear SVM Embedding #Clusters Training Accuracy BERTbase,cased 100 17 BERTlarge,cased 100 17 RoBERTabase 100 17 RoBERTalarge 99.97 1487 ELMo 100 17 Table 1: Linearity experiments on POS tagging task. Our POS tagging task has 17 labels in total. Both linear SVM and the number of clusters suggest that RoBERTalarge is non-linear while others are all linear, which means the best classifier for RoBERTalarge is not a Figure 3: Here we juxtapose the minimum distances linear model. More details can be found in Appendix E. between clusters and the best classifier accuracy for all 12 layers. The horizontal axis is the layer index of BERTbase,cased ; the left vertical axis is the best classifier this scenario calls for a more complex classifier, accuracy and the right vertical axis is the minimum e.g., a multi-layer neural network. distance between all pairs of clusters. In other words, using the clusters, and without training a classifier, we can answer the question: can a linear classifier fit the training set for a task separates them and compute its margin. The dis- with a given representation? tance we seek is twice the margin. For a given To validate our predictions, we use the training representation, we are interested in the minimum accuracy of a linear SVM (Chang and Lin, 2011) distance across all pairs of clusters with different classifier. If a linear SVM can perfectly fit (100% labels. accuracy) a training set, then there exist linear de- cision boundaries that separate the labels. Table 1 5.2.1 Minimum Distance of Different Layers shows the linearity experiments on the POS task, which has 17 labels in total. All representations Higher layers usually have larger -version except RoBERTalarge have 17 clusters, suggesting a spaces. Different layers of BERT play different linearly separable space, which is confirmed by the roles when encoding liguistic information (Tenney SVM accuracy. We conjecture that this may be the et al., 2019). To investigate the geometry of dif- reason why linear models usually work for BERT- ferent layers of BERT, we apply D IRECT P ROBE family models. Of course, linear separability does to each layer of BERTbase,cased for all five tasks. not mean the task is easy or that the best classifier Then, we computed the minimum distances among is a linear one. We found that, while most repre- all pairs of clusters with different labels. By com- sentations we considered are linearly separable for paring the minimum distances of different layers, most of our tasks, the best classifier is not always we answer the question: how do different layers of linear. We refer the reader to Appendix E for the BERT differ in their representations for a task? full results. Figure 3 shows the results on all tasks. In each subplot, the horizontal axis is the layer index. For 5.2 Distances between Clusters each layer, the blue circles (left vertical axis) is As we mentioned in §3.1, a learning process seeks the best classifier accuracy, and the red triangles to find a decision boundary that separates clusters (right vertical axis) is the minimum distance de- with different labels. Intuitively, a larger gap be- scribed above. We observe that both best classi- tween them would make it easier for a learner to fier accuracy and minimum distance show similar find a suitable hypothesis h that generalizes better. trends across different layers: first increasing, then We use the distance between convex hulls of decreasing. It shows that minimum distance corre- clusters as an indicator of the size of these gaps. lates with the best performance for an embedding We note that the problem of computing the distance space, though it is not a simple linear relation. An- between convex hulls of clusters is equivalent to other interesting observation is the decreasing per- finding the maximum margin separator between formance and minimum distance of higher layers, them. To find the distance between two clusters, which is also corroborated by Ethayarajh (2019) we train a linear SVM (Chang and Lin, 2011) that and Liu et al. (2019a). 5075
Task Min Distance Best Acc Small Medium Large Task Distance Distance Distance original 0.778 77.51 SS-role fine-tuned 4.231 81.62 SS-role 97.17% (555) 2.83% (392) 0% (88) SS-func 96.88% (324) 3.12% (401) 0% (55) original 0.333 86.13 POS 99.19% (102) 0.81% (18) 0% (16) SS-func fine-tuned 2.686 88.4 SR 93.20% (20) 6.80% (20) 0% (5) original 0.301 93.59 DEP 99.97% (928) 0.03% (103) 0% (50) POS fine-tuned 0.7696 95.95 original 0.421 86.85 Table 3: Error distribution based on different distance SR bins. The number of label pairs in each bin is shown in fine-tuned 4.734 90.03 original 0.345 91.52 the parentheses. DEP fine-tuned 1.075 94.82 Table 2: The best performance and the minimum dis- classified label pairs for each bin. For example, if tances between all pairs of clusters of the last layer of the clusters associated with the part of speech tags BERTbase,cased before and after fine-tuning. ADV and ADJ are close to each other, and the best classifier misclassified ADV as ADJ, we put this error pair into the bin of small distance. The distri- 5.2.2 Impact of Fine-tuning bution of all errors is shown in Table 3. This table Fine-tuning expands the -version space. Past shows that a large majority of the misclassified work (Peters et al., 2019; Arase and Tsujii, 2019; labels are concentrated in the small distance bin. Merchant et al., 2020) has shown that fine-tuning For example, in the supersense role task (SS-role), pre-trained models on a specific task improves per- 97.17% of the errors happened in small distance formance, and fine-tuning is now the de facto proce- bin. The number of label pairs of each bin is shown dure for using contextualized embeddings. In this in the parentheses. Table 3 shows that small dis- experiment, we try to understand why fine-tuning tances between clusters indeed confuse a classifier can improve performance. Without training classi- and we can detect it without training classifiers. fiers, we answer the question: What changes in the embedding space after fine-tuning? 5.3 By-product: A Parameter-free Classifier We conduct the experiments described in §5.2.1 on the last layer of BERTbase,cased before and after fine-tuning for all tasks. Table 2 shows the results. We see that after fine-tuning, both the best classifier accuracy and minimum distance show a big boost. It means that fine-tuning pushes the clusters away from each other in the representation space, which results in a larger -version space. As we discussed in §5.2, a larger -version space admits more good classifiers and allows for better generalization. 5.2.3 Label Confusion Small distances between clusters can confuse a Figure 4: Comparison between the best classifier accu- classifier. By comparing the distances between racy, intra-accuracy, and 1-kNN accuracy. The X-axis is clusters, we can answer the question: Which labels different representation models. BB: BERTbase,cased , BL: for a task are more confusable? BERTlarge,cased , RB: RoBERTabase , RL: RoBERTalarge , We compute the distances between all the pairs E: ELMo. The pearson correlation coefficient between best classifier accuracy and intra-accuracy is shown in of labels based on the last layer of BERTbase,cased .6 the parentheses alongside each task title. This figure is Based on an even split of the distances, we parti- best viewed in color. tion all label pairs into three bins: small, medium, and large. For each task, we use the predictions of the best classifier to compute the number of mis- We can predict the expected performance of 6 the best classifier. Any h ∈ V (H, E, D) is a For all tasks, BERTbase,cased space (last layer) is linearly separable. So, the number of label pairs equals the number of predictor for the task D on the representation E. As cluster pairs. a by-product of the clusters from D IRECT P ROBE, 5076
we can define a predictor. The prediction strategy is between the annotated label and the BERTbase,cased simple: for a test example, we assign it to its closest neighborhood: cluster.7 Indeed, if the label of the cluster is the true label of the test point, then we know that there . . . our new mobile number is . . . exists some classifier that allows this example to be correctly labeled. We can verify the label every test The data labels the word our as G ESTALT, while point and compute the aggregate accuracy to serve the embedding places it in the neighborhood of as an indicator of the generalization ability of the P OSSESSOR. The annotation guidelines for these representation at hand. We call this accuracy the labels (Schneider et al., 2017) notes that G ESTALT intra-accuracy. In other words, without training is a supercategory of P OSSESSOR. The latter is classifiers, we can answer the question: given a specifically used to identify cases where the pos- representation, what is the expected performance sessed item is alienable and has monetary value. of the best classifier for a task? From this definition, we see that though the an- Figure 4 compares the best classifier accuracy, notated label is G ESTALT, it could arguably also and the intra-accuracy of the last layer of different be a P OSSESSOR if phone numbers are construed embeddings. Because our assignment strategy is as alienable possessions that have monetary value. similar to nearest neighbor classification (1-kNN), Importantly, it is unclear whether BERTbase,cased which assigns the unlabelled test point to its closest makes this distinction. Other examples we exam- labeled point, the figure also compares to the 1- ined required similarly nuanced analysis. This ex- kNN accuracy. ample shows D IRECT P ROBE can be used to iden- First, we observe that intra-accuracy always out- tify examples in datasets that are potentially misla- performs the simple 1-kNN classifier, showing that beled, or at least, require further discussion. D IRECT P ROBE can use more information from the representation space. Second, we see that the intra- 6 Related Work and Discussion accuracy is close to the best accuracy for some tasks (Supersense tasks and POS tagging). More- In addition to the classifier based probes described over, all the pearson correlation coefficients be- in the rest of the paper, a complementary line of tween best accuracy and intra-accuracy (showed in work focuses on probing the representations us- the parentheses alongside each task title) suggest a ing a behavior-based methodology. Controlled test high linear correlation between best classifier accu- sets (Şenel et al., 2018; Jastrzebski et al., 2017) racy and intra-accuracy. That is, the intra-accuracy are designed and errors are analyzed to reverse- can be a good predictor of the best classifier accu- engineer what information can be encoded by the racy for a representation. From this, we argue that model (e.g., Marvin and Linzen, 2018; Ravichander intra-accuracy can be interpreted as a benchmark et al., 2021; Wu et al., 2020). Another line of work accuracy of a given representation without actually probes the space by “opening up” the representa- training classifiers. tion space or the model (e.g., Michel et al., 2019; Voita et al., 2019). There are some efforts to inspect 5.4 Case Study: Identifying Difficult the space from a geometric perspective (e.g., Etha- Examples yarajh, 2019; Mimno and Thompson, 2017). Our work extends this line of work to connect the geo- The distances between a test point and all the clus- metric structure of embedding space with classifier ters from the training set can not only be used to performance without actually training a classifier. predict the label but also can be used to identify Recent work (Pimentel et al., 2020b; Voita and difficult examples as per a given representation. Titov, 2020; Zhu and Rudzicz, 2020) probe repre- Doing so could lead to re-annotation of the data, sentations from an information theoretic perspec- and perhaps lead to cleaner data, or to improved tive. These efforts still need a probability distribu- embeddings. Using the supersense role task, we tion p(y|x) from a trained classifier. In §5.3, we show a randomly chosen example of a mismatch use clusters to predict labels. In the same vein, 7 To find the distance between the convex hull of a cluster the conditional probability p(y|x) can be obtained and a test point, we find a max-margin separating hyperplane by treating the negative distances between the test by training a linear SVM that separates the point from the clus- ter. The distance is twice the distance between the hyperplane point x and all clusters as predicted scores and and the test point. normalizing via softmax. Our formalization can 5077
fit into the information theoretic analysis and yet Yuki Arase and Jun’ichi Tsujii. 2019. Transfer fine- avoid training a classifier. tuning: A BERT case study. In Proceedings of the 2019 Conference on Empirical Methods in Natu- Our analysis and experiments open new direc- ral Language Processing and the 9th International tions for further research: Joint Conference on Natural Language Processing Novel pre-training target: The analysis presented (EMNLP-IJCNLP), pages 5393–5404, Hong Kong, here informs us that larger distance between clus- China. Association for Computational Linguistics. ters can improve classifiers. This could guide loss Y Bengio, A Courville, and P Vincent. 2013. Repre- function design when pre-training representations. sentation learning: a review and new perspectives. Quality of a representation: In this paper, we IEEE transactions on pattern analysis and machine focus on the accuracy of a representation. We could intelligence, 35(8):1798. seek to measure other properties (e.g., complexity) Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. or proxies for them. These analytical approaches 2012. English web treebank. Linguistic Data Con- can be applied to the -version space to further sortium, Philadelphia, PA. analyze the quality of the representation space. Theory of representation: Learning theory, e.g. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: A library for support vector machines. ACM trans- VC-theory (Vapnik, 2013), describes the learnabil- actions on intelligent systems and technology (TIST), ity of classifiers; representation learning lacks of 2(3):1–27. such theoretical analysis. The ideas explored in this work (-version spaces, distances between clusters Alexis Conneau, German Kruszewski, Guillaume Lam- ple, Loïc Barrault, and Marco Baroni. 2018. What being critical) could serve as a foundation for an you can cram into a single $&!#* vector: Probing analogous theory of representations. sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the As- 7 Conclusion sociation for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Aus- In this work, we ask the question: what makes a tralia. Association for Computational Linguistics. representation good for a task? We answer it by developing D IRECT P ROBE, a heuristic approach Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of builds upon hierarchical clustering to approximate deep bidirectional transformers for language under- the -version space. Via experiments with several standing. In Proceedings of the 2019 Conference of contextualized embeddings and linguistic tasks, we the North American Chapter of the Association for showed that D IRECT P ROBE can help us under- Computational Linguistics: Human Language Tech- stand the geometry of the embedding space and nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for ascertain when a representation can successfully Computational Linguistics. be employed for a task. Steffen Eger, Andreas Rücklé, and Iryna Gurevych. Acknowledgments 2019. Pitfalls in the evaluation of sentence em- beddings. In Proceedings of the 4th Workshop on We thank the members of the Utah NLP group Representation Learning for NLP (RepL4NLP-2019), and Nathan Schneider for discussions and valuable pages 55–60, Florence, Italy. Association for Com- insights, and reviewers for their helpful feedback. putational Linguistics. We also thank the support of NSF grants #1801446 (SATC) and #1822877 (Cyberlearning). Kawin Ethayarajh. 2019. How contextual are contextu- alized word representations? comparing the geom- etry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical References Methods in Natural Language Processing and the Roee Aharoni and Yoav Goldberg. 2020. Unsupervised 9th International Joint Conference on Natural Lan- domain clusters in pretrained language models. In guage Processing (EMNLP-IJCNLP), pages 55–65, Proceedings of the 58th Annual Meeting of the Asso- Hong Kong, China. Association for Computational ciation for Computational Linguistics, pages 7747– Linguistics. 7763, Online. Association for Computational Lin- guistics. Emily Goodwin, Koustuv Sinha, and Timothy J. O’Donnell. 2020. Probing linguistic systematicity. Guillaume Alain and Yoshua Bengio. 2017. Under- In Proceedings of the 58th Annual Meeting of the As- standing intermediate layers using linear classifier sociation for Computational Linguistics, pages 1958– probes. In International Conference on Learning 1969, Online. Association for Computational Linguis- Representations. tics. 5078
Vikram Gupta, Haoyue Shi, Kevin Gimpel, and Mrin- Empirical Methods in Natural Language Processing maya Sachan. 2020. Clustering contextualized repre- and the 9th International Joint Conference on Natu- sentations of text for unsupervised syntax induction. ral Language Processing (EMNLP-IJCNLP), pages arXiv preprint arXiv:2010.12784. 4365–4374, Hong Kong, China. Association for Com- putational Linguistics. LLC Gurobi Optimization. 2020. Gurobi optimizer reference manual. Katarzyna Krasnowska-Kieraś and Alina Wróblewska. 2019. Empirical linguistic study of sentence embed- Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, dings. In Proceedings of the 57th Annual Meeting of Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian the Association for Computational Linguistics, pages Padó, Marco Pennacchiotti, Lorenza Romano, and 5729–5739, Florence, Italy. Association for Compu- Stan Szpakowicz. 2010. SemEval-2010 task 8: Multi- tational Linguistics. way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou, Workshop on Semantic Evaluation, pages 33–38, Up- and Joakim Nivre. 2020. Do neural language mod- psala, Sweden. Association for Computational Lin- els show preferences for syntactic formalisms? In guistics. Proceedings of the 58th Annual Meeting of the Asso- John Hewitt and Percy Liang. 2019. Designing and in- ciation for Computational Linguistics, pages 4077– terpreting probes with control tasks. In Proceedings 4091, Online. Association for Computational Lin- of the 2019 Conference on Empirical Methods in Nat- guistics. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Nelson F. Liu, Matt Gardner, Yonatan Belinkov, (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Matthew E. Peters, and Noah A. Smith. 2019a. Lin- China. Association for Computational Linguistics. guistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Confer- Stanisław Jastrzebski, Damian Leśniak, and Woj- ence of the North American Chapter of the Associ- ciech Marian Czarnecki. 2017. How to evaluate ation for Computational Linguistics: Human Lan- word embeddings? on importance of data effi- guage Technologies, Volume 1 (Long and Short Pa- ciency and simple supervised tasks. arXiv preprint pers), pages 1073–1094, Minneapolis, Minnesota. arXiv:1702.02170. Association for Computational Linguistics. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 2019. What does BERT learn about the structure of dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, language? In Proceedings of the 57th Annual Meet- Luke Zettlemoyer, and Veselin Stoyanov. 2019b. ing of the Association for Computational Linguistics, Roberta: A robustly optimized bert pretraining ap- pages 3651–3657, Florence, Italy. Association for proach. arXiv preprint arXiv:1907.11692. Computational Linguistics. Rebecca Marvin and Tal Linzen. 2018. Targeted syn- Nora Kassner and Hinrich Schütze. 2020. Negated and tactic evaluation of language models. In Proceed- misprimed probes for pretrained language models: ings of the 2018 Conference on Empirical Methods Birds can talk, but cannot fly. In Proceedings of the in Natural Language Processing, pages 1192–1202, 58th Annual Meeting of the Association for Compu- Brussels, Belgium. Association for Computational tational Linguistics, pages 7811–7818, Online. Asso- Linguistics. ciation for Computational Linguistics. Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, and Najoung Kim, Roma Patel, Adam Poliak, Patrick Xia, Ian Tenney. 2020. What happens to BERT embed- Alex Wang, Tom McCoy, Ian Tenney, Alexis Ross, dings during fine-tuning? In Proceedings of the Tal Linzen, Benjamin Van Durme, Samuel R. Bow- Third BlackboxNLP Workshop on Analyzing and In- man, and Ellie Pavlick. 2019. Probing what differ- terpreting Neural Networks for NLP, pages 33–44, ent NLP tasks teach machines about function word Online. Association for Computational Linguistics. comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 235–249, Minneapolis, Min- Julian Michael, Jan A Botha, and Ian Tenney. 2020. nesota. Association for Computational Linguistics. Asking without telling: Exploring latent ontologies in contextual representations. In Proceedings of the Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 2020 Conference on Empirical Methods in Natural Method for Stochastic Optimization. In 3rd Inter- Language Processing (EMNLP), pages 6792–6812. national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Paul Michel, Omer Levy, and Graham Neubig. 2019. Conference Track Proceedings. Are sixteen heads really better than one? In Ad- vances in Neural Information Processing Systems 32: Olga Kovaleva, Alexey Romanov, Anna Rogers, and Annual Conference on Neural Information Process- Anna Rumshisky. 2019. Revealing the dark secrets ing Systems 2019, NeurIPS 2019, December 8-14, of BERT. In Proceedings of the 2019 Conference on 2019, Vancouver, BC, Canada, pages 14014–14024. 5079
David Mimno and Laure Thompson. 2017. The strange Yada Pruksachatkun, Jason Phang, Haokun Liu, geometry of skip-gram with negative sampling. In Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Proceedings of the 2017 Conference on Empirical Clara Vania, Katharina Kann, and Samuel R. Bow- Methods in Natural Language Processing, pages man. 2020. Intermediate-task transfer learning with 2873–2878, Copenhagen, Denmark. Association for pretrained language models: When and why does it Computational Linguistics. work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages Tom M Mitchell. 1982. Generalization as search. Artifi- 5231–5247, Online. Association for Computational cial intelligence, 18(2):203–226. Linguistics. Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin- Abhilasha Ravichander, Yonatan Belinkov, and Ed- ter, Yoav Goldberg, Jan Hajic, Christopher D Man- uard H. Hovy. 2021. Probing the probing paradigm: ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Does probing accuracy entail task relevance? In Natalia Silveira, et al. 2016. Universal dependencies Proceedings of the 16th Conference of the European v1: A multilingual treebank collection. In Proceed- Chapter of the Association for Computational Lin- ings of the Tenth International Conference on Lan- guistics: Main Volume, EACL 2021, Online, April 19 guage Resources and Evaluation (LREC’16), pages - 23, 2021, pages 3363–3377. Association for Com- 1659–1666. putational Linguistics. Nils Reimers, Benjamin Schiller, Tilman Beck, Jo- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, hannes Daxenberger, Christian Stab, and Iryna B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Gurevych. 2019. Classification and clustering of R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, arguments with contextualized word embeddings. In D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Proceedings of the 57th Annual Meeting of the As- esnay. 2011. Scikit-learn: Machine learning in sociation for Computational Linguistics, pages 567– Python. Journal of Machine Learning Research, 578, Florence, Italy. Association for Computational 12:2825–2830. Linguistics. Christian S Perone, Roberto Silveira, and Thomas S Nathan Schneider, Jena D Hwang, Archna Bhatia, Vivek Paula. 2018. Evaluation of sentence embeddings Srikumar, Na-Rae Han, Tim O’Gorman, Sarah R in downstream and linguistic probing tasks. arXiv Moeller, Omri Abend, Adi Shalev, Austin Blod- preprint arXiv:1806.06259. gett, et al. 2017. Adposition and case supersenses v2. 5: Guidelines for english. arXiv preprint Matthew Peters, Mark Neumann, Mohit Iyyer, Matt arXiv:1704.02134. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word repre- Nathan Schneider, Jena D. Hwang, Vivek Srikumar, sentations. In Proceedings of the 2018 Conference of Jakob Prange, Austin Blodgett, Sarah R. Moeller, the North American Chapter of the Association for Aviram Stern, Adi Bitan, and Omri Abend. 2018. Computational Linguistics: Human Language Tech- Comprehensive supersense disambiguation of En- nologies, Volume 1 (Long Papers), pages 2227–2237, glish prepositions and possessives. In Proceedings New Orleans, Louisiana. Association for Computa- of the 56th Annual Meeting of the Association for tional Linguistics. Computational Linguistics (Volume 1: Long Papers), pages 185–196, Melbourne, Australia. Association Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. for Computational Linguistics. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of Lütfi Kerem Şenel, Ihsan Utlu, Veysel Yücesoy, Aykut the 4th Workshop on Representation Learning for Koc, and Tolga Cukur. 2018. Semantic structure NLP (RepL4NLP-2019), pages 7–14, Florence, Italy. and interpretability of word embeddings. IEEE/ACM Association for Computational Linguistics. Transactions on Audio, Speech, and Language Pro- cessing, 26(10):1769–1779. Tiago Pimentel, Naomi Saphra, Adina Williams, and Alon Talmor, Yanai Elazar, Yoav Goldberg, and Ryan Cotterell. 2020a. Pareto probing: Trading off Jonathan Berant. 2020. oLMpics-on what language accuracy for complexity. In Proceedings of the 2020 model pre-training captures. Transactions of the As- Conference on Empirical Methods in Natural Lan- sociation for Computational Linguistics, 8:743–758. guage Processing (EMNLP), pages 3138–3153, On- line. Association for Computational Linguistics. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Proceedings of the 57th Annual Meeting of the Asso- Ran Zmigrod, Adina Williams, and Ryan Cotterell. ciation for Computational Linguistics, pages 4593– 2020b. Information-theoretic probing for linguistic 4601, Florence, Italy. Association for Computational structure. In Proceedings of the 58th Annual Meet- Linguistics. ing of the Association for Computational Linguistics, pages 4609–4622, Online. Association for Computa- Vladimir Vapnik. 2013. The nature of statistical learn- tional Linguistics. ing theory. Springer science & business media. 5080
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- Zining Zhu and Frank Rudzicz. 2020. An informa- nrich, and Ivan Titov. 2019. Analyzing multi-head tion theoretic view on selecting linguistic probes. In self-attention: Specialized heads do the heavy lift- Proceedings of the 2020 Conference on Empirical ing, the rest can be pruned. In Proceedings of the Methods in Natural Language Processing (EMNLP), 57th Annual Meeting of the Association for Computa- pages 9251–9262, Online. Association for Computa- tional Linguistics, pages 5797–5808, Florence, Italy. tional Linguistics. Association for Computational Linguistics. A D IRECT P ROBE in Practice Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Pro- In practice, we apply several techniques to speed ceedings of the 2020 Conference on Empirical Meth- up the probing process. Firstly, we add two caching ods in Natural Language Processing (EMNLP), strategies: pages 183–196, Online. Association for Computa- tional Linguistics. Black List Suppose we know two clusters A and Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, B can not be merged, then A0 and B 0 cannot be and Matt Gardner. 2019. Do NLP models know num- merged either if A ⊆ A0 and B ⊆ B 0 bers? probing numeracy in embeddings. In Proceed- ings of the 2019 Conference on Empirical Methods White List Suppose we know two clusters A and in Natural Language Processing and the 9th Inter- B can be merged, then A0 and B 0 can be merged national Joint Conference on Natural Language Pro- too if A0 ⊆ A and B 0 ⊆ B cessing (EMNLP-IJCNLP), pages 5307–5315, Hong Kong, China. Association for Computational Linguis- These two observations allow us to cache pre- tics. vious decisions in checking whether two clusters overlap. Applying these caching strategies can help William F Whitney, Min Jae Song, David Brand- fonbrener, Jaan Altosaar, and Kyunghyun Cho. us avoid unnecessary checking for overlap, which 2020. Evaluating representations by the complex- is time-consuming. ity of learning low-loss predictors. arXiv preprint Secondly, instead of merging from the start to arXiv:2009.07368. the end and checking at every step, we directly Thomas Wolf, Lysandre Debut, Victor Sanh, Julien keep merging to the end without any overlap check- Chaumond, Clement Delangue, Anthony Moi, Pier- ing. After we arrived at the end, we start checking ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, backwards to see if there is an overlap. If final Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le clusters have overlaps, we find the step of the first Scao, Sylvain Gugger, Mariama Drame, Quentin error, correct it and keep merging. Merging to the Lhoest, and Alexander M. Rush. 2020. Transform- end can also help us avoid plenty of unnecessary ers: State-of-the-art natural language processing. In checking because, in most representations, the first Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System error usually happens only in the final few merges. Demonstrations, pages 38–45, Online. Association Algorithm 2 shows the whole algorithm. for Computational Linguistics. Algorithm 2 D IRECT P ROBE in Practice Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020. Perturbed masking: Parameter-free probing for ana- Input: A dataset D with labeled examples (xi , yi ). Output: A set of clusters of points C. lyzing and interpreting BERT. In Proceedings of the 1: Initialize Ci as the cluster for each (xi , yi ) ∈ D 58th Annual Meeting of the Association for Compu- 2: C = {C0 , C1 , · · · } tational Linguistics, pages 4166–4176, Online. Asso- 3: Keep merging the closest pairs who have the same label ciation for Computational Linguistics. in C without overlapping checking. 4: if There are overlappings in C then Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, 5: Find the step k that first overlap happens. Yiming Yang, and Graham Neubig. 2020. Predicting 6: Rebuild C from step k − 1 performance for natural language processing tasks. 7: Keeping merging the closest pairs who have the same In Proceedings of the 58th Annual Meeting of the As- label in C with overlapping checking. sociation for Computational Linguistics, pages 8625– 8: end if 8646, Online. Association for Computational Lin- 9: return C guistics. Yadollah Yaghoobzadeh, Katharina Kann, T. J. Hazen, Table 4 shows the runtime comparison between Eneko Agirre, and Hinrich Schütze. 2019. Probing DirectProbe and training classifiers. for semantic classes: Diagnosing the meaning con- tent of word embeddings. In Proceedings of the 57th B Classifier Training Details Annual Meeting of the Association for Computational Linguistics, pages 5740–5753, Florence, Italy. Asso- We train 3 different kinds of classifiers in order ciation for Computational Linguistics. to find the best one: Logistic Regression, a one- 5081
DirectProbe Classifier • Embedding: The name of representation. Clustering Predict CV Train-Test SS-role 1 min 7 min 1.5 hours 1 hour • Best classification: The accuracy of the best SS-func 1.5 min 5.5 min 1 hour 1 hour classifier among 21 different classifiers. Each POS 14 min 43 min 3.5 hours 2 hours classifier run 10 times with different starting SR 10 min 22 min 50 min 1 hour DEP 3 hours 3 hours 20 hours 11 hours points. The final accuracy is the average accu- racy of these 10 runs. Table 4: The comparison of running time between D I - RECT P ROBE and training classifiers using the setting • Intra-accuracy: The prediction accuracy by described in Appendix B. The time is computed on the assigning test examples to their closest clus- BERTbase,cased space. D IRECT P ROBE and classifiers run ters. on the same machine. • Type: The type of the best classifier. e.g. Representations #Parameters Dimensions (256, ) means one-layer neural network with hidden size 256 and (256, 128) means two- BERTbase,cased 110M 768 layer neural network with hidden sizes 256 BERTlarge,cased 340M 1024 and 128 respectively. RoBERTabase 125M 768 RoBERTalarge 355M 1024 • Parameters: The number of parameters of ELMo 93.6M 1024 the best classifier. Table 5: Statistics of the five representations in our • Linear SVM: Training accuracy of a linear experiments. SVM classifier. • #Clusters: The number of final clusters after layer neural network with hidden layer size of probing. (32, 64, 128, 256) and two-layers neural network F Detailed Classification Results with hidden layer sizes of (32, 64, 128, 256) × (32, 64, 128, 256). All neural networks use ReLU Table 7 shows the detailed classification results on as the activation function and optimized by Supersense-role task using the same settings de- Adam (Kingma and Ba, 2015). The cross- scribed in Appendix B. In this table, we record validation process chooses the weight of each regu- the difference between the minimum and maxi- larizer of each classifier. The weight range is from mum test accuracy of the 10 runs of each classifier. 10−7 to 100 . We set the maximum iterations to The maximum difference for each representation be 1000. After choosing the best hyperparame- is highlighted by the underlines. The best perfor- ters, each specific classifier is trained 10 times with mance of each representation is highlighted by the different initializations. bold numbers. From this table, we observe: (i) the difference between different runs of the same clas- C Summary of Representations sifier can be as large as 4-7%, which can not be ignored; (ii) different representation requires dif- Table 5 summarizes all the representations in our ferent model architecture to achieve its best perfor- experiments. mance. D Summary of Tasks In this work, we conduct experiments on 5 tasks, which is designed to cover different usages of rep- resentations. Table 6 summarizes these tasks. E Classifier Results v.s. D IRECT P ROBE Results Table 8 summarizes the results of the best classifier for each representation and task. The meanings of each column are as follows: 5082
You can also read