COLES: CONTRASTIVE LEARNING FOR EVENT SE- QUENCES WITH SELF-SUPERVISION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Under review as a conference paper at ICLR 2021 C O LES: C ONTRASTIVE L EARNING FOR E VENT S E - QUENCES WITH S ELF -S UPERVISION Anonymous authors Paper under double-blind review A BSTRACT We address the problem of self-supervised learning on discrete event sequences generated by real-world users. Self-supervised learning incorporates complex in- formation from the raw data in low-dimensional fixed-length vector representations that could be easily applied in various downstream machine learning tasks. In this paper, we propose a new method CoLES, which adopts contrastive learning, previ- ously used for audio and computer vision domains, to the discrete event sequences domain in a self-supervised setting. Unlike most previous studies, we theoretically justify under mild conditions that the augmentation method underlying CoLES provides representative samples of discrete event sequences. We evaluated CoLES on several public datasets and showed that CoLES representations consistently outperform other methods on different downstream tasks. 1 I NTRODUCTION A promising and rapidly growing approach known as self-supervised learning1 is the main choice for pre-training in situations where the amount of labeled data for the target task of interest is limited. Most of the research in the area of self-supervised learning has been focused on the core machine learning domains, including NLP (e.g., ELMO (Peters et al., 2018), BERT (Devlin et al., 2019)), speech (e.g., CPC (van den Oord et al., 2018)) and computer vision (Doersch et al., 2015; van den Oord et al., 2018). However, there has been very little research on self-supervised learning in the domain of discrete event sequences, including user behavior sequences (Ni et al., 2018) such as credit card transactions at banks, phone calls and messages at telecom, purchase history at retail and click-stream data of online services. Produced in many business applications, such data is a major key to the growth of modern companies. User behavior sequence is attributed to a person and captures regular and routine actions of a certain type. The analysis of these sequences constitutes an important sub-field of machine learning (Laxman et al., 2008; Wiese and Omlin, 2009; Zhang et al., 2017; Bigon et al., 2019). NLP, audio and computer vision domains are similar in the sense that the data of this type is "continuous": a short term in NLP can be accurately reconstructed from its context (like a pixel from its neighboring pixels). This fact underlies popular NLP approaches for self-supervision such as BERT’s Cloze task (Devlin et al., 2019) and approaches for self-supervision in audio and computer vision, like CPC (van den Oord et al., 2018). In contrast, for many types of event sequence data, a single token cannot be determined using its nearby tokens, because the mutual information between a token and its context is small. For this reason, most state-of-the-art self-supervised methods are not applicable to event sequence data. In this paper, we propose the COntrastive Learning for Event Sequences (CoLES) method that learns low-dimensional representations of discrete event sequences. It is based on a novel theoretically grounded data augmentation strategy, which adapts the ideas of contrastive learning (Xing et al., 2002; Hadsell et al., 2006) to the discrete event sequences domain in a self-supervised setting. The aim of contrastive learning is to represent semantically similar objects (positive pairs of images, video, audio, etc.) closer to each other, while dissimilar ones (negative pairs) further away. Positive pairs are obtained for training either explicitly, e.g., in a manual labeling process or implicitly using different data augmentation strategies (Falcon and Cho (2020)). We treat explicit cases as a 1 See, e.g., keynote by Yann LeCun at ICLR-20: https://www.iclr.cc/virtual_2020/speaker_7.html 1
Under review as a conference paper at ICLR 2021 supervised approach and implicit cases as a self-supervised one. In most applications, where each person is represented by one sequence of events, there are no explicit positive pairs, and thus only self-supervised approaches are applicable. Our CoLES method is self-supervised and based on the observation that event sequences usually possess periodicity and repeatability of their events. We propose and theoretically justify a new augmentation algorithm, which generates sub-sequences of an observed event sequence and uses them as different high-dimensional views of the same (sequence) object for contrastive learning. Representations produced by the CoLES model can be used directly as a fixed vector of features in some supervised downstream task (e. g. classification task) similarly to (Mikolov et al., 2013; Song et al., 2017; Zhai et al., 2019). Alternatively, the trained CoLES model can be fine-tuned (Devlin et al., 2019) for the specific downstream task. We applied CoLES to several user behavior sequence datasets with different downstream classification tasks. When used directly as feature vectors, CoLES representations achieve strong performance comparable to the hand-crafted features produced by data scientists. We demonstrate that fine-tuned CoLES representations consistently outperform methods based on other representations by a significant margin. We provide the full source code for all the experiments described in the paper2 . This paper makes the following contributions: (1) We present the CoLES method that adapts contrastive learning in the self-supervised setting to the discrete event sequence domain. (2) We propose a novel theoretically grounded augmentation method for discrete event sequences. (3) We demonstrate that CoLES consistently outperforms previously introduced supervised, self-supervised and semi-supervised learning baselines adapted to the event sequence domain. We also conducted a pilot study on event sequence data of a large European bank. We tested CoLES against the baselines and achieved superior performance on downstream tasks which produced significant financial gains, measured in hundreds of millions of dollars yearly. The rest of the paper is organized as follows. In the next section, we discuss related studies on self-supervised and contrastive learning. In Section 3 we introduce our new method CoLES for discrete event sequences. In Section 4 we demonstrate that CoLES outperforms several strong baselines including previously proposed contrastive learning methods adapted to event sequence datasets. Section 5 is dedicated to the discussion of our results and conclusions. 2 R ELATED WORK Contrastive learning has been successfully applied to constructing low-dimensional representa- tions (embeddings) of various objects, such as images (Chopra et al., 2005; Schroff et al., 2015), texts (Reimers and Gurevych, 2019), and audio recordings (Wan et al., 2018). The aim of these studies is to identify the object based on its sample (Schroff et al., 2015; Hu et al., 2014; Wan et al., 2018). Therefore, their training datasets explicitly contain several independent samples per each particular object, which form positive pairs as a critical component for learning. These supervised approaches are not applicable to our setting. For situations when positive pairs are not available or their amount is limited, augmentation techniques were proposed in the computer vision domain. One of the first frameworks with augmentation was proposed by Dosovitskiy et al. (2014). In this work, surrogate classes for model training were introduced using augmentations of the same image. Several recent works (Bachman et al., 2019; He et al., 2019; Chen et al., 2020) extended this idea by applying contrastive learning methods, they are nicely summarised by Falcon and Cho (2020). Although augmentation techniques proposed in these studies provide good performance empirically, we note that no theoretical background behind different augmentation approaches has been proposed so far. Contrastive Predictive Coding (CPC) is a self-supervised learning approach proposed for non-discrete sequential data (van den Oord et al., 2018). CPC extracts meaningful representations by predicting latent representations of future observations of the input sequence and using autoregressive methods. CPC representations demonstrated strong performance on four distinct domains: audio, computer vision, natural language and reinforcement learning. We adapted the CPC based approach to the domain of discrete event sequences and compared it with our CoLES approach (see Section 4.2). 2 https://github.com/***/*** (the link was anonymized for the double-blind peer review purposes) 2
Under review as a conference paper at ICLR 2021 Independently of our study, several papers appeared in the past few months on self-supervision for user behavior sequences in the recommender systems domain. Zhou et al. (2020a) proposed to use a CPC-like approach for self-supervised learning on user clicks history. Ma et al. (2020) used an auxiliary self-supervised loss on click sequences. Zhou et al. (2020b) proposed "Cloze" task from BERT (Devlin et al., 2019) for self-supervision on purchase sequences. Finally, Yao et al. (2020) adapts a SimCLR-like approach for text-based tasks and tabular data. Although there has been significant progress on contrastive learning, augmentation for contrastive learning does not have any theoretical grounding and is understudied in the domain of discrete event sequences. 3 P ROBLEM FORMULATION AND OVERVIEW OF THE C O LES METHOD 3.1 P ROBLEM FORMULATION While the method proposed in this paper could be studied in different domains, in this paper we focus on discrete sequences of events. Assume there are some entities e, and the life activity of each entity e is observed as a sequence of events xe := {xe (t)}Tt=1 e . Entities could be people or organizations or some other abstractions. Events xe (t) may have any nature and structure (e.g., transactions of a client, click logs of a user), and their components may contain numerical, categorical, and textual fields (see datasets description in Section 4). According to theoretical framework of contrastive learning proposed in Saunshi et al. (2019), each entity e is a latent class, which is associated with a distribution Pe over its possible samples (event sequences). However, unlike the problem setting of Saunshi et al. (2019), we have no positive pairs, i.e. pairs of event sequences representing the same entity e. Instead, we have only one sequence xe per entity e. Formally, each entity e is associated with a latent stochastic process {Xe (t)}Tt=1 e , and Te we observe only one realisation {xe (t)}t=1 generated by the process {Xe (t)}. Our goal is to learn an encoder M that maps event sequences into a feature space Rd in such a way that the obtained embedding ce = M ({xe }) ∈ Rd of sequence {xe (t)}Tt=1 e encodes essential properties of e and disregards any randomness and noise contained in the sequence. That is, embeddings M ({x1 }) and M ({x2 }) should be close to each other, if x1 and x2 are sequences generated by the same process {Xe (t)}, and they should be further away, when generated by distinct processes. The quality of representations can be examined by downstream tasks in the two ways: (1) ce can be used as a feature vector for a task–specific model, and (2) encoder M can also be (jointly) fine-tuned (Yosinski et al., 2014). 3.2 S AMPLING OF SURROGATE SEQUENCES AS AN AUGMENTATION PROCEDURE While we have no access to the latent processes {Xe (t)}, we need to use augmentation. Most augmentation techniques proposed earlier for continuous domains (such as image jitter, color jitter or random gray scale in computer vision, see Falcon and Cho (2020)) are not applicable to discrete events. A possible approach for augmentation is generating sub-sequences of the same event sequence {xe (t)}. The idea proposed below resembles the bootstrap method (Efron and Tibshirani, 1994), which enables to generate several bootstrap samples using only one sample of independent datapoints of a latent distribution. However, our setting is different, since we have no independent observations, so we should rely on different data assumptions. The key property of event sequences that represent life activity is periodicity and repeatability of its events (see Figure 4 in the Appendix D for the empyrical observations of these properties for the considered datasets). This is a motivation for the Random slices sampling method applied in CoLES, as presented in Algorithm 1. Each sub-sequence is generated from the initial sequence as its connected segment ("slice") using the following three steps. First, the length of the slice is chosen uniformly from possible values. Second, its starting position is uniformly chosen from all possible values. Third, too short (and optionally too long) sub-sequences are discarded. It could seem that the mean length of obtained sub-sequences are less than the mean length of sequences in the dataset. However, we show in the next section that the distribution of sub-sequences is close to the initial distribution in some realistic assumptions. The overview of the CoLES method is presented in Figure 1. 3
Under review as a conference paper at ICLR 2021 User 1 event sequence Embedding vectors sub-sequence 1 CoLES Minimize Encoder distance sub-sequence 2 Maximize User 2 event sequence distance Embedding vectors sub-sequence 3 CoLES Minimize sub-sequence 4 Encoder distance Figure 1: General framework Algorithm 1: Random slices sub-sequence generation strategy hyperparameters: m, M : minimal and maximal possible length of a sub-sequence k: number of trials. input: A sequence S of length T . output: S: sub-sequences of S. for i ← 1 to k do Generate random integer Ti uniformly from [1, T ]; if Ti ∈ [m, M ] then Generate random integer s from [0, T − Ti − 1]; Add Si := S[s : s + Ti − 1] to S end 3.3 T HEORETICAL ANALYSIS Assume that process {Xe (t)}Tt=1 e is a segment of a latent process {X be (t)}∞ , which generates t=1 sequence of all events in the potentially infinite life of entity e. That is, we assume that Xe (t) = Xbe (t + se ) for some random starting point se ∈ {0, 1, . . .} and horizon Te . Thus we observe, in our data, segment [se + 1, se + Te ] of the life of e. We also make the following Assumptions: 1. Process {X be (t)}∞ is cyclostationary (in the strict sense) (Gardner et al., 2006) with some t=1 period T . b 2. Starting se is independent, and the distribution of (se mod Tb) is uniform over [0, Tb − 1]. 3. Horizon Te is independent and follows a power–law distribution on [m, ∞]. These assumptions correspond to a scenario where some persons become clients of a service at a random moment and for some random time span and their behaviour obey some periodicity. Theorem 1. If sequences {xe (t)} in the dataset are generated from latent processes {X be (t)} as described above with a lower bound m for the length of a sequence {xe (t)}, then sub-sequences obtained by Algorithm 1 from {xe (t)} follow the same distribution as {xe (t)} up to a slight alteration of the distribution of the length Te . Namely, if Te follows power law with an exponent α < −1, then the density function for the length Te0 of a sub-sequence satisfy −α −α m−1 0 k p(Te = k) ≤ p(Te = k) ≤ p(Te = k) for any k ∈ [m, ∞]. (1) m k − 1/2 This theorem means that a sub-sequence obtained by Algorithm 1 is a representative sample of entity e and follows its latent distribution Pe . Combining this result with generalization guarantees proved for setting with explicitly observed positive pairs (Saunshi et al. (2019)), we obtain theoretical background for our implicit self-supervised setting. See Appendix C for the proof of Theorem 1. 4
Under review as a conference paper at ICLR 2021 3.4 M ODEL TRAINING Batch generation. The following procedure creates a batch during CoLES training. N initial sequences are randomly taken and K sub-sequences are produced for each of them. Pairs of sub- sequences of the same sequence are used as positive samples and pairs from different sequences are used as negative ones. We consider several baseline empirical strategies for the sub-sequence generation to compare with Algorithm 1. The simplest strategy is random sampling without replacement. One more strategy is to produce sub-sequences by the random splitting of the initial sequence to several connected segments without intersection between them (see Appendix A). Contrastive loss We consider a classical variant of the contrastive loss, proposed by (Hadsell et al., 2006): L = (1 − Y ) 21 (DWi 2 ) + Y ∗ 21 {max(0, m − DW i )}2 , where DWi is a distance function between embeddings in i-th labeled sample pair, Y is a binary variable identifying that the pair is positive.pAs P proposed in2 (Hadsell et al., 2006), we use euclidean distance function: i DW = D(A, B) = i (Ai − Bi ) . Pair distance calculation. In order to select negative samples, we need to compute the pairwise distance between all possible pairs of embedding vectors of a batch. For the purpose of making this procedure more computationally effective we perform normalization of the embedding vectors, i.e. project them onto a hyper-sphere of the unit radius (see Appendix B). Negative sampling is a way to address the following challenge of the contrastive learning approach: using all pairs of samples can be inefficient: for example, some of the negative pairs are already distant enough, thus these pairs are not valuable for the training (Simo-Serra et al., 2015; Schroff et al., 2015). Hence, only a part of possible negative pairs in the batch are used during loss calculation. We compared the most popular choices for negative sampling applied for CoLES, see Section 4.2 for details. 3.5 E NCODER ARCHITECTURE To embed a sequence of events to the fixed-size vector, we use an encoder network, which consists of two conceptual parts: the event encoder and the sequence encoder subnetworks. The event encoder e takes the set of attributes of each single event xt and outputs its representation in the latent space Rd : zt = e(xt ). The event encoder consists of several embedding layers and batch normalization layers. Each categorical attribute is encoded by its corresponding embedding layer. Batch normalization is applied to numerical attributes of events. Outputs of all embedding and batch normalization layers are concatenated to produce latent representation zt . The sequence encoder s takes latent representations of the sequence of events: z1:T = z1 , z2 , · · · zT and outputs the representation of the whole sequence ct in the time-step t: ct = s(z1:t ). Several approaches can be used to encode a sequence (Cho et al., 2014; Vaswani et al., 2017) . In our experiments we use the recurrent network (RNN) similarly to (Sutskever et al., 2014). The output produced for the last event is used to represent the whole sequence of events. In the case of RNN the last output ht is a representation of the sequence. To summarise, the CoLES method consists of three major ingredients: event sequence encoder, positive and negative pair generation strategy and the loss function for contrastive learning. 4 E XPERIMENTS We compare our method with existing baselines on several publicly available datasets from various data science competitions. We chose datasets with sufficient amounts of discrete events per user. Age group prediction competition3 . The dataset of 44M anonymized credit card transactions representing 50k persons was used to predict the age group of a person. The label is known for 30k persons, other 20k are unlabelled. The group ratio is balanced in the dataset. Each transaction includes the date, type, and amount being charged. 3 https://ods.ai/competitions/sberbank-sirius-lesson 5
Under review as a conference paper at ICLR 2021 Churn prediction competition4 . The dataset of 1M anonymized card transactions representing 10K clients was used to predict a churn probability. Each transaction is characterized by date, type, amount and Merchant Category Code. 5k clients have labels, 5.2k clients haven’t labels. Target is binary, almost balanced with proportions 0.55 and 0.45. Assessment prediction competition5 . The task is to predict the in-game assessment results based on the history of children’s gameplay data. Target is one of 4 grades, with proportions 0.50, 0.24, 0.14, 0.12. The dataset consists of 12M gameplay events combined in 330k gameplays representing 18k children. 17.7k gameplays are labeled, the remaining 312k gameplays are not labeled. Each gameplay event is characterized by timestamp, event code, the incremental counter of events within a game session, time since the start of the game session, etc. Retail purchase history age group prediction6 . The task is to predict the age group of a client based on its retail purchase history. The group ratio is balanced in the dataset. Only labeled data is used. The dataset consists of 45,8M retail purchases representing 400k clients. Each purchase is characterized by time, product level, segment, amount, value, loyalty program points received. As we can see in Figure 3 (Appendix D), these datasets satisfy the power law assumption for the sequence length distribution of Theorem 1. Also, as shown in Figure 4 (Appendix D) the datasets satisfy the periodicity and repeatability assumption. Dataset split. For each dataset, we set apart 10% persons from the labeled part of the data as the test set that we used for evaluation of different models. The rest 90% of labeled data and unlabeled data constitute our training set used for learning. For all methods, a random search on 5-fold cross- validation over the training set is used for hyper-parameter selection. The hyper-parameters with the best out-of-fold performance are then chosen. For the learning of semi-supervised/self-supervised techniques (including CoLES), we used all transactions of training sets including unlabeled data. The unlabelled parts of the datasets were ignored while training supervised models. Performance. Neural network training was performed on a single Tesla P-100 GPU card. For the training part of CoLES, the single training batch is processed in 142 milliseconds. For example, in the age group prediction dataset the single training batch contains 64 unique persons with 5 sub-sequences per person, i.e. 320 training sub-sequences in total, the mean number of transactions in a sub-sequence is 90, hence each batch contains about 28800 transactions. Hyperparameters Unless we explicitly specify, we use contrastive loss and random slices pair generation strategy for CoLES in our experiments (see Section 4.2 for motivation). The final set of hyper-parameters used for CoLES is shown in the Appendix E, Table 5. 4.1 BASELINES LightGBM. We consider the Gradient Boosting Machine (GBM) method (Friedman, 2001) on hand-crafted features. GBM can be considered as a strong baseline in cases of tabular data with heterogeneous features. (Wu et al., 2009; Vorobev et al., 2019; Zhang and Haghani, 2015; Niu et al., 2019). GBM based model requires a large number of hand-crafted aggregate features produced from the raw transactional data. An example of an aggregate feature is an average spending amount in some categories of merchants, such as hotels of the entire transaction history. We used LightGBM (Ke et al., 2017) implementation of the GBM algorithm with nearly 1,000 hand-crafted features for the application. The details of producing hand-crafted features can be found in the Appendix E.1. Self-supervised baselines. NSP. We consider a simple baseline inspired by the next sentence prediction task used in BERT (De- vlin et al., 2019). Specifically, we generate two sub-sequences A and B, in a way that 50% of the time B is the sub-sequence from the same sequence as A and follows it (positive pair), and 50% of the time it is a random sub-sequence taken from another sequence (negative pair). 4 https://boosters.pro/championship/rosbank1/ 5 https://www.kaggle.com/c/data-science-bowl-2019 6 https://ods.ai/competitions/x5-retailhero-uplift-modeling 6
Under review as a conference paper at ICLR 2021 Table 1: Comparison of batch generation strategies Dataset Random samples Random disjoint samples Random slices Age group (Accuracy) 0.613 ± 0.006 0.619 ± 0.011 0.639 ±0.006 Churn (AUROC) 0.820 ± 0.014 0.819 ± 0.011 0.823 ±0.017 Assessment (Accuracy) 0.563 ± 0.004 0.563 ± 0.004 0.618 ±0.009 Retail (Accuracy) 0.523 ± 0.001 0.505 ± 0.002 0.542 ±0.002 5-fold cross-validation metric ±95% is shown SOP. Another simple baseline is the same as sequence order prediction task from ALBERT (Lan et al., 2020). It uses two consecutive sub-sequences as a positive pair, and two consecutive sub-sequences with swapped order as a negative pair. RTD. We also adapt the replaced token detection approach from ELECTRA (Clark et al., 2020) for event sequences as a baseline for our research. We replaced 15% of events from the sequence with random events, taken from other sequences and train a model to predict whether an event is replaced or not. CPC. As the last self-supervised baseline, we selected the recently proposed Contrastive Predictive Coding (CPC) (van den Oord et al., 2018), a self-supervised learning method that produced an excellent performance on sequential data of such traditional domains as audio, computer vision, reinforcement learning and recommender systems (Zhou et al., 2020a). Supervised learning. In addition to the aforementioned baselines, we compare our method with a supervised learning approach where the encoder network e (see Section 3.5) and the classification sub-network h are jointly trained on the downstream task target, i. e. the classification sub-network takes encoder output and produces a prediction: ŷ = h(e(x)). One-layer neural net with softmax activation is used as h. Note that no pre-training is used in this case. Note that all neural network baselines use the same architecture of the encoder model as CoLES. 4.2 R ESULTS Features of CoLES. To evaluate the proposed method of sub-sequence generation we compared it with two alternative strategies described in Section 3.4. The results are presented in Table 1. The proposed random slices sub-sequence generation strategy significantly outperforms alternative strategies, what confirm theoretical results (see Section 3.3). Also, note that the random samples strategy is similar to the augmentation strategy proposed by Yao et al. (2020), and the random disjoint samples strategy is similar to sub-sequence generation proposed by Ma et al. (2020). We evaluated several possible loss functions and found that contrastive loss that can be considered as the basic variant of contrastive learning loss, performs on par or better than other losses on the downstream tasks (see Appendix F.1, Table 7). This means that improvements obtained by more recent losses in object recognition tasks does not necessarily lead to gains in other downstream tasks. We also compared popular negative sampling strategies (distance-weighted sampling (Manmatha et al., 2017), and hard-negative mining (Schroff et al., 2015)) with random negative sampling strategy. The results are shown in the Appendix F.1, Table 8. We found that hard negative mining leads to a significant increase in quality on downstream tasks in comparison to random negative sampling. Comparison with baselines. We compared CoLES with baselines described in Section 4.1 in two scenarios. First, we compared embeddings produced by the CoLES encoder with other types of embeddings and with manually created aggregates by using them as input features of a downstream task model. The downstream task model is trained by LightGBM (Ke et al., 2017) independently from the sequence encoder. As Table 2 demonstrates, our method generates sequence embeddings of sequential data that achieve strong performance results in comparison to the case of manually crafted features when used on the downstream tasks. In particular, Table 2 shows that even unsupervised CoLES embeddings perform on par and sometimes even better than hand-crafted features. Also note, that CoLES embeddings outperform embeddings produced by the other self-supervised baselines on each dataset. 7
Under review as a conference paper at ICLR 2021 (a) Age group (b) Assessment 0.650 0.60 0.625 0.600 0.58 Accuracy Accuracy 0.575 0.56 0.550 Setup 0.54 Setup 0.525 lightGBM on hand-crafted features lightGBM on hand-crafted features CPC Fine-tuning CPC Fine-tuning 0.500 CoLES Fine-tuning 0.52 CoLES Fine-tuning Supervised learning Supervised learning 1.56% 3.12% 6.25% 12.5% 25.0% 50.0% 100.0% 3.12% 6.26% 12.5% 25.0% 50.0% 100.0% Shares of Labeled Datapoints Shares of Labeled Datapoints Figure 2: Model quality for different dataset sizes The rightmost point corresponds to all labels and supervised setup. Table 2: Accuracy on the downstream tasks: Metric increase against baseline Age group Churn Assessment Retail Method Accuracy AUROC Accuracy Accuracy LightGBM: Designed features 0.631 ± 0.004 0.825 ± 0.005 0.602 ± 0.006 0.547 ± 0.001 SOP embeddings −21.9% ± 0.6% −5.3% ± 0.8% −4.1% ± 1.0% −22.8% ± 0.2% NSP embeddings −1.5% ± 0.9% +0.6% ± 0.7% −3.5% ± 1.1% −22.3% ± 0.4% RTD embeddings +0.1% ± 0.6% −2.9% ± 0.8% −3.6% ± 1.1% −5.0% ± 0.3% CPC embeddings −5.9% ± 0.6% −2.9% ± 0.6% −2.3% ± 0.9% −4.0% ± 0.3% CoLES embeddings +1.1% ± 1.2% +2.2% ±0.6% −0.1% ± 0.9% −1.4% ± 0.2% Supervised learning 0.628 ± 0.005 0.817 ± 0.012 0.602 ± 0.006 0.542 ± 0.001 RTD fine-tuning +1.2% ± 1.2% +0.3% ± 1.3% −2.7% ± 1.0% +0.5% ± 0.4% CPC fine-tuning −2.1% ± 1.6% −0.9% ± 1.4% +0.7% ± 1.1% +1.2% ± 0.3% CoLES fine-tuning +2.5% ±1.0% +1.1% ±1.3% +2.2% ±1.1% +1.9% ±0.2% test set quality metric ±95% is shown In the second scenario, we fine-tune pre-trained models for specific downstream tasks. The models are pre-trained using CoLES and other self-supervised learning approaches and then are additionally trained on the labeled data for the specific task in the same way as we trained a neural net for the supervised learning (see Section 4.1). A neural net without pre-training is also added to the comparison. As Table 2 shows, fine-tuned representations obtained by our method achieve superior performance on all the considered datasets, outperforming all other methods by statistically significant margins. Semi-supervised setup. To evaluate our method in case of the restricted amount of labeled data, we performed the series of experiments where only a fraction of available labels are used to train the downstream task model. As in the case of the supervised setup, we compare the proposed method with LigthGBM over hand-crafted features, CPC, and supervised learning without pre-training (see Section 4.1). The results of this comparison are presented in Figure 2. Note that the difference in performance between CoLES and supervised-only methods increases as we decrease the number of available labels. Also note that CoLES consistently outperforms CPC for different volumes of labeled data. 8
Under review as a conference paper at ICLR 2021 Business applications. In addition to the described experiments on public datasets, we have per- formed extensive testing of our method on the private data in a large European bank. We’ve observed a significant increase in model performance (+ 2-10% AUROC) after the addition of CoLES embed- dings to the existing models in many downstream tasks, including credit scoring, marketing campaign targeting, product recommendations cold start, fraud detection and legal entities connections predic- tion. 5 C ONCLUSIONS In this paper, we present Contrastive Learning for Event Sequences (CoLES), a novel self-supervised method for building embeddings of discrete event sequences. In particular, the CoLES method can be effectively used for pre-training neural networks in semi-supervised settings. It can also be used to produce embeddings of complex event sequences that can be effectively used in various downstream tasks. We also empirically demonstrate that our approach achieves strong performance results on several downstream tasks and consistently outperforms both classical machine learning baselines on hand- crafted features, as well as other previously introduced self-supervised and semi-supervised learning baselines adapted to the event sequence domain. In the semi-supervised setting, where the number of labeled data is limited, our method demonstrates even stronger results: the lesser is the labeled data the larger is performance margin between CoLES and supervised-only methods. The method is especially adapted for event sequence data which is extensively used by the core businesses of many large companies, including financial institutions, internet companies, retail and telecom. R EFERENCES Bachman, P., R. D. Hjelm, and W. Buchwalter 2019. Learning representations by maximizing mutual information across views. ArXiv, abs/1906.00910. Bigon, L., G. Cassani, C. Greco, L. Lacasa, M. Pavoni, A. Polonioli, and J. Tagliabue 2019. Prediction is very hard, especially about conversion. predicting user purchases from clickstream data in fashion e-commerce. ArXiv, abs/1907.00400. Chen, T., S. Kornblith, M. Norouzi, and G. E. Hinton 2020. A simple framework for contrastive learning of visual representations. ArXiv, abs/2002.05709. Cho, K., B. van Merrienboer, Çaglar Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ArXiv, abs/1406.1078. Chopra, S., R. Hadsell, and Y. LeCun 2005. Learning a similarity metric discriminatively, with application to face verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1:539–546 vol. 1. Clark, K., M.-T. Luong, Q. V. Le, and C. D. Manning 2020. Electra: Pre-training text encoders as discriminators rather than generators. ArXiv, abs/2003.10555. Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805. Doersch, C., A. Gupta, and A. A. Efros 2015. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV), Pp. 1422–1430. 9
Under review as a conference paper at ICLR 2021 Dosovitskiy, A., J. T. Springenberg, M. A. Riedmiller, and T. Brox 2014. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS. Efron, B. and R. J. Tibshirani 1994. An introduction to the bootstrap. CRC press. Falcon, W. and K. Cho 2020. A framework for contrastive self-supervised learning and designing a new approach. ArXiv, abs/2009.00104. Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. The Annals of Statistics. Gardner, W. A., A. Napolitano, and L. Paura 2006. Cyclostationarity: Half a century of research. Signal processing, 86.4:639–697. Hadsell, R., S. Chopra, and Y. LeCun 2006. Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1735–1742. He, K., H. Fan, Y. Wu, S. Xie, and R. B. Girshick 2019. Momentum contrast for unsupervised visual representation learning. ArXiv, abs/1911.05722. Hoffer, E. and N. Ailon 2015. Deep metric learning using triplet network. In SIMBAD. Hu, J., J. Lu, and Y.-P. Tan 2014. Discriminative deep metric learning for face verification in the wild. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Pp. 1875–1882. Kaya, M. and H. Şakir Bilge 2019. Deep metric learning: A survey. Symmetry, 11:1066. Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu 2017. Lightgbm: A highly efficient gradient boosting decision tree. In NIPS. Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut 2020. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942. Laxman, S., V. Tankasali, and R. W. White 2008. Stream prediction using a generative model based on frequent episodes in event sequences. In KDD. Ma, J., C. Zhou, H. Yang, P. Cui, X. Wang, and W. Zhu 2020. Disentangled self-supervision in sequential recommenders. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Manmatha, R., C.-Y. Wu, A. J. Smola, and P. Krähenbühl 2017. Sampling matters in deep embedding learning. 2017 IEEE International Conference on Computer Vision (ICCV), Pp. 2859–2867. Mikolov, T., K. Chen, G. S. Corrado, and J. Dean 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. Ni, Y., D. Ou, S. Liu, X. Li, W. Ou, A. Zeng, and L. Si 2018. Perceive your users in depth: Learning universal user representations from multiple e- commerce tasks. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Niu, X., L. Wang, and X. Yang 2019. A comparison study of credit card fraud detection: Supervised versus unsupervised. ArXiv, abs/1904.10604. 10
Under review as a conference paper at ICLR 2021 Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer 2018. Deep contextualized word representations. ArXiv, abs/1802.05365. Reimers, N. and I. Gurevych 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP/IJCNLP. Saunshi, N., O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar 2019. A theoretical analysis of contrastive unsupervised representation learning. In In International Conference on Machine Learning, Pp. 5628–5637. Schroff, F., D. Kalenichenko, and J. Philbin 2015. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pp. 815–823. Simo-Serra, E., E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer 2015. Discriminative learning of deep convolutional feature point descriptors. 2015 IEEE International Conference on Computer Vision (ICCV), Pp. 118–126. Song, Y., Y. Li, B. Wu, C.-Y. Chen, X. Zhang, and H. Adam 2017. Learning unified embedding for apparel recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Pp. 2243–2246. Sutskever, I., O. Vinyals, and Q. V. Le 2014. Sequence to sequence learning with neural networks. ArXiv, abs/1409.3215. Ustinova, E. and V. S. Lempitsky 2016. Learning deep embeddings with histogram loss. In NIPS. van den Oord, A., Y. Li, and O. Vinyals 2018. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748. van der Maaten, L. and G. E. Hinton 2008. Visualizing data using t-sne. Journal of machine learning research. Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin 2017. Attention is all you need. ArXiv, abs/1706.03762. Vorobev, A., A. Ustimenko, G. Gusev, and P. Serdyukov 2019. Learning to select for a predefined ranking. In ICML. Wan, L., Q. shan Wang, A. Papir, and I. Lopez-Moreno 2018. Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Pp. 4879–4883. Wiese, B. and C. W. Omlin 2009. Credit card transactions, fraud detection, and machine learning: Modelling time with lstm recurrent neural networks. In Innovations in Neural Information Paradigms and Applications. Wu, Q., C. J. C. Burges, K. M. Svore, and J. Gao 2009. Adapting boosting for information retrieval measures. Information Retrieval, 13:254–270. Xing, E. P., A. Y. Ng, M. I. Jordan, and S. J. Russell 2002. Distance metric learning with application to clustering with side-information. In NIPS. Yao, T., X. Yi, D. Cheng, F. Yu, A. Menon, L. Hong, E. H. hsin Chi, S. Tjoa, J. Kang, and E. Ettinger 2020. Self-supervised learning for deep models in recommendations. ArXiv, abs/2007.12865. Yi, D., Z. Lei, and S. Z. Li 2014. Deep metric learning for practical person re-identification. ArXiv, abs/1407.4979. Yosinski, J., J. Clune, Y. Bengio, and H. Lipson 2014. How transferable are features in deep neural networks? ArXiv, abs/1411.1792. 11
Under review as a conference paper at ICLR 2021 Zhai, A., H.-Y. Wu, E. Tzeng, D. H. Park, and C. Rosenberg 2019. Learning a unified embedding for visual search at pinterest. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Zhang, Y. and A. Haghani 2015. A gradient boosting method to improve travel time prediction. Transportation Research Part C-emerging Technologies, 58:308–324. Zhang, Y., D. Wang, Y. Chen, H. Shang, and Q. Tian 2017. Credit risk assessment based on long short-term memory model. In ICIC. Zhou, C., J. Ma, J. Zhang, J. Zhou, and H. Yang 2020a. Contrastive learning for debiased candidate generation in large-scale recommender systems. ArXiv. Zhou, K., H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. yuan Wang, and J. Wen 2020b. S3̂-rec: Self-supervised learning for sequential recommendation with mutual information maximization. ArXiv, abs/2008.07873. A BATCH GENERATION We consider several empirical strategies for the sub-sequence generation to compare with the random slices algorithm described in section "Sampling of surrogate sequences" of the main paper. The simplest strategy is random sampling without replacement. One more strategy is to produce sub- sequences by the random splitting of the initial sequence to several connected segments without an intersection between them. To generate k sub-sequences, the following procedure should be repeated k times: take a random number of elements from the sequence without replacement. Algorithm 2: Disjointed sub-sequences generation strategy hyperparameters: k: number of sub-sequences to be produced. input: A sequence S of length l. output: S1 , ..., Sk : sub-sequences of S. Generate vector inds of length l with random integers from [1,k]. for i ← 1 to k do Si = S[inds == i] end B P ERFORMANCE OPTIMIZATIONS Here we describe several performance optimizations that can be used for model training and inference. Pair distance calculation. In order to select negative samples, we need to compute pair-wise distance between all possible pairs of embedding vectors of a batch. For the purpose of making this procedure more computationally effective we perform normalization of p thePembedding vec- 2 tors, i.e. project them on a hyper-sphere of unit radius. Since D(A, B) = pP i (Ai − Bi ) = 2 P 2 P i Ai + i Bpi −2 i Ai Bi and ||A|| = ||B|| = 1, to compute the euclidean distance we only need to compute: 2 − 2(A · B). To compute the dot product between all pairs in a batch we just need to multiply the matrix of all embedding vectors of a batch by itself transposed, which is a highly optimized computational procedure in most modern deep learning frameworks. Hence, the computational complexity of the negative pair selection is O(n2 h) where h is the size of the output embeddings and n is the size of the batch. Embedding update calculation. Encoder, based on RNN-type architecture like GRU (Cho et al., 2014), allows to calculate embedding ct+k by updating embedding ct instead of calculating em- bedding ct+k from the whole sequence of past events z1:t : ck = rnn(ct , zt+1:k ). We use this 12
Under review as a conference paper at ICLR 2021 optimization to reduce inference time to update already existing person embeddings with new events, occurred after the calculation of embeddings. This is possible due to the recurrent nature of RNN-like networks. C P ROOF OF T HEOREM 1 In this section, we provide proof of Theorem 1 (Section 3.3) that justifies the Random Slices sub- sequence generation strategy proposed in the paper. Proof. First, we state the following straightforward lemma: Lemma 1. Let a stochastic process {Y (t)}∞ ∞ t=1 be a shift of another stochastic process {Y (t)}t=1 b by independent random time s, i.e. Y (t) = Yb (t + s) with integer s ≥ 0. If process {Yb (t)}∞ t=1 is cyclostationary with period T and (se mod T ) is uniform over [0, T − 1], then process {Y (t)}∞ b b b t=1 is stationary. T 0 +s0 Lemma 1 implies that process {Xe (t)}Tt=1 e is stationary, and all its segments {Xe (t)}t=s e 0 +1 of a T0 given length Te0 define the same distribution over sequences as its starting segment {Xe (t)}t=1 e does. 0 Furthermore, integrating over s , we conclude that the conditional distribution of a sub-sequence Te0 obtained via Random Slices generation strategy given its length Te0 follows the process {Xe (t)}t=1 . To finish the proof, it remains to prove Equation 1. Assume P(Te = k) ∝ k α for k ∈ [m, ∞]. By the law of total probability, we have P(Te0 = k0 ) = 0 P k P(Te = k)P(Te = k0 | Te = k), that is, ∞ X P(Te0 = k0 ) = C k α−1 , k=k0 where C is the normalization constant. To estimate the sum of the series, notice that Z ∞ X∞ Z ∞ xα−1 dx > k α−1 > xα−1 dx, (2) k0 − 21 k=k0 k0 R k+1/2 where the former inequality follows from the fact that k−1/2 xα−1 > k α−1 as long as function f (x) = xα−1 is convex. After integration, we rewrite Equation 2 as follows: α ∞ −1 1 X −1 α k0 − > k α−1 > k . (3) α 2 α 0 k=k0 Using these inequalities, we obtain the upper bound for P(Te0 = k) in the following way: ∞ ∞ X ∞ α X ∞ 0 X α−1 X α−1 −1 1 −1 α P(Te = k0 ) = k / k < k0 − / l = α 2 α k=k0 l=m k=l l=m −α ∞ −α k0 α X α k0 = k0 / l = P(Te = k0 ). k0 − 1/2 k0 − 1/2 l=m At last, the lower bound for P(Te0 = k) can be obtained using Equation 3 as follows: ∞ ∞ X ∞ ∞ α 0 X α−1 X α−1 α X 1 P(Te = k0 ) = k / k > k0 / l− > 2 k=k0 l=m k=l l=m −α ∞ −α m − 1/2 X m − 1/2 > k0α / lα = P(Te = k0 ). m m l=m 1 α α l− 2 m− 21 The latter inequality in these calculations follows from the fact that l < m for l > m. 13
Under review as a conference paper at ICLR 2021 Table 3: Data structure for a single credit card Date Time Amount Currency Country Merchant Type Jun 21 16:40 230 EUR France Restaurant Jun 21 20:15 5 USD US Transportation Jun 22 09:30 40 USD US Household Appliance Table 4: Click-stream structure for a single user Time Date Domain Referrer Domain 17:40 Jun 21 amazon.com google.com 17:41 Jun 21 amazon.com amazon.com 17:45 Jun 21 en.wikipedia.org google.com D DATASETS We designed the method specially for the user behavior sequences (Ni et al., 2018). These sequences consist of discrete events per person in continuous time, for example, behavior on websites, credit card transactions, etc. Considering credit card transactions, each transaction has a set of attributes, either categorical or numerical including the timestamp of the transaction. An example of the sequence of three transactions with their attributes is presented in Table 3. The merchant type field represents the category of a merchant, such as "airline", "hotel", "restaurant", etc. Another example of user behavior data is click-stream: the log of internet page visits. The example of a click-stream log of a single user is presented in Table 4. In our research we chose several publicly available datasets from data science competitions. 1. Age group prediction competition7 - the task is to predict the age group of a person. The group ratio is balanced in the dataset. The dataset consists of 44M anonymized transactions representing 50k persons with a target labeled for only 30k of them (27M out of 44M transactions), for the other 20k persons (17M out of 44M transactions) label is unknown. Each transaction includes date, type (for example, grocery store, clothes, gas station, children’s goods, etc.) and amount. We use all available 44M transactions for contrastive learning, excluding 10% - for the test part of the dataset, and 5% for the contrastive learning validation. 2. Churn prediction competition8 . The dataset of 1M anonymized card transactions repre- senting 10K clients was used to predict a churn probability. 5k clients have labels (0.49M out of 1M transactions), 5.2k clients haven’t labels (0.52M out of 1M transactions). Target is binary, almost balanced with proportions 0.55 and 0.45. Transactions of the same type and month are grouped and represented as a single pseudo-transaction, which amount is the sum of grouped transactions. Each transaction is characterized by date, type, amount and Merchant Category Code. 3. Assessment prediction competition9 - the task is to predict the results of the in-game assessment based on the history of children gameplay data. Target is one of 4 grades, with proportions 0.50, 0.24, 0.14, 0.12. The dataset consists of 12M gameplay events combined in 330k gameplays representing 18k children. 17.7k gameplays (0.9M out of 12M gameplay events) are labeled, the remaining 312k gameplays (11.6M out of 12M gameplay events) are not labeled. Each gameplay event is characterized by timestamp, event code, the incremental counter of events within a game session, time since the start of the game session, etc. 7 https://onti.ai-academy.ru/competition 8 https://boosters.pro/championship/rosbank1/ 9 https://www.kaggle.com/c/data-science-bowl-2019 14
Under review as a conference paper at ICLR 2021 Figure 3: Event sequence length distribution 1.0 Age group Churn 0.8 Assessment Retail 0.6 Frequency 0.4 0.2 0.0 16 64 256 1024 4096 Events per client 4. Retail purchase history age group prediction10 - the task is to predict the age group of a client based on its retail purchase history. The group ratio is balanced in the dataset. The dataset consists of 45,8M retail purchases representing 400k clients. Only labeled data is used. Each purchase is characterized by time, product level, segment, amount, value, points received. To check that sequence lengths of the considered datasets follow the power-low distribution we measured their distribution of lengths. As Figure 3 shows, the event sequence length distribution is close to the power-law distribution for every considered dataset. To check that considered datasets follow our repeatability and periodicity assumption made in Section 3.2 and used for theoretical analysis in Section 3.3 we performed the following experiments. We measure the KL-divergence two kinds of samples: (1) between random sub-samples of the same sequence, generated using a modified version of Algorithm 1 where overlapping events are dropped and (2) between random sub-samples taken from different sequences. The results are shown in Figure 4. As Figure 4 shows, the KL-divergence between sub-sequences of the same sequence of events is relatively small compared to the typical KL-divergence between sub-samples of different sequences of events. This observation supports our repeatability and periodicity assumption. Also note that additional plot (e) is provided as an example for data without any repeatable structure. E E XPERIMENT SETUP For all methods, a random search on 5-fold cross-validation over the train set is used for hyper- parameter selection. The hyper-parameters with the best out-of-fold performance on the train set are then chosen. The final set of hyper-parameters used for CoLES is shown in Table 5. The number of sub-sequences generated for each sequence was always 5 for each dataset. E.1 H AND - CRAFTED FEATURES Here we describe the details of producing hand-crafted features. All attributes of each transaction are either numerical (e. g. amount) or categorical (e.g. merchant type (MCC code), transaction type, etc.). For the numerical type of attribute we apply aggregation functions, such as ’sum’, ’mean’, ’std’, ’min’, ’max’, over all transactions per user. For example, if we apply ’sum’ for the numerical field ’amount’ 10 https://ods.ai/competitions/x5-retailhero-uplift-modeling 15
Under review as a conference paper at ICLR 2021 (a) Age group (b) Churn 7000 type type Same client sample 500 Same client sample Random client sample Random client sample 6000 400 5000 4000 300 Count Count 3000 200 2000 100 1000 0 0 5 10 15 20 25 0 0 5 10 15 20 25 KL KL (c) Assessment (d) Retail 20000 type type Same client sample 17500 Same client sample 17500 Random client sample Random client sample 15000 15000 12500 12500 Count Count 10000 10000 7500 7500 5000 5000 2500 2500 0 0 5 10 15 20 25 0 0 5 10 15 20 25 KL KL (e) Texts 30000 type Same post sample Random post sample 25000 20000 Count 15000 10000 5000 0 0 5 10 15 20 25 KL Figure 4: Periodicity and repeatbility of the data. KL-divergence between event types of two random sub-sequences from the same sequence is compared with KL-divergence between sub-sequences of different sequences. 16
Under review as a conference paper at ICLR 2021 Table 5: Hyper-parameters for CoLES training Output Learning N samples N Min seq Max seq Dataset Encoder size rate in batch epochs length length Age group 800 0.001 64 150 25 200 GRU Churn 1024 0.004 128 60 15 150 LSTM Assessment 100 0.002 256 100 100 500 GRU Retail 800 0.002 256 30 30 180 GRU Table 6: Comparison of encoder types Dataset LSTM GRU Transformer Age group (Accuracy) 0.621 ±0.008 0.638 ±0.007 0.622 ± 0.006 Churn (AUROC) 0.823 ±0.017 0.812 ± 0.010 0.780 ± 0.012 Assessment (Accuracy) 0.620 ±0.007 0.618 ±0.009 0.542 ± 0.007 Retail (Accuracy) 0.535 ± 0.003 0.542 ±0.002 0.499 ± 0.002 5-fold cross-validation metric ±95% is shown we obtain a feature ’sum of all transaction amounts per user’. For the categorical type of attribute we apply aggregation functions in a slightly different way. For each unique value of categorical attribute we apply aggregation functions, such as ’count’, ’mean’, ’std’ over all transactions per user’ numerical attribute. For example, if we apply ’mean’ for the numerical attribute ’amount’ grouped by categorical attribute ’MCC code’ we obtain a feature ’mean amount of all transactions for each MCC code per user’. For example, for age prediction task we have one categorical attribute (small group) with 200 unique values, combining it with amount we can produce 200 ∗ 3 features (’group0 x amount x count’, ’group1 x amount x count’, ..., ’group199 x amount x count’, ’group0 x amount x mean’, ...). In total we use approx 605 features for this task. Note, that hand-crafted features contain information about user spending profile but omit information about transactions temporal order. F R ESULTS F.1 D ESIGN CHOICES OBSERVATIONS We consider several contrastive learning losses that showed promising performance on differ- ent datasets (Kaya and Şakir Bilge, 2019) and some classical variants: contrastive loss (Hadsell et al., 2006), binomial deviance loss (Yi et al., 2014), triplet loss (Hoffer and Ailon, 2015), his- togram loss (Ustinova and Lempitsky, 2016), and margin loss (Manmatha et al., 2017). The results of comparison are shown in the Table 7. It is interesting to observe that even contrastive loss that can be considered as the basic variant of contrastive learning loss allows to get strong results on the downstream tasks (see Table 7). Our hypothesis is that an increase in the model performance on contrastive learning task does not always lead to an increase in performance on downstream tasks. As shown in Table 6, different choices of encoder architectures show comparable performance on the downstream tasks. F.2 E MBEDDING SIZE Figure 5 shows that the performance quality on the downstream task increases with the dimensionality of an embedding. After the best quality is achieved, a further increase in the dimensionality of an embedding dramatically reduces quality. These results can be interpreted as the bias-variance trade-off. When the embedding dimensionality is too small, too much information can be discarded (high bias). On the other hand, when embedding dimensionality is too large, too much noise is added (high variance). Note, that increasing the embedding size will also linearly increase the training time and the volume of consumed memory on the GPU. 17
You can also read