Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Understanding Chat Messages for Sticker Recommendation in Hike Messenger Abhiskek Laddha Mohamed Hanoosh Debdoot Mukherjee Parth Patwa Hike Messenger {abhiskekl, moh.hanoosh, debdoot, parthp}@hike.in ABSTRACT Stickers are popularly used in messaging apps such as Hike to visu- ally express a nuanced range of thoughts and utterances to convey arXiv:1902.02704v3 [cs.CL] 19 Jul 2019 exaggerated emotions. However, discovering the right sticker from a large and ever expanding pool of stickers while chatting can be cumbersome. In this paper, we describe a system for recommending stickers in real time as the user is typing based on the context of conversation. We decompose the sticker recommendation problem into two steps. First, we predict the message that the user is likely to send in the chat. Second, we substitute the predicted message with an appropriate sticker. Majority of Hike’s messages are in the form of text which is transliterated from users’ native language to the Roman script. This leads to numerous orthographic variations of the same message and complicates message prediction. To address this issue, we learn dense representations of chat messages and use them to cluster the messages that have same meaning. In the subsequent Figure 1: Sticker Recommendation UI on Hike and a high level steps we predict the message cluster instead of the message. Our flow of our two step sticker recommendation system. model employs a character level convolution network to capture the similar intents in orthographic variants of chats. We validate any perceivable delay during typing. This is possible only if the our approach using manually labelled data on two tasks. We also system runs end-to-end on the mobile device without any network propose a novel hybrid message prediction model, which can run calls. Furthermore, a large fraction of Hike users use low end mobile with low latency on low end phones that have severe computational phones, so we need a solution which is efficient both in terms of limitations. CPU load and memory requirements. Prior to this work, sticker recommendation (SR) on Hike app was based on string matching of the typed text to the tags that were man- 1 INTRODUCTION ually assigned to each sticker. Simple string matching algorithms In messaging apps such as Facebook Messenger, WhatsApp, Line fail to capture importance of phrases while scoring candidates. eg; and Hike, new modalities are extensively used to visually express for an input text “so” the algorithm would rank phrase “so” higher thoughts and emotions (e.g. emojis, gifs and stickers). Emojis (e.g., than that of “sorry” while the latter could have been more useful ,, /) are used in conjunction with text to convey emotions in a recommendation to the user. Recall of string matching based ap- message [7]. Unlike emojis, stickers provide a graphic alternative proach is limited by exhaustiveness of tagging of the stickers. Tags for text messages. Hike stickers are composed of an artwork (e.g., of a sticker are supposed to contain all the phrases that the sticker cartoonized characters and objects) and a stylized text (See Figure can possibly substitute, but there are many ways of expressing same 1). They help to convey rich expressions along with the message. message in a chat. For instance, people often skip vowels in words Hundreds of thousands of stickers are available for free download or as they type; e.g., “where are you” → “whr r u”, “ok” → “k”. This purchase on sticker stores of popular messaging apps. Once a user is further exacerbated when people transliterate messages from their downloads a sticker pack, it gets added to a palette, which can be native language to Roman script; e.g., phrase “acchha” (Hindi for accessed from the chat input box. However, discovering the right “good”) is written in many variants - “accha”, “acha”, “achha” etc. sticker while chatting can be cumbersome because it’s not easy to Another reason for proliferation of such variants is that certain words think of the best sticker that can substitute your utterance. Even if are pronounced differently in different regions. We observe 343 or- you do recall a sticker, browsing the palette to find it can slow you thographic variants of “kya kar raha hai” (Hindi for “What are you down. Apps like Hike and Line offer type-ahead sticker suggestions doing”) in our dataset. Hence, it is hard for any person to capture all while you type a message (as shown in Figure 1) in order to alleviate variants of an utterance as tags. this problem. Decomposing Sticker Recommendation: One can potentially think The goal of type-ahead sticker recommendation is to help users of setting up sticker recommendations with the help of a supervised discover a suitable sticker which can be shared in lieu of the mes- model, which directly learns the most relevant stickers for a given sage that they want to send. The latency of generating such sticker context defined by the previous message in the chat and the text recommendation should be in tens of milliseconds in order to avoid typed in the chat input box. However, due to updates in the set of
available stickers and a massive skew in historical usage toward memory and CPU. Trie search is efficient mechanism to retrieve a handful of popular stickers, it becomes difficult to collect reli- message group based on typed text and can be executed for able, unbiased data to train such an end-to-end model. Moreover, each character typed. Hence, the system is able to easily meet an end-to-end model will need to be retrained frequently to support the latency constraints with this setup. Scores from these two new stickers and the updated model would have to be synced to all components are combined to produce final scores for message devices. Such regular updates of the model, which is possibly >10 prediction. MB in size, will be prohibitively expensive in terms of data costs. In summary, our contributions include: Thus, instead of creating an end-to-end model, we decompose the • We present a novel application of type-ahead Sticker Recom- SR task into two steps. First, we predict the message that a user is mendation within a messaging application. We decompose the likely to send based on the chat context and what the user may have recommendation task in two steps: message prediction and sticker typed in the message input box. This model does not require frequent substitution. updates because the distribution of chat messages does not change • We describe an approach to cluster chat messages that have the much with time. Second, we recommend stickers by mapping the same meaning. We evaluate different methods to create dense predicted message to stickers that can substitute it. We make use of vector representations for chat messages, which preserve message manual tags added to stickers to bootstrap the mapping of message semantics. to sticker.This step uses a lightweight message-to-sticker mapping, • We propose a novel hybrid message prediction model, which which is less than 1 MB in size. We update the mapping frequently can run with low latency on low end phones that have severe based on relevance feedback observed on recommendations and computational limitations. We show that the hybrid model has incorporates new stickers as they are launched. comparable performance with a neural network based message For message prediction, we train a classification model with the prediction model. help of historical chat data. Due to the aforementioned problem of The rest of the paper is organized as follows. In Section 2, we numerous variations in messages, we cluster different messages that explore methods to learn effective representations for chat messages have same meaning and use them as classes in prediction model. and describe creation of message clusters, which can be used as This helps us drastically reduce the number of classes in the classifier classes in the message prediction task. In Section 3, we discuss while keeping most frequent message intent of our corpus. It lowers different message prediction models, including our Hybrid model. its complexity as well as size. In Section 4, we briefly mention our approach for mapping stickers In this paper, we focus on how to efficiently set up the message to message clusters. In Section 5, we demonstrate the efficacy of prediction task. In doing so, we specifically deal with two sets of the proposed message embeddings in identifying synonymous chat challenges: messages and also evaluate the performance of a quantized neural network and the hybrid model on our message prediction task. Fi- (1) Orthographic variations of chat messages: As described above, nally, we describe how we have deployed the system on Hike and a large fraction of chat messages are simply variants of each discuss related literature before concluding the paper. other. We investigate different architectures to learn embeddings for chat messages such that the messages with the same intent take up similar representations. We empirically show that the 2 CHAT MESSAGE CLUSTERING use of charCNN [13] with transformer [23] for encoding chat As mentioned earlier, we cluster frequent messages in our chat and messages can be highly effective to capture semantics of chat use them as classes in the message prediction model. For covering a words and phrases. Next, we perform a fine grained clustering on large fraction of messages in our chat corpus with a limited number the representations of the frequent chat messages with the help of clusters, we need to group all messages with the same intent of the HDBSCAN algorithm. These clusters are used as classes into a single cluster. This has to be done without compromising the for our message prediction model. Though we construct the semantics of each cluster. In order to efficiently cluster messages, message clusters to reduce complexity of the message prediction it is critical to obtain dense vector representations of messages that task, we observe that these message clusters are generic enough effectively capture their meaning. to be used in other application such as conversational response In this section, we describe an encoder which is used to convert generation, intent classification etc. [22, 24]. chat phrases into dense vectors [19]. Then, we explain the network (2) Computational limitations on low end smart-phones: Most of architecture which is used to train the message embeddings such our users use Hike on low end mobile devices with severe limi- that we learn similar representations for messages that have same tations on memory and compute power. Running inference with meaning. a neural network model for message prediction proves to be challenging on such devices [8]. The size of a neural network 2.1 Encoder model trained for message prediction exceeds the memory limi- The architecture of the encoder is shown in Figure 2. Input to the tation even after quantization. We present a novel hybrid model, encoder is a message mi , consisting of a sequence of N words which can run efficiently on low end devices without signifi- {w i,1 , w i,2 , . . . , w i, N }, and the output is a dense vector ei ∈ Rd ×1 . cantly compromising accuracy. Our system is a combination of To represent a word, we use an embedding composed of two parts; a neural network based model that processes chat context and a character based embedding aggregated from character representa- runs on server, and a Trie [20] based model that processes typed tions, and a word level embedding. To generate the character based text input on the client. The first component is not limited by embedding, we use a character CNN [13] that leverages sub-word 2
Figure 3: Model architecture for learning message embedding. Input-Response Chat (IRC) maximize the similarity of trans- ′ formed next message embedding (ei+1 ) with current message embedding (ei ). Both encoder share the parameters. It adds a positional embedding with word embedding to encode the sequence information present in the message. Similar to [26], Figure 2: Illustration of encoder which comprises of a character we use only the encoder part of transformer to create our message based CNN layer for each word and GRU/Transformer layer at embeddings. Unlike GRU, which gives the final representation at word level step t, transformer gives a representation of each word after attending to the different words in the sentence using a series of attention information to learn similar representations for orthographic variants layers. We compute the message embedding ei by averaging the of the same word. Let dc be the dimension of character embedding. words vectors in the final layer of the transformer. For a word w t we have a sequence of l characters [c t,1 , c t,2 . . . , c t,l ]. Then, the character representation of w t can be obtained by stacking 2.2 Model Architecture the character level embeddings in a matrix Ct ∈ Rl ×dc and applying Akin to [26], we use the model architecture shown in Fig. 3 to train a narrow convolution with a filter H ∈ Rk ×dc of k width with ReLU the encoders described in Section 2.1. We use a dot-product scor- non linearity, and a max pooling layer. A k width filter can be as- ing model [9, 26] to train the sentence level representation from sumed to capture k-gram features of a word. We have multiple filters input-response pairs in unsupervised manner. It is a discriminative for particular a width k, that are concatenated to obtain a character approach using which the network is optimized to score the gold level embedding ew char for word w . The character level represen- reply message higher in comparison to other reply messages. The in- t tation of the word is concatenated with the word-level embedding put to our model is a tuple, (mi , mi+1 ), extracted from a conversation wor d of w to get a the final word represetation e . Let d be the ew between two users. Messages mi and mi+1 are encoded using the t w w dimenion of word embedding. So, the representation of a word w t encoder described in Section 2.1 and represented as ei and ei+1 re- becomes ew ∈ R(dw +dc )×1 . spectively. Note that the parameters of both the encoders are tied, i.e To capture the sequential properties of a message, we explore two they share weights. Hence, both ei and ei+1 represent the encoding architectures. of message in same space. We transform the input space embedding ei+1 to reply space by applying two fully connected layers to obtain 2.1.1 GRU. Gated Recurrent Unit (GRU) [4] is an improved ver- ′ the final response embedding ei+1 . Finally, the dot product of the sion of RNN to solve the vanishing gradient problem. At each step t ′ input message embedding ei and the response embedding ei+1 is it takes the output of previous step (t − 1) and current word repre- used to score the replies. sentation to generate the representation of a message till time step t. We train the model using batches of randomly shuffled tuples. The final step representation in the GRU is taken to be the message Within a batch, each mi+1 serves as the correct response to its corre- embedding ei . sponding input mi and all other instances are assumed to be negative 2.1.2 Transformer. [23] propose an architecture solely based on replies. It is computationally efficient to consider all other instances the attention to curb the recurrence step in RNN. It uses multi- as negative during training because we don’t have to explicitly en- headed attention to capture various relationships among the words. code the specific negative example for each instance. 3
2.3 Message Clustering We cluster frequent messages with the help of the embeddings learned as above. Our goal is to have a single cluster for all the orthographical variations and acronyms of a message. We observe that the number of different variants of a phrase increases with the ubiquity of that phrase. For example, “good morning” has ∼ 300 variants while relatively less frequent phrases tend to have signifi- cantly low number of variants. This poses a challenge when applying a standard density based clustering algorithm such as DBSCAN be- cause it is difficult to decide a single threshold for drawing cluster boundaries. To handle this uncertainty, we choose HDBSCAN [18] to cluster chat messages. This algorithm builds a hierarchy of clus- Figure 4: Hybrid architecture for message cluster prediction. 1) ters and handles the variable density of clusters in a time efficient Predicts the next message cluster using a neural network (NN) manner. After building the hierarchy, it condenses the tree, extracts based classification model on server and sends the result to mo- the useful clusters and assigns noise to the points that do not fit into bile device. 2) For each character typed, client predicts the mes- any cluster. Further, it doesn’t require parameters such as the number sage cluster based on the typed text using trie. 3) Combiner ag- of clusters or the distance between pairs of points to be considered gregates scores from both models to obtain a final score for each as a neighbour etc. The only necessary parameter is the minimum message cluster number of points required for a cluster. All the clusters including the noise points are taken to be the different classes in our message Top frequent G message clusters that were prepared in Section 2.3 prediction step. are chosen as classes for this model. We encode the last received message and typed text into a dense vector, which is fed as input features to a classification layer that is a fully connected layer having 3 MESSAGE CLUSTER PREDICTION sigmoid non-linearity and G output neurons. Details of the training In this section, we describe our approach of predicting the message procedure is described in Section 5.4 (its cluster to be precise) that a user is going to send. In chat, the next An active area of research is to reduce the model size and the message to be sent heavily depends on the conversational context. inference times of neural network models with minimum accuracy In this paper, we use the previous message received from the other loss so that they can run efficiently on mobile devices [8, 10, 11, 21]. party in chat as the only context signal. Upon receiving a message, One approach is to quantize the floating point representations of the we predict a message that is a likely response. In case the user weights and the activations of the neural network from 32 bits to a starts typing, we update our message prediction after every key press lower number of bits. We use a quantization scheme detailed in [11] taking into account the text entered so far. The response time of that converts both weights and activations to 8-bit integers and we our message prediction model needs to be of the order of tens of use a few 32-bit integers to represent biases. This reduces the model milliseconds so that the user’s typing experience is not adversely size by a factor of 4. When we quantize the weights of the network affected and the sticker recommendations change in real time as the after training the model with full precision floats, the inference times user types in the message input box. reduce significantly. We employ quantize aware training1 to reduce We pose message prediction as a classification problem where we the effect of quantization on model accuracy. Under this scheme, we train a model to score each message cluster based on the likelihood use quantized weights and activation to compute the loss function of the event that the user sends the next message from that cluster. of the network during training as well. This ensures parity during A neural network (NN) based classification model that accepts the training and inference time. While performing back propagation, we last received message and the typed text as inputs can be a possible use full precision float numbers because minor adjustments to the solution, but running inference on such a model on mobile devices parameters are necessary to effectively train the model. After weight proves to be a challenge [8]. The size of the NN model that needs to pruning, we obtain a quantized model of size ∼ 9 MB . be shipped explodes since we have a large input vocabulary and a large number of output classes. The embedding layer that transforms 3.2 Hybrid Message Prediction Model the one-hot input to a dense vector usually has a size of the order of We build a hybrid message prediction model, where a resource inten- tens of megabytes, which exceeds the memory limits that we set for sive component is run on the server and its output is combined with our application. To overcome this issue, we evaluate two orthogonal that of a lightweight on-device model to obtain message predictions approaches. One is to apply a quantization scheme to reduce the under tight resource constraints on memory and CPU. Algorithm 1 model size and the other is to build a hybrid model, composed of a outlines the score computation of the hybrid model. Figure 4 demon- neural network component and a trie-based search. In this section, strates the high level flow with an example. The three components we detail these two approaches. in our hybrid model are detailed below. 3.1 Quantized Message Prediction Model 3.2.1 Reply Model: Similar to the prediction model described We train a neural network for message prediction task where the in Section 3.1, we build a NN model which takes the last received input is the last received message and the typed text, and the out- 1 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/ put is the message cluster from which the user sends the message. README.md 4
Algorithm 1 Message Cluster Prediction To score the retrieved message clusters, we add the frequency of Input: Last message received prev, Typed text typ the phrases in our chat corpus as additional information in the trie. So, the records of the form < phrase , ( message cluster id, frequency, 1: For each message cluster д ∈ G, compute Pr eply (G = д|prev) min prefix length ) > are stored as key-value pairs in the trie. The with the reply NN model score of a message cluster retrieved is the ratio of the aggregated 2: Send the message clusters with scores higher than a threshold, frequency of the retrieved records from that message cluster to the Pr eply > tr eply , to the client. sum of frequencies of all entries that were matching, irrespective of 3: Using text typ as input to Trie model, fetch set of phrases S that whether these records are retrieved or not after the minimum prefix start with typ and the set of phrases Sm ⊆ S that start with typ filter is applied This is shown in Equation 1. For example, suppose and also meet the minimum prefix length constraint. the typed text is string “go”, and the sum of frequencies of all phrases 4: Compute trie scores for all message clusters mapped to phrases starting with “go” is 10000, and there were three different phrases in in Sm as Í the message cluster for “good morning”, which were starting with freq(ph) “go” (e.g., “good morning”,“good mrng”, “good mong”), having a ph ∈Sд ∩Sm Pt r ie (G = д|typ) = Í (1) combined frequency of 1500, then the score of the “good morning” freq(ph) cluster is calculated to be 0.15. ph ∈S For building the trie, we pick the top frequent phrases and the where Sд is the set of all phrases present in message cluster д frequency of a phrase ph in our chat corpus after some normalization 5: For message clusters obtained from reply model or trie model, is set as f req(ph). The length of the smallest prefix pre f of phrase combine the scores from both models as: ph that gives P(ph|pre f ) ≥ θ is kept as the minimum prefix length Q(G = д|prev, typ) = w 0 + w t ∗ Pt r ie (G = д|typ)+ of the phrase ph. Pr eply (G = д|prev)∗ (2) At Hike, we created a trie with 34k phrases and around 7500 −λnc message clusters. In the serialized form, the size of the trie was (wp0 e + wp1 Pt r ie (G = д|typ)) around 700KB. This is small enough for us to ship to client devices. Output: Message cluster scores Q Another advantage of a trie is that it is interpretable and makes it easy for us to incorporate any new phrase, even if they are not observed in our historical data. 3.2.3 Combiner: After receiving relevant message clusters and message as input and scores each message cluster by their likelihood corresponding scores from reply model and trie model, a final of being a reply to the input message. Given the last message prev, score for a message cluster Q(G = д|prev, typed) is computed from the model computes the reply probability Pr eply (G = д|prev) for Pr eply (G = д|prev) , Pt r ie (G = д|typed) and length of string typed each message cluster д in the set of message clusters G. nc as mentioned in Equation 2. The weighting of terms is designed in When a message gets routed to its recipient, the reply model such a way that as the user types more characters, contribution from is queried on the server and the response predictions are sent to the reply model vanishes unless the message cluster is not predicted the client along with the message. Since this inference runs on the from the trie. The weights used were manually chosen after some server, we can afford to use large sized NN models without adversely trial and error. A threshold tcombined is applied on the final score to affecting user experience. Furthermore, we pre-compute and cache filter out recommendations with low scores. the replies for frequent messages to improve the average response time for reply prediction. The message clusters that have a reply 4 MESSAGE CLUSTER TO STICKERS probability Pr eply (G = д|prev) above a threshold tr eply are sent to the clients. MAPPING As mentioned in Section 1, we make use of a message cluster to 3.2.2 Typed Model: Trie [20] is an efficient data structure for sticker mapping to suggest suitable stickers from the predicted mes- prefix searches. We retrieve the relevant message clusters for a given sage clusters. When a sticker is created, it is tagged with conversa- typed text by querying a trie that stores tuples as key-value pairs. These pairs meta-data in order to map the message clusters to stickers. We com- are the chat messages and their corresponding message clusters that pute similarity between the tag phrases of a sticker and the phrases we prepare from our corpus, as described in Section 2 . In order present in each message cluster, after converting them into vectors to reduce false positives, we assign a minimum prefix length for using the encoder mentioned in Section 2. Compared to the histor- each phrase in the trie, so that a matching record is retrieved only if ical approach of suggesting a sticker when the user’s typed input the typed text meets this minimum length or its message cluster is matches one of its tag phrases, the current system is able to suggest present in the recommendations available from the reply model. For stickers even if different variations of the tag phrase are typed by example, suppose the phrase “hi” has a min prefix length set to 2, user, as we have many variations of a message already captured then a typed text “h” will not retrieve the record corresponding to in the message clusters. We regularly refresh the message cluster “hi”, though it is matching the typed text. But, in case the message to sticker mapping by taking into account the relevance feedback cluster of “hi” is present in the predictions of the reply model, then observed on our recommendations. Since generation of message to “hi” is retrieved even if length of the prefix is below the required sticker mapping is not the main focus of paper, we skip the details minimum length. of these algorithms. 5
Spell Syn Overall 5.2 Message Embedding Evaluation Models P@10 R@10 P@10 R@10 P@10 R@10 Trans+CharCNN 73.3 67.1 89.2 51.3 77.6 62.9 In this subsection, we evaluate the embeddings generated from dif- Trans 70.8 67.6 86.7 47.4 75.0 62.2 ferent architecture on two manually labeled data. First, we describe GRU+CharCNN 71.6 66.7 86.6 48.7 75.6 61.9 the baselines, training and hyper parameter tuning. GRU 70.7 67.6 86.2 44.9 74.8 61.5 5.2.1 Message embeddings models. We compare the embed- Table 1: Evaluation of different architecture of message embed- dings obtained from architecture (Section 2.2) and its variants which ding model on chat slang task are as follows: (1) Trans+CharCNN - Encoder as shown in Figure 2 with trans- former. (2) Trans - Uses only word-level embedding em wor d as input to transformer. (3) GRU+CharCNN - Encoder as shown in Figure 2 with GRU. wor d as input to GRU. (4) GRU - Uses only word-level embedding em 5.2.2 Training and Hyper-parameters: We train our GRU based models using Adam Optimizer with learning rate of .0001 and step exponential decay for every 10k steps. We apply gradient clipping (with value of 5), dropout at GRU and fully connected layer to reduce overfitting. For the GRU+CharCNN model, we use filters of width 1, 2, 3, 4 with number of filters of 50, 50, 75, 75 for the convolution layers. The GRU models use embedding size of 300. Transformer models are trained using RMSprop optimizer with con- stant learning rate of 0.0001. They use 2 attention layers and 8 atten- tion heads. Within each attention layer, the feed forward network has input and output size of 256, and has a 512 unit inner-layer. To avoid overfitting we implement attention dropout and embedding drop out Figure 5: Performance of the message embedding models on of 0.1. The convolution layers of the Trans+charCNN model has the phrase similarity task same number of filters and filter width as that of the GRU+CharCNN model. 5 EXPERIMENTS 5.2.3 Phrase Similarity Task. We created a dataset which con- In this section, we first describe the dataset used for training the sists of phrase pairs and labelled them as similar or not-similar. For Sticker Recommendation (SR) system. Next, we quantitatively eval- e.g. (“ghr me hi”, “room me h”) , (“majak kr rha hu", “mjak kr rha uate different message embeddings on two manually curated datasets. hu") are labelled as similar while (“network nhi tha”, “washroom We show qualitative results after clustering to show the effectiveness gyi thi"), (“it’s normal”, “its me”) are labelled as not-similar. Pos- of the message embeddings. Then, we present a comparison of the sible similar examples are sampled from top 50k frequent phrases NN model and the hybrid model for the message cluster prediction. using 2 methods: a) Pairs close to each other in terms of minimum Lastly, we discuss about the deployed system. edit distance. b) Pairs having words with similar word embeddings. Possible negative examples are sampled randomly. These pairs were 5.1 Dataset and preprocessing annotated by 3 annotators. Pairs with disagreement within the anno- tators were not considered during evaluation. Finally, the dataset has We collected conversations of anonymous users for a period of 5 3341 similar pairs and 2437 non-similar pairs. months. This dataset is pre-processed to extract useful conversations To evaluate the models, we calculate the angular similarity [3] from the chat corpus. First step is tokenization, which includes between the embeddings of phrases in a pair. The ROC curves of accurate detection of emoticons and stickers, reducing more than 3 the models are shown in Figure 5. We observe that transformer has consecutive repetitions of the same character to 2 characters. We also significant improvement over GRU. This is due to better capturing of add a special symbol at the start and the end of every conversation semantics of phrases using self-attention. Trans+CharCNN models After pre-processing the data, we create tuples of the current and performs the best and shows ≈6% absolute improvement in terms of the next message for training the message embedding models. Since area under the curve over the GRU baseline. stickers are mostly used to convey short messages, we filter out all the tuples which have at least one message having more than 5 5.2.4 Chat Slang Task: We manually labelled a set of words words. Both our input and output vocabularies consist of top frequent which are common slangs or variants used in conversation and 50k words in our corpus. The dataset is randomly split into training mapped them to their corresponding correct form. We considered and validation sets. Messages having length less than 5 words are two types of errors. 1) Spelling variants (spell) - In this class we padded up to the pre-defined maximum length of 5. Similarly, we considered words which can be converted to their correct form by pad characters in words if the character length of word is less than either replacing the characters nearby in keyboard (‘hsn’ → ‘han’) 10 characters. or addition of vowels (‘phn’ → ‘phone’, ‘yr’ → ‘yaar’). These errors 6
occur either due to unintentional mistakes or to shorten the word # of # of times Fraction length to increase the speed of typing. 2) Synonym words (syn) - we Character inaccurate considered words which have same meaning and are phonetically Method of msg to be predictions similar but cannot be converted into their correct form by spelling retrieved Typed shown correction. For e.g.(‘plzz’ → ‘please’, ‘n8’ → ‘night’). Most users are familiar with these form of transformation of words and use them Quantized widely in conversation. In total, we labelled 213 words for spelling NN 1.58 1.47 0.978 variants and 78 synonym words2 to their correct word form. (100d) To evaluate the message embedding we retrieve top 10 nearest Hybrid 2.84 1.22 0.991 neighbour from top frequent phrases for each query of labelled Table 2: Performance of different message prediction models dataset and used two evaluation metrics. 1) Precision@10 (P@10) counts the number of correct phrases in top 10 (retrieved phrases that are semantically similar to query). 2) Recall@10 (R@10) calculates to type for seeing correct message cluster in top 3 positions (# of whether labelled phrase of query is present in top 10 or not. Two Character to be typed), 2) How many times the model has shown human judges were used in the annotation and disagreements were wrong message before predicting the correct message cluster in first resolved upon consensus among judges. 3 positions (# of times inaccurate prediction) and 3) Fraction of Table 1 shows the performance of different variants of encoder messages that could be retrieved by a model in first 3 positions, to learn message embedding. Transformer and GRU have compa- with some prefix of that message (Fraction of msg retrieved). Lower rable recall on spell data because it contains single words. Hence, the absolute number for first two metrics, better the model is while transformer doesn’t have much advantage over GRU. CharCNN fraction of message retrieved should be as high as possible. For improves the recall performance of synonym word variants signifi- training the message prediction using NN model, we collected pairs cantly for both the architectures. CharCNN is able to capture certain of current and next messages from complete conversation data. Our chat characteristics which occur due to similar sounding sub-words. training data had around 10M such pairs. We randomly sampled 38k For e.g. ‘night’ and ‘n8’, ‘see’ and ‘c’ , ‘what’ and ‘wt’ are used pairs for testing. We curate the training data by treating all prefix of interchangeably during conversation. CharCNN learns such chat next message as a typed text and message cluster of the next message nuances and performs well in matching those words to semantically as class label. We restrict our next message clusters only to top 7500 similar words. message clusters (clusters are picked on basis of sum frequency of their phrases). Selected clusters covered 34k top frequent reply 5.3 Message Clustering Evaluation messages in our data set. For hybrid model we train two models. Figure 6 shows a qualitative evaluation of clusters obtained from One is prediction of next message based on current message. It is HDBSCAN algorithm. We select the top 100 clusters based on the trained directly from current and next message pairs. Next message number of phrases present in cluster. We projected the phrase em- was mapped to its corresponding message cluster. Second we build bedding learned from our model on 2-dimension using t-sne [17] a trie model based on typed message. We prepare phrase frequency for visualization. Figure 6 shows some randomly picked phrases and min prefix of phrases as mentioned in Sec. 3.2. from the clusters; phrases from the same cluster are represented with Results of various model is shown in Table 2. The quantized NN the same color. We observe that the various spelling variants and model performs better in terms of # of characters to be typed. This synonyms for a phrase are grouped in a single cluster. For example, is expected because the NN model learns to use both inputs simulta- the ‘good night’ cluster includes phrases like ‘good nighy’, ‘good neously whereas the combiner (last function in Hybrid model) used nyt’, ‘good nght’, ‘gud nyr’, ‘gud nite’. Our clustering algorithm is is a simple weighted linear combination of just three features. The able to capture fine-grained semantics of phrases. For instance, ‘i hybrid model performs slightly better in terms of # of times inac- love u’, ‘i love you’, ‘i luv u’ phrases belong to one cluster while curate recommendations shown and fraction of messages retrieved. ‘love u too’, ‘love u 2’, ‘luvv uhh 2’ belong to a different cluster Compared to quantized models, hybrid model needs slightly more which helps us in showing accurate sticker recommendation for both than one more character to be typed on an average to get required clusters. If a user has messaged ‘i love u’ then showing stickers predictions. It has to be noted that the quantized NN is of size ∼ 9.2 related to ‘i love u too’ cluster is make more relevant replies as MB which is not practically feasible in our use case. The on-device compared to showing ‘i love u’ stickers. Our message embedding is foot print for hybrid model is below 1 MB. Another important point able to cluster phrases like ‘i love u baby’ and ‘i luv u shona’ in a to be noted is that a trie based model can guarantee retrieval of all cluster distinct from the ‘i love u’ cluster. This helps us deliver more message clusters, with some prefix of the message. If the message precise sticker recommendations since we have different stickers in is not interfered by another longer but more frequent message, the our database for phrases like ‘i love u’ and ‘i love u baby’. message class can be featured in top position itself, with some pre- fix of the message as input or with full message as input. A pure 5.4 Message Cluster Prediction Evaluation NN based models can not make such guarantees. This is reflected in the third metric shown in Table 2–more than 2% of times, the Our aim is to show the right stickers with minimal effort from user. intended message class is not retrievable by the NN. Given that this So, we compare our hybrid model and quantized models mainly SR interface is one of the heavily used interface for sticker discovery, on the following metrics. 1) Number of character that a user need if a sticker is not retrieved through this recommendation, the user 2 We plan to release the labelled data and baseline results might perceive it as the app doesn’t support the message class or the 7
Figure 6: HDBSCAN clusters of top frequent phrases using message embedding produced from our GRU-CharCNN model. The points represents chat message vectors projected into 2D using TSNE [17] app does not have such stickers. So making the retrieval of message or adds new information. [1, 2] predict which of the top-20 emojis cluster close to 100%% is critical for our SR models. are likely to be used in an Instagram post based on the post’s text and image. Unlike emojis that are majorly used in conjunction with 5.5 Real World Deployment text, stickers are independent messages that substitute text. Thus, in order to have effective sticker suggestion for a given conversational The proposed system has been deployed in production on Hike for context, we need to predict the likely utterance and not just the more than 6 months. A/B tests to compare this system with a previous emotion. Since, the possible utterances are more numerous than implementation showed an 8% improvement in the fraction of mes- emotions, our problem is more nuanced than emoji prediction. saging users who use stickers. There are multiple regional languages There exists a large body of research on conversational response spoken in India. We segregate our data based on geography and train generation. [22, 24] design an end-to-end model leveraging a hi- separate models (both message embedding and message prediction erarchical RNN to encode the input utterances and another RNN models) for each state. It ensures that frequent chat phrases distri- to decode possible responses. The approach suffers from a limita- bution doesn’t get skewed towards one majority spoken language tion that it often generates generic uninformative responses (e.g., “i across India. We obtain the primary language of a user from the don’t know”). [28] describe a model that explicitly optimizes for language preferences explicitly set by them or infer it from their improving diversity of responses. [25] propose a retrieval based ap- sticker usage. proach with the help of a DNN based ranker that combines multiple evidences around queries, contexts, candidate postings and replies. 6 RELATED WORK Smart Reply [12] is a system that suggests short replies to e-mails To the best of our knowledge there is no prior art which studies the which uses an approach to select the high quality responses with problem of type-ahead sticker suggestions. However, there are a few diversity to increase the options of user to choose from. Akin to our closely related research threads that we describe here. system, SmartReply also generates clusters of responses. They apply The use of emojis is widespread in social media. [7] classify semi-supervised learning to expand the set of responses starting from whether an emoji, which is used in a tweet, emphasizes something a small number of manually labeled responses for each semantic 8
intent. In contrast, we follow an unsupervised approach to discover [4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. message clusters since the set of all intents that may correspond to Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014). stickers wasn’t readily available. A unique aspect of our system is [5] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine that we update the message prediction by incorporating whatever Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 (2017). the user has typed so far; we need to do this in order to deliver [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: type-ahead sticker suggestions. Pre-training of Deep Bidirectional Transformers for Language Understanding. There is a parallel research thread around learning effective repre- arXiv preprint arXiv:1810.04805 (2018). [7] Giulia Donato and Patrizia Paggio. 2017. Investigating redundancy in emoji sentations for sentences that can capture sentence semantics [5, 6, 14– use: Study on a twitter based corpus. In Proceedings of the 8th Workshop on 16, 26]. Skip Thought [14] is an encoder-decoder model that learns Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. to generate the context sentences for a given sentence. It includes 118–126. [8] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. 2016. Hardware- sequential generation of words of the target sentences, which limits oriented approximation of convolutional neural networks. arXiv preprint the target vocabulary and increases training time. Quick Thought arXiv:1604.03168 (2016). [9] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo [16] circumvents this problem by replacing the generative objective Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Ef- with a discriminative approximation, where the model attempts to ficient natural language response suggestion for smart reply. arXiv preprint classify the embedding of a correct target sentence given a set of sen- arXiv:1705.00652 (2017). [10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun tence candidates. Recently, BERT [6] has proposed a model which Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: predicts the bidirectional context to learn sentence representation Efficient convolutional neural networks for mobile vision applications. arXiv using tranformer [23]. [26] proposes the Input-Response model that preprint arXiv:1704.04861 (2017). [11] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- we have evaluated in this paper. However, unlike these works, we drew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization apply a CharCNN [13, 27] in our encoder to deal with the problem and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. of learning similar representations for phrases that are orthographic 2704–2713. variants of each other. [12] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al. 7 CONCLUSION 2016. Smart reply: Automated response suggestion for email. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and In this paper, we present a novel system for deriving contextual, Data Mining. ACM, 955–964. type-ahead sticker recommendation within a chat application. We [13] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character- Aware Neural Language Models.. In AAAI. 2741–2749. decompose the recommendation task into two steps. First, we pre- [14] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, dict the next message that a user is likely to send based on the last Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. 3294–3302. message received in the chat. As the user types the message, we [15] Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences continuously update our prediction by taking into account the char- and Documents.. In ICML, Vol. 14. 1188–1196. acter sequence typed in the chat input box. Second, we substitute the [16] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893 (2018). predicted message with relevant stickers. We discuss how numerous [17] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. orthographic variations for the same utterance exist in Hike’s chat Journal of machine learning research 9, Nov (2008), 2579–2605. corpus, which mostly contains messages in a transliterated form. We [18] Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2, 11 (2017), 205. describe a clustering solution that can identify such variants with [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. the help of a highly effective message embedding, which learns Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. similar representations for semantically similar messages. Message [20] Donald R Morrison. 1968. PATRICIA—practical algorithm to retrieve information clustering reduces the complexity of the classifier used in message coded in alphanumeric. Journal of the ACM (JACM) 15, 4 (1968), 514–534. prediction, e.g., by predicting one of 7500 classes (message clusters). [21] Sujith Ravi. 2017. Projectionnet: Learning efficient on-device deep networks using neural projections. arXiv preprint arXiv:1708.00630 (2017). We are able to cover all intents expressed in 34k frequent messages. [22] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Furthermore, it allows us to collect relevance feedback from related Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative message contexts to improve our mapping of message to stickers. Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776–3784. [23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, For message prediction on low-end mobile phones, we use a hybrid Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you model that combines a neural network on the server and a memory need. In Advances in neural information processing systems. 5998–6008. efficient trie search on the device for effective sticker recommen- [24] Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017. Hierarchical recurrent attention network for response generation. arXiv preprint dation. We show experimentally that the hybrid model is able to arXiv:1701.07149 (2017). predict a higher fraction of overall messages clusters compared to [25] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings a quantized neural network. This model also helps in better sticker of the 39th International ACM SIGIR conference on Research and Development discovery for rare message clusters. in Information Retrieval. ACM, 55–64. [26] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning REFERENCES Semantic Textual Similarity from Conversations. arXiv preprint arXiv:1804.07754 [1] Francesco Barbieri, Miguel Ballesteros, Francesco Ronzano, and Horacio Saggion. (2018). 2018. Multimodal emoji prediction. arXiv preprint arXiv:1803.02392 (2018). [27] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional [2] Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are Emojis networks for text classification. In Advances in neural information processing Predictable? arXiv preprint arXiv:1702.07285 (2017). systems. 649–657. [3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. [28] Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun- and Bill Dolan. 2018. Generating informative and diverse conversational responses Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. via adversarial information maximization. arXiv:1809.05972 (2018). CoRR abs/1803.11175 (2018). arXiv:1803.11175 http://arxiv.org/abs/1803.11175 9
You can also read