Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv

Page created by Bob Mueller

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Understanding Chat Messages for Sticker Recommendation in Hike Messenger - arXiv

Understanding Chat Messages for Sticker Recommendation
                                                                in Hike Messenger
                                                       Abhiskek Laddha              Mohamed Hanoosh        Debdoot Mukherjee                                Parth Patwa
                                                                                               Hike Messenger
                                                                              {abhiskekl, moh.hanoosh, debdoot, parthp}@hike.in
                                         ABSTRACT
                                         Stickers are popularly used in messaging apps such as Hike to visu-
                                         ally express a nuanced range of thoughts and utterances to convey
arXiv:1902.02704v3 [cs.CL] 19 Jul 2019

                                         exaggerated emotions. However, discovering the right sticker from
                                         a large and ever expanding pool of stickers while chatting can be
                                         cumbersome. In this paper, we describe a system for recommending
                                         stickers in real time as the user is typing based on the context of
                                         conversation. We decompose the sticker recommendation problem
                                         into two steps. First, we predict the message that the user is likely to
                                         send in the chat. Second, we substitute the predicted message with
                                         an appropriate sticker. Majority of Hike’s messages are in the form
                                         of text which is transliterated from users’ native language to the
                                         Roman script. This leads to numerous orthographic variations of the
                                         same message and complicates message prediction. To address this
                                         issue, we learn dense representations of chat messages and use them
                                         to cluster the messages that have same meaning. In the subsequent          Figure 1: Sticker Recommendation UI on Hike and a high level
                                         steps we predict the message cluster instead of the message. Our           flow of our two step sticker recommendation system.
                                         model employs a character level convolution network to capture
                                         the similar intents in orthographic variants of chats. We validate         any perceivable delay during typing. This is possible only if the
                                         our approach using manually labelled data on two tasks. We also            system runs end-to-end on the mobile device without any network
                                         propose a novel hybrid message prediction model, which can run             calls. Furthermore, a large fraction of Hike users use low end mobile
                                         with low latency on low end phones that have severe computational          phones, so we need a solution which is efficient both in terms of
                                         limitations.                                                               CPU load and memory requirements.
                                                                                                                        Prior to this work, sticker recommendation (SR) on Hike app was
                                                                                                                    based on string matching of the typed text to the tags that were man-
                                         1   INTRODUCTION                                                           ually assigned to each sticker. Simple string matching algorithms
                                         In messaging apps such as Facebook Messenger, WhatsApp, Line               fail to capture importance of phrases while scoring candidates. eg;
                                         and Hike, new modalities are extensively used to visually express          for an input text “so” the algorithm would rank phrase “so” higher
                                         thoughts and emotions (e.g. emojis, gifs and stickers). Emojis (e.g.,      than that of “sorry” while the latter could have been more useful
                                         ,, /) are used in conjunction with text to convey emotions in a            recommendation to the user. Recall of string matching based ap-
                                         message [7]. Unlike emojis, stickers provide a graphic alternative         proach is limited by exhaustiveness of tagging of the stickers. Tags
                                         for text messages. Hike stickers are composed of an artwork (e.g.,         of a sticker are supposed to contain all the phrases that the sticker
                                         cartoonized characters and objects) and a stylized text (See Figure        can possibly substitute, but there are many ways of expressing same
                                         1). They help to convey rich expressions along with the message.           message in a chat. For instance, people often skip vowels in words
                                         Hundreds of thousands of stickers are available for free download or       as they type; e.g., “where are you” → “whr r u”, “ok” → “k”. This
                                         purchase on sticker stores of popular messaging apps. Once a user          is further exacerbated when people transliterate messages from their
                                         downloads a sticker pack, it gets added to a palette, which can be         native language to Roman script; e.g., phrase “acchha” (Hindi for
                                         accessed from the chat input box. However, discovering the right           “good”) is written in many variants - “accha”, “acha”, “achha” etc.
                                         sticker while chatting can be cumbersome because it’s not easy to          Another reason for proliferation of such variants is that certain words
                                         think of the best sticker that can substitute your utterance. Even if      are pronounced differently in different regions. We observe 343 or-
                                         you do recall a sticker, browsing the palette to find it can slow you      thographic variants of “kya kar raha hai” (Hindi for “What are you
                                         down. Apps like Hike and Line offer type-ahead sticker suggestions         doing”) in our dataset. Hence, it is hard for any person to capture all
                                         while you type a message (as shown in Figure 1) in order to alleviate      variants of an utterance as tags.
                                         this problem.                                                              Decomposing Sticker Recommendation: One can potentially think
                                            The goal of type-ahead sticker recommendation is to help users          of setting up sticker recommendations with the help of a supervised
                                         discover a suitable sticker which can be shared in lieu of the mes-        model, which directly learns the most relevant stickers for a given
                                         sage that they want to send. The latency of generating such sticker        context defined by the previous message in the chat and the text
                                         recommendation should be in tens of milliseconds in order to avoid         typed in the chat input box. However, due to updates in the set of

available stickers and a massive skew in historical usage toward                   memory and CPU. Trie search is efficient mechanism to retrieve
a handful of popular stickers, it becomes difficult to collect reli-               message group based on typed text and can be executed for
able, unbiased data to train such an end-to-end model. Moreover,                   each character typed. Hence, the system is able to easily meet
an end-to-end model will need to be retrained frequently to support                the latency constraints with this setup. Scores from these two
new stickers and the updated model would have to be synced to all                  components are combined to produce final scores for message
devices. Such regular updates of the model, which is possibly >10                  prediction.
MB in size, will be prohibitively expensive in terms of data costs.               In summary, our contributions include:
Thus, instead of creating an end-to-end model, we decompose the                • We present a novel application of type-ahead Sticker Recom-
SR task into two steps. First, we predict the message that a user is             mendation within a messaging application. We decompose the
likely to send based on the chat context and what the user may have              recommendation task in two steps: message prediction and sticker
typed in the message input box. This model does not require frequent             substitution.
updates because the distribution of chat messages does not change              • We describe an approach to cluster chat messages that have the
much with time. Second, we recommend stickers by mapping the                     same meaning. We evaluate different methods to create dense
predicted message to stickers that can substitute it. We make use of             vector representations for chat messages, which preserve message
manual tags added to stickers to bootstrap the mapping of message                semantics.
to sticker.This step uses a lightweight message-to-sticker mapping,            • We propose a novel hybrid message prediction model, which
which is less than 1 MB in size. We update the mapping frequently                can run with low latency on low end phones that have severe
based on relevance feedback observed on recommendations and                      computational limitations. We show that the hybrid model has
incorporates new stickers as they are launched.                                  comparable performance with a neural network based message
    For message prediction, we train a classification model with the             prediction model.
help of historical chat data. Due to the aforementioned problem of
                                                                                  The rest of the paper is organized as follows. In Section 2, we
numerous variations in messages, we cluster different messages that
                                                                               explore methods to learn effective representations for chat messages
have same meaning and use them as classes in prediction model.
                                                                               and describe creation of message clusters, which can be used as
This helps us drastically reduce the number of classes in the classifier
                                                                               classes in the message prediction task. In Section 3, we discuss
while keeping most frequent message intent of our corpus. It lowers
                                                                               different message prediction models, including our Hybrid model.
its complexity as well as size.
                                                                               In Section 4, we briefly mention our approach for mapping stickers
    In this paper, we focus on how to efficiently set up the message
                                                                               to message clusters. In Section 5, we demonstrate the efficacy of
prediction task. In doing so, we specifically deal with two sets of
                                                                               the proposed message embeddings in identifying synonymous chat
challenges:
                                                                               messages and also evaluate the performance of a quantized neural
                                                                               network and the hybrid model on our message prediction task. Fi-
(1) Orthographic variations of chat messages: As described above,
                                                                               nally, we describe how we have deployed the system on Hike and
    a large fraction of chat messages are simply variants of each
                                                                               discuss related literature before concluding the paper.
    other. We investigate different architectures to learn embeddings
    for chat messages such that the messages with the same intent
    take up similar representations. We empirically show that the
                                                                               2     CHAT MESSAGE CLUSTERING
    use of charCNN [13] with transformer [23] for encoding chat                As mentioned earlier, we cluster frequent messages in our chat and
    messages can be highly effective to capture semantics of chat              use them as classes in the message prediction model. For covering a
    words and phrases. Next, we perform a fine grained clustering on           large fraction of messages in our chat corpus with a limited number
    the representations of the frequent chat messages with the help            of clusters, we need to group all messages with the same intent
    of the HDBSCAN algorithm. These clusters are used as classes               into a single cluster. This has to be done without compromising the
    for our message prediction model. Though we construct the                  semantics of each cluster. In order to efficiently cluster messages,
    message clusters to reduce complexity of the message prediction            it is critical to obtain dense vector representations of messages that
    task, we observe that these message clusters are generic enough            effectively capture their meaning.
    to be used in other application such as conversational response                In this section, we describe an encoder which is used to convert
    generation, intent classification etc. [22, 24].                           chat phrases into dense vectors [19]. Then, we explain the network
(2) Computational limitations on low end smart-phones: Most of                 architecture which is used to train the message embeddings such
    our users use Hike on low end mobile devices with severe limi-             that we learn similar representations for messages that have same
    tations on memory and compute power. Running inference with                meaning.
    a neural network model for message prediction proves to be
    challenging on such devices [8]. The size of a neural network              2.1     Encoder
    model trained for message prediction exceeds the memory limi-              The architecture of the encoder is shown in Figure 2. Input to the
    tation even after quantization. We present a novel hybrid model,           encoder is a message mi , consisting of a sequence of N words
    which can run efficiently on low end devices without signifi-              {w i,1 , w i,2 , . . . , w i, N }, and the output is a dense vector ei ∈ Rd ×1 .
    cantly compromising accuracy. Our system is a combination of                  To represent a word, we use an embedding composed of two parts;
    a neural network based model that processes chat context and               a character based embedding aggregated from character representa-
    runs on server, and a Trie [20] based model that processes typed           tions, and a word level embedding. To generate the character based
    text input on the client. The first component is not limited by            embedding, we use a character CNN [13] that leverages sub-word
                                                                           2

Figure 3: Model architecture for learning message embedding.
Input-Response Chat (IRC) maximize the similarity of trans-
′
formed next message embedding (ei+1 ) with current message
embedding (ei ). Both encoder share the parameters.

It adds a positional embedding with word embedding to encode
the sequence information present in the message. Similar to [26],
Figure 2: Illustration of encoder which comprises of a character we use only the encoder part of transformer to create our message
based CNN layer for each word and GRU/Transformer layer at embeddings. Unlike GRU, which gives the final representation at
word level step t, transformer gives a representation of each word after attending
to the different words in the sentence using a series of attention
information to learn similar representations for orthographic variants layers. We compute the message embedding ei by averaging the
of the same word. Let dc be the dimension of character embedding. words vectors in the final layer of the transformer.
For a word w t we have a sequence of l characters [c t,1 , c t,2 . . . , c t,l ].
Then, the character representation of w t can be obtained by stacking 2.2 Model Architecture
the character level embeddings in a matrix Ct ∈ Rl ×dc and applying Akin to [26], we use the model architecture shown in Fig. 3 to train
a narrow convolution with a filter H ∈ Rk ×dc of k width with ReLU the encoders described in Section 2.1. We use a dot-product scor-
non linearity, and a max pooling layer. A k width filter can be as- ing model [9, 26] to train the sentence level representation from
sumed to capture k-gram features of a word. We have multiple filters input-response pairs in unsupervised manner. It is a discriminative
for particular a width k, that are concatenated to obtain a character approach using which the network is optimized to score the gold
level embedding ew char for word w . The character level represen- reply message higher in comparison to other reply messages. The in-
t
tation of the word is concatenated with the word-level embedding put to our model is a tuple, (mi , mi+1 ), extracted from a conversation
wor d of w to get a the final word represetation e . Let d be the
ew between two users. Messages mi and mi+1 are encoded using the
t w w
dimenion of word embedding. So, the representation of a word w t encoder described in Section 2.1 and represented as ei and ei+1 re-
becomes ew ∈ R(dw +dc )×1 . spectively. Note that the parameters of both the encoders are tied, i.e
To capture the sequential properties of a message, we explore two they share weights. Hence, both ei and ei+1 represent the encoding
architectures. of message in same space. We transform the input space embedding
ei+1 to reply space by applying two fully connected layers to obtain
2.1.1 GRU. Gated Recurrent Unit (GRU) [4] is an improved ver- ′
the final response embedding ei+1 . Finally, the dot product of the
sion of RNN to solve the vanishing gradient problem. At each step t ′
input message embedding ei and the response embedding ei+1 is
it takes the output of previous step (t − 1) and current word repre-
used to score the replies.
sentation to generate the representation of a message till time step t.
We train the model using batches of randomly shuffled tuples.
The final step representation in the GRU is taken to be the message
Within a batch, each mi+1 serves as the correct response to its corre-
embedding ei .
sponding input mi and all other instances are assumed to be negative
2.1.2 Transformer. [23] propose an architecture solely based on replies. It is computationally efficient to consider all other instances
the attention to curb the recurrence step in RNN. It uses multi- as negative during training because we don’t have to explicitly en-
headed attention to capture various relationships among the words. code the specific negative example for each instance.
3

2.3 Message Clustering
We cluster frequent messages with the help of the embeddings
learned as above. Our goal is to have a single cluster for all the
orthographical variations and acronyms of a message. We observe
that the number of different variants of a phrase increases with the
ubiquity of that phrase. For example, “good morning” has ∼ 300
variants while relatively less frequent phrases tend to have signifi-
cantly low number of variants. This poses a challenge when applying
a standard density based clustering algorithm such as DBSCAN be-
cause it is difficult to decide a single threshold for drawing cluster
boundaries. To handle this uncertainty, we choose HDBSCAN [18]
to cluster chat messages. This algorithm builds a hierarchy of clus- Figure 4: Hybrid architecture for message cluster prediction. 1)
ters and handles the variable density of clusters in a time efficient Predicts the next message cluster using a neural network (NN)
manner. After building the hierarchy, it condenses the tree, extracts based classification model on server and sends the result to mo-
the useful clusters and assigns noise to the points that do not fit into bile device. 2) For each character typed, client predicts the mes-
any cluster. Further, it doesn’t require parameters such as the number sage cluster based on the typed text using trie. 3) Combiner ag-
of clusters or the distance between pairs of points to be considered gregates scores from both models to obtain a final score for each
as a neighbour etc. The only necessary parameter is the minimum message cluster
number of points required for a cluster. All the clusters including
the noise points are taken to be the different classes in our message Top frequent G message clusters that were prepared in Section 2.3
prediction step. are chosen as classes for this model. We encode the last received
message and typed text into a dense vector, which is fed as input
features to a classification layer that is a fully connected layer having
3 MESSAGE CLUSTER PREDICTION
sigmoid non-linearity and G output neurons. Details of the training
In this section, we describe our approach of predicting the message procedure is described in Section 5.4
(its cluster to be precise) that a user is going to send. In chat, the next An active area of research is to reduce the model size and the
message to be sent heavily depends on the conversational context. inference times of neural network models with minimum accuracy
In this paper, we use the previous message received from the other loss so that they can run efficiently on mobile devices [8, 10, 11, 21].
party in chat as the only context signal. Upon receiving a message, One approach is to quantize the floating point representations of the
we predict a message that is a likely response. In case the user weights and the activations of the neural network from 32 bits to a
starts typing, we update our message prediction after every key press lower number of bits. We use a quantization scheme detailed in [11]
taking into account the text entered so far. The response time of that converts both weights and activations to 8-bit integers and we
our message prediction model needs to be of the order of tens of use a few 32-bit integers to represent biases. This reduces the model
milliseconds so that the user’s typing experience is not adversely size by a factor of 4. When we quantize the weights of the network
affected and the sticker recommendations change in real time as the after training the model with full precision floats, the inference times
user types in the message input box. reduce significantly. We employ quantize aware training1 to reduce
We pose message prediction as a classification problem where we the effect of quantization on model accuracy. Under this scheme, we
train a model to score each message cluster based on the likelihood use quantized weights and activation to compute the loss function
of the event that the user sends the next message from that cluster. of the network during training as well. This ensures parity during
A neural network (NN) based classification model that accepts the training and inference time. While performing back propagation, we
last received message and the typed text as inputs can be a possible use full precision float numbers because minor adjustments to the
solution, but running inference on such a model on mobile devices parameters are necessary to effectively train the model. After weight
proves to be a challenge [8]. The size of the NN model that needs to pruning, we obtain a quantized model of size ∼ 9 MB .
be shipped explodes since we have a large input vocabulary and a
large number of output classes. The embedding layer that transforms 3.2 Hybrid Message Prediction Model
the one-hot input to a dense vector usually has a size of the order of
We build a hybrid message prediction model, where a resource inten-
tens of megabytes, which exceeds the memory limits that we set for
sive component is run on the server and its output is combined with
our application. To overcome this issue, we evaluate two orthogonal
that of a lightweight on-device model to obtain message predictions
approaches. One is to apply a quantization scheme to reduce the
under tight resource constraints on memory and CPU. Algorithm 1
model size and the other is to build a hybrid model, composed of a
outlines the score computation of the hybrid model. Figure 4 demon-
neural network component and a trie-based search. In this section,
strates the high level flow with an example. The three components
we detail these two approaches.
in our hybrid model are detailed below.
3.1 Quantized Message Prediction Model 3.2.1 Reply Model: Similar to the prediction model described
We train a neural network for message prediction task where the in Section 3.1, we build a NN model which takes the last received
input is the last received message and the typed text, and the out- 1 https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/
put is the message cluster from which the user sends the message. README.md
4

Algorithm 1 Message Cluster Prediction                                                 To score the retrieved message clusters, we add the frequency of
Input: Last message received prev, Typed text typ                                  the phrases in our chat corpus as additional information in the trie.
                                                                                   So, the records of the form < phrase , ( message cluster id, frequency,
 1:   For each message cluster д ∈ G, compute Pr eply (G = д|prev)
                                                                                   min prefix length ) > are stored as key-value pairs in the trie. The
      with the reply NN model
                                                                                   score of a message cluster retrieved is the ratio of the aggregated
 2:   Send the message clusters with scores higher than a threshold,
                                                                                   frequency of the retrieved records from that message cluster to the
      Pr eply > tr eply , to the client.
                                                                                   sum of frequencies of all entries that were matching, irrespective of
 3:   Using text typ as input to Trie model, fetch set of phrases S that
                                                                                   whether these records are retrieved or not after the minimum prefix
      start with typ and the set of phrases Sm ⊆ S that start with typ
                                                                                   filter is applied This is shown in Equation 1. For example, suppose
      and also meet the minimum prefix length constraint.
                                                                                   the typed text is string “go”, and the sum of frequencies of all phrases
 4:   Compute trie scores for all message clusters mapped to phrases
                                                                                   starting with “go” is 10000, and there were three different phrases in
      in Sm as
                                         Í                                         the message cluster for “good morning”, which were starting with
                                                freq(ph)                           “go” (e.g., “good morning”,“good mrng”, “good mong”), having a
                                        ph ∈Sд ∩Sm
                Pt r ie (G = д|typ) =        Í                           (1)       combined frequency of 1500, then the score of the “good morning”
                                                    freq(ph)                       cluster is calculated to be 0.15.
                                            ph ∈S
                                                                                       For building the trie, we pick the top frequent phrases and the
      where Sд is the set of all phrases present in message cluster д              frequency of a phrase ph in our chat corpus after some normalization
 5:   For message clusters obtained from reply model or trie model,                is set as f req(ph). The length of the smallest prefix pre f of phrase
      combine the scores from both models as:                                      ph that gives P(ph|pre f ) ≥ θ is kept as the minimum prefix length
        Q(G = д|prev, typ) = w 0 + w t ∗ Pt r ie (G = д|typ)+                      of the phrase ph.
                            Pr eply (G = д|prev)∗                        (2)           At Hike, we created a trie with 34k phrases and around 7500
                                     −λnc                                          message clusters. In the serialized form, the size of the trie was
                            (wp0 e          + wp1 Pt r ie (G = д|typ))
                                                                                   around 700KB. This is small enough for us to ship to client devices.
Output: Message cluster scores Q                                                   Another advantage of a trie is that it is interpretable and makes it easy
                                                                                   for us to incorporate any new phrase, even if they are not observed
                                                                                   in our historical data.
                                                                                   3.2.3 Combiner: After receiving relevant message clusters and
message as input and scores each message cluster by their likelihood
                                                                                   corresponding scores from reply model and trie model, a final
of being a reply to the input message. Given the last message prev,
                                                                                   score for a message cluster Q(G = д|prev, typed) is computed from
the model computes the reply probability Pr eply (G = д|prev) for
                                                                                   Pr eply (G = д|prev) , Pt r ie (G = д|typed) and length of string typed
each message cluster д in the set of message clusters G.
                                                                                   nc as mentioned in Equation 2. The weighting of terms is designed in
   When a message gets routed to its recipient, the reply model
                                                                                   such a way that as the user types more characters, contribution from
is queried on the server and the response predictions are sent to
                                                                                   the reply model vanishes unless the message cluster is not predicted
the client along with the message. Since this inference runs on the
                                                                                   from the trie. The weights used were manually chosen after some
server, we can afford to use large sized NN models without adversely
                                                                                   trial and error. A threshold tcombined is applied on the final score to
affecting user experience. Furthermore, we pre-compute and cache
                                                                                   filter out recommendations with low scores.
the replies for frequent messages to improve the average response
time for reply prediction. The message clusters that have a reply
                                                                                   4    MESSAGE CLUSTER TO STICKERS
probability Pr eply (G = д|prev) above a threshold tr eply are sent to
the clients.                                                                            MAPPING
                                                                                   As mentioned in Section 1, we make use of a message cluster to
3.2.2 Typed Model: Trie [20] is an efficient data structure for                    sticker mapping to suggest suitable stickers from the predicted mes-
prefix searches. We retrieve the relevant message clusters for a given             sage clusters. When a sticker is created, it is tagged with conversa-
typed text by querying a trie that stores  tuples as key-value pairs. These  pairs                meta-data in order to map the message clusters to stickers. We com-
are the chat messages and their corresponding message clusters that                pute similarity between the tag phrases of a sticker and the phrases
we prepare from our corpus, as described in Section 2 . In order                   present in each message cluster, after converting them into vectors
to reduce false positives, we assign a minimum prefix length for                   using the encoder mentioned in Section 2. Compared to the histor-
each phrase in the trie, so that a matching record is retrieved only if            ical approach of suggesting a sticker when the user’s typed input
the typed text meets this minimum length or its message cluster is                 matches one of its tag phrases, the current system is able to suggest
present in the recommendations available from the reply model. For                 stickers even if different variations of the tag phrase are typed by
example, suppose the phrase “hi” has a min prefix length set to 2,                 user, as we have many variations of a message already captured
then a typed text “h” will not retrieve the record corresponding to                in the message clusters. We regularly refresh the message cluster
“hi”, though it is matching the typed text. But, in case the message               to sticker mapping by taking into account the relevance feedback
cluster of “hi” is present in the predictions of the reply model, then             observed on our recommendations. Since generation of message to
“hi” is retrieved even if length of the prefix is below the required               sticker mapping is not the main focus of paper, we skip the details
minimum length.                                                                    of these algorithms.
                                                                               5

Spell Syn Overall 5.2 Message Embedding Evaluation
Models
P@10 R@10 P@10 R@10 P@10 R@10
Trans+CharCNN 73.3 67.1 89.2 51.3 77.6 62.9 In this subsection, we evaluate the embeddings generated from dif-
Trans 70.8 67.6 86.7 47.4 75.0 62.2 ferent architecture on two manually labeled data. First, we describe
GRU+CharCNN 71.6 66.7 86.6 48.7 75.6 61.9 the baselines, training and hyper parameter tuning.
GRU 70.7 67.6 86.2 44.9 74.8 61.5
5.2.1 Message embeddings models. We compare the embed-
Table 1: Evaluation of different architecture of message embed-
dings obtained from architecture (Section 2.2) and its variants which
ding model on chat slang task
are as follows:
(1) Trans+CharCNN - Encoder as shown in Figure 2 with trans-
former.
(2) Trans - Uses only word-level embedding em wor d as input to
transformer.
(3) GRU+CharCNN - Encoder as shown in Figure 2 with GRU.
wor d as input to GRU.
(4) GRU - Uses only word-level embedding em
5.2.2 Training and Hyper-parameters: We train our GRU based
models using Adam Optimizer with learning rate of .0001 and step
exponential decay for every 10k steps. We apply gradient clipping
(with value of 5), dropout at GRU and fully connected layer to reduce
overfitting. For the GRU+CharCNN model, we use filters of width
1, 2, 3, 4 with number of filters of 50, 50, 75, 75 for the convolution
layers. The GRU models use embedding size of 300.
Transformer models are trained using RMSprop optimizer with con-
stant learning rate of 0.0001. They use 2 attention layers and 8 atten-
tion heads. Within each attention layer, the feed forward network has
input and output size of 256, and has a 512 unit inner-layer. To avoid
overfitting we implement attention dropout and embedding drop out
Figure 5: Performance of the message embedding models on of 0.1. The convolution layers of the Trans+charCNN model has the
phrase similarity task same number of filters and filter width as that of the GRU+CharCNN
model.
5 EXPERIMENTS 5.2.3 Phrase Similarity Task. We created a dataset which con-
In this section, we first describe the dataset used for training the sists of phrase pairs and labelled them as similar or not-similar. For
Sticker Recommendation (SR) system. Next, we quantitatively eval- e.g. (“ghr me hi”, “room me h”) , (“majak kr rha hu", “mjak kr rha
uate different message embeddings on two manually curated datasets. hu") are labelled as similar while (“network nhi tha”, “washroom
We show qualitative results after clustering to show the effectiveness gyi thi"), (“it’s normal”, “its me”) are labelled as not-similar. Pos-
of the message embeddings. Then, we present a comparison of the sible similar examples are sampled from top 50k frequent phrases
NN model and the hybrid model for the message cluster prediction. using 2 methods: a) Pairs close to each other in terms of minimum
Lastly, we discuss about the deployed system. edit distance. b) Pairs having words with similar word embeddings.
Possible negative examples are sampled randomly. These pairs were
5.1 Dataset and preprocessing annotated by 3 annotators. Pairs with disagreement within the anno-
tators were not considered during evaluation. Finally, the dataset has
We collected conversations of anonymous users for a period of 5 3341 similar pairs and 2437 non-similar pairs.
months. This dataset is pre-processed to extract useful conversations To evaluate the models, we calculate the angular similarity [3]
from the chat corpus. First step is tokenization, which includes between the embeddings of phrases in a pair. The ROC curves of
accurate detection of emoticons and stickers, reducing more than 3 the models are shown in Figure 5. We observe that transformer has
consecutive repetitions of the same character to 2 characters. We also significant improvement over GRU. This is due to better capturing of
add a special symbol at the start and the end of every conversation semantics of phrases using self-attention. Trans+CharCNN models
After pre-processing the data, we create tuples of the current and performs the best and shows ≈6% absolute improvement in terms of
the next message for training the message embedding models. Since area under the curve over the GRU baseline.
stickers are mostly used to convey short messages, we filter out
all the tuples which have at least one message having more than 5 5.2.4 Chat Slang Task: We manually labelled a set of words
words. Both our input and output vocabularies consist of top frequent which are common slangs or variants used in conversation and
50k words in our corpus. The dataset is randomly split into training mapped them to their corresponding correct form. We considered
and validation sets. Messages having length less than 5 words are two types of errors. 1) Spelling variants (spell) - In this class we
padded up to the pre-defined maximum length of 5. Similarly, we considered words which can be converted to their correct form by
pad characters in words if the character length of word is less than either replacing the characters nearby in keyboard (‘hsn’ → ‘han’)
10 characters. or addition of vowels (‘phn’ → ‘phone’, ‘yr’ → ‘yaar’). These errors
6

occur either due to unintentional mistakes or to shorten the word # of # of times
Fraction
length to increase the speed of typing. 2) Synonym words (syn) - we Character inaccurate
considered words which have same meaning and are phonetically Method of msg
to be predictions
similar but cannot be converted into their correct form by spelling retrieved
Typed shown
correction. For e.g.(‘plzz’ → ‘please’, ‘n8’ → ‘night’). Most users
are familiar with these form of transformation of words and use them Quantized
widely in conversation. In total, we labelled 213 words for spelling NN 1.58 1.47 0.978
variants and 78 synonym words2 to their correct word form. (100d)
To evaluate the message embedding we retrieve top 10 nearest Hybrid 2.84 1.22 0.991
neighbour from top frequent phrases for each query of labelled Table 2: Performance of different message prediction models
dataset and used two evaluation metrics. 1) Precision@10 (P@10)
counts the number of correct phrases in top 10 (retrieved phrases that
are semantically similar to query). 2) Recall@10 (R@10) calculates to type for seeing correct message cluster in top 3 positions (# of
whether labelled phrase of query is present in top 10 or not. Two Character to be typed), 2) How many times the model has shown
human judges were used in the annotation and disagreements were wrong message before predicting the correct message cluster in first
resolved upon consensus among judges. 3 positions (# of times inaccurate prediction) and 3) Fraction of
Table 1 shows the performance of different variants of encoder messages that could be retrieved by a model in first 3 positions,
to learn message embedding. Transformer and GRU have compa- with some prefix of that message (Fraction of msg retrieved). Lower
rable recall on spell data because it contains single words. Hence, the absolute number for first two metrics, better the model is while
transformer doesn’t have much advantage over GRU. CharCNN fraction of message retrieved should be as high as possible. For
improves the recall performance of synonym word variants signifi- training the message prediction using NN model, we collected pairs
cantly for both the architectures. CharCNN is able to capture certain of current and next messages from complete conversation data. Our
chat characteristics which occur due to similar sounding sub-words. training data had around 10M such pairs. We randomly sampled 38k
For e.g. ‘night’ and ‘n8’, ‘see’ and ‘c’ , ‘what’ and ‘wt’ are used pairs for testing. We curate the training data by treating all prefix of
interchangeably during conversation. CharCNN learns such chat next message as a typed text and message cluster of the next message
nuances and performs well in matching those words to semantically as class label. We restrict our next message clusters only to top 7500
similar words. message clusters (clusters are picked on basis of sum frequency
of their phrases). Selected clusters covered 34k top frequent reply
5.3 Message Clustering Evaluation messages in our data set. For hybrid model we train two models.
Figure 6 shows a qualitative evaluation of clusters obtained from One is prediction of next message based on current message. It is
HDBSCAN algorithm. We select the top 100 clusters based on the trained directly from current and next message pairs. Next message
number of phrases present in cluster. We projected the phrase em- was mapped to its corresponding message cluster. Second we build
bedding learned from our model on 2-dimension using t-sne [17] a trie model based on typed message. We prepare phrase frequency
for visualization. Figure 6 shows some randomly picked phrases and min prefix of phrases as mentioned in Sec. 3.2.
from the clusters; phrases from the same cluster are represented with Results of various model is shown in Table 2. The quantized NN
the same color. We observe that the various spelling variants and model performs better in terms of # of characters to be typed. This
synonyms for a phrase are grouped in a single cluster. For example, is expected because the NN model learns to use both inputs simulta-
the ‘good night’ cluster includes phrases like ‘good nighy’, ‘good neously whereas the combiner (last function in Hybrid model) used
nyt’, ‘good nght’, ‘gud nyr’, ‘gud nite’. Our clustering algorithm is is a simple weighted linear combination of just three features. The
able to capture fine-grained semantics of phrases. For instance, ‘i hybrid model performs slightly better in terms of # of times inac-
love u’, ‘i love you’, ‘i luv u’ phrases belong to one cluster while curate recommendations shown and fraction of messages retrieved.
‘love u too’, ‘love u 2’, ‘luvv uhh 2’ belong to a different cluster Compared to quantized models, hybrid model needs slightly more
which helps us in showing accurate sticker recommendation for both than one more character to be typed on an average to get required
clusters. If a user has messaged ‘i love u’ then showing stickers predictions. It has to be noted that the quantized NN is of size ∼ 9.2
related to ‘i love u too’ cluster is make more relevant replies as MB which is not practically feasible in our use case. The on-device
compared to showing ‘i love u’ stickers. Our message embedding is foot print for hybrid model is below 1 MB. Another important point
able to cluster phrases like ‘i love u baby’ and ‘i luv u shona’ in a to be noted is that a trie based model can guarantee retrieval of all
cluster distinct from the ‘i love u’ cluster. This helps us deliver more message clusters, with some prefix of the message. If the message
precise sticker recommendations since we have different stickers in is not interfered by another longer but more frequent message, the
our database for phrases like ‘i love u’ and ‘i love u baby’. message class can be featured in top position itself, with some pre-
fix of the message as input or with full message as input. A pure
5.4 Message Cluster Prediction Evaluation NN based models can not make such guarantees. This is reflected
in the third metric shown in Table 2–more than 2% of times, the
Our aim is to show the right stickers with minimal effort from user.
intended message class is not retrievable by the NN. Given that this
So, we compare our hybrid model and quantized models mainly
SR interface is one of the heavily used interface for sticker discovery,
on the following metrics. 1) Number of character that a user need
if a sticker is not retrieved through this recommendation, the user
2 We plan to release the labelled data and baseline results might perceive it as the app doesn’t support the message class or the
7

Figure 6: HDBSCAN clusters of top frequent phrases using message embedding produced from our GRU-CharCNN model. The
points represents chat message vectors projected into 2D using TSNE [17]

app does not have such stickers. So making the retrieval of message or adds new information. [1, 2] predict which of the top-20 emojis
cluster close to 100%% is critical for our SR models. are likely to be used in an Instagram post based on the post’s text
and image. Unlike emojis that are majorly used in conjunction with
5.5 Real World Deployment text, stickers are independent messages that substitute text. Thus, in
order to have effective sticker suggestion for a given conversational
The proposed system has been deployed in production on Hike for
context, we need to predict the likely utterance and not just the
more than 6 months. A/B tests to compare this system with a previous
emotion. Since, the possible utterances are more numerous than
implementation showed an 8% improvement in the fraction of mes-
emotions, our problem is more nuanced than emoji prediction.
saging users who use stickers. There are multiple regional languages
There exists a large body of research on conversational response
spoken in India. We segregate our data based on geography and train
generation. [22, 24] design an end-to-end model leveraging a hi-
separate models (both message embedding and message prediction
erarchical RNN to encode the input utterances and another RNN
models) for each state. It ensures that frequent chat phrases distri-
to decode possible responses. The approach suffers from a limita-
bution doesn’t get skewed towards one majority spoken language
tion that it often generates generic uninformative responses (e.g., “i
across India. We obtain the primary language of a user from the
don’t know”). [28] describe a model that explicitly optimizes for
language preferences explicitly set by them or infer it from their
improving diversity of responses. [25] propose a retrieval based ap-
sticker usage.
proach with the help of a DNN based ranker that combines multiple
evidences around queries, contexts, candidate postings and replies.
6 RELATED WORK Smart Reply [12] is a system that suggests short replies to e-mails
To the best of our knowledge there is no prior art which studies the which uses an approach to select the high quality responses with
problem of type-ahead sticker suggestions. However, there are a few diversity to increase the options of user to choose from. Akin to our
closely related research threads that we describe here. system, SmartReply also generates clusters of responses. They apply
The use of emojis is widespread in social media. [7] classify semi-supervised learning to expand the set of responses starting from
whether an emoji, which is used in a tweet, emphasizes something a small number of manually labeled responses for each semantic
8

intent. In contrast, we follow an unsupervised approach to discover [4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.
message clusters since the set of all intents that may correspond to Empirical evaluation of gated recurrent neural networks on sequence modeling.
arXiv preprint arXiv:1412.3555 (2014).
stickers wasn’t readily available. A unique aspect of our system is [5] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine
that we update the message prediction by incorporating whatever Bordes. 2017. Supervised learning of universal sentence representations from
natural language inference data. arXiv preprint arXiv:1705.02364 (2017).
the user has typed so far; we need to do this in order to deliver [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
type-ahead sticker suggestions. Pre-training of Deep Bidirectional Transformers for Language Understanding.
There is a parallel research thread around learning effective repre- arXiv preprint arXiv:1810.04805 (2018).
[7] Giulia Donato and Patrizia Paggio. 2017. Investigating redundancy in emoji
sentations for sentences that can capture sentence semantics [5, 6, 14– use: Study on a twitter based corpus. In Proceedings of the 8th Workshop on
16, 26]. Skip Thought [14] is an encoder-decoder model that learns Computational Approaches to Subjectivity, Sentiment and Social Media Analysis.
to generate the context sentences for a given sentence. It includes 118–126.
[8] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. 2016. Hardware-
sequential generation of words of the target sentences, which limits oriented approximation of convolutional neural networks. arXiv preprint
the target vocabulary and increases training time. Quick Thought arXiv:1604.03168 (2016).
[9] Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-hsuan Sung, Laszlo
[16] circumvents this problem by replacing the generative objective Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Ef-
with a discriminative approximation, where the model attempts to ficient natural language response suggestion for smart reply. arXiv preprint
classify the embedding of a correct target sentence given a set of sen- arXiv:1705.00652 (2017).
[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
tence candidates. Recently, BERT [6] has proposed a model which Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:
predicts the bidirectional context to learn sentence representation Efficient convolutional neural networks for mobile vision applications. arXiv
using tranformer [23]. [26] proposes the Input-Response model that preprint arXiv:1704.04861 (2017).
[11] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An-
we have evaluated in this paper. However, unlike these works, we drew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization
apply a CharCNN [13, 27] in our encoder to deal with the problem and training of neural networks for efficient integer-arithmetic-only inference. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
of learning similar representations for phrases that are orthographic 2704–2713.
variants of each other. [12] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins,
Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al.
7 CONCLUSION 2016. Smart reply: Automated response suggestion for email. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
In this paper, we present a novel system for deriving contextual, Data Mining. ACM, 955–964.
type-ahead sticker recommendation within a chat application. We [13] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-
Aware Neural Language Models.. In AAAI. 2741–2749.
decompose the recommendation task into two steps. First, we pre- [14] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
dict the next message that a user is likely to send based on the last Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in
neural information processing systems. 3294–3302.
message received in the chat. As the user types the message, we [15] Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences
continuously update our prediction by taking into account the char- and Documents.. In ICML, Vol. 14. 1188–1196.
acter sequence typed in the chat input box. Second, we substitute the [16] Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for
learning sentence representations. arXiv preprint arXiv:1803.02893 (2018).
predicted message with relevant stickers. We discuss how numerous [17] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
orthographic variations for the same utterance exist in Hike’s chat Journal of machine learning research 9, Nov (2008), 2579–2605.
corpus, which mostly contains messages in a transliterated form. We [18] Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical
density based clustering. The Journal of Open Source Software 2, 11 (2017), 205.
describe a clustering solution that can identify such variants with [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
the help of a highly effective message embedding, which learns Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
similar representations for semantically similar messages. Message [20] Donald R Morrison. 1968. PATRICIA—practical algorithm to retrieve information
clustering reduces the complexity of the classifier used in message coded in alphanumeric. Journal of the ACM (JACM) 15, 4 (1968), 514–534.
prediction, e.g., by predicting one of 7500 classes (message clusters). [21] Sujith Ravi. 2017. Projectionnet: Learning efficient on-device deep networks
using neural projections. arXiv preprint arXiv:1708.00630 (2017).
We are able to cover all intents expressed in 34k frequent messages. [22] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
Furthermore, it allows us to collect relevance feedback from related Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative
message contexts to improve our mapping of message to stickers. Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776–3784.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
For message prediction on low-end mobile phones, we use a hybrid Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
model that combines a neural network on the server and a memory need. In Advances in neural information processing systems. 5998–6008.
efficient trie search on the device for effective sticker recommen- [24] Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017.
Hierarchical recurrent attention network for response generation. arXiv preprint
dation. We show experimentally that the hybrid model is able to arXiv:1701.07149 (2017).
predict a higher fraction of overall messages clusters compared to [25] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to respond with deep neural
networks for retrieval-based human-computer conversation system. In Proceedings
a quantized neural network. This model also helps in better sticker of the 39th International ACM SIGIR conference on Research and Development
discovery for rare message clusters. in Information Retrieval. ACM, 55–64.
[26] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar,
Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning
REFERENCES Semantic Textual Similarity from Conversations. arXiv preprint arXiv:1804.07754
[1] Francesco Barbieri, Miguel Ballesteros, Francesco Ronzano, and Horacio Saggion. (2018).
2018. Multimodal emoji prediction. arXiv preprint arXiv:1803.02392 (2018). [27] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
[2] Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are Emojis networks for text classification. In Advances in neural information processing
Predictable? arXiv preprint arXiv:1702.07285 (2017). systems. 649–657.
[3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. [28] Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett,
John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun- and Bill Dolan. 2018. Generating informative and diverse conversational responses
Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. via adversarial information maximization. arXiv:1809.05972 (2018).
CoRR abs/1803.11175 (2018). arXiv:1803.11175 http://arxiv.org/abs/1803.11175
9

You can also read