Context-Aware Drive-thru Recommendation Service at Fast Food Restaurants
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Context-Aware Drive-thru Recommendation Service at Fast Food Restaurants Luyang Wang∗ Kai Huang† Jiao Wang† Shengsheng Huang† Jason(Jinquan) Dai† Yue Zhuang‡ Abstract ate proper recommendations. For example, there is no Drive-thru is a popular sales channel in the fast food indus- easy way to reliably identify guests and retrieve their profiles in the drive-thru environment. Lack of user arXiv:2010.06197v1 [cs.IR] 13 Oct 2020 try where consumers can make food purchases without leav- ing their cars. Drive-thru recommendation systems allow identifier makes popular recommendation algorithms restaurants to display food recommendations on the digital such as Alternating Least Squares (ALS) described in menu board as guests are making their orders. Popular rec- [16] and Neural Collaborative Filtering (NCF) [10] not ommendation models in eCommerce scenarios rely on user applicable for drive-thru as they require user profiles as attributes (such as user profiles or purchase history) to gen- model inputs. Session-based recommendation systems erate recommendations, while such information is hard to are able to learn from behavior sequence data to gen- obtain in the drive-thru use case. Thus, in this paper, we erate recommendations so they are less dependent on propose a new recommendation model Transformer Cross user identifier and its related information. However in Transformer (Tx T), which exploits the guest order behavior our drive-thru use case, only relying on order sequence and contextual features (such as location, time, and weather) data for prediction is simply not enough. Fast food pur- using Transformer encoders for drive-thru recommendations. chase preference can drastically change given different Empirical results show that our Tx T model achieves supe- context data like location, time, and current weather rior results in Burger King’s drive-thru production environ- conditions. A proper neural network architecture that ment compared with existing recommendation solutions. In leverages both order sequence data and multiple context addition, we implement a unified system to run end-to-end features is key to a successful recommendation system big data analytics and deep learning workloads on the same in the drive-thru scenario. cluster. We find that in practice, maintaining a single big To tackle aforementioned challenges, we propose a data cluster for the entire pipeline is more efficient and cost- new deep learning recommendation model called Trans- saving. Our recommendation system is not only beneficial former Cross Transformer (Tx T) that exploits the se- for drive-thru scenarios, and it can also be generalized to quence of each order as well as the context information other customer interaction channels. to infer a customer’s preference at the moment. Se- quential signals underlying the guest’s current order are 1 Introduction important to reflect behavioral patterns. Using apple pie as an example, an apple pie shown earlier during Modern restaurant drive-thru service is a type of take- an order sequence suggests this is a lightweight snack- out service that allows customers to purchase food driven order while an apple pie ordered after multiple without leaving their cars. The process starts where large sandwiches could imply this is a large multi-party customers browse the items on the outdoor digital menu order. Contextual data such as time and weather could board and talk to the cashier inside the restaurant to also be useful since guests may have particular prefer- place their order. Drive-thru recommendation systems ences according to the current situation. For example, have recently drawn the attention of large players in the it is natural that people tend to have cold drinks when quick serving restaurant industry. At Burger King, we the temperature increases. Furthermore, we incorporate take the initiative to develop a state-of-the-art context- other context information such as the store location so aware recommendation system to improve our guest that our recommendation can prioritize popular food experience in the drive-thru. items being sold in particular regions. Drive-thru being an offline shopping channel has In addition to designing a suitable algorithm for much fewer data points that can be leveraged to gener- our drive-thru scenario, setting up a complete pipeline ∗ Burger to efficiently train the model on our enormous guest King Corporation. (lwang1@whopper.com) † Intel Corporation. ({kai.k.huang, jiao.wang, sheng- transaction data is also important for the whole system. sheng.huang, jason.dai}@intel.com) We do have big data platforms to store and process ‡ Brown University. (yue zhuang1@brown.edu)
the big data, but how to seamlessly integrate deep terns to predict a user’s next action given the previous learning training to the existing big data system is a actions. In recent years, RNN based neural networks nontrivial task. A common practice is to allocate a including Long Short-Term Memory (LSTM) [13] and separate system dedicated for distributed training only, Gated Recurrent Unit (GRU) [6] have been adopted which introduces extra data transfer time and system and yield better recommendation results. For exam- maintenance efforts. Instead, we propose an integrated ple, GRU4Rec [12] utilizes session-parallel mini-batch system allowing big data analytics and distributed training and ranking-based loss function for recommen- training on exactly the same cluster where the data is dation. Later, other RNN based algorithms including stored, which is efficient, easy to scale and maintain. Improved GRU4Rec [11] and user-based GRU [8] are In summary, the main contributions of this paper explored for session-based recommendation as well. are as follows: Attention mechanism has become another powerful approach to modeling sequential signals in a user ses- • We propose a Transformer Cross Transformer sion. There are studies leveraging attention to session- model (Tx T) for the drive-thru scenario. The key based recommendation. For example, NARM [17] hy- advantage of our model is that we apply Trans- brids attention into GRU to capture the behavior and former encoder [27] to capture both user order purpose of users in a session. Instead of incorporat- behavior sequence and complicated context fea- ing attention as a component into other models, Trans- tures and combine both transformers through la- former [27] networks, which only use multi-head self- tent cross [3] to generate recommendations. attention to model sequential data, achieve state-of-art performance (e.g. SASRec [14]). BST [4] uses a Trans- • We design a unified system architecture based on former layer to learn item representations in a behavior Apache Spark [29] and Ray [20] to perform dis- sequence and concatenates the result with other feature tributed deep learning training on exactly the same embeddings for the final prediction. cluster where our big data is stored and processed. Our system fully utilizes the existing big data clus- 2.2 Context-Aware Recommendation. Tradi- ters, avoids the expensive data transfer, eliminates tional recommendation systems deal with applications separate workflows and systems, and thus improves which only include users and items. But in many the end-to-end efficiency in the production environ- scenarios, only user and item attributes may not be ment. sufficient, and it is also important to take contextual • We have successfully deployed our recommendation information into consideration to recommend items system at Burger King’s restaurants to serve our to users under certain circumstances. This is more drive-thru customers. We share our key learnings important for fast food recommendation since food in production and have open sourced our imple- order habits of users may change significantly given mentation in [7]. Our model concepts and system current location, time and weather conditions. design could be easily extended to other customer More recently, many works have started on context- interaction channels. aware recommender systems (CARS) [2] to incorporate available context features into the recommendation pro- 2 Related Work cess as explicit additional categories of data. In Wide & Deep Learning [10], context features can be treated as In this section, we give a brief overview of the past categorical features and fed into hidden layers of a lin- research work in terms of recommendation that we ear model. [26] applies general context features to the gain insights from. Our work extends state-of-the-art RNN layer. Most commonly used approaches concate- models of session-based recommendation and context- nate contextual features to other inputs. In the mean- aware recommendation to make the topology best fit time, Latent Cross [3] performs an element-wise product our drive-thru scenario. between the context embedding and the hidden states from RNN to improve the performance of context in- 2.1 Session-based Recommendation. A straight- corporation. But the paper doesn’t focus on how to forward approach in session-based recommendation is effectively learn the feature representations from mul- item-based collaborative filtering [23], which uses an tiple complex context features. A major difference of item-to-item similarity matrix to recommend the most our proposed model in this paper is that we apply a similar items to the one that the user has clicked most separate Transformer block to capture the complicated recently in the session. Another approach is mainly interactions of different context features. based on Markov chains (e.g. Markov Decision Pro- cesses [25]) which capture user behavior sequential pat-
3 Model Description of each item, we need to apply a pooling function In this section, we describe our Transformer Cross here. Mean pooling is a common pooling strategy Transformer model (Tx T) in detail. We begin with which performs element-wise mean calculation across all defining a drive-thru recommendation task followed by product vectors such that the output basket embedding illustrating the structure of our Tx T model. is represented by all the products contained in the product sequence. Max pooling is another popular 3.1 Problem Description. We define the drive- pooling strategy which picks the maximum value at thru recommendation task as follows. Given an order each position across all product vectors and therefore sequence event Sn (e) = {f1 , f2 , ..., fn } where fi is the the output basket embedding is represented by a small i-th food item a guest orders, and context feature number of key products and their salient features only. collection C(e) = {c1 , c2 , ..., cm } where ci is one context To combine the advantage of both mean pooling and feature when the order event occurs (e.g. location, time max pooling, we follow mean-max pooling [30] to apply and weather), we shall learn a function, F , mean pooling and max pooling separately against the output of the Transformer block and concatenate both fn+1 = F (Sn (e), C(e)) pooling outputs to form the guest order basket vector. The final output of the Sequence Transformer can be which predicts the food item the guest is most likely formalized as: to order next, i.e. fn+1 . To effectively deal with the drive-thru recommen- α = sum(zi )/n; β = max(zi ) dation task, we propose the Transformer Cross Trans- output = Concat(α, β) former model (Tx T), which uses a Sequence Trans- former to encode guest order behavior, a Context Trans- where Z = {zi } is the output from the Transformer former to encode context features, and then uses an encoder layer. element-wise product to combine them to produce the final output, as shown in Figure 1. 3.3 Context Transformer. Context information is a significant factor in drive-thru recommendation sce- 3.2 Sequence Transformer. We construct a Se- narios. A common way to incorporate contextual fea- quence Transformer to learn the sequence embedding tures is to directly concatenate context features with vector of each item in the guest order basket, as shown in other inputs. But it is less meaningful simply concate- the lower left part of Figure 1. To ensure that the item nating non-sequence features with sequence features. position information can be considered in its original In [3], the author proposed an element-wise sum when add-to-cart sequence, we perform positional embedding dealing with multiple context features before crossing to on input items in addition to the item feature embed- the sequence output. However, sum can only represent ding. Both embedding outputs are added together and how context features aggregately contribute to the out- fed into a multi-head self-attention network [27]. put, but can not represent the individual contribution In the multi-head self-attention network, we project of each context feature. For example, the restaurant lo- item embedding into a set of query Q and key-value cation may sometimes have a stronger impact on guest pairs (K; V) to h spaces; and then calculate the multi- food ordering compared with other context features due head self-attention output in parallel: to regional food offerings. In this case, a simple sum of all the context features cannot emphasize the impor- M ultiHead (Q, K, V) = Concat (head1 , ..., headh ) WO tance of location. QK| Therefore, we use a Context Transformer to encode headi = Attention (Q, K, V) = sof tmax √ V the contextual information, as shown in the bottom dk right part of Figure 1. Using Transformer’s multi- where dk is the dimension of K, h is the number of head self-attention [27], we can capture not only the heads and WO is the weight of the heads. individual effect of each context, but also the internal Padding mask is applied where attention scores relationship and interactions across different context at padded item positions are replaced with infinitive features. Attention outputs of different heads focus negative values so that after the softmax, the attention on the importance of different context features, which weights at these padded positions will be close to 0. We enables our model to consider the combined influence of then add point-wise Feed-Forward Networks on top of various context features at the same time. In Figure the multi-head self-attention outputs. 2, the color depth of boxes indicates the weights of To extract the vector representation of the whole context features in terms of each head and the darker guest order basket information from the hidden vectors ones are treated as more important features. From
Figure 1: The model structure of Tx T. Case 1 in Figure 2 we can see that Head 2 focuses on the Transformer output to generate the overall repre- Temperature while Head 2 in another Case 2 focuses sentation of contextual features. more on Weather. 3.4 Transformer Cross Transformer. To jointly train Sequence Transformer and Context Transformer, we adopt the idea proposed in [3] to perform an element- wise product between these two outputs. Through this cross Transformer training, we are able to optimize all the parameters such as item embeddings, contextual feature embeddings and their interactions at the same time. Finally, we apply LeakyRelu as the activation function followed by a softmax layer to predict the probabilities of each candidate item. 4 Distributed Deep Learning Architecture 4.1 Motivations. We have a huge amount of user Figure 2: Heat map for context features transaction records stored on our big data cluster, which is impossible to train on a single node. So we aim In particular, we assemble multiple context embed- to build an end-to-end training pipeline containing a dings into a context vector and apply multi-head self- distributed data cleaning and ETL process, which we attention to it. Unlike the Sequence Transformer, we implement using Apache Spark [29], and a distributed don’t create positional-encoding to the input of multi- training process to train our Tx T model. head self-attention since the order of context features Conventional approaches to build such a pipeline in the context vector is fixed in our model. In produc- would normally set up two separate clusters, one ded- tion we find that one Transformer layer is sufficient to icated to big data processing, and the other dedicated well extract our context features. Similar to Sequence to deep learning training (e.g., a GPU cluster). Unfor- Transformer, we also apply mean-max pooling [30] on
tunately, this not only introduces a lot of overhead for for launching multiple Spark executors on the YARN data transfer, but also requires extra efforts for man- cluster to run Spark jobs. In RayOnSpark, the Spark aging separate workflows and systems in production. driver program can also create a RayContext object, In addition, while popular deep learning frameworks which will automatically launch Ray processes along- [1, 21, 5] and Horovod [24] from Uber provide support side each Spark executor; it will also create a RayMan- for data parallel distributed training (using either pa- ager inside each Spark executor to manage Ray pro- rameter server architecture [18] or MPI [9] based AllRe- cesses (e.g., automatically shutting down the processes duce), it can be very complex to properly set them up when the program exits). As a result, one can directly in production. For instance, these methods usually re- write Ray code inside the Spark program. The pro- quire all relevant Python packages pre-installed on ev- cessed Spark’s in-memory DataFrames or Resilient Dis- ery node, and the master node has SSH permission to tributed Datasets (RDD) [29] can be directly fed into all the other nodes, which is probably infeasible for the the Ray cluster through the Plasma object store used production environment and also inconvenient for clus- by Ray for distributed training. ter management. We choose Apache MXNet [5] as our deep learn- To address these challenges, we propose and imple- ing framework. On top of RayOnSpark, we implement ment a unified system (i.e. RayOnSpark ) using Spark a lightweight shim layer to automatically deploy dis- and Ray [20], which runs the end-to-end data processing tributed MXNet training on the underlying YARN clus- and deep learning training pipeline on the same big data ter. Similar to RaySGD [19], each MXNet worker (or cluster. This unified system architecture has greatly im- parameter server) runs as a Ray actor [20], and they di- proved the end-to-end efficiency of our production ap- rectly communicate with each other via the distributed plications, and is open sourced as a part of the project key-value store provided natively by MXNet. In this [7]. way, users are relieved from managing the complex steps of distributed training, through the scikit-learn [22] style APIs provided by the shim layer. In addi- tion, the shim layer also provides just-in-time Python package deployment across the cluster, so that the user no longer needs to install the runtime environment for each application on each node beforehand, and the clus- ter environment remains clean after the job finishes. The same methodology can be easily applied to other deep learning frameworks such as TensorFlow [1] and PyTorch [21]. 4.3 Model Training Pipeline. With RayonSpark, we now simplify our end-to-end model training process by combining Spark ETL job and distributed MXNet training job into one unified pipeline. Our model train- ing process starts from launching Spark tasks to extract our restaurant transactions data stored on distributed file systems. We then perform data cleaning and prepro- cessing steps on the YARN cluster. Once the Spark job has completed, the Spark driver program launches Ray Figure 3: RayOnSpark architecture overview. on the same YARN cluster and dynamically distributes MXNet packages across the cluster. RayonSpark helps 4.2 RayOnSpark. On a shared Hadoop YARN [28] to set up the distributed deep learning environment and cluster, we first use Spark [29] to process the data; launches MXNet processes alongside Spark executors; then we launch a Ray [20] cluster alongside Spark using each every MXNet worker takes the partition of the RayOnSpark to run distributed deep learning training preprocessed data from Ray’s Plasma object store on directly on top of the YARN cluster. its local node and starts model training. The trained We design RayOnSpark for users to run distributed model is then saved with a version tag for online serv- Ray applications on big data clusters. Figure 3 illus- ing. trates the architecture of RayOnSpark. In the Spark im- Listing 1 shows the code segments of our distributed plementation, a Spark program runs on the driver node MXNet training pipeline. As described in Section 4.2, and creates a SparkContext object, which is responsible
we first initiate the SparkContext on YARN and use Table 1: Model variations for comparison. Spark to do data processing followed by initiating the Model Sequence Input Context Inputs RayContext to launch Ray on the same cluster. We im- plement a scikit-learn [22] style Estimator for MXNet RNN X which takes the MXNet model, loss, metrics and train- RNN Latent Cross X X ing config (e.g., the number of workers and parame- Contextual ItemCF X ter servers) as inputs. MXNet Estimator performs the Tx T X X distributed MXNet training on Ray by directly fitting Spark DataFrames or RDDs [29] to the given model and loss. In our implementation, only several extra lines of multiplicative to context features (Contextual ItemCF). code are needed to scale the training pipeline from a Table 1 summarizes whether these models consider single node to production clusters. order sequence and context features. Evaluation metrics. For offline measurement, we 1 sc = i n i t _ s p a r k _ o n _ y a r n (...) use Top1 and Top3 accuracy to evaluate the perfor- 2 # Use S p a r k C o n t e x t to load data as Spark mance of different models since for drive-thru, three D a t a f r a m e or RDD , and do data p r o c e s s i n g . food recommendations are shown simultaneously to the 3 4 ray_ctx = RayContext ( sc ) guest on the digital menu board with the first recom- 5 ray_ctx . init () mendation highlighted. For online measurement, we 6 conducted online A/B testing in our production envi- 7 m xn et _e s ti ma to r = Estimator ( model , loss , ronment and use conversion gain (observed recommen- metrics , config ) dation conversion uplift over the control group) and 8 m xn et _e s ti ma to r . fit ( train_rdd , val_rdd , epochs , batch_size ) add-on sales gain (observed add-on sales uplift over the control group) to evaluate the model performance. Listing 1: Code sample for data processing and training pipeline with RayOnSpark. This unified software architecture allows big data processing and deep learning training workloads to run on the same big data cluster. Consequently, we only need to maintain one big data cluster for the entire AI pipeline, with no extra data transfer across different clusters. This achieves the full utilization of the cluster resources and significantly improves the end- to-end performance of the whole system. 5 Experiment Results In this section, we experimentally evaluate the Tx T model through both offline evaluation and online A/B testing in Burger King’s production environment. Figure 4: Training loss of Tx T with different optimizers. 5.1 Setup. Dataset. The dataset is constructed from Environment and Configurations. Our model is im- Burger King’s customer drive-thru transaction records plemented with Python 3.6 and MKL-DNN optimized of the past year. The historical data of the first 11 version of Apache MXNet (i.e. mxnet-mkl) 1.6. Based months is used as training data and the last month is on our experiential results shown in Figure 4, we choose for validation. An add-to-cart event sequence is formed Adam [15] as the optimizer for all models. The training for each order from the transactions and the last item in configurations of Tx T are shown in Table 2. Other base- the sequence is used for prediction. We select weather line models use the compatible set of the corresponding descriptions, order time, current temperature and a configurations as Tx T. number of restaurant attributes including identifier and location as context features. 5.2 Comparative Analysis. We conducted offline Baselines. To evaluate the effectiveness of the Tx T comparison for model training using the above evalua- model, we select several of its variations as baseline tion metrics and configurations. We also conducted on- models for comparison, i.e. RNN [12], RNN Latent line A/B testing of different recommendation systems Cross [3], and a variation of ItemCF model [23] with for 8 weeks. We share our results in this subsection.
Table 2: Training configurations of Tx T. Table 4: Comparison of online A/B testing results. Sequence Transformer Context Transformer Model Conversation Add-on Sales Embedding size 100 Embedding size 100 Rate Gain Gain Head number 4 Head number 2 Phase 1 With Mask True With Mask False Blackbox vendor’s - - Transformer layers 1 Transformer layers 1 system (control) Sequence length 5 Tx T +79% +14% Batch size 512 Phase 2 Epochs 1 RNN Latent Cross - - Learning rate 0.001 (control) Tx T +7.5% +4.7% Additional A/B Testing on Mobile APP Table 3: Comparison of offline training results. Rule-based system - - Model Top1 Accuracy Top3 Accuracy (control) RNN 29.98% 46.24% Google Recommen- +164% +64% Contextual ItemCF 32.18% 48.37% dation AI RNN Latent Cross 33.10% 49.98% Tx T +264% +137% Tx T 34.52% 52.37% As shown in Figure 5 and Table 3, when only se- been previously using. We randomly divided our pilot quence information (using RNN) or context information restaurants into two groups and ran both systems simul- (using Contextual ItemCF) is considered, we observe taneously for 8 weeks. As shown in Table 4, Tx T model significantly higher cross entropy loss during training improved the recommendation conversion rate for drive- and lower validation accuracy compared with models thru by +79%. Production observation also shows that that take both into consideration (using RNN Latent Tx T is able to react to location and weather changes Cross and Tx T). Compared with RNN Latent Cross, very well. In the second phase, we conducted online Tx T is able to further improve the Top1 and Top3 ac- A/B testing for model variations at our drive-thru en- curacy by around 1.4% and 2.4% respectively. vironment for 4 weeks. For the control group, 50% of the drive-thru transactions were randomly selected to use the RNN Latent Cross model while the remaining transactions used the Tx T model. It turned out that Tx T was able to improve the conversion rate by 7.5% with 4.7% add-on sales gain over the control group. To further test Tx T’s performance, we also ran our recommendation system on Burger King’s mobile appli- cation for 4 weeks side by side with Google Recommen- dation AI, a state-of-art recommendation service pro- vided by Google Cloud Platform. For the control group, we randomly selected 20% of the users and presented them with a previous recommendation system based on simple rules. As shown in Table 4, Tx T improved rec- ommendation conversion at the checkout page by 264% and add-on sales by 137% when compared with the con- trol group. This also stands for +100% conversion gain and +73% add-on sales gain when compared with the Figure 5: Training loss of model variations. test groups running Google Recommendation AI ser- vice. For online measurement, due to time constraints and business requirements, we divided online compari- 6 Experience son into two different phases. In the first stage, we mea- We have deployed the end-to-end context-aware recom- sured Tx T against an existing blackbox recommenda- mendation system in Burger King’s selected drive-thru tion system provided by an outside vendor that we have restaurants for 10 months, and also deployed our Tx T
model on Burger King’s mobile application for product recommendation at check-out. In this section, we share some key learnings and insights we have gained when building and deploying our recommendation system. In drive-thru or other offline customer interaction channels where limited user data could be obtained, guest order sequence and different context features can be exploited for recommendation. Transformer encoder and mean-max pooling architectures can well extract the feature representations of these inputs. Using our Figure 6: Model Serving Architecture. Tx T model, we have seen 79% improvement in real- world conversion rate for drive-thru compared with the existing vendor’s recommendation solution. In addition, that Tx T outperforms other existing recommendation our Tx T model has also proven to be effective even solutions in Burger King’s production environment. We applied to online settings. The preliminary results show further implemented a unified system based on Spark that Tx T outperforms other test groups using out-of- and Ray to easily run data processing and deep learning box solutions such as Google Recommendations AI on training tasks on the same big data cluster. Our both recommendation conversion rate and add-on sales. solution has been open sourced in [7] and we shared In the real production environment, building a uni- our practical experience for building such an end-to-end fied end-to-end system for the entire pipeline is critical system, which could be easily applied to other customer to save time, cost and efforts. Previously we used a sep- interaction channels. arate GPU cluster for model training and nearly 20% of the total time is spent on data transfer to the GPU References cluster. After adopting RayOnSpark described in Sec- tion 4.2, we find it far more productive to combine big data processing and deep learning tasks in a single end- [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, to-end pipeline. Using the implementation with high- J. Dean, M. Devin, S. Ghemawat, G. Irving, level APIs in project [7], engineers already with Spark M. Isard, et al., Tensorflow: A system for large- experience can quickly learn how to inject AI workloads scale machine learning, in 12th {USENIX} Symposium to the existing clusters in a distributed fashion. on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283. Our drive-thru recommendation system needs to be [2] G. Adomavicius and A. Tuzhilin, Context-aware able to provide recommendations in sub milliseconds to recommender systems, in Recommender systems hand- all of Burger King restaurants across different locations. book, Springer, 2011, pp. 217–253. As network conditions vary from restaurant to restau- [3] A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, rant, we design our system in a way such that it can V. Gatto, and E. H. Chi, Latent cross: Making use either be deployed to cloud or on-premise to ensure high of context in recurrent recommender systems, in Pro- availability to our drive-thru guests. Figure 6 shows how ceedings of the Eleventh ACM International Confer- we manage our models using Docker registry; both cloud ence on Web Search and Data Mining, 2018, pp. 46–54. and on-premise systems can pull latest model images [4] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou, from the docker registry to run recommendation infer- Behavior sequence transformer for e-commerce recom- ence. This setup combined with Tx T model’s location mendation in alibaba, in Proceedings of the 1st Interna- tional Workshop on Deep Learning Practice for High- awareness eliminates the need to manage different mod- Dimensional Sparse Data, 2019, pp. 1–4. els across multiple restaurant regions. Self-contained [5] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, model servers can also be deployed to any environment T. Xiao, B. Xu, C. Zhang, and Z. Zhang, with Docker installed. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint 7 Conclusion arXiv:1512.01274, (2015). In this paper, we introduced the context-aware drive- [6] K. Cho, B. Van Merriënboer, C. Gulcehre, thru recommendation system implemented at Burger D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using King. Our Transformer Cross Transformer model rnn encoder-decoder for statistical machine translation, (Tx T) has been proven to be effective in the drive- arXiv preprint arXiv:1406.1078, (2014). thru circumstance for modeling user order behavior and [7] J. Dai, Analytics zoo. https://github.com/intel- complex context features. Experimental results show analytics/analytics-zoo, 2018.
[8] T. Donkers, B. Loepp, and J. Ziegler, Sequential in pytorch, NIPS 2017 Autodiff Workshop, (2017). user-based recurrent neural network recommendations, [22] F. Pedregosa, G. Varoquaux, A. Gramfort, in Proceedings of the Eleventh ACM Conference on V. Michel, B. Thirion, O. Grisel, M. Blondel, Recommender Systems, 2017, pp. 152–160. P. Prettenhofer, R. Weiss, V. Dubourg, et al., [9] R. L. Graham, G. M. Shipman, B. W. Barrett, Scikit-learn: Machine learning in python, the Journal R. H. Castain, G. Bosilca, and A. Lumsdaine, of machine Learning research, 12 (2011), pp. 2825– Open mpi: A high-performance, heterogeneous mpi, 2830. in 2006 IEEE International Conference on Cluster [23] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, Computing, IEEE, 2006, pp. 1–9. Item-based collaborative filtering recommendation algo- [10] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. rithms, in Proceedings of the 10th international con- Chua, Neural collaborative filtering, in Proceedings of ference on World Wide Web, 2001, pp. 285–295. the 26th international conference on world wide web, [24] A. Sergeev and M. Del Balso, Horovod: fast 2017, pp. 173–182. and easy distributed deep learning in tensorflow, arXiv [11] B. Hidasi and A. Karatzoglou, Recurrent neural preprint arXiv:1802.05799, (2018). networks with top-k gains for session-based recommen- [25] G. Shani, D. Heckerman, and R. I. Brafman, An dations, Proceedings of the 27th ACM International mdp-based recommender system, Journal of Machine Conference on Information and Knowledge Manage- Learning Research, 6 (2005), pp. 1265–1295. ment, (2018). [26] B. Twardowski, Modelling contextual information in [12] B. Hidasi, A. Karatzoglou, L. Baltrunas, and session-aware recommender systems with neural net- D. Tikk, Session-based recommendations with recur- works, in Proceedings of the 10th ACM Conference on rent neural networks, CoRR, abs/1511.06939 (2016). Recommender Systems, 2016, pp. 273–276. [13] S. Hochreiter and J. Schmidhuber, Long short- [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszko- term memory, Neural computation, 9 (1997), pp. 1735– reit, L. Jones, A. N. Gomez, L. Kaiser, and 1780. I. Polosukhin, Attention is all you need, in Ad- [14] W.-C. Kang and J. McAuley, Self-attentive sequen- vances in neural information processing systems, 2017, tial recommendation, in 2018 IEEE International Con- pp. 5998–6008. ference on Data Mining (ICDM), IEEE, 2018, pp. 197– [28] V. K. Vavilapalli, A. C. Murthy, C. Douglas, 206. S. Agarwal, M. Konar, R. Evans, T. Graves, [15] D. P. Kingma and J. Ba, Adam: A method J. Lowe, H. Shah, S. Seth, et al., Apache hadoop for stochastic optimization, arXiv preprint yarn: Yet another resource negotiator, in Proceedings arXiv:1412.6980, (2014). of the 4th annual Symposium on Cloud Computing, [16] Y. Koren, R. Bell, and C. Volinsky, Matrix fac- 2013, pp. 1–16. torization techniques for recommender systems, Com- [29] M. Zaharia, M. Chowdhury, T. Das, A. Dave, puter, 42 (2009), pp. 30–37. J. Ma, M. McCauly, M. J. Franklin, S. Shenker, [17] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, and I. Stoica, Resilient distributed datasets: A fault- Neural attentive session-based recommendation, in Pro- tolerant abstraction for in-memory cluster computing, ceedings of the 2017 ACM on Conference on Infor- in Presented as part of the 9th {USENIX} Symposium mation and Knowledge Management, 2017, pp. 1419– on Networked Systems Design and Implementation 1428. ({NSDI} 12), 2012, pp. 15–28. [18] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, [30] M. Zhang, Y. Wu, W. Li, and W. Li, Learning A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, universal sentence representations with mean-max at- and B.-Y. Su, Scaling distributed machine learning tention autoencoder, arXiv preprint arXiv:1809.06590, with the parameter server, in 11th {USENIX} Sym- (2018). posium on Operating Systems Design and Implemen- tation ({OSDI} 14), 2014, pp. 583–598. [19] R. Liaw, Faster and cheaper pytorch with raysgd. https://medium.com/distributed-computing- with-ray/faster-and-cheaper-pytorch-with-raysgd- a5a44d4fd220, 2020. [20] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al., Ray: A distributed framework for emerging {AI} applications, in 13th {USENIX} Symposium on Operating Systems Design and Imple- mentation ({OSDI} 18), 2018, pp. 561–577. [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation
You can also read