Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism Hao Wang[,∗ Doyen Sahoo‡,∗ Chenghao Liu† Ke Shu† Palakorn Achananuparp† Ee-peng Lim† Steven C. H. Hoi†,‡ [ Nanyang Technological University † Singapore Management University ‡ Salesforce Research Asia hao005@e.ntu.edu.sg dsahoo@salesforce.com {chliu, keshu, palakorna, eplim, chhoi}@smu.edu.sg arXiv:2003.03955v1 [cs.CV] 9 Mar 2020 Abstract cause they have similar ingredients. Hence, the data rep- resentation from the same food can be different, but differ- Cross-modal food retrieval is an important task to per- ent food may have similar data representations. This leads form analysis of food-related information, such as food im- to large intra-class variance but small inter-class variance ages and cooking recipes. The goal is to learn an embed- in food data. Existing studies [2, 5, 4] only addressed the ding of images and recipes in a common feature space, so small inter-class variance problem by utilizing triplet loss to that precise matching can be realized. Compared with ex- measure the similarities among cross-modal data. Specifi- isting cross-modal retrieval approaches, two major chal- cally, the objective of this triplet loss is to make inter-class lenges in this specific problem are: 1) the large intra-class feature distance larger than intra-class feature distance by variance across cross-modal food data; and 2) the difficul- a predefined margin [6]. Therefore, cross-modal instances ties in obtaining discriminative recipe representations. To from the same class may form a loose cluster with large av- address these problems, we propose Semantic-Consistent erage intra-class distance. As a consequence, it eventually and Attention-based Networks (SCAN), which regularize results in less-than-optimal ranking, i.e., irrelevant images the embeddings of the two modalities by aligning output se- are closer to the queried recipe than relevant images. See mantic probabilities. In addition, we exploit self-attention Figure 1 for an example. mechanism to improve the embedding of recipes. We eval- Besides, many recipes share common ingredients in dif- uate the performance of the proposed method on the large- ferent food. For instance fruit salad has ingredients of ap- scale Recipe1M dataset, and the result shows that it outper- ple, orange and sugar etc., where apple and orange are forms the state-of-the-art. the main ingredients, while sugar is one of the ingredi- ents in many other foods. If the embeddings of ingredients are treated equally during training, the features learned by 1. Introduction the model may not be discriminative enough. The cook- ing instructions crawled from cooking websites tend to be In recent years, cross-modal retrieval has drawn much noisy, some instructions turn out irrelevant to cooking e.g. attention with the rapid growth of multimodal data. In this ’Enjoy!’, which convey no information for cooking instruc- paper we address the problem of cross-modal food retrieval tion features but degrade the performance of cross-modal based on large-amount of heterogeneous food dataset [17], retrieval task. In order to find the attended ingredients, i.e. take cooking recipes (ingredients & cooking instruc- Chen et al. [4] apply a two-layer deep attention mechanism, tions) as the query to retrieve the food images, and vice which learns joint features by locating the visual food re- versa. There have been many works [17, 2, 20] on cross- gions that correspond to ingredients. However, this method modal food retrieval. Despite those efforts, cross-modal relies on high-quality food images and essentially increases food retrieval remains challenging mainly due to the fol- the computational complexity. lowing two reasons: 1) the large intra-class variance across To resolve those issues, we propose a novel unified food data pairs; and 2) the difficulties of obtaining discrim- framework of Semantic-Consistent and Attention-Based inative recipe representation. Network (SCAN) to improve the cross-modal food retrieval In cross-modal food data, given a recipe, we have many performance. The pipeline of the framework is shown in food images that are cooked by different chefs. Besides, Figure 2. To reduce the intra-class variance, we introduce a the images from different recipes can look very similar be- semantic consistency loss, which imposes Kullback-Leibler ∗ Work done in Singapore Management University (KL) Divergence to minimize the distance between the out- 1
Recipe Query (apple salad) Ranked Retrieved Images salad greens, apples, pecan halves, dried cherries, triplet loss blue cheese, Dijon mustard, maple syrup, apple cider vinegar, olive oil, salt and pepper 1. Add greens, apple slices, pecan halves, dried cherries, and blue cheese chunks into a large salad apple salad chicken salad apple salad bowl. 0.78 0.91 1.05 2. In a small jar, mix Dijon, maple syrup, vinegar, olive oil, and salt and pepper. SCAN 3. Put the lid on the jar and shake well to mix. 4. Pour a little salad dressing over the top of the salad and toss to combine. 5. Taste salad and add more salad dressing to taste. apple salad apple salad chicken salad 0.84 1.01 1.10 Figure 1. Recipe-to-image retrieval ranked results: Take an apple salad recipe as the query, which contains ingredients and cooking instructions, we show the retrieval results based on Euclidean distance (as the numbers indicated in the figure) for 3 different food images with large intra-class variance and small inter-class variance, i.e. images of apple salad have different looks, while chicken salad image is more similar to apple salad. We rank the retrieved results of using (i) vanilla triplet loss; and (ii) our proposed SCAN model. It shows vanilla triplet loss outputs a wrong ranking order, while SCAN can provide more precise ranking results. put semantic probabilities of paired image and recipe. In or- modalities which are semantically similar to be close in the der to obtain discriminative recipe representations, we com- common space. Many recent works [18, 7] utilize deep ar- bine self-attention mechanism [19] with LSTM to find the chitectures for cross-modal retrieval, which have the advan- key ingredients and cooking instructions for each recipe. tage of capturing complex non-linear cross-modal correla- Without requiring food images or adding extra layers, we tions. With the advent of generative adversarial networks can learn better discriminative recipe embeddings, com- (GANs) [9], which is helpful in modeling the data distri- pared to that trained with plain LSTM. butions, some adversarial training methods [20, 8] are fre- Our work makes two major contributions as follows: quently used for modality fusion. For cross-modal food retrieval, [3, 16] are early works in • We introduce a semantic consistency loss to cross- cross-modal food retrieval. [3] uses vanilla attention mech- modal food retrieval task and the result shows that it anism with extra learnable layers, and only test their model can align cross-modal matching pairs and reduce the in a small-scale dataset. [16] utilize a multi-modal Deep intra-class variance of food data representations. Boltzmann Machine for recipe-image retrieval. • We integrate self-attention mechanism with LSTM, [4, 5] both integrate attention mechanism into cross- and learn discriminative recipe features without requir- modal retrieval, [4] introduce a stacked attention network ing the food images. It is useful to discriminate sam- (SAN) to learn joint space from images and recipes for ples of similar recipes. cross-modal retrieval. Consequently, [5] make full use of the ingredient, cooking instruction and title (category la- We perform an extensive experimental analysis on bel) information of Recipe1M, and concatenate the three Recipe1M, which is the largest cross-modal food dataset types of features above to construct the recipe embeddings. and available in public. We find that our proposed cross- Compared with the self-attention mechanism we adopt in modal food retrieval approach SCAN outperforms state-of- our model, the above two works increase the computational the-art methods. Finally, we show some visualizations of complexity by depending on the food images or adding the retrieved results. some extra learnable layers and weights. 2. Related Work In order to have a better regularization on the shared rep- resentation space learning, some researchers corporate the Our work is closely related to cross-modal retrieval. semantic labels with the joint training. [17] develop a hy- Canonical Correlation Analysis (CCA) [12] is a represen- brid neural network architecture with a cosine embedding tative work in cross-modal data representation learning, it loss for retrieval learning and an auxiliary cross-entropy utilizes global alignment to allow the mapping of different loss for classification, so that a joint common space for im- 2
age and recipe embeddings can be learned for cross-modal main ingredients for different food items, making the at- retrieval. [2] is an extended version of [17], by providing tended ingredients contribute more to the ingredient embed- a double-triplet strategy to express both the retrieval loss ding, while reducing the effect of common ingredients. and the classification loss. [20] preserve the semantic infor- Given an ingredient input {z1 , z2 , ..., zn }, we first en- mation on the learned cross-modal features with generative code it with pretrained embeddings from word2vec algo- approaches. Here we include the KL loss into the training, rithm to obtain the ingredient representation Zt . Then and preserve well the semantic information across image- {Z1 , Z2 , ..., Zn } will be fed into the one-layer bidirectional recipe pairs. We also adopt the self-attention mechanism LSTM as a sequence step by step. For each step t, the recur- for discriminative feature representations and achieve bet- rent network takes in the ingredient vector Zt and the output ter performance. of previous step ht−1 as the input, and produces the current step output ht by a non-linear transformation, as follow: 3. Proposed Methods ht = tanh(WZt + Uht−1 + b), (2) In this section, we introduce our proposed model, where we utilize food image-recipe paired data to learn cross- The bidirectional LSTM consists of a forward hidden modal embeddings as shown in Figure 2. → − state ht which processes ingredients from Z1 to Zn and a ← − 3.1. Overview backward hidden state ht which processes ingredients from Zn to Z1 . We obtain the representation ht of each ingredi- We formulate the proposed cross-modal food retrieval → − ← − → − ← − ent zt by concatenating ht and ht , i.e. ht = [ ht , ht ], so that with three networks, i.e. one convolutional neural network the representation of the ingredient list of each food item is (CNN) for food image embeddings, and two LSTMs to H = {h1 , h2 , ..., hn }. encode ingredients and cooking instructions respectively. We further measure the importance of ingredients in the The food image representations I can be obtained from the recipe with the self-attention mechanism which has been output of CNN directly, while the recipe representations studied in Transformer [19], where the input comes from R come from the concatenation of the ingredient features queries Q and keys K of dimension dk , and values V of fingredient and instruction features finstruction . Specifi- dimension dv (the definition of Q, K and V can be referred cally, for obtaining discriminative ingredient and instruction in [19]), we compute the attention output as: embeddings, we integrate the self-attention mechanism [19] into the LSTM embedding. Triplet loss is used as the main loss function LRet to map cross-modal data to the com- QK T mon space, and semantic consistency loss LSC is utilized Attention(Q, K, V ) = softmax( √ )V, (3) dk to align cross-modal matching pairs for retrieval task, re- ducing the intra-class variance of food data. The overall Different from the earlier attention-based methods [5], objective function of the proposed SCAN is given as: we use self-attention mechanism where all of the keys, val- ues and queries come from the same ingredient representa- L = LRet + λLSC , (1) tion H. Therefore, the computational complexity is reduced since it is not necessary to add extra layers to train attention 3.2. Recipe Embedding weights. The ingredient attention output Hattn can be for- We use two LSTMs to get ingredient and instruction mulated as: representations fingredient , finstruction , concatenate them and pass through a fully-connected layer to give a 1024- Hattn = Attention(H, H, H) dimensional feature vector, as the recipe representation R. HH T (4) = softmax( √ )H, dh 3.2.1 Ingredient Representation Learning where dh is the dimension of H. In order to enable unim- Instead of word-level word2vec representations, ingredient- peded information flow for recipe embedding, skip connec- level word2vec representations are used in ingredient em- tions are used in the attention model. Layer normalization bedding. To be specific, ground ginger is regarded as a sin- [1] is also used since it is effective in stabilizing the hidden gle word vector, instead of two separate word vectors of state dynamics in recurrent network. The final ingredient ground and ginger. representation fingredient is generated from summation of We integrate self-attention mechanism with LSTM out- H and Hattn , which can be defined as: put to construct recipe embeddings. The purpose of apply- ing self-attention model lies in assigning higher weights to fingredient = LayerNorm(Hattn + H), (5) 3
"#$ " 6 Ingredients: honey, water, self-attention LSTM ground ginger, lime zest, fresh lime juice, sugar, orange zest Instructions: Combine first 6 ingredients in a small saucepan. LSTM Bring to a boil over medium heat; self-attention cook 5 minutes, whisking constantly. Remove from heat… ; ; ()* - -+#% 01 ( ()* || +#% ) ()* ()* %&' . .+#% +#% %&' … … ()* 01 ( +#% || ()* ) +#% 78 / / Figure 2. Our proposed framework for cross-modal retrieval task. We have two branches to encode food images and recipes respectively. One embedding function EI is designed to extract food image representations I, where a CNN is used. The other embedding function ER is composed of two LSTMs with self-attention mechanism, designed for obtaining discriminative recipe representations R. I and R are fed into retrieval loss (triplet loss) LRet to do cross-modal retrieval learning. We add another FC transformation on I and R with the output dimensionality as the number of food categories, to obtain the semantic probabilities pimg and prec , where we utilize semantic consistency loss LSC to correlate food image and recipe data. 3.2.2 Instruction Representation Learning 3.4. Cross-modal Food Retrieval Learning Considering that cooking instructions are composed of a se- Triplet loss is utilized to do retrieval learning, the objec- quence of variable-form and lengthy sentences, we compute tive function is: the instruction embedding with a two-stage LSTM model. For the first stage, we apply the same approach as [17] to X obtain the representations of each instruction sentence, in LRet = [d(Ia , Rp ) − d(Ia , Rn ) + α]+ which it uses skip-instructions [17] with the technique of V X (6) skip-thoughts [14]. + [d(Ra , Ip ) − d(Ra , In ) + α]+ , The next stage is similar to the ingredient representation R learning. We feed the pre-computed fixed-length instruc- tion sentence representation into the LSTM model to gen- where d(•) is the Euclidean distance, subscripts a, p and n erate the hidden representation of each cooking instruction refer to anchor, positive and negative samples respectively sentence. Based on that, we can obtain the self-attention and α is the margin. To improve the effectiveness of train- representation. The final instruction feature finstruction is ing, we adopt the BatchHard idea proposed in [11]. Specif- generated from the layer normalization function on the pre- ically in a mini-batch, given an anchor sample, we simply vious two representations, as we formulate in the last sec- select the most distant positive instance and the closest neg- tion. By doing so, we are able to find the key sentences ative instance, to construct the triplet. in cooking instruction. Some visualizations on attended in- gredients and cooking instructions can be found in Section 3.5. Semantic Consistency 4.7. Given the pairs of food image and recipe representations I, R, we first transform I and R into I 0 and R0 for classifi- 3.3. Image Embedding cation with an extra FC layer. The dimension of the output We use ResNet-50 [10] pretrained on ImageNet to en- is same as the number of categories N . The probabilities of code food images. We empirically remove the last layer, food category i can be computed by a softmax activation as: which is used for ImageNet classification, and add an ex- tra FC layer to obtain the final food image features I. The exp(Ii0 ) pimg i = PN , (7) dimension of features I is 1024. 0 i=1 exp(Ii ) 4
We perform cross-modal the food retrieval task based on exp(Ri0 ) food data pairs, i.e. when we take the recipes as the query to prec i = PN , (8) i=1 exp(Ri0 ) do retrieval, the ground truth will be the food images in food data pairs, and vice versa. We use the original Recipe1M where N represents the total number of food categories, data split [17], containing 238,999 image-recipe pairs for and it is also the dimension of I 0 and R0 . Given i ∈ training, 51,119 and 51,303 pairs for validation and test, {1, 2, ..., N }, with the probabilities of pimg i and preci for img respectively. In total, the dataset has 1,047 categories. each food category, the predicted label l and lrec can be obtained. We formulate the classification (cross-entropy) 4.2. Evaluation Protocol loss as Lcls (pimg , prec , cimg , crec ), where cimg , crec are the ground-truth class label for food image and recipe respec- We evaluate our proposed model with the same metrics tively. used in prior works [17, 5, 2, 20]. To be specific, median In the previous work [17], Lcls (pimg , prec , cimg , crec ) retrieval rank (MedR) and recall at top K (R@K) are used. consists of Limg rec MedR measures the median rank position among where cls and Lcls , which are treated as two in- dependent classifiers, focusing on the regularization on the true positives are returned. Therefore, higher performance embeddings from food images and recipes separately. How- comes with a lower MedR score. Given a food image, R@K ever, food image and recipe embeddings come from hetero- calculates the fraction of times that the correct recipe is geneous modalities, the output probabilities of each cate- found within the top-K retrieved candidates, and vice versa. gory can be significantly different. As a result, the distance Different from MedR, the performance is directly propor- of intra-class features remains large. To improve image- tional to the score of R@K. In the test phase, we first sample recipe matching and make the probabilities predicted by dif- 10 different subsets of 1,000 pairs (1k setup), and 10 differ- ferent classifiers consistent, we minimize Kullback-Leibler ent subsets of 10,000 (10k setup) pairs. It is the same set- (KL) Divergence between the probabilities pimg and prec of ting as in [17]. We then consider each item from food image i i paired cross-modal data, which can be formulated as: modality in subset as a query, and rank samples from recipe modality according to L2 distance between the embedding N X pimg of image and that of recipe, which is served as image-to- LKL (pimg kprec ) = pimg i log i , (9) recipe retrieval, and vice versa for recipe-to-image retrieval. i=1 prec i 4.3. Implementation Details N X prec i LKL (prec kpimg ) = prec i log , (10) We set the trade-off parameter λ in Eq. (1) based on em- i=1 pimg i pirical observations, where we tried a range of values and The overall semantic consistency loss LSC is defined as: evaluated the performance on the validation set. We set the λ as 0.05. The model was trained using Adam optimizer LSC = {(Limg rec kpimg )) [13] with the batch size of 64 in all our experiments. The cls + LKL (p (11) initial learning rate is set as 0.0001, and the learning rate +(Lrec cls + LKL (p img kprec ))}/2. decreases 0.1 in the 30th epoch. Note that we update the two sub-networks, i.e. image encoder EI and recipe en- 4. Experiments coder ER , alternatively. It only takes 40 epochs to get the best performance with our proposed methods, while [17] 4.1. Dataset requires 220 epochs to converge. Our training records can We conduct extensive experiments to evaluate the per- be viewed in Figure 3. We do our experiments on a single formance of our proposed methods in Recipe1M dataset Tesla V100 GPU, which costs about 16 hours to finish the [17], the largest cooking dataset with recipe and food im- training. age pairs available to the public. Recipe1M was scraped 4.4. Baselines from over 24 popular cooking websites and it not only con- tains the image-recipe paired labels but also more than half We compare the performance of our proposed methods amount of the food data with semantic category labels ex- with several state-of-the-art baselines, and the results are tracted from food titles on the websites. The category labels shown in Table 1. provide semantic information for cross-modal retrieval task, CCA [12]: Canonical Correlation Analysis (CCA) is making it fit in our proposed method well. The paired labels one of the most widely-used classic models for learning a and category labels construct the hierarchical relationships common embedding from different feature spaces. CCA among the food. One food category (e.g. fruit salads) may learns two linear projections for mapping text and image contain hundreds of different food pairs, since there are a features to a common space that maximizes their feature number of recipes of different fruit salads. correlation. 5
Table 1. Main Results. Evaluation of the performance of our proposed method compared against the baselines. The models are evaluated on the basis of MedR (lower is better), and R@K (higher is better). Size of Test Set Image-to-Recipe Retrieval Recipe-to-Image Retrieval Methods medR ↓ R@1 ↑ R@5 ↑ R@10 ↑ medR ↓ R@1 ↑ R@5 ↑ R@10 ↑ CCA [12] 15.7 14.0 32.0 43.0 24.8 9.0 24.0 35.0 SAN [4] 16.1 12.5 31.1 42.3 - - - - JE [17] 5.2 24.0 51.0 65.0 5.1 25.0 52.0 65.0 1k AM [5] 4.6 25.6 53.7 66.9 4.6 25.7 53.9 67.1 AdaMine [2] 1.0 39.8 69.0 77.4 1.0 40.2 68.1 78.7 ACME [20] 1.0 51.8 80.2 87.5 1.0 52.8 80.2 87.6 SCAN (Ours) 1.0 54.0 81.7 88.8 1.0 54.9 81.9 89.0 JE [17] 41.9 - - - 39.2 - - - AM [5] 39.8 7.2 19.2 27.6 38.1 7.0 19.4 27.8 10k AdaMine [2] 13.2 14.9 35.3 45.2 12.2 14.8 34.6 46.1 ACME [20] 6.7 22.9 46.8 57.9 6.0 24.4 47.9 59.0 SCAN (Ours) 5.9 23.7 49.3 60.6 5.1 25.3 50.6 61.6 0.5 0.50 0.5 0.45 0.4 0.4 0.40 Recall@1 Recall@1 Recall@1 0.35 0.3 0.3 TL 0.30 TL+SA 0.2 TL 0.25 TL+cls 0.2 TL+cls TL+SA 0.20 TL+SC TL+SC 0.1 SCAN 0.15 SCAN 0.1 SCAN 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Epochs Epochs Epochs Figure 3. The training records of our proposed model SCAN and each component of SCAN. SAN [4]: Stacked Attention Network (SAN) consid- Table 2. Ablation Studies. Evaluation of benefits of different components of the SCAN framework. The models are evaluated ers ingredients only (and ignores recipe instructions), and on the basis of MedR (lower is better), and R@K (higher is better). learns the feature space between ingredient and image fea- LRet Component medR ↓ R@1 ↑ R@5 ↑ R@10 ↑ tures via a two-layer deep attention mechanism. TL 2.0 47.5 76.2 85.1 JE [17]: They [17] use pairwise cosine embedding loss TL+SA 1.1 52.5 81.1 88.4 to find a joint embedding (JE) between the different modal- Triplet Loss TL+cls 1.7 48.5 78.0 85.5 ities. To impose regularization, they add classifiers to the TL+SC 1.1 51.9 80.3 88.0 cross-modal embeddings which predict the category of a SCAN 1.0 54.0 81.7 88.8 given food data. AM [5]: Attention mechanism (AM) over the recipe is adopted in [5], applied at different parts of a recipe (title, tributions from different modalities to be similar. In order ingredients and instructions). They use an extra transfor- to further preserve the semantic information in the cross- mation matrix and context vector in the attention model. modal food data representation, Wang et al. introduce a AdaMine [2]: A double triplet loss is used, where triplet translation consistency component. loss is applied to both the joint embedding learning and the In summary, our proposed model SCAN is lightweight auxiliary classification task of categorizing the embedding and effective and outperforms all of earlier methods by a into an appropriate category. They also integrate the adap- margin, as is shown in Table 1. tive learning schema (AdaMine) into the training phase, which performs an adaptive mining for significant triplets. 4.5. Ablation Studies ACME [20]: Adversarial training methods are utilized Extensive ablation studies are conducted to evaluate the in ACME for modality alignment, to make the feature dis- effectiveness of each component of our proposed model. 6
Retrieved Images (top 1) Recipe Query SCAN TL+SC TL+SA TL butter, lemon juice, frozen whipped topping, brown sugar, strawberries, all - purpose flour, walnuts, white sugar 1. combine flour, nuts, and brown sugar. 2. Add melted butter, tossing to combine the ingredients. 3. Bake the crust at 350 degrees F (175 degrees C) for 15 minutes. 4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan. 5. Reserve the remaining crust mixture. 6. Filling: In large bowl combine strawberries, sugar and lemon juice. butter, lemon juice, frozen whipped topping, brown Remove strawberries sugar, strawberries, all - purpose flour, walnuts, white sugar 1. combine flour, nuts, and brown sugar. 2. Add melted butter, tossing to combine the ingredients. 3. Bake the crust at 350 degrees F (175 degrees C) for 15 minutes. 4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan. 5. Reserve the remaining crust mixture. 6. Filling: In large bowl combine strawberries, sugar and lemon juice. butter, lemon juice, frozen whipped topping, brown sugar, strawberries, all - purpose flour, walnuts, Remove walnuts white sugar 1. combine flour, nuts, and brown sugar. 2. Add melted butter, tossing to combine the ingredients. 3. Bake the crust at 350 degrees F (175 degrees C) for 15 minutes. 4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan. 5. Reserve the remaining crust mixture. 6. Filling: In large bowl combine strawberries, sugar and lemon juice. Figure 4. Recipe-to-image retrieval results in Recipe1M dataset. We give an original recipe query from dessert, and then remove different ingredients of strawberries and walnuts separately to construct new recipe queries. We show the retrieved results by SCAN and different components of our proposed model. Table 2 illustrates the contributions of self-attention model contributes further to the alignment of paired cross-modal (SA), semantic consistency loss (SC) and their combination data. on improving the image to recipe retrieval performance. In conclusion, we observe that each of the proposed com- TL serves as a baseline for SA, which adopts the Batch- ponents improves the cross-modal embedding model, and Hard [11] training strategy. We then add SA and SC in- the combination of those components yields better perfor- crementally, and significant improvements can be found in mance overall. both of the two components. To be specific, integrating SA into TL helps improve the performance of the image- 4.6. Recipe-to-Image Retrieval Results to-recipe retrieval more than 4% in R@1, illustrating the We show three recipe-to-image retrieval results in Fig- effectiveness of self-attention mechanism to learn discrimi- ure 4. In the top row, we select a recipe query dessert from native recipe representations. The model trained with triplet Recipe1M dataset, which has the ground truth for retrieved loss and classification loss (cls) used in [17] is another base- food images. Images with the green box are the correctly line for SC. It shows that our proposed semantic consistency retrieved ones, which come from the retrieved results by loss improves the performance in R@1 and R@10 by more SCAN and TL+SA. But we can see that the model trained than 2%, which suggests that reducing intra-class variance only with semantic consistency loss (TL+SC) has a reason- can be helpful in cross-modal retrieval task. able retrieved result as well, which is relevant to the recipe We show the training records in Figure 3, in the left fig- query. ure, we can see that for the first 20 epochs, the performance In the middle and bottom row, we remove some ingre- gap between TL and TL+SA gets larger, while the perfor- dients and the corresponding cooking instruction sentences mance of TL+cls and TL+SC keeps to be similar, which in the recipe, and then construct the new recipe embeddings is shown in the middle figure. But for the last 20 train- for the recipe-to-image retrieval. In the bottom row where ing epochs, the performance of TL+SC improves signif- we remove the walnuts, we can see that all of the retrieved icantly, which indicts that for those hard samples whose images have no walnuts. However, only the image retrieved intra-variance can hardly be reduced by TL+cls, TL+SC by our proposed SCAN reflects the richest recipe informa- 7
tion. For instance, the image from SCAN remains visible between paired food image and recipe representations, we ingredients of frozen whipped topping, while images from show the difference of the intra-class feature distance on TL+SC and TL+SA have no frozen whipped toppings. cross-modal data trained with and without semantic con- The recipe-to-image retrieval results might indicate an sistency loss, i.e. SCAN and TL, in Figure 6. In the test interesting way to satisfy users’ needs to find the corre- set, we select the recipe and food image data from choco- sponding food images for their customized recipes. late chip, which in total has 425 pairs. We obtain the food data representations from models trained with two different Ingredients : Instructions: Preheat oven to 350 degrees Fahrenheit. Spray french - fried onions pan with non stick cooking spray. Heat milk, water and butter methods, then we compute the Euclidean distance between cheddar cheese to boiling; stir in contents of both pouches of potatoes; let paired cross-modal data to obtain the mean intra-class fea- water stand one minute. Stir in corn. Spoon half the potato mixture in whole kernel corn pan. Sprinkle half each of cheese and onions; top with ture distance. We adopt t-SNE [15] to do dimensionality mashed potatoes remaining potatoes. Sprinkle with remaining cheese and milk butter onions. Bake 10 to 15 minutes until cheese is melted. Enjoy! reduction to visualize the food data. Ingredients: Instructions: If making your own sauce, prepare it the day It can be observed that cross-modal food data which is frozen chopped spinach before you are planning to make the meatloaf. In a large bowl, onion, eggs, salt combine everything but the lamb/beef and tzatziki sauce. trained with semantic consistency loss (SCAN) has smaller romano cheese Shape into a loaf and place in a greased 11x7" baking dish. Greek seasoning Bake uncovered at 350 for 55-60 minutes or until no pink intra-class variance than that trained without semantic con- ground lamb remains and a thermostat reads 160. Let stand for 15 minutes tzatziki before slicing. Drizzle with tzatziki sauce. sistency loss (TL). This means that semantic consistency Ingredients: Instructions: Preheat greased grill to medium-high heat. Grill loss is able to correlate paired cross-modal data representa- salad greens fruit 3 min, on each side or until lightly browned on both sides. sticks Cut fruit into 2-inch sticks; place in large salad bowl. Add tions effectively by reducing the intra-class feature distance, cherry tomatoes greens, jicama and tomatoes; toss lightly. Drizzle with dressing and also our experiment results suggest its efficacy. pineapple just before serving. vinaigrette dressing Figure 5. Visualizations of image-to-recipe retrieval. We show the image features recipe features retrieved recipes of the given food images, along with the attended ingredients and cooking instruction sentences. 4.7. Image-to-Recipe Retrieval Results & Effect of Self-Attention model In this section, we show some of the image-to-recipe re- trieval results in Figure 5 and then focus on analyzing the ef- fect of our self-attention model. Given images from cheese cake, meat loaf and salad, we show the retrieved recipe re- sults by SCAN, which are all correct. We visualize the at- tended ingredients and instructions for the retrieved recipes SCAN TL Mean intra-class with the yellow background, where we choose the ingredi- feature distance 0.82 1.02 ents and cooking instruction sentences of the top 2 atten- Figure 6. The difference on the intra-class feature distance of tion weights as the attended ones. We can see that some cross-modal paired data trained with and without semantic con- frequently used ingredients like water, milk, salt, etc. are sistency loss. The food data is selected from the same category, not attended with high weights, since they are not visible chocolate chip. SCAN obtains closer image-recipe feature dis- and shared by many kinds of food, which cannot provide tance than TL. (Best viewed in color.) enough discriminative information for cross-modal food re- trieval. This is an intuitive explanation for the effectiveness 5. Conclusion of our self-attention model. In conclusion, we propose SCAN, a lightweight and ef- Another advantage of using self-attention mechanism is fective training framework, for cross-modal food retrieval. that the image quality cannot affect the attended outputs. It introduces a novel semantic consistency loss and employs Obviously, the top two rows of food images cheese cake a self-attention mechanism to learn the joint embedding be- and meat loaf do not possess good image quality, while tween food images and recipes for the first time. To be spe- our self-attention model still outputs reasonable attended cific, we apply semantic consistency loss on cross-modal results. This suggests that our proposed attention model food data pairs to reduce the intra-class variance, and utilize has good capabilities to capture informative and reasonable self-attention mechanism to find the important parts in the parts for recipe embedding. recipes to construct discriminative recipe representations. SCAN is easy to implement and can extend to other general 4.8. Effect of Semantic Consistency cross-modal datasets. We have conducted extensive exper- In order to have a concrete understanding of the abil- iments and ablation studies. We achieved state-of-the-art ity of our proposed semantic consistency loss on reducing results in Recipe1M dataset. the mean intra-class feature distance (intra-class variance) 8
References [15] Laurens van der Maaten and Geoffrey Hinton. Visualiz- ing data using t-sne. Journal of machine learning research, [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- 9(Nov):2579–2605, 2008. ton. Layer normalization. arXiv preprint arXiv:1607.06450, [16] Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, 2016. Xinda Liu, and Luis Herranz. Being a supercook: Joint [2] Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, food attributes and multimodal content modeling for recipe Nicolas Thome, and Matthieu Cord. Cross-modal retrieval in retrieval and exploration. IEEE Transactions on Multimedia, the cooking context: Learning semantic text-image embed- 19(5):1100–1113, 2017. dings. In ACM SIGIR, 2018. [17] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, [3] Jingjing Chen and Chong-Wah Ngo. Deep-based ingredi- Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning ent recognition for cooking recipe retrieval. In Proceedings cross-modal embeddings for cooking recipes and food im- of the 2016 ACM on Multimedia Conference, pages 32–41. ages. Training, 720(619-508):2, 2017. ACM, 2016. [18] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal [4] Jingjing Chen, Lei Pang, and Chong-Wah Ngo. Cross- learning with deep boltzmann machines. In Advances in neu- modal recipe retrieval: How to cook this dish? In Interna- ral information processing systems, pages 2222–2230, 2012. tional Conference on Multimedia Modeling, pages 588–600. [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Springer, 2017. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [5] Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Polosukhin. Attention is all you need. In Advances in Neural Chua. Deep understanding of cooking procedure for cross- Information Processing Systems, pages 5998–6008, 2017. modal recipe retrieval. In 2018 ACM Multimedia Conference [20] Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and on Multimedia Conference, pages 1020–1028. ACM, 2018. Steven CH Hoi. Learning cross-modal embeddings with ad- [6] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and versarial networks for cooking recipes and food images. In Nanning Zheng. Person re-identification by multi-channel Proceedings of the IEEE Conference on Computer Vision parts-based cnn with improved triplet loss function. In Pro- and Pattern Recognition, pages 11572–11581, 2019. ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2016. [7] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, pages 7–16. ACM, 2014. [8] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi, Yanhua Yang, Cheng Deng, and Heng Huang. Unsupervised deep generative adversarial hashing network. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3664–3673, 2018. [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [11] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de- fense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. [12] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936. [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [14] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015. 9
You can also read