Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Page created by Arnold Pratt

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and
 Recipes with Semantic Consistency and Attention Mechanism

 Hao Wang[,∗ Doyen Sahoo‡,∗ Chenghao Liu† Ke Shu† Palakorn Achananuparp†
 Ee-peng Lim† Steven C. H. Hoi†,‡
 [
 Nanyang Technological University † Singapore Management University ‡ Salesforce Research Asia
 hao005@e.ntu.edu.sg dsahoo@salesforce.com {chliu, keshu, palakorna, eplim, chhoi}@smu.edu.sg
arXiv:2003.03955v1 [cs.CV] 9 Mar 2020

 Abstract cause they have similar ingredients. Hence, the data rep-
 resentation from the same food can be different, but differ-
 Cross-modal food retrieval is an important task to per- ent food may have similar data representations. This leads
 form analysis of food-related information, such as food im- to large intra-class variance but small inter-class variance
 ages and cooking recipes. The goal is to learn an embed- in food data. Existing studies [2, 5, 4] only addressed the
 ding of images and recipes in a common feature space, so small inter-class variance problem by utilizing triplet loss to
 that precise matching can be realized. Compared with ex- measure the similarities among cross-modal data. Specifi-
 isting cross-modal retrieval approaches, two major chal- cally, the objective of this triplet loss is to make inter-class
 lenges in this specific problem are: 1) the large intra-class feature distance larger than intra-class feature distance by
 variance across cross-modal food data; and 2) the difficul- a predefined margin [6]. Therefore, cross-modal instances
 ties in obtaining discriminative recipe representations. To from the same class may form a loose cluster with large av-
 address these problems, we propose Semantic-Consistent erage intra-class distance. As a consequence, it eventually
 and Attention-based Networks (SCAN), which regularize results in less-than-optimal ranking, i.e., irrelevant images
 the embeddings of the two modalities by aligning output se- are closer to the queried recipe than relevant images. See
 mantic probabilities. In addition, we exploit self-attention Figure 1 for an example.
 mechanism to improve the embedding of recipes. We eval- Besides, many recipes share common ingredients in dif-
 uate the performance of the proposed method on the large- ferent food. For instance fruit salad has ingredients of ap-
 scale Recipe1M dataset, and the result shows that it outper- ple, orange and sugar etc., where apple and orange are
 forms the state-of-the-art. the main ingredients, while sugar is one of the ingredi-
 ents in many other foods. If the embeddings of ingredients
 are treated equally during training, the features learned by
 1. Introduction the model may not be discriminative enough. The cook-
 ing instructions crawled from cooking websites tend to be
 In recent years, cross-modal retrieval has drawn much
 noisy, some instructions turn out irrelevant to cooking e.g.
 attention with the rapid growth of multimodal data. In this
 ’Enjoy!’, which convey no information for cooking instruc-
 paper we address the problem of cross-modal food retrieval
 tion features but degrade the performance of cross-modal
 based on large-amount of heterogeneous food dataset [17],
 retrieval task. In order to find the attended ingredients,
 i.e. take cooking recipes (ingredients & cooking instruc-
 Chen et al. [4] apply a two-layer deep attention mechanism,
 tions) as the query to retrieve the food images, and vice
 which learns joint features by locating the visual food re-
 versa. There have been many works [17, 2, 20] on cross-
 gions that correspond to ingredients. However, this method
 modal food retrieval. Despite those efforts, cross-modal
 relies on high-quality food images and essentially increases
 food retrieval remains challenging mainly due to the fol-
 the computational complexity.
 lowing two reasons: 1) the large intra-class variance across
 To resolve those issues, we propose a novel unified
 food data pairs; and 2) the difficulties of obtaining discrim-
 framework of Semantic-Consistent and Attention-Based
 inative recipe representation.
 Network (SCAN) to improve the cross-modal food retrieval
 In cross-modal food data, given a recipe, we have many
 performance. The pipeline of the framework is shown in
 food images that are cooked by different chefs. Besides,
 Figure 2. To reduce the intra-class variance, we introduce a
 the images from different recipes can look very similar be-
 semantic consistency loss, which imposes Kullback-Leibler
 ∗ Work done in Singapore Management University (KL) Divergence to minimize the distance between the out-

 1

Recipe Query (apple salad) Ranked Retrieved Images

salad greens, apples, pecan halves, dried cherries,

triplet loss
blue cheese, Dijon mustard, maple syrup, apple cider
vinegar, olive oil, salt and pepper

1. Add greens, apple slices, pecan halves, dried
cherries, and blue cheese chunks into a large salad apple salad chicken salad apple salad
bowl. 0.78 0.91 1.05
2. In a small jar, mix Dijon, maple syrup, vinegar,
olive oil, and salt and pepper.

SCAN
3. Put the lid on the jar and shake well to mix.
4. Pour a little salad dressing over the top of the
salad and toss to combine.
5. Taste salad and add more salad dressing to taste.
apple salad apple salad chicken salad
0.84 1.01 1.10

Figure 1. Recipe-to-image retrieval ranked results: Take an apple salad recipe as the query, which contains ingredients and cooking
instructions, we show the retrieval results based on Euclidean distance (as the numbers indicated in the figure) for 3 different food images
with large intra-class variance and small inter-class variance, i.e. images of apple salad have different looks, while chicken salad image
is more similar to apple salad. We rank the retrieved results of using (i) vanilla triplet loss; and (ii) our proposed SCAN model. It shows
vanilla triplet loss outputs a wrong ranking order, while SCAN can provide more precise ranking results.

put semantic probabilities of paired image and recipe. In or- modalities which are semantically similar to be close in the
der to obtain discriminative recipe representations, we com- common space. Many recent works [18, 7] utilize deep ar-
bine self-attention mechanism [19] with LSTM to find the chitectures for cross-modal retrieval, which have the advan-
key ingredients and cooking instructions for each recipe. tage of capturing complex non-linear cross-modal correla-
Without requiring food images or adding extra layers, we tions. With the advent of generative adversarial networks
can learn better discriminative recipe embeddings, com- (GANs) [9], which is helpful in modeling the data distri-
pared to that trained with plain LSTM. butions, some adversarial training methods [20, 8] are fre-
Our work makes two major contributions as follows: quently used for modality fusion.
For cross-modal food retrieval, [3, 16] are early works in
• We introduce a semantic consistency loss to cross-
cross-modal food retrieval. [3] uses vanilla attention mech-
modal food retrieval task and the result shows that it
anism with extra learnable layers, and only test their model
can align cross-modal matching pairs and reduce the
in a small-scale dataset. [16] utilize a multi-modal Deep
intra-class variance of food data representations.
Boltzmann Machine for recipe-image retrieval.
• We integrate self-attention mechanism with LSTM, [4, 5] both integrate attention mechanism into cross-
and learn discriminative recipe features without requir- modal retrieval, [4] introduce a stacked attention network
ing the food images. It is useful to discriminate sam- (SAN) to learn joint space from images and recipes for
ples of similar recipes. cross-modal retrieval. Consequently, [5] make full use of
the ingredient, cooking instruction and title (category la-
We perform an extensive experimental analysis on bel) information of Recipe1M, and concatenate the three
Recipe1M, which is the largest cross-modal food dataset types of features above to construct the recipe embeddings.
and available in public. We find that our proposed cross- Compared with the self-attention mechanism we adopt in
modal food retrieval approach SCAN outperforms state-of- our model, the above two works increase the computational
the-art methods. Finally, we show some visualizations of complexity by depending on the food images or adding
the retrieved results. some extra learnable layers and weights.

2. Related Work In order to have a better regularization on the shared rep-
resentation space learning, some researchers corporate the
Our work is closely related to cross-modal retrieval. semantic labels with the joint training. [17] develop a hy-
Canonical Correlation Analysis (CCA) [12] is a represen- brid neural network architecture with a cosine embedding
tative work in cross-modal data representation learning, it loss for retrieval learning and an auxiliary cross-entropy
utilizes global alignment to allow the mapping of different loss for classification, so that a joint common space for im-

age and recipe embeddings can be learned for cross-modal main ingredients for different food items, making the at-
retrieval. [2] is an extended version of [17], by providing tended ingredients contribute more to the ingredient embed-
a double-triplet strategy to express both the retrieval loss ding, while reducing the effect of common ingredients.
and the classification loss. [20] preserve the semantic infor- Given an ingredient input {z1 , z2 , ..., zn }, we first en-
mation on the learned cross-modal features with generative code it with pretrained embeddings from word2vec algo-
approaches. Here we include the KL loss into the training, rithm to obtain the ingredient representation Zt . Then
and preserve well the semantic information across image- {Z1 , Z2 , ..., Zn } will be fed into the one-layer bidirectional
recipe pairs. We also adopt the self-attention mechanism LSTM as a sequence step by step. For each step t, the recur-
for discriminative feature representations and achieve bet- rent network takes in the ingredient vector Zt and the output
ter performance. of previous step ht−1 as the input, and produces the current
step output ht by a non-linear transformation, as follow:
3. Proposed Methods
ht = tanh(WZt + Uht−1 + b), (2)
In this section, we introduce our proposed model, where
we utilize food image-recipe paired data to learn cross- The bidirectional LSTM consists of a forward hidden
modal embeddings as shown in Figure 2. →
−
state ht which processes ingredients from Z1 to Zn and a
←
−
3.1. Overview backward hidden state ht which processes ingredients from
Zn to Z1 . We obtain the representation ht of each ingredi-
We formulate the proposed cross-modal food retrieval →
− ←
− →
− ← −
ent zt by concatenating ht and ht , i.e. ht = [ ht , ht ], so that
with three networks, i.e. one convolutional neural network the representation of the ingredient list of each food item is
(CNN) for food image embeddings, and two LSTMs to H = {h1 , h2 , ..., hn }.
encode ingredients and cooking instructions respectively. We further measure the importance of ingredients in the
The food image representations I can be obtained from the recipe with the self-attention mechanism which has been
output of CNN directly, while the recipe representations studied in Transformer [19], where the input comes from
R come from the concatenation of the ingredient features queries Q and keys K of dimension dk , and values V of
fingredient and instruction features finstruction . Specifi- dimension dv (the definition of Q, K and V can be referred
cally, for obtaining discriminative ingredient and instruction in [19]), we compute the attention output as:
embeddings, we integrate the self-attention mechanism [19]
into the LSTM embedding. Triplet loss is used as the main
loss function LRet to map cross-modal data to the com- QK T
mon space, and semantic consistency loss LSC is utilized Attention(Q, K, V ) = softmax( √ )V, (3)
dk
to align cross-modal matching pairs for retrieval task, re-
ducing the intra-class variance of food data. The overall Different from the earlier attention-based methods [5],
objective function of the proposed SCAN is given as: we use self-attention mechanism where all of the keys, val-
ues and queries come from the same ingredient representa-
L = LRet + λLSC , (1) tion H. Therefore, the computational complexity is reduced
since it is not necessary to add extra layers to train attention
3.2. Recipe Embedding weights. The ingredient attention output Hattn can be for-
We use two LSTMs to get ingredient and instruction mulated as:
representations fingredient , finstruction , concatenate them
and pass through a fully-connected layer to give a 1024- Hattn = Attention(H, H, H)
dimensional feature vector, as the recipe representation R. HH T (4)
= softmax( √ )H,
dh
3.2.1 Ingredient Representation Learning
where dh is the dimension of H. In order to enable unim-
Instead of word-level word2vec representations, ingredient- peded information flow for recipe embedding, skip connec-
level word2vec representations are used in ingredient em- tions are used in the attention model. Layer normalization
bedding. To be specific, ground ginger is regarded as a sin- [1] is also used since it is effective in stabilizing the hidden
gle word vector, instead of two separate word vectors of state dynamics in recurrent network. The final ingredient
ground and ginger. representation fingredient is generated from summation of
We integrate self-attention mechanism with LSTM out- H and Hattn , which can be defined as:
put to construct recipe embeddings. The purpose of apply-
ing self-attention model lies in assigning higher weights to fingredient = LayerNorm(Hattn + H), (5)

 "#$ "
 6 Ingredients: honey, water,
 self-attention LSTM ground ginger, lime zest, fresh
 lime juice, sugar, orange zest
 Instructions: Combine first 6
 ingredients in a small saucepan.
 LSTM Bring to a boil over medium heat;
 self-attention cook 5 minutes, whisking
 constantly. Remove from heat…
 ; ;
 ()*
 - -+#%
 01 ( ()* || +#% )
 ()*
 ()*
 %&' . .+#% +#%
 %&'
 … …
 ()* 01 ( +#% || ()* ) +#%
 78 / /

Figure 2. Our proposed framework for cross-modal retrieval task. We have two branches to encode food images and recipes respectively.
One embedding function EI is designed to extract food image representations I, where a CNN is used. The other embedding function ER
is composed of two LSTMs with self-attention mechanism, designed for obtaining discriminative recipe representations R. I and R are
fed into retrieval loss (triplet loss) LRet to do cross-modal retrieval learning. We add another FC transformation on I and R with the output
dimensionality as the number of food categories, to obtain the semantic probabilities pimg and prec , where we utilize semantic consistency
loss LSC to correlate food image and recipe data.

3.2.2 Instruction Representation Learning 3.4. Cross-modal Food Retrieval Learning
Considering that cooking instructions are composed of a se- Triplet loss is utilized to do retrieval learning, the objec-
quence of variable-form and lengthy sentences, we compute tive function is:
the instruction embedding with a two-stage LSTM model.
For the first stage, we apply the same approach as [17] to X
obtain the representations of each instruction sentence, in LRet = [d(Ia , Rp ) − d(Ia , Rn ) + α]+
which it uses skip-instructions [17] with the technique of V
 X (6)
skip-thoughts [14]. + [d(Ra , Ip ) − d(Ra , In ) + α]+ ,
 The next stage is similar to the ingredient representation R
learning. We feed the pre-computed fixed-length instruc-
tion sentence representation into the LSTM model to gen- where d(•) is the Euclidean distance, subscripts a, p and n
erate the hidden representation of each cooking instruction refer to anchor, positive and negative samples respectively
sentence. Based on that, we can obtain the self-attention and α is the margin. To improve the effectiveness of train-
representation. The final instruction feature finstruction is ing, we adopt the BatchHard idea proposed in [11]. Specif-
generated from the layer normalization function on the pre- ically in a mini-batch, given an anchor sample, we simply
vious two representations, as we formulate in the last sec- select the most distant positive instance and the closest neg-
tion. By doing so, we are able to find the key sentences ative instance, to construct the triplet.
in cooking instruction. Some visualizations on attended in-
gredients and cooking instructions can be found in Section
 3.5. Semantic Consistency
4.7. Given the pairs of food image and recipe representations
 I, R, we first transform I and R into I 0 and R0 for classifi-
3.3. Image Embedding
 cation with an extra FC layer. The dimension of the output
 We use ResNet-50 [10] pretrained on ImageNet to en- is same as the number of categories N . The probabilities of
code food images. We empirically remove the last layer, food category i can be computed by a softmax activation as:
which is used for ImageNet classification, and add an ex-
tra FC layer to obtain the final food image features I. The exp(Ii0 )
 pimg
 i = PN , (7)
dimension of features I is 1024. 0
 i=1 exp(Ii )

 4

We perform cross-modal the food retrieval task based on
exp(Ri0 ) food data pairs, i.e. when we take the recipes as the query to
prec
i = PN , (8)
i=1 exp(Ri0 ) do retrieval, the ground truth will be the food images in food
data pairs, and vice versa. We use the original Recipe1M
where N represents the total number of food categories,
data split [17], containing 238,999 image-recipe pairs for
and it is also the dimension of I 0 and R0 . Given i ∈
training, 51,119 and 51,303 pairs for validation and test,
{1, 2, ..., N }, with the probabilities of pimg i and preci for
img respectively. In total, the dataset has 1,047 categories.
each food category, the predicted label l and lrec can
be obtained. We formulate the classification (cross-entropy) 4.2. Evaluation Protocol
loss as Lcls (pimg , prec , cimg , crec ), where cimg , crec are the
ground-truth class label for food image and recipe respec- We evaluate our proposed model with the same metrics
tively. used in prior works [17, 5, 2, 20]. To be specific, median
In the previous work [17], Lcls (pimg , prec , cimg , crec ) retrieval rank (MedR) and recall at top K (R@K) are used.
consists of Limg rec MedR measures the median rank position among where
cls and Lcls , which are treated as two in-
dependent classifiers, focusing on the regularization on the true positives are returned. Therefore, higher performance
embeddings from food images and recipes separately. How- comes with a lower MedR score. Given a food image, R@K
ever, food image and recipe embeddings come from hetero- calculates the fraction of times that the correct recipe is
geneous modalities, the output probabilities of each cate- found within the top-K retrieved candidates, and vice versa.
gory can be significantly different. As a result, the distance Different from MedR, the performance is directly propor-
of intra-class features remains large. To improve image- tional to the score of R@K. In the test phase, we first sample
recipe matching and make the probabilities predicted by dif- 10 different subsets of 1,000 pairs (1k setup), and 10 differ-
ferent classifiers consistent, we minimize Kullback-Leibler ent subsets of 10,000 (10k setup) pairs. It is the same set-
(KL) Divergence between the probabilities pimg and prec of ting as in [17]. We then consider each item from food image
i i
paired cross-modal data, which can be formulated as: modality in subset as a query, and rank samples from recipe
modality according to L2 distance between the embedding
N
X pimg of image and that of recipe, which is served as image-to-
LKL (pimg kprec ) = pimg
i log i
, (9) recipe retrieval, and vice versa for recipe-to-image retrieval.
i=1
prec
i
4.3. Implementation Details
N
X prec
i
LKL (prec kpimg ) = prec
i log , (10) We set the trade-off parameter λ in Eq. (1) based on em-
i=1 pimg
i pirical observations, where we tried a range of values and
The overall semantic consistency loss LSC is defined as: evaluated the performance on the validation set. We set the
λ as 0.05. The model was trained using Adam optimizer
LSC = {(Limg rec
kpimg )) [13] with the batch size of 64 in all our experiments. The
cls + LKL (p
(11) initial learning rate is set as 0.0001, and the learning rate
+(Lrec
cls + LKL (p
img
kprec ))}/2. decreases 0.1 in the 30th epoch. Note that we update the
two sub-networks, i.e. image encoder EI and recipe en-
4. Experiments coder ER , alternatively. It only takes 40 epochs to get the
best performance with our proposed methods, while [17]
4.1. Dataset requires 220 epochs to converge. Our training records can
We conduct extensive experiments to evaluate the per- be viewed in Figure 3. We do our experiments on a single
formance of our proposed methods in Recipe1M dataset Tesla V100 GPU, which costs about 16 hours to finish the
[17], the largest cooking dataset with recipe and food im- training.
age pairs available to the public. Recipe1M was scraped
4.4. Baselines
from over 24 popular cooking websites and it not only con-
tains the image-recipe paired labels but also more than half We compare the performance of our proposed methods
amount of the food data with semantic category labels ex- with several state-of-the-art baselines, and the results are
tracted from food titles on the websites. The category labels shown in Table 1.
provide semantic information for cross-modal retrieval task, CCA [12]: Canonical Correlation Analysis (CCA) is
making it fit in our proposed method well. The paired labels one of the most widely-used classic models for learning a
and category labels construct the hierarchical relationships common embedding from different feature spaces. CCA
among the food. One food category (e.g. fruit salads) may learns two linear projections for mapping text and image
contain hundreds of different food pairs, since there are a features to a common space that maximizes their feature
number of recipes of different fruit salads. correlation.

Table 1. Main Results. Evaluation of the performance of our proposed method compared against the baselines. The models are evaluated
on the basis of MedR (lower is better), and R@K (higher is better).
 Size of Test Set Image-to-Recipe Retrieval Recipe-to-Image Retrieval
 Methods medR ↓ R@1 ↑ R@5 ↑ R@10 ↑ medR ↓ R@1 ↑ R@5 ↑ R@10 ↑
 CCA [12] 15.7 14.0 32.0 43.0 24.8 9.0 24.0 35.0
 SAN [4] 16.1 12.5 31.1 42.3 - - - -
 JE [17] 5.2 24.0 51.0 65.0 5.1 25.0 52.0 65.0
 1k
 AM [5] 4.6 25.6 53.7 66.9 4.6 25.7 53.9 67.1
 AdaMine [2] 1.0 39.8 69.0 77.4 1.0 40.2 68.1 78.7
 ACME [20] 1.0 51.8 80.2 87.5 1.0 52.8 80.2 87.6
 SCAN (Ours) 1.0 54.0 81.7 88.8 1.0 54.9 81.9 89.0
 JE [17] 41.9 - - - 39.2 - - -
 AM [5] 39.8 7.2 19.2 27.6 38.1 7.0 19.4 27.8
 10k
 AdaMine [2] 13.2 14.9 35.3 45.2 12.2 14.8 34.6 46.1
 ACME [20] 6.7 22.9 46.8 57.9 6.0 24.4 47.9 59.0
 SCAN (Ours) 5.9 23.7 49.3 60.6 5.1 25.3 50.6 61.6

 0.5 0.50 0.5

 0.45
 0.4 0.4
 0.40
 Recall@1

 Recall@1

 Recall@1
 0.35
 0.3 0.3 TL
 0.30
 TL+SA
 0.2 TL 0.25 TL+cls 0.2 TL+cls
 TL+SA 0.20 TL+SC TL+SC
 0.1 SCAN 0.15 SCAN 0.1 SCAN
 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
 Epochs Epochs Epochs

 Figure 3. The training records of our proposed model SCAN and each component of SCAN.

 SAN [4]: Stacked Attention Network (SAN) consid- Table 2. Ablation Studies. Evaluation of benefits of different
 components of the SCAN framework. The models are evaluated
ers ingredients only (and ignores recipe instructions), and
 on the basis of MedR (lower is better), and R@K (higher is better).
learns the feature space between ingredient and image fea-
 LRet Component medR ↓ R@1 ↑ R@5 ↑ R@10 ↑
tures via a two-layer deep attention mechanism.
 TL 2.0 47.5 76.2 85.1
 JE [17]: They [17] use pairwise cosine embedding loss
 TL+SA 1.1 52.5 81.1 88.4
to find a joint embedding (JE) between the different modal- Triplet Loss TL+cls 1.7 48.5 78.0 85.5
ities. To impose regularization, they add classifiers to the TL+SC 1.1 51.9 80.3 88.0
cross-modal embeddings which predict the category of a SCAN 1.0 54.0 81.7 88.8
given food data.
 AM [5]: Attention mechanism (AM) over the recipe is
adopted in [5], applied at different parts of a recipe (title, tributions from different modalities to be similar. In order
ingredients and instructions). They use an extra transfor- to further preserve the semantic information in the cross-
mation matrix and context vector in the attention model. modal food data representation, Wang et al. introduce a
 AdaMine [2]: A double triplet loss is used, where triplet translation consistency component.
loss is applied to both the joint embedding learning and the In summary, our proposed model SCAN is lightweight
auxiliary classification task of categorizing the embedding and effective and outperforms all of earlier methods by a
into an appropriate category. They also integrate the adap- margin, as is shown in Table 1.
tive learning schema (AdaMine) into the training phase,
which performs an adaptive mining for significant triplets.
 4.5. Ablation Studies
 ACME [20]: Adversarial training methods are utilized Extensive ablation studies are conducted to evaluate the
in ACME for modality alignment, to make the feature dis- effectiveness of each component of our proposed model.

 6

Retrieved Images (top 1)
Recipe Query
SCAN TL+SC TL+SA TL
butter, lemon juice, frozen whipped topping, brown
sugar, strawberries, all - purpose flour, walnuts,
white sugar
1. combine flour, nuts, and brown sugar.
2. Add melted butter, tossing to combine the ingredients.
3. Bake the crust at 350 degrees F (175 degrees C) for 15
minutes.
4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan.
5. Reserve the remaining crust mixture.
6. Filling: In large bowl combine strawberries, sugar and
lemon juice.

butter, lemon juice, frozen whipped topping, brown
Remove strawberries

sugar, strawberries, all - purpose flour, walnuts,
white sugar
1. combine flour, nuts, and brown sugar.
2. Add melted butter, tossing to combine the ingredients.
3. Bake the crust at 350 degrees F (175 degrees C) for 15
minutes.
4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan.
5. Reserve the remaining crust mixture.
6. Filling: In large bowl combine strawberries, sugar and
lemon juice.

butter, lemon juice, frozen whipped topping, brown
sugar, strawberries, all - purpose flour, walnuts,
Remove walnuts

white sugar
1. combine flour, nuts, and brown sugar.
2. Add melted butter, tossing to combine the ingredients.
3. Bake the crust at 350 degrees F (175 degrees C) for 15
minutes.
4. Sprinkle 1/2 cups crust mixture in a 9x13 inch pan.
5. Reserve the remaining crust mixture.
6. Filling: In large bowl combine strawberries, sugar and
lemon juice.

Figure 4. Recipe-to-image retrieval results in Recipe1M dataset. We give an original recipe query from dessert, and then remove different
ingredients of strawberries and walnuts separately to construct new recipe queries. We show the retrieved results by SCAN and different
components of our proposed model.

Table 2 illustrates the contributions of self-attention model contributes further to the alignment of paired cross-modal
(SA), semantic consistency loss (SC) and their combination data.
on improving the image to recipe retrieval performance. In conclusion, we observe that each of the proposed com-
TL serves as a baseline for SA, which adopts the Batch- ponents improves the cross-modal embedding model, and
Hard [11] training strategy. We then add SA and SC in- the combination of those components yields better perfor-
crementally, and significant improvements can be found in mance overall.
both of the two components. To be specific, integrating
SA into TL helps improve the performance of the image- 4.6. Recipe-to-Image Retrieval Results
to-recipe retrieval more than 4% in R@1, illustrating the We show three recipe-to-image retrieval results in Fig-
effectiveness of self-attention mechanism to learn discrimi- ure 4. In the top row, we select a recipe query dessert from
native recipe representations. The model trained with triplet Recipe1M dataset, which has the ground truth for retrieved
loss and classification loss (cls) used in [17] is another base- food images. Images with the green box are the correctly
line for SC. It shows that our proposed semantic consistency retrieved ones, which come from the retrieved results by
loss improves the performance in R@1 and R@10 by more SCAN and TL+SA. But we can see that the model trained
than 2%, which suggests that reducing intra-class variance only with semantic consistency loss (TL+SC) has a reason-
can be helpful in cross-modal retrieval task. able retrieved result as well, which is relevant to the recipe
We show the training records in Figure 3, in the left fig- query.
ure, we can see that for the first 20 epochs, the performance In the middle and bottom row, we remove some ingre-
gap between TL and TL+SA gets larger, while the perfor- dients and the corresponding cooking instruction sentences
mance of TL+cls and TL+SC keeps to be similar, which in the recipe, and then construct the new recipe embeddings
is shown in the middle figure. But for the last 20 train- for the recipe-to-image retrieval. In the bottom row where
ing epochs, the performance of TL+SC improves signif- we remove the walnuts, we can see that all of the retrieved
icantly, which indicts that for those hard samples whose images have no walnuts. However, only the image retrieved
intra-variance can hardly be reduced by TL+cls, TL+SC by our proposed SCAN reflects the richest recipe informa-

tion. For instance, the image from SCAN remains visible between paired food image and recipe representations, we
ingredients of frozen whipped topping, while images from show the difference of the intra-class feature distance on
TL+SC and TL+SA have no frozen whipped toppings. cross-modal data trained with and without semantic con-
 The recipe-to-image retrieval results might indicate an sistency loss, i.e. SCAN and TL, in Figure 6. In the test
interesting way to satisfy users’ needs to find the corre- set, we select the recipe and food image data from choco-
sponding food images for their customized recipes. late chip, which in total has 425 pairs. We obtain the food
 data representations from models trained with two different
 Ingredients : Instructions: Preheat oven to 350 degrees Fahrenheit. Spray
 french - fried onions pan with non stick cooking spray. Heat milk, water and butter
 methods, then we compute the Euclidean distance between
 cheddar cheese to boiling; stir in contents of both pouches of potatoes; let paired cross-modal data to obtain the mean intra-class fea-
 water stand one minute. Stir in corn. Spoon half the potato mixture in
 whole kernel corn pan. Sprinkle half each of cheese and onions; top with ture distance. We adopt t-SNE [15] to do dimensionality
 mashed potatoes remaining potatoes. Sprinkle with remaining cheese and
 milk
 butter
 onions. Bake 10 to 15 minutes until cheese is melted. Enjoy! reduction to visualize the food data.
 Ingredients: Instructions: If making your own sauce, prepare it the day It can be observed that cross-modal food data which is
 frozen chopped spinach before you are planning to make the meatloaf. In a large bowl,
 onion, eggs, salt combine everything but the lamb/beef and tzatziki sauce. trained with semantic consistency loss (SCAN) has smaller
 romano cheese Shape into a loaf and place in a greased 11x7" baking dish.
 Greek seasoning Bake uncovered at 350 for 55-60 minutes or until no pink
 intra-class variance than that trained without semantic con-
 ground lamb remains and a thermostat reads 160. Let stand for 15 minutes
 tzatziki before slicing. Drizzle with tzatziki sauce.
 sistency loss (TL). This means that semantic consistency
 Ingredients: Instructions: Preheat greased grill to medium-high heat. Grill
 loss is able to correlate paired cross-modal data representa-
 salad greens fruit 3 min, on each side or until lightly browned on both sides.
 sticks Cut fruit into 2-inch sticks; place in large salad bowl. Add
 tions effectively by reducing the intra-class feature distance,
 cherry tomatoes greens, jicama and tomatoes; toss lightly. Drizzle with dressing and also our experiment results suggest its efficacy.
 pineapple just before serving.
 vinaigrette dressing

Figure 5. Visualizations of image-to-recipe retrieval. We show the image features recipe features
retrieved recipes of the given food images, along with the attended
ingredients and cooking instruction sentences.

4.7. Image-to-Recipe Retrieval Results & Effect of
 Self-Attention model
 In this section, we show some of the image-to-recipe re-
trieval results in Figure 5 and then focus on analyzing the ef-
fect of our self-attention model. Given images from cheese
cake, meat loaf and salad, we show the retrieved recipe re-
sults by SCAN, which are all correct. We visualize the at-
tended ingredients and instructions for the retrieved recipes SCAN TL
 Mean intra-class
with the yellow background, where we choose the ingredi- feature distance 0.82 1.02
ents and cooking instruction sentences of the top 2 atten- Figure 6. The difference on the intra-class feature distance of
tion weights as the attended ones. We can see that some cross-modal paired data trained with and without semantic con-
frequently used ingredients like water, milk, salt, etc. are sistency loss. The food data is selected from the same category,
not attended with high weights, since they are not visible chocolate chip. SCAN obtains closer image-recipe feature dis-
and shared by many kinds of food, which cannot provide tance than TL. (Best viewed in color.)
enough discriminative information for cross-modal food re-
trieval. This is an intuitive explanation for the effectiveness 5. Conclusion
of our self-attention model. In conclusion, we propose SCAN, a lightweight and ef-
 Another advantage of using self-attention mechanism is fective training framework, for cross-modal food retrieval.
that the image quality cannot affect the attended outputs. It introduces a novel semantic consistency loss and employs
Obviously, the top two rows of food images cheese cake a self-attention mechanism to learn the joint embedding be-
and meat loaf do not possess good image quality, while tween food images and recipes for the first time. To be spe-
our self-attention model still outputs reasonable attended cific, we apply semantic consistency loss on cross-modal
results. This suggests that our proposed attention model food data pairs to reduce the intra-class variance, and utilize
has good capabilities to capture informative and reasonable self-attention mechanism to find the important parts in the
parts for recipe embedding. recipes to construct discriminative recipe representations.
 SCAN is easy to implement and can extend to other general
4.8. Effect of Semantic Consistency
 cross-modal datasets. We have conducted extensive exper-
 In order to have a concrete understanding of the abil- iments and ablation studies. We achieved state-of-the-art
ity of our proposed semantic consistency loss on reducing results in Recipe1M dataset.
the mean intra-class feature distance (intra-class variance)

 8

References [15] Laurens van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- 9(Nov):2579–2605, 2008.
ton. Layer normalization. arXiv preprint arXiv:1607.06450, [16] Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang,
2016. Xinda Liu, and Luis Herranz. Being a supercook: Joint
[2] Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, food attributes and multimodal content modeling for recipe
Nicolas Thome, and Matthieu Cord. Cross-modal retrieval in retrieval and exploration. IEEE Transactions on Multimedia,
the cooking context: Learning semantic text-image embed- 19(5):1100–1113, 2017.
dings. In ACM SIGIR, 2018. [17] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin,
[3] Jingjing Chen and Chong-Wah Ngo. Deep-based ingredi- Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning
ent recognition for cooking recipe retrieval. In Proceedings cross-modal embeddings for cooking recipes and food im-
of the 2016 ACM on Multimedia Conference, pages 32–41. ages. Training, 720(619-508):2, 2017.
ACM, 2016. [18] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal
[4] Jingjing Chen, Lei Pang, and Chong-Wah Ngo. Cross- learning with deep boltzmann machines. In Advances in neu-
modal recipe retrieval: How to cook this dish? In Interna- ral information processing systems, pages 2222–2230, 2012.
tional Conference on Multimedia Modeling, pages 588–600. [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Springer, 2017. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[5] Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Polosukhin. Attention is all you need. In Advances in Neural
Chua. Deep understanding of cooking procedure for cross- Information Processing Systems, pages 5998–6008, 2017.
modal recipe retrieval. In 2018 ACM Multimedia Conference [20] Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and
on Multimedia Conference, pages 1020–1028. ACM, 2018. Steven CH Hoi. Learning cross-modal embeddings with ad-
[6] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and versarial networks for cooking recipes and food images. In
Nanning Zheng. Person re-identification by multi-channel Proceedings of the IEEE Conference on Computer Vision
parts-based cnn with improved triplet loss function. In Pro- and Pattern Recognition, pages 11572–11581, 2019.
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1335–1344, 2016.
[7] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. Cross-modal
retrieval with correspondence autoencoder. In Proceedings
of the 22nd ACM international conference on Multimedia,
pages 7–16. ACM, 2014.
[8] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi,
Yanhua Yang, Cheng Deng, and Heng Huang. Unsupervised
deep generative adversarial hashing network. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3664–3673, 2018.
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances
in neural information processing systems, pages 2672–2680,
2014.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[11] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de-
fense of the triplet loss for person re-identification. arXiv
preprint arXiv:1703.07737, 2017.
[12] Harold Hotelling. Relations between two sets of variates.
Biometrika, 28(3/4):321–377, 1936.
[13] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
2014.
[14] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard
Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.
Skip-thought vectors. In Advances in neural information
processing systems, pages 3294–3302, 2015.