RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification - arXiv.org

Page created by Annie Klein

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

RP2K: A Large-Scale Retail Product Dataset for Fine-Grained Image Classification - arXiv.org

RP2K: A Large-Scale Retail Product Dataset for
                                                   Fine-Grained Image Classification

                                                              Jingtian Peng1 Chang Xiao2 Xun Wei1 Yifan Li1
                                                             1
                                                               Pinlan Data Technology Co,. Ltd. 2 Columbia University
arXiv:2006.12634v5 [cs.CV] 18 Jan 2021

                                                                                        Abstract

                                                  We introduce RP2K, a new large-scale retail product dataset for fine-grained
                                                  image classification. Unlike previous datasets focusing on relatively few products,
                                                  we collect more than 500,000 images of retail products on shelves belonging to
                                                  2000 different products. Our dataset aims to advance the research in retail object
                                                  recognition, which has massive applications such as automatic shelf auditing and
                                                  image-based product information retrieval. Our dataset enjoys following properties:
                                                  (1) It is by far the largest scale dataset in terms of product categories. (2) All images
                                                  are captured manually in physical retail stores with natural lightings, matching
                                                  the scenario of real applications. (3) We provide rich annotations to each object,
                                                  including the sizes, shapes and flavors/scents. We believe our dataset could benefit
                                                  both computer vision research and retail industry. Our dataset is publicly available
                                                  at https://www.pinlandata.com/rp2k_dataset.

                                         1   Introduction

                                         Retailing is a vital commercial activity which plays an essential role in our daily life. The traditional
                                         retail industry requires tremendous human labor for its entire supply chain, from product distribution
                                         to inventory counting. In response to the rise of online shopping, traditional markets also quickly
                                         take advantage of AI-related technology at the physical store level to replace the tedious human labor
                                         by computer algorithms and robots. Many promising applications of applying AI to retailing are
                                         considered to be widely realized in the near-term. For instance, by using computer vision techniques,
                                         the retailers can audit the placement of products or track in-store sale activities from shelves images,
                                         while the customers can obtain the product information by taking pictures of the shelf. To implement
                                         these applications, one of the core problem is to apply object recognition techniques to recognize the
                                         products on shelves.
                                         Despite the recent advances in computer vision, the task of recognizing retail products on shelves,
                                         which is a basic task for many important retailing events, is still considerably challenging in the
                                         computer vision perspective. First, the number of product categories can be huge in a supermarket.
                                         According to Goldman et al. [13], a typical supermarket could have more than thousands of different
                                         products. Second, different products may have similar appearances. For example, products from
                                         the same brand with different sizes or flavors usually have similar appearance, but they should
                                         be recognized as separate products (see Fig. 1-a,b). Third, the camera angle of the product to be
                                         recognized and lighting condition may vary a lot (see Fig. 1-c), so a reliable recognition algorithm
                                         should able to handle the scene complexities.
                                         To overcome these challenges, several retail product datasets have been proposed in the past decades
                                         [42, 13, 28, 10, 15, 33]. However, most of the previous datasets either focus on a relatively small scale
                                         (less than 200 categories), or collect data in a controlled environment (e.g., lab with sufficient light).
                                         Thereby, we propose the RP2K dataset, which contains images of 2000 different products (stock
                                         keeping units or SKUs), from real retail stores, to bridge the gap between research and real-life
                                         applications.

(a)

(b)

(c)

Figure 1: Sample images from our dataset. Precise retail product recognition on shelves is considered highly
challenging because (a) Products from the same line may have different sizes, and they usually have similar
appearances but different prices. The image size could not reflect the real size of the products. (b) The
manufacturer usually make multiple flavors for one product line, but their appearance only have subtle differences
on the labels. (c) Product images may be captured at different camera angles according to its placement location
on shelves. The image can also be stretched due to camera distortion.
It is worth noting that a full retail product recognition pipeline usually consists of two seperate
parts—an object detector for locating the potential objects of interests and a classifier to recognize the
object from a cropped image containing the object. Since the object detection task has been addressed
by many of the previous dataset [13, 42, 10], our dataset mainly focuses on recognizing objects on
retail shelves, namely, given a shelf image and bounding boxes of potential products on the shelf,
identifying the SKU ID for each bounding box1 .
We highlight the contributions of our dataset as follows:
Large scale. Our dataset is large scale in terms of both the number of images and the number of
categories. We collect images of 2000 different SKUs, each containing 250 images on average,
rendering a dataset with more than 500k images. We also include the 10k original on-shelf images,
with an average resolution of 3024 x 4032.
Realistic retail environment. Unlike many of the previous datasets with images taken under the
laboratory environment or web images, our images are all manually captured in real retail stores. The
collection of our dataset matches the settings of realistic applications.
Hierarchical labels. Besides the individual SKU ID of each image, we also provide multiple levels
of hierarchical labels. The 2000 SKUs can be categorized into 6 meta-categories based on their
product types, or another 7 meta-categories based or their shapes. In addition, we provide the detailed
attributes of each SKU, including the brand, flavor/type, and size, which allows the users to evaluate
their algorithms on a customized fine-grained level. We will discuss this more in Sec. 3.

2 Related work
2.1 Fine-grained image classification

Fine-grained image classification is an extensively-studied field in computer vision, which refers
to the task of recognizing images from multiple subordinate categories of a meta-category. Those
subordinate categories usually share similar visual features, thus additional attention is required
to deal with them. To benchmark the performance of fine-grained image classification algorithms,
1
We also release an auxiliary shelf object detection dataset, see Appendix C for details.

2000 Our RP2K Dataset

1000

# Categories
500
RPC
100 Grozi-120

50 MVTEC
TGFS
Grocery Shelves
SKU-110K
0 10k 50k 100k 500k 1M 2M
# Images
Figure 2: Comparing to other datasets, our dataset shows considerably larger number of categories, while
maintaining a decent amount of images.

many datasets have been proposed in the past few years, including distinguishing different animal
species [43, 20, 3], car models [23] and clothes [26]. Comparing to traditional image classification,
fine-grained image classification is considered more challenging due to the similar appearance in
different subcategories. Furthermore, models trained on dataset with images captured under lab-
controlled environment may not perform well on natural images [11], so that additional domain
adaption techniques are required to resolve this gap.
Retail product recognition lies in the field of fine-grained image classification, as products from
different brands or with different flavors will look very similar even to human eyes.

2.2 Retail product dataset

Retail product detection/recognition has been widely-studied due to its basic role in automatic retailing
industry. Even before the deep learning dominant computer vision, researchers have investigated
recognition algorithms on various retail products datsets for decades. For example, SOIL-47 [4]
dataset contains only 47 categories and 987 images which were acquired under different geometry of
light sources. Grozi-120 [28] is proposed for recognizing groceries in physical retail store, which
contains 120 grocery product categories and 11,870 images. Similarly, the supermarket product
dataset [33] focuses on fruit and vegetable classification, containing 15 categories and 2,633 images.
There are many other similar datasets [12, 41, 18, 41], and we refer the reader to the survey [34] for
a comprehensive list of retail product datasets. Those early datasets usually have less than 20,000
images and may not work well on today’s data-demanding deep learning model. Thus, here we briefly
review some of the recently-released datasets that are most relevant to ours.
RPC dataset [42] is a large-scale dataset proposed for automatic checkout. It contains 200 categories
and 83,739 images. Each image includes different number of products, from 3 to 20, based on their
clutter mode. Since bounding boxes and labels are provided for each object in each image, it could
also provide more than 400k single-object images for object recognition task. However, since the
images were captured under controlled lighting and clean background, this dataset may not able to
reflect the real-life scenario of product recognition on shelves.
Take Goods from Shelves (TGFS) [15] dataset is a object detection/recognition dataset for automatic
checkout. Unlike RPC dataset captured in laboratory environment, TGFS dataset use images collected
from real self-service vending machines. Thus, the images are closer to the checkout system in
natural environment. Yet, it contains only 30K images from 24 fine-grained and 3 coarse classes, and
the resolution of each image is 480 x 640, which limit the usage of this dataset.
SKU-110K dataset [13] is so far the largest retail image dataset in terms of the number of images.
It contains more than 1M images from 11,762 images of store shelves. Since the main focus of the
dataset is retail object detection in densely packed scenes, they only provide bounding boxes of each
object in scene without further annotating the category of the bounding boxes. Thus, this dataset
cannot be used for object recognition purpose.
MVTEC [10] is an instance-aware semantic segmentation dataset for retail products. It provides
21,000 images of 60 object categories with pixelwise labels. It can also be served as additional

Discard Incorrect Images
Shelve Image
Original Shelve Image with Bounding Boxes

Annotate Remaining Images
Human Annotator

Spring Water A (500mL)

Coke A (1000mL)

......
Machine Annotator Coke B (500mL)

Figure 3: Pipeline of our data collection process. Our photo collectors were first distributed in over 500 different
retail stores and collected over 10k high-resolution shelf images. Then we use a pre-trained detection model
to extract the bounding boxes of potential objects of interests. After that, our human annotators discard the
incorrect bounding boxes, including heavily occluded images and images that is not a valid retail product. The
remaining images are annotated by the annotators.

grocery relevant component to other semantic segmentation datasets [6, 8, 25]. Similar to RPC
dataset, MVTEC is also captured in laboratory environment with controlled camera settings. The
scale of this dataset in terms of images and categories is also relatively small. Thus, it may not be
suitable for the task of object recognition in real store.

3 Our dataset
In this section, we provide the details of our RP2K dataset.
Organizations. Our dataset contains two components: the original shelf images and the individual
object images cropped from the shelf images. The shelf images are labeled with the shelf type, store
ID, and a list of bounding boxes of objects of interest. For each image cropped from its bounding
box, we provide rich annotations include the SKU ID, product name, brand, product type, shape, size,
flavor/scent and the bounding box reference to its corresponding shelf image. Fig. 5 demonstrates
some sample attributes of the object images. Note that some attributes may not be applicable to
particular products.
We also provide meta category label for each object image, in two different ways. One is categorized
by its product type, which reflects the placement of the products, i.e., products with the same type
usually placed on the same or nearby shelves. We include 6 meta categories by product types: dairy,
liquor, beer, cosmetics, non-alcoholic drinks and seasoning.
Another categorization method is by its product shape. We include 7 shapes, bottle, can, box, bag,
jar, handled bottle and pack , which covers all possible shapes that appeared in our dataset. These
7 shapes are also used in training our pre-annotation detector. The sample images for different
meta-categories are shown in Fig. 4.
Besides these two meta categorization method, our rich labels provide an option for the users to
evaluate their algorithms on a customized fine-grained level.
Data collections. The shelve images were collected in 10 cities from 500 different general stores,
to increase the data diversity. The data collectors were instructed to make sure the shelf positioned in
the center of the image, and each image should only contain one shelf. Each shelf image is at least
3000 by 3000 pixels. The image collectors use different (smartphone) camera models to capture the

Dairy Liquor Beer Cosmetics Drinks Seasoning

Bottle Can Box Bag Jar Handled Bottle Pack

Figure 4: Sample images from our dataset with different meta-categorizations. (top) Categorized by product
types. (bottom) Categorized by product shapes.

photos under natural in-store lighting environments, mimicking the realistic application scenarios.
We also make sure each individual object in the collected images is at least 80 by 80 pixels and clear
enough to be recognized by human eyes.
After collected from multiple stores, the shelves images were sent to a group of annotators to label
the products that appeared in those images. To reduce the workload of human labor, we first run a
pre-annotation step by using a pre-trained object detector to generate bounding boxes of potential
objects. We use RetinaNet [24] as the base structure of the object detector and train it on our auxiliary
object detection dataset; see Appendix C for details. The annotators then manually remove all
the unwanted bounding boxes (e.g., overlapped box or occluded box), before assigning labels and
attributes to each bounding box image. The data collection pipeline is summarized in Fig. 3.

Statistics. We collect 14,368 high-resolution shelf images in total, with, on average, 37.1 objects
in each images, resulting in a dataset with 533,633 images of individual objects. Each individual
object image represents a product from in total of 2000 SKUs. We split the train/test set by the ratio
of 0.85/0.15. The detailed statistics for different meta categories are illustrated in Table 1.

Auxiliary detection dataset. To encourage the development of a full pipeline solution, from on-
shelve product detection to recognition, we also provide our auxiliary object detection dataset. We
include 95,800 bounding boxes from 1400 shelve images with 7 different shapes described in Fig. 4
to train our object detector. We refer the audience for a detailed description of our detection dataset to
Appendix C. We want to emphasize that the object detection task for retail shelve has been targeted

Table 1: Statistics of our dataset.
Meta Category SKUs Train Imgs. Test Imgs. Total Imgs. Imgs./SKU
Dairy 325 117,597 20,753 138,350 425.7
By Product

Liquor 167 17,394 3,069 20,463 122.53
Beer 238 57,875 10,213 10,213 236.4
Cosmetics 257 17,714 3,126 20,840 81.1
Non-Alcohol Drinks 354 46,147 8,143 54,290 153.4
Seasoning 609 196,887 34,745 231,632 380.3
Bottle 1,063 257,254 45,398 302,652 284.7
Can 267 69,063 12,187 81,250 304.3
By Shape

Box 201 42,193 7,446 49,639 247.0
Bag 109 9,814 1,732 1,1546 105.9
Jar 46 9,123 1,610 10,733 233.3
Handled Bottle 251 56,925 10,045 66,970 266.8
Pack 63 9,219 1,627 10,846 172.2
Total 2,000 453,614 80,049 533,663 266.8

Table 2: Classification results for different categories. # images indicates the number of total images available
in the corresponding meta-category. MobileNetV1 shows better performance compare to InceptionV3. However,
neither of them achieved enough high accuracy for deploying real applications such as product placement
auditing.
MobileNetV1 InceptionV3
Meta-Category # images
Top-1 Top-5 Top-1 Top-5
Dairy 138,350 91.05% 98.69% 84.17% 97.15%
Liquor 20,463 86.59% 97.15% 73.07% 92.89%
Beer 68,088 93.30% 98.54% 85.86% 96.52%
Cosmetics 20,840 78.88% 95.95% 75.80% 93.83%
Non-Alcohol Drinks 54,290 86.80% 96.51% 76.74% 93.29%
Seasoning 231,632 88.08% 97.61% 75.46% 92.99%
All 533,663 87.05% 96.94% 54.94% 81.69%

by many other dataset [13], thus we do not target our dataset as an object detection dataset. Here we
provide the detection dataset for completeness’ sake.

4 Benchmarking our dataset

In this section, we provide our evaluations to benchmarking our RP2K dataset. We demonstrate our
results on the classification task. Yet, we would like to emphasize that our dataset is not limited to the
task. We will provide a further discussion on other potential use cases of our dataset in Sec. 5.
We report the experimental results by meta categories; specifically, we train independent models and
evaluate the performance for each meta-category. We also tested using a single model to predict 2000
SKUs. We use MobileNet [17] and InceptionV3 [39] as our base network structure to perform the
classification. Two fully connected layers are attached after the base structure. Each fully connected
layer are followed by a BatchNorm and a ReLU activation. Another fully connected layer with the
output dimension of the number of classes is set to be the final layer of the whole network. For
both MobileNet and InceptionV3, we use the ImageNet pre-trained model, i.e., we use the ImageNet
pre-trained model that available in the TensorFlow [1] library, and freeze the feature extraction
layers while only optimize the fully connected layer for 20 epochs with learning rate 0.1. Then
we unfreeze the feature extraction layers and train the entire network for another 5 epochs with
learning rate 0.01. We use Adam optimizer [21] for all experiments. All models are trained using the
TensorFlow framework on a single NVIDIA Tesla V100 GPU. We report both top-1 accuracies and
top-5 accuracies for our evaluation.
As shown in Table 2, MobileNetV1 outperforms InceptionV3 on all 6 meta-categories, as well as
using all images. However, higher accuracy is required to implement commercial applications such
as checking if the products on the shelf are placed in the correct way.

Training from scratch and data augmentation. Since images in our dataset are captured under
diverse view angles and lighting conditions, using data augmentation during training presumably
would increase the classification performance. In addition, we are also interested in evaluating the
performance between using a pre-trained model and training from scratch.

SKU id: 746 SKU id: 748 SKU id: 450 SKU id: 1049
Type: Drinks Type: Drinks Type: Cosmetics Type: Dairy
Shape: Can Shape: Can Shape: Handled Bottle Shape: Box
Brand: Starbucks Brand: Starbucks Brand: OMO Brand: Yili
Size: 250mL Size: 250mL Size: 1.8L Size: 1L
Flavor: Latte Flavor: Motcha Flavor: N/A Flavor: Original

Figure 5: Sample data with attributes information.

Table 3: Evaluation of different training protocols. We found that adding data augmentation does not necessarily
increase the recognition performance. We hypothesis it is due to our dataset already include lighting variance
and camera distortion, further augmenting data does not making our dataset more close to the real image.
Model Full Pre-trained Full + Aug. Pre-trained + Aug.
Top-1 Acc. 85.19% 87.05% 76.30% 86.84%
Top-5 Acc. 96.08% 96.94% 91.57% 97.60%

Thus, we further evaluate the MobileNetV1 using 4 different training schemes: training from scratch,
pre-training, training from scratch with data augmentation, and pre-training with data augmentation.
For all training scheme we use the Adam optimizer.
The training from scratch scheme refers to random initializing the entire network, and train it for a
fixed time. We first train the network with 0.1 learning rate for 30 epochs, and then we decay the
learning rate to 0.01 to train the network for another 10 epochs. The momentum is set to 0.9 all the
time.
To mimic the real scenario environment, we use the following data augmentation scheme in our
training:

• Adding a constant border with a width randomly chosen between 0 and 30px.
• Cropping randomly by up to 10px.
• Applying a randomly-parameterized perspective transformation.
• Darkening/Brightening the image ramdomly by up to 20%.

The results of evaluating 4 training protocols are shown in Table 3.
Besides reporting classification accuracy for each meta category, we also plot the relationship between
the number of instance image and classification accuracy for each individual instance. To analyze
this relationship, we first sort our entire 2000 product by its instance count, and then group each 10
nearby products. Then we calculate the average top-1 classification accuracy for each group. As
shown in Fig. 6, the prediction accuracy is decreasing with the number of instances available in the
dataset, and the variance of the accuracy increased. This suggest that future research should focus on
improving the prediction accuracy for the "long tail" part, since total accuracy could be high due to
instances with abundant training data usually have high prediction performance while contribute to
the statistics more. In contrast, the long tail part may still have low accuracies.
As we mentioned in Sec. 3, we also collect a object detection dataset for training the pre-annotator.
To give a reference of the real retail store application scenario, we evaluate the performance of the
detection using our auxiliary object detection dataset. Please refer to Appendix C for detailed results
and discussions.

40000 100%

30000 80%
Top-1 Accuracy
Instance Count

20000 60%

10000 40%

0
1 50 100 200
Product Category IDs
Figure 6: Long tail problem in fine-grained recognition. With the decreased number of available images, the
recognition accuracy is tend to decrease.

5 Other potential research problems

Besides the classification and detection task, there are many other potential research problems on our
dataset. We list a few of them in this section.

Adversarial attacks and defenses. Adversarial attacks refer to adding deliberately crafted, imper-
ceptible noise to natural images, aiming to mislead the network’s decision entirely [14]. Adversarial
attacks pose serious threats to numerous machine learning applications, from autonomous drive to
face recognition authorization. Motivated by building a robust neural network model, a series of
adversarial attacks and defenses have been proposed, advancing the understanding of adversarial
examples [5, 36, 45, 27, 40].
However, most of the existing defense methods are evaluated on standard benchmarks such as
MNIST, CIFAR-10 or ImageNet. In those datasets, the visual difference between images from
different classes are often large. In contrast, due to the fine-grained characteristics, two images from
different classes in our dataset could also hold very similar visual features. Besides, the number of
total categories in our dataset (2000) is also considerably higher than traditional benchmarks like
CIFAR-10 (10) or ImageNet (200). These two factors could pose a much more challenging problem
in the adversarial defense task. Thus, it would be worthwhile to develop and evaluate new adversarial
defenses algorithm on our dataset.

Generative models on structured images. Image synthesis has achieved remarkable progress in
recent years with the emergence of various generative models [31, 2, 44, 29, 19, 22]. The state-of-
the-art generative models are capable of generating realistic high-resolution images of many distinct
objects and scenes.
Despite the success in generating natural images, generating images with structured layouts remains
challenging, as pointed out by many recent work [30, 16, 38]. Our original shelves image dataset
could be a practical dataset for evaluating generative models on structured image synthesis. The
bounding box combined with SKU labels in our dataset provides the ground truth of semantic layout
information. Once robust generative models are developed, it could also facilitate to pave a way for
generating more data for shelf objects detection and recognition.

Few-shot learning. The ability to learn from a few examples remains a challenge for modern
machine learning systems. This problem has received significant attention from the machine learning
community [9, 32, 37, 7, 35]. According to Fig. 6, the long-tail effect of our characteristics provides
more than 100 classes with the number of instance images less than 30. Thus, our dataset could
also serve the purpose of few-shot learning algorithm evaluation. Moreover, the large number of
categories reside in our dataset enables a broader range of choices to evaluate the algorithm. Besides
the research value of few-shot learning in our dataset, few-shot learning is also important in the retail
industry since some products may only have few placed in a retail store.

6 Conclusion

We introduce a new retail recognition dataset, RP2K. Our dataset is inspired by the task of retail
product recognition on store shelves, which has tremendous applications on AI-powered retail
industry—image-based product retrieval, empty shelf detection, and sales activity tracking, to name a
few. As a fine-grained classification dataset, our dataset has the so far largest amount of categories,
while maintaining a decent amount of images. In addition to the pre-defined meta-categories, we
also provide rich attributes information that allows the user to adjust the fine-grained level for their
evaluations. To bridge the gap between research and real-life applications, our data are all collected in
natural retail store environments. Our experiments show that there are many spaces of improvement
for current models to establish a robust recognition system. Besides object recognition, our dataset
could also be used for other important computer vision tasks such as few-shot learning and generative
model. We believe our dataset could further progress the revolution that is already occurring in the
retail industry.

References
 [1] A BADI , M., BARHAM , P., C HEN , J., C HEN , Z., DAVIS , A., D EAN , J., D EVIN , M., G HE -
     MAWAT, S., I RVING , G., I SARD , M., ET AL . Tensorflow: A system for large-scale machine
     learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation
     ({OSDI} 16) (2016), pp. 265–283.
 [2] A RJOVSKY, M., C HINTALA , S., AND B OTTOU , L. Wasserstein generative adversarial networks.
     In International Conference on Machine Learning (2017), pp. 214–223.
 [3] B ERG , T., L IU , J., W OO L EE , S., A LEXANDER , M. L., JACOBS , D. W., AND B ELHUMEUR ,
     P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the
     IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 2011–2018.
 [4] B URIANEK , J., A HMADYFARD , A., AND K ITTLER , J. Soil-47, the surrey object image library,
     centre for vision, speach and signal processing, univerisity of surrey. J. Kittler.[Electronic
     resource].–Mode of access http://www. ee. surrey. ac. uk/CVSSP/demos/colour/soil47 (2002).
 [5] C OHEN , J., ROSENFELD , E., AND KOLTER , Z. Certified adversarial robustness via randomized
     smoothing. In International Conference on Machine Learning (2019), pp. 1310–1320.
 [6] C ORDTS , M., O MRAN , M., R AMOS , S., R EHFELD , T., E NZWEILER , M., B ENENSON ,
     R., F RANKE , U., ROTH , S., AND S CHIELE , B. The cityscapes dataset for semantic urban
     scene understanding. In Proceedings of the IEEE conference on computer vision and pattern
     recognition (2016), pp. 3213–3223.
 [7] D UAN , Y., A NDRYCHOWICZ , M., S TADIE , B., H O , O. J., S CHNEIDER , J., S UTSKEVER ,
     I., A BBEEL , P., AND Z AREMBA , W. One-shot imitation learning. In Advances in neural
     information processing systems (2017), pp. 1087–1098.
 [8] E VERINGHAM , M., E SLAMI , S. A., VAN G OOL , L., W ILLIAMS , C. K., W INN , J., AND
     Z ISSERMAN , A. The pascal visual object classes challenge: A retrospective. International
     journal of computer vision 111, 1 (2015), 98–136.
 [9] F INN , C., A BBEEL , P., AND L EVINE , S. Model-agnostic meta-learning for fast adaptation
     of deep networks. In Proceedings of the 34th International Conference on Machine Learning-
     Volume 70 (2017), JMLR. org, pp. 1126–1135.
[10] F OLLMANN , P., B OTTGER , T., H ARTINGER , P., KONIG , R., AND U LRICH , M. Mvtec
     d2s: densely segmented supermarket dataset. In Proceedings of the European Conference on
     Computer Vision (ECCV) (2018), pp. 569–585.
[11] G EBRU , T., H OFFMAN , J., AND F EI -F EI , L. Fine-grained recognition in the wild: A multi-task
     domain adaptation approach. In Proceedings of the IEEE International Conference on Computer
     Vision (2017), pp. 1349–1358.
[12] G EORGE , M., AND F LOERKEMEIER , C. Recognizing products: A per-exemplar multi-label
     image classification approach. In European Conference on Computer Vision (2014), Springer,
     pp. 440–455.
[13] G OLDMAN , E., H ERZIG , R., E ISENSCHTAT, A., G OLDBERGER , J., AND H ASSNER , T. Precise
     detection in densely packed scenes. In Proceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition (2019), pp. 5227–5236.
[14] G OODFELLOW, I. J., S HLENS , J., AND S ZEGEDY, C. Explaining and harnessing adversarial
     examples. arXiv preprint arXiv:1412.6572 (2014).
[15] H AO , Y., F U , Y., AND J IANG , Y.-G. Take goods from shelves: A dataset for class-incremental
     object detection. In Proceedings of the 2019 on International Conference on Multimedia
     Retrieval (2019), pp. 271–278.
[16] H INZ , T., H EINRICH , S., AND W ERMTER , S. Generating multiple objects at spatially distinct
     locations. In International Conference on Learning Representations (2019).
[17] H OWARD , A. G., Z HU , M., C HEN , B., K ALENICHENKO , D., WANG , W., W EYAND , T.,
     A NDREETTO , M., AND A DAM , H. Mobilenets: Efficient convolutional neural networks for
     mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[18] K ARLINSKY, L., S HTOK , J., T ZUR , Y., AND T ZADOK , A. Fine-grained recognition of
     thousands of object categories with single-example training. In Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition (2017), pp. 4113–4122.

                                                  9

[19] K ARRAS , T., L AINE , S., AND A ILA , T. A style-based generator architecture for generative
     adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition (2019), pp. 4401–4410.
[20] K HOSLA , A., JAYADEVAPRAKASH , N., YAO , B., AND F EI -F EI , L. Novel dataset for fine-
     grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE
     Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO, June 2011).
[21] K INGMA , D. P., AND BA , J. Adam: A method for stochastic optimization. arXiv preprint
     arXiv:1412.6980 (2014).
[22] KOTOVENKO , D., S ANAKOYEU , A., L ANG , S., AND O MMER , B. Content and style disen-
     tanglement for artistic style transfer. In Proceedings of the IEEE International Conference on
     Computer Vision (2019), pp. 4422–4431.
[23] K RAUSE , J., S TARK , M., D ENG , J., AND F EI -F EI , L. 3d object representations for fine-grained
     categorization. In 4th International IEEE Workshop on 3D Representation and Recognition
     (3dRR-13) (Sydney, Australia, 2013).
[24] L IN , T.-Y., G OYAL , P., G IRSHICK , R., H E , K., AND D OLLÁR , P. Focal loss for dense object
     detection. In Proceedings of the IEEE international conference on computer vision (2017),
     pp. 2980–2988.
[25] L IN , T.-Y., M AIRE , M., B ELONGIE , S., H AYS , J., P ERONA , P., R AMANAN , D., D OLLÁR ,
     P., AND Z ITNICK , C. L. Microsoft coco: Common objects in context. In European conference
     on computer vision (2014), Springer, pp. 740–755.
[26] L IU , Z., L UO , P., Q IU , S., WANG , X., AND TANG , X. Deepfashion: Powering robust clothes
     recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer
     Vision and Pattern Recognition (CVPR) (2016).
[27] M ADRY, A., M AKELOV, A., S CHMIDT, L., T SIPRAS , D., AND V LADU , A. Towards deep
     learning models resistant to adversarial attacks. In International Conference on Learning
     Representations (2018).
[28] M ERLER , M., G ALLEGUILLOS , C., AND B ELONGIE , S. Recognizing groceries in situ using
     in vitro training data. In 2007 IEEE Conference on Computer Vision and Pattern Recognition
     (2007), IEEE, pp. 1–8.
[29] M USTAFA , M., BARD , D., B HIMJI , W., L UKI Ć , Z., A L -R FOU , R., AND K RATOCHVIL , J. M.
     Cosmogan: creating high-fidelity weak lensing convergence maps using generative adversarial
     networks. Computational Astrophysics and Cosmology 6, 1 (2019), 1–13.
[30] Q I , M., WANG , Y., Q IN , J., AND L I , A. Ke-gan: Knowledge embedded generative adversarial
     networks for semi-supervised scene parsing. In Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition (2019), pp. 5237–5246.
[31] R ADFORD , A., M ETZ , L., AND C HINTALA , S. Unsupervised representation learning with
     deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).
[32] R EN , M., T RIANTAFILLOU , E., R AVI , S., S NELL , J., S WERSKY, K., T ENENBAUM , J. B.,
     L AROCHELLE , H., AND Z EMEL , R. S. Meta-learning for semi-supervised few-shot classifica-
     tion. arXiv preprint arXiv:1803.00676 (2018).
[33] ROCHA , A., H AUAGGE , D. C., WAINER , J., AND G OLDENSTEIN , S. Automatic fruit and
     vegetable classification from images. Computers and Electronics in Agriculture 70, 1 (2010),
     96–104.
[34] S ANTRA , B., AND M UKHERJEE , D. P. A comprehensive survey on computer vision based
     approaches for automatic identification of products in retail store. Image and Vision Computing
     86 (2019), 45–63.
[35] S ATORRAS , V. G., AND E STRACH , J. B. Few-shot learning with graph neural networks. In
     International Conference on Learning Representations (2018).
[36] S HAFAHI , A., NAJIBI , M., G HIASI , M. A., X U , Z., D ICKERSON , J., S TUDER , C., DAVIS ,
     L. S., TAYLOR , G., AND G OLDSTEIN , T. Adversarial training for free! In Advances in Neural
     Information Processing Systems (2019), pp. 3353–3364.

                                                   10

[37] S NELL , J., S WERSKY, K., AND Z EMEL , R. Prototypical networks for few-shot learning. In
     Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio,
     H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017,
     pp. 4077–4087.
[38] S UN , W., AND W U , T. Image synthesis from reconfigurable layout and style. In The IEEE
     International Conference on Computer Vision (ICCV) (October 2019).
[39] S ZEGEDY, C., VANHOUCKE , V., I OFFE , S., S HLENS , J., AND W OJNA , Z. Rethinking the
     inception architecture for computer vision. In Proceedings of the IEEE conference on computer
     vision and pattern recognition (2016), pp. 2818–2826.
[40] T RAMER , F., C ARLINI , N., B RENDEL , W., AND M ADRY, A. On adaptive attacks to adversarial
     example defenses. arXiv preprint arXiv:2002.08347 (2020).
[41] VAROL , G., AND K UZU , R. S. Toward retail product recognition on grocery shelves. In Sixth
     International Conference on Graphic and Image Processing (ICGIP 2014) (2015), vol. 9443,
     International Society for Optics and Photonics, p. 944309.
[42] W EI , X.-S., C UI , Q., YANG , L., WANG , P., AND L IU , L. Rpc: A large-scale retail product
     checkout dataset. arXiv preprint arXiv:1901.07249 (2019).
[43] W ELINDER , P., B RANSON , S., M ITA , T., WAH , C., S CHROFF , F., B ELONGIE , S., AND
     P ERONA , P. Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of
     Technology, 2010.
[44] X IAO , C., Z HONG , P., AND Z HENG , C. Bourgan: Generative networks with metric embeddings.
     In Advances in Neural Information Processing Systems (2018), pp. 2269–2280.
[45] X IAO , C., Z HONG , P., AND Z HENG , C. Resisting adversarial attacks by k-winners-take-all.
     arXiv preprint arXiv:1905.10510 (2019).

                                                11

A   Additional Samples of Shelf Images

               (a) Beer.                                           (b) Seasoning.

             (c) Cosmetics.                                         (d) Liquor.
                              Figure 7: Additional shelf images.

                                             12

(a) Dairy.                                             (b) Seasoning.

(c) Non-Alcoholic Drink.                                      (d) Cosmetics.
                    Figure 8: More additional shelf images.

                                      13

B   Additional Samples of Object Images

    SKU id: 54
    Types: Dairy
    Shape: Box
    Size: 1L
    Brand: Mengniu
    Flavor: Original

    SKU id: 126
    Types: Drink
    Shape: Can
    Size: 250mL
    Brand: Meco
    Flavor: Peach

    SKU id: 127
    Types: Drink
    Shape: Can
    Size: 250mL
    Brand: Meco
    Flavor: Lime

    SKU id: 360
    Types: Cosmetics
    Shape: Can
    Size: 250mL
    Brand: TJOY
    Flavor: Oil Control

    SKU id: 361
    Types: Cosmetics
    Shape: Can
    Size: 250mL
    Brand: TJOY
    Flavor: Moisture

                          Figure 9: Additional images with different SKUs.

                                                14

SKU id: 589
    Types: Beer
    Shape: Pack
    Size: 12x250mL
    Brand: TUBORG
    Flavor: Original

    SKU id: 646
    Types: Drink
    Shape: Bottle
    Size: 100mL
    Brand: evian
    Flavor: Original

     SKU id: 1003
     Types: Seasoning
     Shape: Bottle
     Size: 1L
     Brand: Weidamei
     Flavor: Fish

     SKU id: 1345
     Types: Seasoning
     Shape: Jar
     Size: 348g
     Brand: Tantanxiang
     Flavor: Original

     SKU id: 1724
     Types: Seasonging
     Shape: Bag
     Size: 200g
     Brand: Haidilao
     Flavor: Pork

                             Figure 10: More images with different SKUs.

C     Auxiliary Detection Dataset
Here we provide the evaluation on the auxiliary detection dataset. We split the 95,800 bounding boxes
to 80,000 training boxes and 15,800 test boxes. We use the standard RetinaNet [24] to evaluate the
detection performance. We report the Average Precision (AP) for each shape, as well as mean Average
Precision (mAP) for the entire test set, with different IoU threshold, the results are summarized in
Table 4.

                                                 15

Table 4: Performance of object detection. AP(x) indicates AP with IoU threshold=x.
Shape            AP(0.5)     AP(0.55)     AP(0.6)    AP(0.65)     AP(0.7)     AP(0.75)       AP(0.8)
Box              0.3485       0.3475       0.344      0.3372      0.3195       0.3003        0.2670
Can              0.6886       0.6868      0.6837      0.6811      0.6766       0.6650        0.6314
Bottle           0.7525       0.7509      0.7487      0.7448      0.7365       0.7137        0.6545
Jar              0.2620       0.2559       0.235      0.2302      0.2191       0.1921        0.1748
Handled Bottle   0.4919       0.4896      0.4658      0.4505      0.4014       0.3804        0.3059
Bag              0.3449        0.338      0.3273      0.3151      0.2827       0.2518        0.1901
Pack             0.4643       0.4634      0.4634      0.4634      0.4634       0.4494        0.3789
mAP for All      0.6186       0.6164      0.6121      0.6069      0.5962       0.5739        0.3994

                                               16

You can also read