A Comparative Study of Deep Learning Models for Doodle Recognition - sersc

Page created by Rita Delgado

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 12077 - 12083

A Comparative Study of Deep Learning Models for Doodle Recognition
Ishan Miglani1, Dinesh Kumar Vishwakarma2
1,2
Deptment of Information Technology, Delhi Technological University, New Delhi, India
1
miglaniishan@gmail.com, dinesh@dtu.ac.in

Abstract
Recent advancements in the field of Deep Neural Networks have led researchers and
industries to challenge the boundaries between humans and machines. Computers are
advancing at great speed in this age. If they have the ability to understand our doodles or
quick line drawings, it will allow for much more advanced and simplified forms of
communication. This has a major effect on handwriting recognition and its important
applications in the areas of OCR (Optical Character Recognition), ASR (Automatic Speech
Recognition) and NLP (Nature Language Processing). This paper aims to implement various
techniques of deep learning to efficiently recognize labels of hand-drawn doodles. In this
study, the Google Quick Draw dataset has been used to train and evaluate the models.
Various state-of- the-art transfer learning approaches are compared, and the traditional
Convolutional Neural Network approach is also examined. Two pre-trained models -
VGG16, and MobileNet are used. We have compared these models using various metrics
such as top 3 accuracy percentage on training and validation data and also Mean Average
Precision (mAP@3). The result shows that the Transfer Learning model with VGG16
architecture outperforms the other methods, giving an accuracy of 93.97% on 340
categories.
Keywords: Deep Learning, Pictionary, Games, Transfer Learning, Convolutional Neural Network,
VGG16, Doodle

1 Introduction:

Drawing and understanding images is a way of communication when words fall short or language is
inadequate due to different cultural and literacy levels. When successful, this method can be applied for a
variety of applications such as an application for mute people to interact with other people about their
demands or an interface where a person can communicate their needs by doodles. People learning new
languages can also draw doodles and get translations. All these applications require computers to
understand our doodles or quick line drawings. This paper aims to build such models that can efficiently
recognize user drawn doodles that can be used for a variety of purposes, and then compare their
efficiencies.

This issue, which we aim to examine, is a challenging task to determine the label of an item in a doodle
using only the visual indications. In this paper, we have experimented to find the efficiency of 2 main
techniques - Convolution Neural Networks (CNN) and Transfer Learning to understand the results of
different types of architectures on the problem. Transfer learning is a deep learning technique that
transfers knowledge learned from a source field to a target field. Many Researchers have previously
addressed the problem of Image Recognition, with and without transfer learning extensively such as in
[1]. Other Deep Learning techniques have proven to be very useful in the past as is discussed in [2][3].
Even though in Transfer Learning pre-trained ImageNet models have been widely used for other image
classification tasks, they have not been commonly used for hand-drawn images. In the ImageNet

ISSN: 2005-4238 IJAST 12077
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 12077 - 12083

competition [4] in the past few years winners have been using deep learning techniques mostly CNNs.
These factors bear favourably for our analysis and were important factors in deciding the relevant models
for experimentation.

The remainder of this paper is organized as follows: Section II gives a detailed study on the various works
done in the related

Field. Section III consists of the proposed framework of our model. In section IV the implementation and
experiment details are discussed. Section V describes the evaluations performed and the results obtained.
The last section VI, concludes the research work with a detailed insight into its future scope.

2 Related Work:

The idea of labelling doodles is a very important benchmark. Multiple systems have been proposed for
Sketch classification, in [8] the authors aim to implement an embedded system. Various datasets have
been proposed, the WORDGUESS-160 dataset (Pictionary-style word-guessing on hand-drawn object
sketches: dataset, analysis, and deep network models) is one such example, the MNIST dataset [19] has
also been used in research for handwriting recognition. We use QuickDraw dataset [5] provided by
Google. It contains more than 50 million sketch pictures, including 340 categories, and thus provides
enough complexity for a fair comparison on CNN, MobileNet, and Transfer Learning approaches.

The aim of transfer learning can be seen in [9] where the pre-trained VGG 16 model was used to classify
brain images. This was also the case in [10] where multiple ImageNet based Models were used for the
classification of flowers. Therefore Transfer Learning is a very beneficial method for improving the
performance of a model as the knowledge that the pre-trained model contains can be applied to the model
as is summarized in [11] and [12]. MobileNet is a very new and efficient CNN architecture model and
was proposed in [6]. What makes MobileNet so special is that it requires very low computational power
without sacrificing much on the accuracy of results. So it is very easy to apply transfer learning to such a
model and run it. MobileNet has a very lightweight architecture as the job of many convolutional layers is
being done by depth wise separable convolutions. This architecture is very useful in creating efficient
models as is seen in [13] and [14] where the existing MobileNet architecture is modified to achieve
models with higher accuracy and efficiency.

Figure 1. CNN Architecture

3 Proposed Methodology:

In our study, we examined 2 main approaches - traditional Convolutional Neural Networks and various
transfer learning models. We used VGG16 and MobileNet pre-trained architectures to experiment with

ISSN: 2005-4238 IJAST 12078
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 12077 - 12083

transfer learning approaches. We used VGG16, due to the simplicity of data, and MobileNet due to the
competitive advantage of its size, that will enable our model to be used on a variety of devices.

A. CONVOLUTIONAL NEURAL NETWORK(CNN)

We use a CNN, to create a benchmark as it is a natural candidate for image recognition, as the concept of
dimensionality reduction suits the huge number of parameters in an image and it has previously been
successfully used in image classification problems [17],[18]. We train a CNN with 3 Convolutional layers
and 2 Dense layers interspersed with Max Pooling layers, to extract features and then classify images.
The detailed architecture used is shown in Fig. 1.

Figure 2. A sketch of Transfer Learning with VGG16 Architecture

B. TRANSFER LEARNING

With transfer learning, we gain an advantage using the process of generalisation by beginning from
patterns learned for a different task. Effectively, instead of starting the process of learning from scratch, it
allows us to start from patterns discovered to solve a different but related task. Thus, increase- ing the
adaptability and flexibility of the model by reusing the existing deep learning models. The transfer
learning block diagram is shown in Fig. 2. This technique is particularly helpful when data of sufficient
size is not available. We use this design model of transfer learning in the 2 different ways

– Strategy 1: The first strategy was to define 2 models, with the same architectures as the pre-
trained models VGG16 [15] and MobileNet [6]. We discarded the weights of every layer from the pre-
trained models and re-trained the models on QuickDraw dataset from scratch.

– The second strategy was to perfect or fine-tune the pre- trained models. The lower layers focus on
the most basic information that can be extracted from the data and hence this could be used as it is on
another problem, as often the basic level information would be the same. So all the layers remain
unfreezed and the model was then trained.

– Strategy 2: freezing up the layers up to several convolutional blocks.

– Strategy 3: fine-tuning all the layers of CNN to update the weights of the pre-trained model to
better fit our specific problem.

4 Experimental Setup:

A. Dataset

We use - Google’s QuickDraw Dataset [5] - It is the largest dataset in the world, consisting of 50 million

ISSN: 2005-4238 IJAST 12079
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
 Vol. 29, No. 7, (2020), pp. 12077 - 12083

 images across 340 categories. All the images are collected as user drawn doodles as part of their
 QuickDraw Experiment. This dataset is highly suited to our needs, as doodles are an accurate
 representation of what human-to-human interaction may look like when we will need to communicate
 something quickly via image. We used the comma-separated value (CSV) files of the dataset that is
 offered publicly by Google. The images in these files are stored as JSON arrays and have been simplified
 by uniformly scaling the drawing and resampling the strokes present. Each drawing was in a 256 x 256
 raw pixel region with values between 0 and 255. All the classes from the overall pool of classes were used
 and stayed constant throughout our experiment. The total number of images used were 50 Million. For
 each class the data was split in the ratio 80:10:10 for the training, validation, and test sets. Some examples
 of the dataset images are shown in Fig. 3.

 Figure 3. Examples of the doodles available in the Quick Draw Dataset

 B. Data Preprocessing

 The data for drawings of each class label exists in the form of separate CSV files. We firstly shuffled the
 CSVs, generating 100 new files with data from all classes to make sure that the model receives a
 random sample of images as input and removed bias. We used greyscale / color-coded processing to
 capture this data to take advantage of the RGB channel while building the CNN, so that the model could
 understand differences between different strokes. We also modified the images by randomly flipping,
 rotating, or blocking parts to help put noise into the images and improve the model’s ability to overcome
 noise.

 C. Experimental Object

 The GPU used in this experiment is NVIDIA Tesla P100 with 30 GB RAM and deep learning framework
 is TensorFlow, configured with CUDA Toolkit 10.0. In these experiments, we used Google’s QuickDraw
 Dataset, consisting of 50 million images from 340 class labels

 D. Model Description and Training Settings

 In the VGG16 model and MobileNet model, firstly the mod- els were retrained on Google QuickDraw
 Dataset. Secondly, in VGG16 all the layers except the last 2 convolutional blocks were freezed and the
 model was trained. After which we added a global spatial average pooling layer over the existing
 architecture and two dense layers of 512 units with ‘relu’ activation which were both followed by
 Dropout of 0.3 and finally a SoftMax function for classification. We used an Adam optimizer [16].
 Thirdly, in VGG16 all layers were unfreezed and the weights were updated. The learning rate used for all
 3 approaches was 0.0001, for 70 epochs with a batch size of 400 images.

 5 Result and Analysis:

ISSN: 2005-4238 IJAST 12080
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 12077 - 12083

As explained in section 4.1, In our study, 64 x 64 RGB images were used for training and validation. In
particular, 80% for testing, 10% for validation and 10% for testing. For the same dataset, different Deep
Learning architectures are tested and the results are given in the subsections.

A. Performance Evaluation Criteria

Model evaluation was performed by calculating validation top3 accuracy, top3 accuracy and MAP@3.
Top N accuracy is the measure of how often your predicted class falls in the top N values of your softmax
distribution. For instance, in a top-3 accuracy for an image dataset, the correct class only needs to be in
the top three predicted classes to count as correct. Top 3 accuracy means that the predicted correct class
should be in the Top-3 probabilities respectively. Validation top3 accuracy is the top3 accuracy calculated
on the validation set. The average of the different precisions at the positions where a relevant image exists
is called the Average Precision (AP).
1
= ∑ ∗ @i (1)
=1

mAP is the mean AP across all of the queries, where the number of queries is M. For mAP@i, the order
of the recommendation is important, as the precision@i is the percentage of correct items in the first k
recommendations. This metric gives "partial credit" to predictions that are almost correct. Specifically,
the 3 most probable guesses per sample can count for the score (with 1, 1/2 and 1/3, respectively).
1
= ∑
=1 (2)

B. Results

Table 1 summarizes the performance of our proposed models in Doodle classification. In terms of
top_3_accuracy the VGG16 model where all the layers of the pretrained network were retrained
performed the best and gave us an accuracy of 93.97 with val_top_3_accuracy as 93.89 and a MAP@3
score of 0.87.This was closely followed by the VGG16 model with the second strategy where only a part
of the layers of the pretrained network were trained again. This model showed promising results but was
outperformed when all the layers of the network were retrained. The models where only the existing
architectures without the pretrained weights were used also performed well with the VGG16 Model
giving a top_3_accuracy of 91.99, a val_top_3_accuracy of 92.41 and a MAP@3 value of 0.85 while the
MobileNet Model gave us a top_3_accuracy of 92.67, a val_top_3_accuracy of 92.69 and a MAP@3
value of 0.86. All of these models outperformed the CNN Model that was used as a baseline model. The
results of the best performing model are highlighted in bold in Table 1.

TABLE 1: COMPARISON OF MODEL ACCURACY

ISSN: 2005-4238 IJAST 12081
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
Vol. 29, No. 7, (2020), pp. 12077 - 12083

6 Conclusion:

In this paper, we proposed and compared five different models for recognition of labels for doodles in
Google's Quick Draw Dataset. The five proposed models are a CNN model trained from scratch, two
transfer learning models built with pre-trained CNN models: VGG16, and MobileNet, in two ways -
Using the architecture of the pre-trained network without using the pretrained weights and the other
where the network with pretrained weights is used. In the second method two variations were used -One
where all the layers of the network are retrained and the other where only part of the layers of the network
remain unfreezed and those are retrained. Among the models discussed in this paper, the VGG16 model
where all layers are retrained achieved the best results with a validation accuracy of 93.89 and a MAP@3
of 0.87, outperforming all other models. In the future, combining the transferred features of multiple
CNNs could further improve the accuracy of the classification. The precision of the classifier can be
improved by adding the drawing stroke order. Some more memory-saving models, such as shallower or
faster architectures, may be considered for embedded devices.

References

1. S. J. Pan and Q. Yang, "A survey on transfer learning," in IEEE Transactions on Knowledge and
Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2010.
2. M. T. Islam, B. M. N. Karim Siddique, S. Rahman and T. Jabid, "Image recognition with deep
learning," 2018 International Conference on Intelligent Informatics and Biomedical Sciences
(ICIIBMS), Bangkok, 2018, pp. 106-110.
3. S. Panigrahi, A. Nanda and T. Swarnkar, "Deep learning approach for image classification," 2018
2nd International Conference on Data Science and Business Analytics (ICDSBA), Changsha,
2018, pp. 511-51
4. Imagenet. http://www.image-net.org/. Accessed: 2020- 05-01.
5. quick,draw!dataset.https://github.com/googlecreativelab/quickdrawdataset.com Accessed: 2018-
12
6. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H.
Adam,"Mobilenets: efficient convolutional neural networks for mobile vision applications."
arXiv preprint arXiv:1704.04861, 2017.
7. Models for image classification with weights trained on imagenet. https://keras.io/applications/.
Accessed: 2020-05-01.

ISSN: 2005-4238 IJAST 12082
Copyright ⓒ 2020 SERSC

International Journal of Advanced Science and Technology
 Vol. 29, No. 7, (2020), pp. 12077 - 12083

 8. T. Tsai, P. Chi and K. Cheng, "A sketch classifier technique with deep learning models realized
 in an embedded system," 2019 IEEE 22nd International Symposium on Design and Diagnostics
 of Electronic Circuits & Systems (DDECS), Cluj-Napoca, Romania, 2019, pp. 1-4.
 9. T. Kaur and T. K. Gandhi, "automated brain image classification based on vgg-16 and transfer
 Learning," 2019 International Conference on Information Technology (ICIT), Bhubaneswar,
 India, 2019, pp. 94-98.
 10. Y. Wu, X. Qin, Y. Pan and C. Yuan, "Convolution neural network based transfer learning for
 classification of flowers," 2018 IEEE 3rd International Conference on Signal and Image
 Processing (ICSIP), Shenzhen, 2018, pp. 562-566.
 11. M. Shaha and M. Pawar, "Transfer learning for image classification," 2018 Second International
 Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore,
 2018, pp. 656-660.
 12. L. Shao, F. Zhu and X. Li, "Transfer learning for visual categorization: a survey," in IEEE
 Transactions on Neural Networks and Learning Systems, vol. 26, no. 5, pp. 1019-1034, May
 2015.
 13. D. Sinha and M. El-Sharkawy, "Thin mobileNet: An enhanced mobileNet architecture," 2019
 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference
 (UEMCON), New York City, NY, USA, 2019, pp. 0280-0285.
 14. H. Chen and C. Su, "An enhanced hybrid mobileNet," 2018 9th International Conference on
 Awareness Science and Technology (iCAST), Fukuoka, 2018, pp. 308-312.
 15. Karen Simonyan, Andrew Zisserman,"Very deep convolutional networks for large-scale image
 recognition."arXiv:1409.1556v6 ,2015.
 16. Diederik P. Kingma, Jimmy Ba."Adam: a method for stochastic
 optimization."arXiv:1412.6980,2014
 17. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
 neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
 Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on
 Computer Vision and Pattern Recognition, 2015, pp. 1–9.
 19. http://yann.lecun.com/exdb/mnist/ [Online; accessed 19- January-2011]

ISSN: 2005-4238 IJAST 12083
Copyright ⓒ 2020 SERSC

You can also read