A Comparative Study of Deep Learning Models for Doodle Recognition - sersc
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 A Comparative Study of Deep Learning Models for Doodle Recognition Ishan Miglani1, Dinesh Kumar Vishwakarma2 1,2 Deptment of Information Technology, Delhi Technological University, New Delhi, India 1 miglaniishan@gmail.com, dinesh@dtu.ac.in Abstract Recent advancements in the field of Deep Neural Networks have led researchers and industries to challenge the boundaries between humans and machines. Computers are advancing at great speed in this age. If they have the ability to understand our doodles or quick line drawings, it will allow for much more advanced and simplified forms of communication. This has a major effect on handwriting recognition and its important applications in the areas of OCR (Optical Character Recognition), ASR (Automatic Speech Recognition) and NLP (Nature Language Processing). This paper aims to implement various techniques of deep learning to efficiently recognize labels of hand-drawn doodles. In this study, the Google Quick Draw dataset has been used to train and evaluate the models. Various state-of- the-art transfer learning approaches are compared, and the traditional Convolutional Neural Network approach is also examined. Two pre-trained models - VGG16, and MobileNet are used. We have compared these models using various metrics such as top 3 accuracy percentage on training and validation data and also Mean Average Precision (mAP@3). The result shows that the Transfer Learning model with VGG16 architecture outperforms the other methods, giving an accuracy of 93.97% on 340 categories. Keywords: Deep Learning, Pictionary, Games, Transfer Learning, Convolutional Neural Network, VGG16, Doodle 1 Introduction: Drawing and understanding images is a way of communication when words fall short or language is inadequate due to different cultural and literacy levels. When successful, this method can be applied for a variety of applications such as an application for mute people to interact with other people about their demands or an interface where a person can communicate their needs by doodles. People learning new languages can also draw doodles and get translations. All these applications require computers to understand our doodles or quick line drawings. This paper aims to build such models that can efficiently recognize user drawn doodles that can be used for a variety of purposes, and then compare their efficiencies. This issue, which we aim to examine, is a challenging task to determine the label of an item in a doodle using only the visual indications. In this paper, we have experimented to find the efficiency of 2 main techniques - Convolution Neural Networks (CNN) and Transfer Learning to understand the results of different types of architectures on the problem. Transfer learning is a deep learning technique that transfers knowledge learned from a source field to a target field. Many Researchers have previously addressed the problem of Image Recognition, with and without transfer learning extensively such as in [1]. Other Deep Learning techniques have proven to be very useful in the past as is discussed in [2][3]. Even though in Transfer Learning pre-trained ImageNet models have been widely used for other image classification tasks, they have not been commonly used for hand-drawn images. In the ImageNet ISSN: 2005-4238 IJAST 12077 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 competition [4] in the past few years winners have been using deep learning techniques mostly CNNs. These factors bear favourably for our analysis and were important factors in deciding the relevant models for experimentation. The remainder of this paper is organized as follows: Section II gives a detailed study on the various works done in the related Field. Section III consists of the proposed framework of our model. In section IV the implementation and experiment details are discussed. Section V describes the evaluations performed and the results obtained. The last section VI, concludes the research work with a detailed insight into its future scope. 2 Related Work: The idea of labelling doodles is a very important benchmark. Multiple systems have been proposed for Sketch classification, in [8] the authors aim to implement an embedded system. Various datasets have been proposed, the WORDGUESS-160 dataset (Pictionary-style word-guessing on hand-drawn object sketches: dataset, analysis, and deep network models) is one such example, the MNIST dataset [19] has also been used in research for handwriting recognition. We use QuickDraw dataset [5] provided by Google. It contains more than 50 million sketch pictures, including 340 categories, and thus provides enough complexity for a fair comparison on CNN, MobileNet, and Transfer Learning approaches. The aim of transfer learning can be seen in [9] where the pre-trained VGG 16 model was used to classify brain images. This was also the case in [10] where multiple ImageNet based Models were used for the classification of flowers. Therefore Transfer Learning is a very beneficial method for improving the performance of a model as the knowledge that the pre-trained model contains can be applied to the model as is summarized in [11] and [12]. MobileNet is a very new and efficient CNN architecture model and was proposed in [6]. What makes MobileNet so special is that it requires very low computational power without sacrificing much on the accuracy of results. So it is very easy to apply transfer learning to such a model and run it. MobileNet has a very lightweight architecture as the job of many convolutional layers is being done by depth wise separable convolutions. This architecture is very useful in creating efficient models as is seen in [13] and [14] where the existing MobileNet architecture is modified to achieve models with higher accuracy and efficiency. Figure 1. CNN Architecture 3 Proposed Methodology: In our study, we examined 2 main approaches - traditional Convolutional Neural Networks and various transfer learning models. We used VGG16 and MobileNet pre-trained archi- tectures to experiment with ISSN: 2005-4238 IJAST 12078 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 transfer learning approaches. We used VGG16, due to the simplicity of data, and MobileNet due to the competitive advantage of its size, that will enable our model to be used on a variety of devices. A. CONVOLUTIONAL NEURAL NETWORK(CNN) We use a CNN, to create a benchmark as it is a natural candidate for image recognition, as the concept of dimensionality reduction suits the huge number of parameters in an image and it has previously been successfully used in image classification problems [17],[18]. We train a CNN with 3 Convolutional layers and 2 Dense layers interspersed with Max Pooling layers, to extract features and then classify images. The detailed architecture used is shown in Fig. 1. Figure 2. A sketch of Transfer Learning with VGG16 Architecture B. TRANSFER LEARNING With transfer learning, we gain an advantage using the process of generalisation by beginning from patterns learned for a different task. Effectively, instead of starting the process of learning from scratch, it allows us to start from patterns discovered to solve a different but related task. Thus, increase- ing the adaptability and flexibility of the model by reusing the existing deep learning models. The transfer learning block diagram is shown in Fig. 2. This technique is particularly helpful when data of sufficient size is not available. We use this design model of transfer learning in the 2 different ways – Strategy 1: The first strategy was to define 2 models, with the same architectures as the pre- trained models VGG16 [15] and MobileNet [6]. We discarded the weights of every layer from the pre- trained models and re-trained the models on QuickDraw dataset from scratch. – The second strategy was to perfect or fine-tune the pre- trained models. The lower layers focus on the most basic information that can be extracted from the data and hence this could be used as it is on another problem, as often the basic level information would be the same. So all the layers remain unfreezed and the model was then trained. – Strategy 2: freezing up the layers up to several convolutional blocks. – Strategy 3: fine-tuning all the layers of CNN to update the weights of the pre-trained model to better fit our specific problem. 4 Experimental Setup: A. Dataset We use - Google’s QuickDraw Dataset [5] - It is the largest dataset in the world, consisting of 50 million ISSN: 2005-4238 IJAST 12079 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 images across 340 categories. All the images are collected as user drawn doodles as part of their QuickDraw Experiment. This dataset is highly suited to our needs, as doodles are an accurate representation of what human-to-human interaction may look like when we will need to communicate something quickly via image. We used the comma-separated value (CSV) files of the dataset that is offered publicly by Google. The images in these files are stored as JSON arrays and have been simplified by uniformly scaling the drawing and resampling the strokes present. Each drawing was in a 256 x 256 raw pixel region with values between 0 and 255. All the classes from the overall pool of classes were used and stayed constant throughout our experiment. The total number of images used were 50 Million. For each class the data was split in the ratio 80:10:10 for the training, validation, and test sets. Some examples of the dataset images are shown in Fig. 3. Figure 3. Examples of the doodles available in the Quick Draw Dataset B. Data Preprocessing The data for drawings of each class label exists in the form of separate CSV files. We firstly shuffled the CSVs, generating 100 new files with data from all classes to make sure that the model receives a random sample of images as input and removed bias. We used greyscale / color-coded processing to capture this data to take advantage of the RGB channel while building the CNN, so that the model could understand differences between different strokes. We also modified the images by randomly flipping, rotating, or blocking parts to help put noise into the images and improve the model’s ability to overcome noise. C. Experimental Object The GPU used in this experiment is NVIDIA Tesla P100 with 30 GB RAM and deep learning framework is TensorFlow, configured with CUDA Toolkit 10.0. In these experiments, we used Google’s QuickDraw Dataset, consisting of 50 million images from 340 class labels D. Model Description and Training Settings In the VGG16 model and MobileNet model, firstly the mod- els were retrained on Google QuickDraw Dataset. Secondly, in VGG16 all the layers except the last 2 convolutional blocks were freezed and the model was trained. After which we added a global spatial average pooling layer over the existing architecture and two dense layers of 512 units with ‘relu’ activation which were both followed by Dropout of 0.3 and finally a SoftMax function for classification. We used an Adam optimizer [16]. Thirdly, in VGG16 all layers were unfreezed and the weights were updated. The learning rate used for all 3 approaches was 0.0001, for 70 epochs with a batch size of 400 images. 5 Result and Analysis: ISSN: 2005-4238 IJAST 12080 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 As explained in section 4.1, In our study, 64 x 64 RGB images were used for training and validation. In particular, 80% for testing, 10% for validation and 10% for testing. For the same dataset, different Deep Learning architectures are tested and the results are given in the subsections. A. Performance Evaluation Criteria Model evaluation was performed by calculating validation top3 accuracy, top3 accuracy and MAP@3. Top N accuracy is the measure of how often your predicted class falls in the top N values of your softmax distribution. For instance, in a top-3 accuracy for an image dataset, the correct class only needs to be in the top three predicted classes to count as correct. Top 3 accuracy means that the predicted correct class should be in the Top-3 probabilities respectively. Validation top3 accuracy is the top3 accuracy calculated on the validation set. The average of the different precisions at the positions where a relevant image exists is called the Average Precision (AP). 1 = ∑ ∗ @i (1) =1 mAP is the mean AP across all of the queries, where the number of queries is M. For mAP@i, the order of the recommendation is important, as the precision@i is the percentage of correct items in the first k recommendations. This metric gives "partial credit" to predictions that are almost correct. Specifically, the 3 most probable guesses per sample can count for the score (with 1, 1/2 and 1/3, respectively). 1 = ∑ =1 (2) B. Results Table 1 summarizes the performance of our proposed models in Doodle classification. In terms of top_3_accuracy the VGG16 model where all the layers of the pretrained network were retrained performed the best and gave us an accuracy of 93.97 with val_top_3_accuracy as 93.89 and a MAP@3 score of 0.87.This was closely followed by the VGG16 model with the second strategy where only a part of the layers of the pretrained network were trained again. This model showed promising results but was outperformed when all the layers of the network were retrained. The models where only the existing architectures without the pretrained weights were used also performed well with the VGG16 Model giving a top_3_accuracy of 91.99, a val_top_3_accuracy of 92.41 and a MAP@3 value of 0.85 while the MobileNet Model gave us a top_3_accuracy of 92.67, a val_top_3_accuracy of 92.69 and a MAP@3 value of 0.86. All of these models outperformed the CNN Model that was used as a baseline model. The results of the best performing model are highlighted in bold in Table 1. TABLE 1: COMPARISON OF MODEL ACCURACY ISSN: 2005-4238 IJAST 12081 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 6 Conclusion: In this paper, we proposed and compared five different models for recognition of labels for doodles in Google's Quick Draw Dataset. The five proposed models are a CNN model trained from scratch, two transfer learning models built with pre-trained CNN models: VGG16, and MobileNet, in two ways - Using the architecture of the pre-trained network without using the pretrained weights and the other where the network with pretrained weights is used. In the second method two variations were used -One where all the layers of the network are retrained and the other where only part of the layers of the network remain unfreezed and those are retrained. Among the models discussed in this paper, the VGG16 model where all layers are retrained achieved the best results with a validation accuracy of 93.89 and a MAP@3 of 0.87, outperforming all other models. In the future, combining the transferred features of multiple CNNs could further improve the accuracy of the classification. The precision of the classifier can be improved by adding the drawing stroke order. Some more memory-saving models, such as shallower or faster architectures, may be considered for embedded devices. References 1. S. J. Pan and Q. Yang, "A survey on transfer learning," in IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2010. 2. M. T. Islam, B. M. N. Karim Siddique, S. Rahman and T. Jabid, "Image recognition with deep learning," 2018 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Bangkok, 2018, pp. 106-110. 3. S. Panigrahi, A. Nanda and T. Swarnkar, "Deep learning approach for image classification," 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), Changsha, 2018, pp. 511-51 4. Imagenet. http://www.image-net.org/. Accessed: 2020- 05-01. 5. quick,draw!dataset.https://github.com/googlecreativelab/quickdrawdataset.com Accessed: 2018- 12 6. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,"Mobilenets: efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861, 2017. 7. Models for image classification with weights trained on imagenet. https://keras.io/applications/. Accessed: 2020-05-01. ISSN: 2005-4238 IJAST 12082 Copyright ⓒ 2020 SERSC
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 12077 - 12083 8. T. Tsai, P. Chi and K. Cheng, "A sketch classifier technique with deep learning models realized in an embedded system," 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Cluj-Napoca, Romania, 2019, pp. 1-4. 9. T. Kaur and T. K. Gandhi, "automated brain image classification based on vgg-16 and transfer Learning," 2019 International Conference on Information Technology (ICIT), Bhubaneswar, India, 2019, pp. 94-98. 10. Y. Wu, X. Qin, Y. Pan and C. Yuan, "Convolution neural network based transfer learning for classification of flowers," 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP), Shenzhen, 2018, pp. 562-566. 11. M. Shaha and M. Pawar, "Transfer learning for image classification," 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, 2018, pp. 656-660. 12. L. Shao, F. Zhu and X. Li, "Transfer learning for visual categorization: a survey," in IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 5, pp. 1019-1034, May 2015. 13. D. Sinha and M. El-Sharkawy, "Thin mobileNet: An enhanced mobileNet architecture," 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York City, NY, USA, 2019, pp. 0280-0285. 14. H. Chen and C. Su, "An enhanced hybrid mobileNet," 2018 9th International Conference on Awareness Science and Technology (iCAST), Fukuoka, 2018, pp. 308-312. 15. Karen Simonyan, Andrew Zisserman,"Very deep convolutional networks for large-scale image recognition."arXiv:1409.1556v6 ,2015. 16. Diederik P. Kingma, Jimmy Ba."Adam: a method for stochastic optimization."arXiv:1412.6980,2014 17. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. 18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. 19. http://yann.lecun.com/exdb/mnist/ [Online; accessed 19- January-2011] ISSN: 2005-4238 IJAST 12083 Copyright ⓒ 2020 SERSC
You can also read