Image-Captioning Model Compression - Article - MDPI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
applied sciences Article Image-Captioning Model Compression Viktar Atliha and Dmitrij Šešok * Department of Information Technologies, Vilnius Gediminas Technical University, Saulėtekio Al. 11, LT-10223 Vilnius, Lithuania; viktar.atliha@vilniustech.lt * Correspondence: dmitrij.sesok@vilniustech.lt Abstract: Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efficient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications. Keywords: image captioning; model compression; pruning; quantization 1. Introduction Citation: Atliha, V.; Šešok, D. One of the most significant tasks combining two different domains such as CV and Image-Captioning Model NLP is the image-captioning task [1]. Its goal is to automatically generate a caption Compression. Appl. Sci. 2022, 12, describing an image given as an input. The description should contain not only a listing of 1638. https://doi.org/10.3390/ the objects in this image, but should also take into account their signs, interactions between app12031638 them, etc., in order for this description to be as humanlike as possible. This is an important Academic Editor: Andrea Prati task for many practical applications in everyday life, such as human–computer interactions, Received: 8 January 2022 help for visually impaired people and image searching [2,3]. Accepted: 2 February 2022 Typically, image-captioning models are based on encoder–decoder architecture. Some Published: 4 February 2022 of the models, such as [4–8], use a usual convolutional neural network (CNN) as an encoder. However, usage of image detectors, such as in [9,10] for this purpose, has been Publisher’s Note: MDPI stays neutral rising in popularity in recent years. The decoder here is a text generator, most often with regard to jurisdictional claims in represented as a recurrent neural network (RNN) [11] or a long short-term memory network published maps and institutional affil- (LSTM) [12] with attention [13]. However, more complex models based on the architecture iations. of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15–18] and images [19]. Unsurprisingly, rising interest in image-captioning models has significantly improved Copyright: © 2022 by the authors. the quality of the models in a very short time. At the same time, improvements occur not Licensee MDPI, Basel, Switzerland. only due to more complex architectures of neural networks, but also due to an increase in This article is an open access article the number of parameters and, accordingly, the size of the captioning model. Moreover, distributed under the terms and it is not known for certain whether a more successful architecture and training method, conditions of the Creative Commons or a physically larger number of parameters, leads to an increase in quality. At the same Attribution (CC BY) license (https:// time, the deep neural network models are more widely used on mobile devices, which creativecommons.org/licenses/by/ have limitations both in computing power and in the size of storage. As a result, current 4.0/). state-of-the-art models, such as, for example [9,10], cannot be used on mobile devices, Appl. Sci. 2022, 12, 1638. https://doi.org/10.3390/app12031638 https://www.mdpi.com/journal/applsci
Appl. Sci. 2022, 12, 1638 2 of 14 as they usually weigh 500 MB and more. The problem of model compression for image captioning is very poorly studied. Even presented papers [20–22] usually only research decoder compression, although the encoder is an integral part of the model used when applying it to new data. At the same time, model-compression techniques in other related areas of CV and NLP are well studied. So, for example, research on the various architectures of the object detector in the image (which is often used as an encoder for the image-captioning task) [23–25] have been conducted for a long time, with the aim of significantly reducing their size and increasing the operating speed without losing quality (or even increasing it). On the other hand, methods of compressing NLP models are investigated for various tasks, starting from text classification [26], and ending with text translation [27], which is similar in meaning to the image-captioning problem. In this article, we have decided to fill this gap and undertake a comprehensive use and analysis of deep neural networks (DNN) model-compression techniques [28] applied to image-captioning models. We investigated various encoder architectures specifically tailored to the task of image detection in a low-resource environment. We also applied methods of compressing models to the decoder, reducing the architecture of the model itself, as well as applying various pruning [29] and quantization [30] methods. As a result, we were able to significantly reduce the occupied space of two state-of-the-art models named Up-Down [9] and AoANet [10], while not losing much in terms of important metrics. For example, using our approach, we were able to reduce the size of the classic Up-Down model from 661.4 MB to 58.5 MB (including the encoder), that is, by 91.2%. At the same time, the main image-captioning metrics such as CIDEr [31] and SPICE [32] decreased from 120.1 and 21.4 to 118.4 and 20.8, respectively (by 1.4% and 2.8%, respectively). For the AoANet model, the size reduction was 95.6% from 791.8 MB to 34.8 MB, the CIDEr and SPICE metrics fell by 1.7% and 4%, respectively, from 129.8 to 127.6 and from 22.4 to 21.5, respectively. Thus, the best-resulting model reaches 127.6 CIDEr and 21.5 SPICE at a size of only 34.8 MB. The main contributions of this paper are as follows: • We have proposed the use of modern methods of model compression in relation to the image-captioning task, both for the encoder and for the decoder in order to reduce the overall size of the model; • We compared different options for such a reduction, such as different encoder models, different decoder architecture variations, as well as different pruning options, and conducted a study of the effect of quantization; • The proposed methods allowed us to significantly reduce the size of the models, without significant loss of quality; • The methods worked universally on two different models, which suggests that they can be successfully used for other architectures used for the image-captioning tasks. The paper is organized as follows: • In Section 2 we review paper related to this work, including such topics as image- captioning tasks in general, neural network model-compression techniques and their particular application to image-captioning tasks; • In Section 3 we describe our methodology, firstly describing encoder and then decoder compression approaches; • In Section 4 we report the design of our experiments as well as their numerical results and their analysis; • In Section 5 we state the main achievements and a discussion of the results; • In Section 6 we conclude the work and name future work directions.
Appl. Sci. 2022, 12, 1638 3 of 14 2. Related Work 2.1. Image Captioning The task of the automatic generation of a caption describing an image is an important task for a smoother human–machine interaction. One of the earliest successful methods of such generation, which laid the foundation for modern image-captioning methods, is [4]. The architecture is a typical encoder–decoder architecture, which first generates some representation of the image, and then, based on this sentence, generates text, most often word by word. This model is followed by most modern image-captioning architectures. Two key works that have significantly improved the performance of image-captioning models are [8,9]. The first work suggested using the REINFORCE algorithm to directly optimize the discrete model’s quality metric named CIDEr, thereby directly improving its value. The second work suggested using the detector of objects in the image as a encoder (the article itself uses Faster R-CNN [24]), and then, based on the detected objects and their attributes, generating a caption, instead of trying to compress the entire image into one vector and generating it based on such representation. In addition, starting with [5], the attention mechanism is actively used for the image- captioning task, which allows the model to adaptively increase the attention to objects or areas that are most important for generating a text description for the current image. It is used by high-quality works such as [10,33,34] and others. Recently, transformers [14] have been gaining more and more popularity, becoming state-of-the-art models for many tasks from the field of NLP. Their use improves the quality of image-captioning models as well, which is investigated in [15,16,18]. Also, research in the field of image captioning is moving towards the unification of models with other vision–language understanding and generation tasks. Works such as [35,36] use networks pretrained on a giant dataset and obtain a model capable of solving a number of vision–language problems. However, due to the fact that such pretraining requires huge computational resources, it is difficult for researchers to find improvements to such models and broadly study them. 2.2. Neural Network Models Compression 2.2.1. Effective Architectures One of the most effective ways to reduce model size is to look for a more efficient architecture that contains fewer parameters but still performs well. The search for such architectures is especially active for those tasks that, on the one hand, are very useful to be able to solve in real time on mobile devices, and, on the other hand, which, as a rule, have too cumbersome architectures. One of these tasks is the task of detecting an object in an image, together with its classification and the determination of its attributes. So, for example, Faster R-CNN from [24] uses Region Proposal Network (RPN) and combines it with Fast R-CNN from [23], along with using convolutional neural network (VGG-16 [37]) as a backbone. SSD [38] initially creates a set of default boxes over different aspect ratios and scales, and then determines whether there are objects of interest. RetinaNet from [39] is a small, dense detector, performing well thanks to focal loss-oriented training. As a rule, the backbone in the form of a convolutional neural network acts as one of the important components of detectors. Therefore, the task of selecting an effective convolu- tional network architecture is also on the agenda. Over time, various ideas have appeared to reduce the number of parameters, and, accordingly, the space occupied by the model. One of the most popular, MobileNetV3 [40] was found by a neural architecture search based on previous MobileNet architecture versions. The other widely used convolution neural network architecture is EfficientNet [41], which was explored during specialized research, concentrated on the amount of parameters that influence on the model quality. This network is also used in EfficientDet [25], which generalizes the ideas presented by transferring them to the objects’ detection area.
Appl. Sci. 2022, 12, 1638 4 of 14 2.2.2. Pruning Pruning is one of the most popular and effective methods for reducing the size of models [42–44]. Its idea is to remove model parameters that are not useful or meaning- ful when present in the model. Thus, deleting such parameters does not cause much damage to the final quality, but at the same time it allows you to save storage space and computing resources. In general, all pruning methods can be divided into two categories: structured prun- ing [45,46] and unstructured pruning. Within structured pruning, entire rows, columns or channels of layers are removed. Such techniques are used both for reducing CNN models and for models using RNN. Another category is unstructured pruning [47,48]. In the framework of this kind of methods, the removal of weights is not tied to a specific structure of the model; any weights can be removed depending on the criterion of their “importance”. Such methods are actively used in different variations for various tasks. For initial research, we focused on relatively simple methods that can achieve a good result. 2.2.3. Quantization Another method of compressing a model without losing quality is quantization [49]. The essence of quantization is to use lower precision numbers for storing the model in order to reduce the amount of occupied space, and at the same time, without losing quality. Indeed, most practical applications do not require the precision that standard floating point data types provide. Quantization approaches could be divided into two groups based on the type of compression range defined: static and dynamic quantization. In static quantization [50,51], the clipping range is calculated before inference and remains the same during the model’s runtime. In the other group of methods, called dynamic quantization [52,53], the clipping range is dynamically calculated for each activation map during model application. 2.3. Model Size Reducing for Image Captioning The area of model compression for image captioning is underexplored. Only a few works offer their own approaches to reducing the size of the image-captioning models. In [54], the authors propose to use SquezeNet [55] as an encoder and LightRNN [56] as a decoder. The authors of [57] propose a way to reduce model size, tackling the problem of huge embedding matrices which are not scalable to bigger vocabularies. In [20,21], novel special pruning approaches to the decoder of image-captioning models are proposed. However, all of these works investigate narrow aspects, for example, only reducing the size of the encoder or decoder or using only one method of compression. In contrast, our article is intended to provide a comprehensive understanding of the methods for compressing models for image captioning, using various combinations of approaches in order to reduce the final model size as much as possible without much loss of quality. 3. Methodology This section will describe the methods we used to reduce the size of the models. Both the methods used for both investigated architectures (Up-Down and AoANet) and methods specific to each particular model will be described. It is important to say that we didn’t aim to compare Up-Down and AoANet models between each other, but to compare compression methods for several models. A comparison of the original models can be found in [10]. 3.1. Encoder Compression As mentioned above, the overwhelming majority of state-of-the-art image-captioning architectures use an image object detector as an encoder. That is, typical models are built according to the following scheme: let I be the input image, E be a detection model, and D be a decoder. Firstly, the method generates k image features E( I ) = V = {v1 , v2 , . . . , vk },
Appl. Sci. 2022, 12, 1638 5 of 14 vi ∈ Rd , such that each image feature encodes a region of the image. Then, the decoder generates caption D (V ) based on these features. We propose to use this general scheme as a sample, however, as an encoder, we take models that are more suitable for our task. One of these options is even saving the Faster R-CNN detection model, but with a different backbone (since it is this backbone that contains most of the parameters). As such a backbone, we propose using MobileNetV3, which performs well for memory-restricted tasks. For comparison, we also propose taking a detector that was initially sharpened for a small memory while maintaining good quality. EfficientDet was chosen as such a detector. Thus, as E, we will have Faster R-CNN ResNet101 used in the original Up-Down model, as well as Faster R-CNN MobileNetV2 and EfficientDet. Because the sizes of these detectors are already small, we have not considered other methods of encoder compression. 3.2. Decoder Compression Assuming that the current models contain more parameters than are necessary to achieve the same quality, we first investigate whether they can be reduced in number with- out changing the nature of the architecture. In this case, the method for a specific change in the number of parameters depends on the model under study. Further, on the best architec- ture, pruning methods are applied, and the best-pruned model is eventually quantized. 3.2.1. Architecture Changes Following [9], the decoder model has three important logical parts: • Embeddings Calculation In this part, one-hot encoded vectors representing words are transformed to word vectors. Let Π be one-hot encoding of the input word and We ∈ RE×|Σ| is a word- embedding matrix for a vocabulary Σ. Then, word vector π could be obtained by the following equation: π = We Π (1) Here, the size of the parameter matrix depends on embedding dimension E and the size of the vocabulary |Σ|. As the size of the vocabulary is a fixed value based on the preprocessing of the dataset, we experimented with changing E and explored its influence on the overall model performance as well as on the model size. • Top-Down Attention LSTM and Language LSTM As these two modules are similar to each other in terms of number of parameters and inner representations, we treat them as one logical part. Both modules are LSTMs, so generally the operation of them over a single time step is the following: ht = LSTM( xt , ht−1 ), (2) where ht ∈ R M . In this case, the sizes of the parameter matrices of LSTM strongly depend on parameter M. We manipulated this parameter by trying to reduce it without causing huge losses in the model’s quality. • Attention Module The third important module of the Up-Down architecture is called Attention Module. It is used to calculate attention weights αt for the encoder’s features vi . It works in the following way: αi,t = waT tanh(Wva vi + Wha h1t ) (3) αt = softmax( at ), (4) where Wva ∈ R H ×V , Wha ∈ R H × M . The main constants that influence the amount of weights in this module are H, V and M. V is the size of the encoder vectors, which is fixed by encoder choice, M has been discussed previously, and H is the dimension of input attention representation. We investigated how changing H influenced both model size and its performance along with other parameters.
Appl. Sci. 2022, 12, 1638 6 of 14 Thus, in trying to reach a smaller model size without harming its performance metrics, we manipulated it with model parameters such as E, M and H. The model AoANet described in [10], similarly to Up-Down, has both embedding calculation and LSTM parts. So, the reasoning for parameters E and M is similar to the previous subsection. However, the other parts of the architecture are different, so we don’t experiment with them in this paper. We will reduce the number of all the parameters multiplying E, M and H for Up-Down and E and M for AoANet used in the original papers by the equal scale factor γ. 3.2.2. Decoder Pruning Because most of the layers of the decoders we have considered are linear or LSTM layers, the most suitable pruning method is unstructured pruning. We propose to carry it out after training, fixing the model. There are two parts of the decoder: calculating embeddings and processing them. The embedding part is a separate, important part, because it is responsible for the selection of vectors that will represent the words. In order to determine the effect of pruning of embeddings, as the main semantic part, on the quality of a model, we consider two options: pruning of the entire model and pruning of everything except embeddings. Let the model to prune be M (W1 , W2 ), where W1 ∈ Rm1 is a vector of model param- eters to which the pruning algorithm is not applied, and W2 ∈ Rm2 is a vector of model parameters to which the pruning algorithm is applied, sorted by the increase in their l1 norms, let α ∈ [0, 1] be the pruning coefficient, and A(W, α) be the pruning algorithm, generating mask of weights which would be pruned. Thus, the final model after pruning would be M (W1 , W2 · A(W2 , α)), where · denotes elementwise multiplication. We concentrate on two methods of choosing weights to prune: • random pruning: A(W ) generates mask M ∈ {0, 1}|W | , where Mi = 0 with probability α; • for l1 pruning: A(W ) generates mask M ∈ {0, 1}|W | , where Mi = 0 ∀i < α|W |. 3.2.3. Decoder Quantization Having fixed the model, we use post-training dynamic quantization. When converting from floating point numbers to integers, this number is multiplied by some constant, subsequently rounding the result so that it fits into an int. This constant can be defined in different ways. The essence of dynamic quantization is that, despite the fact that it quantizes the weights of the model before applying it and keeps the model compressed, it calculates a constant by which to multiply activations during the calculation, depending on the input data. This approach allows us to maintain maximum accuracy while storing the model in a compressed form. This method of quantization was chosen due to the fact that we are primarily interested in the accuracy and size of the model, and not the speed of its application. 4. Experiments 4.1. Experimental Setup To compare the effectiveness of different model-compression methods, we used MSCOCO [58], which is the most common image caption benchmark dataset. It consists of 82,783 training images and 40,504 validation images. There are five different captions for each of the images. We used the standard Karpathy split from [7] for offline evaluation. As a result, the final dataset consists for 113,287 images for training, 5000 images for validation and 5000 images for testing. We also performed preprocessing of the dataset by replacing all the words that occurred less than five times with a special token . Further to following commonly used procedures, we truncated the words to maintain a maximal length of the caption equal to 16. Metrics BLEU [59], METEOR [60], ROUGE-L [61], CIDEr [31] and SPICE [32] were used to compare the results. BLEU came from a machine translation task. It uses n- grams precision to calculate a similarity score between reference and generated sentences.
Appl. Sci. 2022, 12, 1638 7 of 14 METEOR uses synonym matching functions along with n-grams matching. ROUGE-L is based on the longest common subsequence statistics. Two of the most important metrics for image captioning are CIDEr and SPICE, as they are human consensus metrics. CIDEr is based on n-grams and performs TF-IDF weighting on each of them. SPICE uses semantic graph parsing in order to determine how well the model has captured the attributes of objects and their relationships. For our experiments, we took models based on the https://github.com/ruotianluo/ ImageCaptioning.pytorch (accessed on 4 November 2021) public repository. The implemen- tation framework was PyTorch. For encoders, the built-in PyTorch Faster R-CNN was used, as well as the https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch (accessed on 4 November 2021) repository for EfficientDet. All models were trained for 30 epochs in the usual way, and then for 10 more epochs in a self-critical style following [8]. The models were trained and tested on Google Cloud Platform using servers with eight CPU cores, 32 GB of RAM and one Tesla V100 GPU accelerator. 4.2. Encoder Compression In Table 1, we present the results of a comparison of Up-Down and AoANet models modified using different encoders. Variants with a Faster R-CNN ResNet101 encoder refer to the same models as in original papers. It can be observed that all of the proposed variations both for Up-Down and AoANet models have quality metrics very close to each other. This aligns well with the fact that despite their small size, image-detection models used as encoders in these experiments show great results in the standalone object detection task. Furthermore, usage of smaller encoders designed to be launched on mobile devices helps to reduce the encoder’s size from 468.4 MB to 15.1 MB (a 96.8% reduction), losing only 0.6 in CIDEr and 0.3 in SPICE scores. We choose encoder EfficientDet-D0 to be the most suitable for this task among all other tested ones, and we use it in all further experiments. Table 1. Evaluation results of all of tested models with different encoders. Both “Size” and “#Params” columns refer to encoder. Encoder Type B@1 B@4 M R C S Size #Params Up-Down Faster R-CNN 79.8 36.3 27.7 56.9 120.1 21.4 468.4 MB 122.6 M ResNet101 [9] Faster R-CNN 79.3 35.6 27.4 56.8 119.4 21 74.2 MB 19.4 M MobileNetV3 EfficientDet-D0 79.4 35.8 27.3 56.7 119.5 21.1 15.1 MB 3.9 M EfficientDet-D1 79.7 36.2 27.5 56.9 120 21.3 25.7 MB 6.6 M AoANet Faster R-CNN 80.2 38.9 29.2 58.8 129.8 22.4 468.4 MB 122.6 M ResNet101 [10] Faster R-CNN 79.4 38 29.5 58.2 127.2 21.9 74.2 MB 19.4 M MobileNetV3 EfficientDet-D0 79.8 38.3 29.6 58.3 129.2 22.1 15.1 MB 3.9 M EfficientDet-D1 80.1 38.4 29.8 58.4 129.5 22.1 25.7 MB 6.6 M 4.3. Decoder Architecture Changes In Table 2 we evaluate models with their decoder reduced using different scale factors. A value of γ equal to 1 corresponds to using original models, but, as said before, with EfficientDet-D0 encoder. For the Up-Down model as well as for AoANet, decoder reduction with scale factor equal to 0.5 shows results comparable to the model without reduction, but with approximately three times less memory and number of parameters. Reducing the model further leads to greater loss in quality. Thus, for example, Up-Down model’s
Appl. Sci. 2022, 12, 1638 8 of 14 CIDEr metric decreases by only 0.3 points when the scale factor of 0.5 is used, but when the scale factor of 0.25 is used it decreases by 4.9 points, which is a much bigger loss. A similar situation is seen for the CIDEr score drop for the AoANet model. So, in this case, using scale factor γ = 0.5 acts as a compromise which helps to both reduce the model size and leave quality metrics on the same level. We will use decoders reduced with a scale factor equal to 0.5 for our experiments for the rest of the paper. Table 2. Evaluation results of all of tested models decoders reduced with different scale factors. Both “Size” and “#Params” columns refer to decoder. Decoder Scale Factor B@1 B@4 M R C S Size #Params Up-Down γ=1 79.4 35.8 27.3 56.7 119.5 21.1 193 MB 50.1 M γ = 0.5 79.4 35.2 27.1 56.6 119.2 20.9 68.8 MB 18 M γ = 0.25 78.3 33.2 26.2 55.5 114.3 19.8 27.5 MB 7.2 M AoANet γ=1 79.8 38.3 29.6 58.3 129.2 22.1 323.4 MB 84.8 M γ = 0.5 79.5 38.5 29.5 58.3 129.1 22.1 100.7 MB 26.4 M γ = 0.25 78.5 36.4 28.6 57.4 122.9 20.9 35.2 MB 9.2 M 4.4. Decoder Pruning Results of pruning using different methods and pruning coefficients are presented in Figures 1 and 2. Blue and orange lines on both figures are correspondent to l1 pruning, while red and green lines are correspondent to random pruning. “True” represents pruning the embedding layer and “False” represents not doing so. NNZ means “number of non-zero parameters”, which is the common measure for comparing pruning techniques. As it can be seen, random pruning is strictly worse than l1 pruning. This could be due to the fact that diminishing less valuable weights of the model expectedly influences the model’s quality metrics less than removing only random parameters of the model. The other fact is that although not pruning the embeddings layer helps to maintain greater quality for big pruning coefficients for the Up-Down model, it almost doesn’t show any difference for the AoANet model. Additionally, the great decline in metrics could be observed for the Up-Down model starting from α = 0.1, and for AoANet starting from α = 0.5. Taking this into account, in any case we are not interested in the area of Up-Down model pruning where the choice of pruning or not pruning embeddings could play any role. As said before, the biggest pruning coefficients (which lead to a bigger model com- pression) with which models still show good quality are 0.1 and 0.5 for Up-Down and AoANet, respectively. Increasing these values could lead to a drop in metrics. More detailed comparisons of model results near this drop, along with the unpruned model quality, are reported in Table 3. The table proves the observations obtained from figures. The chosen boundary values for pruning coefficients help to reduce model size by 10% and losing only 0.6 CIDEr points for the Up-Down model, and reducing model size by 50% and losing only 1.5 CIDEr points for the AoANet model. Such a great reduction in AoANet could indicate that the overall model is bigger and has more parameters than needed to show comparable results with its architecture. Furthermore, this model’s weights could be quite sparse, and lots of them could be close to 0. In this case, such extensive pruning wouldn’t lead to big loss in performance.
Appl. Sci. 2022, 12, 1638 9 of 14 Table 3. Evaluation results of all of tested models’ decoders pruned using different pruning coeffi- cients. Both “Size” and “NNZ” columns refer to the decoder. Pruning Coefficient B@1 B@4 M R C S Size NNZ Up-Down α=0 79.4 35.2 27.1 56.6 119.2 20.9 68.8 MB 18 M α = 0.1 79.2 35.1 26.9 56.3 118.4 20.8 62 MB 16.2 M α = 0.3 77.5 32.6 25.7 54.7 109.4 19.5 48.2 MB 12.6 M AoANet α=0 79.5 38.5 29.5 58.3 129.1 22.1 100.7 MB 26.4 M α = 0.5 79.3 38.2 29.1 58.3 127.6 21.5 50.4 MB 13.2 M α = 0.7 75.5 34.1 26.6 56.3 112 19 30.2 MB 7.9 M CIDEr score dependence on pruning coefficient 120 100 80 CIDEr 60 40 L1, True L1, False 20 Random, True Random, False 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α SPICE score dependence on pruning coefficient 20.0 17.5 15.0 12.5 SPICE 10.0 7.5 L1, True 5.0 L1, False Random, True 2.5 Random, False 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α Figure 1. CIDEr and SPICE validation scores dependence on pruning coefficient α for different pruning methods applied to the Up-Down model. “L1” indicates that l1 pruning was used, whilist “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning.
Appl. Sci. 2022, 12, 1638 10 of 14 CIDEr score dependence on pruning coefficient 120 100 80 CIDEr 60 40 L1, True 20 L1, False Random, True 0 Random, False 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α SPICE score dependence on pruning coefficient 20 15 SPICE 10 5 L1, True L1, False Random, True 0 Random, False 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α Figure 2. CIDEr and SPICE validation scores depending on pruning coefficient α for different pruning methods applied to AoANet model. “L1” indicates that l1 pruning was used, whilst “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning. 4.5. Decoder Quantization For the quantization experiments, the best models obtained in the previous section were used. Thus, we used the Up-Down model pruned with α = 0.1 and the AoANet model pruned with α = 0.5. The results of the quantization experiments can be found in Table 4. Quantization helps to reduce model size without approximately any loss of quality. The resulting decoder size of the Up-Down model is 43.4 MB, and the decoder size of the AoANet model is 19.7 MB.
Appl. Sci. 2022, 12, 1638 11 of 14 Table 4. Evaluation results of all of tested models decoders with and without quantization. “Size” column refers to decoder. B@1 B@4 M R C S Size Up-Down Without quantization 79.2 35.1 26.9 56.3 118.4 20.8 62 MB With quantization 79.2 35.3 26.9 56.3 118.4 20.8 43.4 MB AoANet Without quantization 79.5 38.2 29.1 58.3 127.6 21.5 50.4 MB With quantization 79.4 38.3 29.0 58.3 127.4 21.4 19.7 MB 5. Discussion The final models compressed using all of the proposed methods compared with the original ones are presented in Table 5. It can be observed that our methods helped to significantly reduce model size with a comparably small change in performance metrics. The Up-Down model’s size was reduced from 661.4 MB to 58.5 MB, with a loss of 1.7 points in CIDEr and 0.6 points in SPICE metrics. The AoANet model’s size was reduced from 791.8 MB to 34.8 MB, with a loss of 2.4 points in CIDEr and 1 point in SPICE metrics. Using the proposed methods worked well for both models, which could be a good indicator of the methods’ generalizability. Sizes of 58.5 MB and 34.8 MB are already small enough to be used on mobile devices in real-world applications. Table 5. Evaluation results of all of tested models both original and compressed using methods from this paper. Both “Size” and “NNZ” columns refer to the whole model. B@1 B@4 M R C S Size NNZ Up-Down Original model 79.8 36.3 27.7 56.9 120.1 21.4 661.4 MB 172.7 M Compressed model 79.2 35.3 26.9 56.3 118.4 20.8 58.5 MB 20.1 M AoANet Original model 80.2 38.9 29.2 58.8 129.8 22.4 791.8 MB 207.4 M Compressed model 79.4 38.3 29.0 58.3 127.4 21.4 34.8 MB 17.1 M 6. Conclusions In this work we proposed the use of different neural network compression techniques for models trained to solve image-captioning tasks. Our extensive experiments showed that all of the proposed methods help to reduce model size with an insignificant loss in quality. Thus, the best obtained model reaches 127.4 CIDEr and 21.4 SPICE, with a weighting of only 34.8 MB. This sets up a strong baseline for future studies on this topic. In future works, other compression methods such as knowledge distillation could be investigated in order to reduce the models’ sizes even more. Also, some more complex pruning or quantization techniques could be used. The other direction for further research could be encoder compression, using not only different architectures, but also the other methods, for example, pruning and quantization. Reducing the size of more complex and bigger models such as [35,36] is also a promising direction of study. Author Contributions: Conceptualization, methodology, software, writing—original draft prepara- tion, visualization, investigation, editing V.A.; writing—review, supervision, project administration, funding acquisition D.Š. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2022, 12, 1638 12 of 14 References 1. Staniūtė, R.; Šešok, D. A Systematic Literature Review on Image Captioning. Appl. Sci. 2019, 9, 2024. [CrossRef] 2. Zafar, B.; Ashraf, R.; Ali, N.; Iqbal, M.K.; Sajid, M.; Dar, S.H.; Ratyal, N.I. A novel discriminating and relative global spatial image representation with applications in CBIR. Appl. Sci. 2018, 8, 2242. [CrossRef] 3. Belalia, A.; Belloulata, K.; Kpalma, K. Region-based image retrieval in the compressed domain using shape-adaptive DCT. Multimed. Tools Appl. 2016, 75, 10175–10199. [CrossRef] 4. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. 5. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 June 2015; pp. 2048–2057. 6. Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7219–7228. 7. Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. 8. Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. 9. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. 10. Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4634–4643. 11. Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. Interspeech 2010, 2, 1045–1048. 12. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed] 13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. 14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. 15. Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10578–10587. 16. Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled Transformer for Image Captioning. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8928–8937. 17. Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4467–4480. [CrossRef] 18. Zhu, X.; Li, L.; Liu, J.; Peng, H.; Niu, X. Captioning transformer with stacked attention modules. Appl. Sci. 2018, 8, 739. [CrossRef] 19. He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image captioning through image transformer. In Proceedings of the Asian Conference on Computer Vision, Online, 30 November–4 December 2020. 20. Tan, J.H.; Chan, C.S.; Chuah, J.H. End-to-End Supermask Pruning: Learning to Prune Image Captioning Models. Pattern Recognit. 2022, 122, 108366. [CrossRef] 21. Tan, J.H.; Chan, C.S.; Chuah, J.H. Image Captioning with Sparse Recurrent Neural Network. arXiv 2019, arXiv:1908.10797. 22. Dai, X.; Yin, H.; Jha, N.K. Grow and prune compact, fast, and accurate LSTMs. IEEE Trans. Comput. 2019, 69, 441–452. [CrossRef] 23. Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. 24. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [CrossRef] [PubMed] 25. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10781–10790. 26. Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. 27. See, A.; Luong, M.T.; Manning, C.D. Compression of Neural Machine Translation Models via Pruning. arXiv 2016, arXiv:1606.09274. 28. Choudhary, T.; Mishra, V.; Goswami, A.; Sarangapani, J. A comprehensive survey on model compression and acceleration. Artif. Intell. Rev. 2020, 53, 5113–5155. [CrossRef] 29. Reed, R. Pruning algorithms-a survey. IEEE Trans. Neural Netw. 1993, 4, 740–747. [CrossRef] 30. Guo, Y. A Survey on Methods and Theories of Quantized Neural Networks. arXiv 2018, arXiv:1808.04752. 31. Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575.
Appl. Sci. 2022, 12, 1638 13 of 14 32. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 382–398. 33. Huang, L.; Wang, W.; Xia, Y.; Chen, J. Adaptively Aligned Image Captioning via Adaptive Attention Time. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8940–8949. 34. Wang, W.; Chen, Z.; Hu, H. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8957–8964. 35. Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 121–137. 36. Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 5579–5588. 37. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. 38. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Euro- pean Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. 39. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. 40. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. 41. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. 42. Hoefler, T.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. arXiv 2021, arXiv:2102.00554. 43. He, Y.; Ding, Y.; Liu, P.; Zhu, L.; Zhang, H.; Yang, Y. Learning filter pruning criteria for deep convolutional neural networks accel- eration. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 2009–2018. 44. Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. arXiv 2020, arXiv:2006.05467. 45. Anwar, S.; Hwang, K.; Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2017, 13, 1–18. [CrossRef] 46. He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 784–800. 47. Lee, N.; Ajanthan, T.; Torr, P. Snip: Single-Shot Network Pruning Based on Connection Sensitivity. arXiv 2018, arXiv:1810.02340. 48. Xiao, X.; Wang, Z. Autoprune: Automatic network pruning by regularizing auxiliary parameters. In Proceedings of the Advances in Neural Information Processing Systems 32, (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. 49. Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv 2021, arXiv:2103.13630. 50. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. 51. Yao, Z.; Dong, Z.; Zheng, Z.; Gholami, A.; Yu, J.; Tan, E.; Wang, L.; Huang, Q.; Wang, Y.; Mahoney, M.; et al. Hawq-v3: Dyadic neural network quantization. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11875–11886. 52. Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv 2018, arXiv:1805.06085. 53. Li, R.; Wang, Y.; Liang, F.; Qin, H.; Yan, J.; Fan, R. Fully quantized network for object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–25 June 2019; pp. 2810–2819. 54. Parameswaran, S.N. Exploring memory and time efficient neural networks for image captioning. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics; Springer: Singapore, 2017; pp. 338–347. 55. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and
Appl. Sci. 2022, 12, 1638 14 of 14 61. Lin, C.Y.; Och, F.J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; p. 605.
You can also read