Image-Captioning Model Compression - Article - MDPI

Page created by Virgil Daniels
 
CONTINUE READING
applied
              sciences
Article
Image-Captioning Model Compression
Viktar Atliha and Dmitrij Šešok *

                                          Department of Information Technologies, Vilnius Gediminas Technical University, Saulėtekio Al. 11,
                                          LT-10223 Vilnius, Lithuania; viktar.atliha@vilniustech.lt
                                          * Correspondence: dmitrij.sesok@vilniustech.lt

                                          Abstract: Image captioning is a very important task, which is on the edge between natural language
                                          processing (NLP) and computer vision (CV). The current quality of the captioning models allows
                                          them to be used for practical tasks, but they require both large computational power and considerable
                                          storage space. Despite the practical importance of the image-captioning problem, only a few papers
                                          have investigated model size compression in order to prepare them for use on mobile devices.
                                          Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder
                                          architecture, while the encoder traditionally occupies most of the space. We applied the most efficient
                                          model-compression techniques such as architectural changes, pruning and quantization to several
                                          state-of-the-art image-captioning architectures. As a result, all of these models were compressed by
                                          no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in
                                          metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results
                                          of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for
                                          compression problems for image-captioning models, and could be used for practical applications.

                                          Keywords: image captioning; model compression; pruning; quantization

         
                                   1. Introduction
Citation: Atliha, V.; Šešok, D.
                                                One of the most significant tasks combining two different domains such as CV and
Image-Captioning Model
                                          NLP is the image-captioning task [1]. Its goal is to automatically generate a caption
Compression. Appl. Sci. 2022, 12,
                                          describing an image given as an input. The description should contain not only a listing of
1638. https://doi.org/10.3390/
                                          the objects in this image, but should also take into account their signs, interactions between
app12031638
                                          them, etc., in order for this description to be as humanlike as possible. This is an important
Academic Editor: Andrea Prati             task for many practical applications in everyday life, such as human–computer interactions,
Received: 8 January 2022
                                          help for visually impaired people and image searching [2,3].
Accepted: 2 February 2022
                                                Typically, image-captioning models are based on encoder–decoder architecture. Some
Published: 4 February 2022
                                          of the models, such as [4–8], use a usual convolutional neural network (CNN) as an
                                          encoder. However, usage of image detectors, such as in [9,10] for this purpose, has been
Publisher’s Note: MDPI stays neutral
                                          rising in popularity in recent years. The decoder here is a text generator, most often
with regard to jurisdictional claims in
                                          represented as a recurrent neural network (RNN) [11] or a long short-term memory network
published maps and institutional affil-
                                          (LSTM) [12] with attention [13]. However, more complex models based on the architecture
iations.
                                          of transformers [14], which are the state of the art in a variety of NLP problems, have been
                                          created, using transformers both for sentences [15–18] and images [19].
                                                Unsurprisingly, rising interest in image-captioning models has significantly improved
Copyright: © 2022 by the authors.
                                          the quality of the models in a very short time. At the same time, improvements occur not
Licensee MDPI, Basel, Switzerland.        only due to more complex architectures of neural networks, but also due to an increase in
This article is an open access article    the number of parameters and, accordingly, the size of the captioning model. Moreover,
distributed under the terms and           it is not known for certain whether a more successful architecture and training method,
conditions of the Creative Commons        or a physically larger number of parameters, leads to an increase in quality. At the same
Attribution (CC BY) license (https://     time, the deep neural network models are more widely used on mobile devices, which
creativecommons.org/licenses/by/          have limitations both in computing power and in the size of storage. As a result, current
4.0/).                                    state-of-the-art models, such as, for example [9,10], cannot be used on mobile devices,

Appl. Sci. 2022, 12, 1638. https://doi.org/10.3390/app12031638                                                 https://www.mdpi.com/journal/applsci
Appl. Sci. 2022, 12, 1638                                                                                            2 of 14

                            as they usually weigh 500 MB and more. The problem of model compression for image
                            captioning is very poorly studied. Even presented papers [20–22] usually only research
                            decoder compression, although the encoder is an integral part of the model used when
                            applying it to new data.
                                  At the same time, model-compression techniques in other related areas of CV and NLP
                            are well studied. So, for example, research on the various architectures of the object detector
                            in the image (which is often used as an encoder for the image-captioning task) [23–25]
                            have been conducted for a long time, with the aim of significantly reducing their size and
                            increasing the operating speed without losing quality (or even increasing it). On the other
                            hand, methods of compressing NLP models are investigated for various tasks, starting from
                            text classification [26], and ending with text translation [27], which is similar in meaning to
                            the image-captioning problem.
                                  In this article, we have decided to fill this gap and undertake a comprehensive use
                            and analysis of deep neural networks (DNN) model-compression techniques [28] applied
                            to image-captioning models. We investigated various encoder architectures specifically
                            tailored to the task of image detection in a low-resource environment. We also applied
                            methods of compressing models to the decoder, reducing the architecture of the model itself,
                            as well as applying various pruning [29] and quantization [30] methods. As a result, we
                            were able to significantly reduce the occupied space of two state-of-the-art models named
                            Up-Down [9] and AoANet [10], while not losing much in terms of important metrics. For
                            example, using our approach, we were able to reduce the size of the classic Up-Down
                            model from 661.4 MB to 58.5 MB (including the encoder), that is, by 91.2%. At the same
                            time, the main image-captioning metrics such as CIDEr [31] and SPICE [32] decreased from
                            120.1 and 21.4 to 118.4 and 20.8, respectively (by 1.4% and 2.8%, respectively). For the
                            AoANet model, the size reduction was 95.6% from 791.8 MB to 34.8 MB, the CIDEr and
                            SPICE metrics fell by 1.7% and 4%, respectively, from 129.8 to 127.6 and from 22.4 to 21.5,
                            respectively. Thus, the best-resulting model reaches 127.6 CIDEr and 21.5 SPICE at a size of
                            only 34.8 MB.
                                  The main contributions of this paper are as follows:
                            •    We have proposed the use of modern methods of model compression in relation to
                                 the image-captioning task, both for the encoder and for the decoder in order to reduce
                                 the overall size of the model;
                            •    We compared different options for such a reduction, such as different encoder models,
                                 different decoder architecture variations, as well as different pruning options, and
                                 conducted a study of the effect of quantization;
                            •    The proposed methods allowed us to significantly reduce the size of the models,
                                 without significant loss of quality;
                            •    The methods worked universally on two different models, which suggests that they
                                 can be successfully used for other architectures used for the image-captioning tasks.
                                 The paper is organized as follows:
                            •    In Section 2 we review paper related to this work, including such topics as image-
                                 captioning tasks in general, neural network model-compression techniques and their
                                 particular application to image-captioning tasks;
                            •    In Section 3 we describe our methodology, firstly describing encoder and then decoder
                                 compression approaches;
                            •    In Section 4 we report the design of our experiments as well as their numerical results
                                 and their analysis;
                            •    In Section 5 we state the main achievements and a discussion of the results;
                            •    In Section 6 we conclude the work and name future work directions.
Appl. Sci. 2022, 12, 1638                                                                                           3 of 14

                            2. Related Work
                            2.1. Image Captioning
                                 The task of the automatic generation of a caption describing an image is an important
                            task for a smoother human–machine interaction. One of the earliest successful methods of
                            such generation, which laid the foundation for modern image-captioning methods, is [4].
                            The architecture is a typical encoder–decoder architecture, which first generates some
                            representation of the image, and then, based on this sentence, generates text, most often
                            word by word. This model is followed by most modern image-captioning architectures.
                                 Two key works that have significantly improved the performance of image-captioning
                            models are [8,9]. The first work suggested using the REINFORCE algorithm to directly
                            optimize the discrete model’s quality metric named CIDEr, thereby directly improving its
                            value. The second work suggested using the detector of objects in the image as a encoder
                            (the article itself uses Faster R-CNN [24]), and then, based on the detected objects and their
                            attributes, generating a caption, instead of trying to compress the entire image into one
                            vector and generating it based on such representation.
                                 In addition, starting with [5], the attention mechanism is actively used for the image-
                            captioning task, which allows the model to adaptively increase the attention to objects or
                            areas that are most important for generating a text description for the current image. It is
                            used by high-quality works such as [10,33,34] and others.
                                 Recently, transformers [14] have been gaining more and more popularity, becoming
                            state-of-the-art models for many tasks from the field of NLP. Their use improves the quality
                            of image-captioning models as well, which is investigated in [15,16,18].
                                 Also, research in the field of image captioning is moving towards the unification of
                            models with other vision–language understanding and generation tasks. Works such as [35,36]
                            use networks pretrained on a giant dataset and obtain a model capable of solving a number
                            of vision–language problems. However, due to the fact that such pretraining requires huge
                            computational resources, it is difficult for researchers to find improvements to such models
                            and broadly study them.

                            2.2. Neural Network Models Compression
                            2.2.1. Effective Architectures
                                  One of the most effective ways to reduce model size is to look for a more efficient
                            architecture that contains fewer parameters but still performs well. The search for such
                            architectures is especially active for those tasks that, on the one hand, are very useful to
                            be able to solve in real time on mobile devices, and, on the other hand, which, as a rule,
                            have too cumbersome architectures. One of these tasks is the task of detecting an object
                            in an image, together with its classification and the determination of its attributes. So, for
                            example, Faster R-CNN from [24] uses Region Proposal Network (RPN) and combines it
                            with Fast R-CNN from [23], along with using convolutional neural network (VGG-16 [37])
                            as a backbone. SSD [38] initially creates a set of default boxes over different aspect ratios
                            and scales, and then determines whether there are objects of interest. RetinaNet from [39]
                            is a small, dense detector, performing well thanks to focal loss-oriented training.
                                  As a rule, the backbone in the form of a convolutional neural network acts as one of
                            the important components of detectors. Therefore, the task of selecting an effective convolu-
                            tional network architecture is also on the agenda. Over time, various ideas have appeared
                            to reduce the number of parameters, and, accordingly, the space occupied by the model.
                            One of the most popular, MobileNetV3 [40] was found by a neural architecture search
                            based on previous MobileNet architecture versions. The other widely used convolution
                            neural network architecture is EfficientNet [41], which was explored during specialized
                            research, concentrated on the amount of parameters that influence on the model quality.
                            This network is also used in EfficientDet [25], which generalizes the ideas presented by
                            transferring them to the objects’ detection area.
Appl. Sci. 2022, 12, 1638                                                                                              4 of 14

                            2.2.2. Pruning
                                 Pruning is one of the most popular and effective methods for reducing the size of
                            models [42–44]. Its idea is to remove model parameters that are not useful or meaning-
                            ful when present in the model. Thus, deleting such parameters does not cause much
                            damage to the final quality, but at the same time it allows you to save storage space and
                            computing resources.
                                 In general, all pruning methods can be divided into two categories: structured prun-
                            ing [45,46] and unstructured pruning. Within structured pruning, entire rows, columns or
                            channels of layers are removed. Such techniques are used both for reducing CNN models
                            and for models using RNN.
                                 Another category is unstructured pruning [47,48]. In the framework of this kind
                            of methods, the removal of weights is not tied to a specific structure of the model; any
                            weights can be removed depending on the criterion of their “importance”. Such methods
                            are actively used in different variations for various tasks. For initial research, we focused
                            on relatively simple methods that can achieve a good result.

                            2.2.3. Quantization
                                 Another method of compressing a model without losing quality is quantization [49].
                            The essence of quantization is to use lower precision numbers for storing the model in
                            order to reduce the amount of occupied space, and at the same time, without losing quality.
                            Indeed, most practical applications do not require the precision that standard floating point
                            data types provide.
                                 Quantization approaches could be divided into two groups based on the type of
                            compression range defined: static and dynamic quantization. In static quantization [50,51],
                            the clipping range is calculated before inference and remains the same during the model’s
                            runtime. In the other group of methods, called dynamic quantization [52,53], the clipping
                            range is dynamically calculated for each activation map during model application.

                            2.3. Model Size Reducing for Image Captioning
                                 The area of model compression for image captioning is underexplored. Only a few
                            works offer their own approaches to reducing the size of the image-captioning models.
                            In [54], the authors propose to use SquezeNet [55] as an encoder and LightRNN [56] as a
                            decoder. The authors of [57] propose a way to reduce model size, tackling the problem of
                            huge embedding matrices which are not scalable to bigger vocabularies. In [20,21], novel
                            special pruning approaches to the decoder of image-captioning models are proposed.
                                 However, all of these works investigate narrow aspects, for example, only reducing
                            the size of the encoder or decoder or using only one method of compression. In contrast,
                            our article is intended to provide a comprehensive understanding of the methods for
                            compressing models for image captioning, using various combinations of approaches in
                            order to reduce the final model size as much as possible without much loss of quality.

                            3. Methodology
                                This section will describe the methods we used to reduce the size of the models.
                            Both the methods used for both investigated architectures (Up-Down and AoANet) and
                            methods specific to each particular model will be described. It is important to say that we
                            didn’t aim to compare Up-Down and AoANet models between each other, but to compare
                            compression methods for several models. A comparison of the original models can be
                            found in [10].

                            3.1. Encoder Compression
                                 As mentioned above, the overwhelming majority of state-of-the-art image-captioning
                            architectures use an image object detector as an encoder. That is, typical models are built
                            according to the following scheme: let I be the input image, E be a detection model, and D
                            be a decoder. Firstly, the method generates k image features E( I ) = V = {v1 , v2 , . . . , vk },
Appl. Sci. 2022, 12, 1638                                                                                              5 of 14

                            vi ∈ Rd , such that each image feature encodes a region of the image. Then, the decoder
                            generates caption D (V ) based on these features.
                                  We propose to use this general scheme as a sample, however, as an encoder, we take
                            models that are more suitable for our task. One of these options is even saving the Faster
                            R-CNN detection model, but with a different backbone (since it is this backbone that
                            contains most of the parameters). As such a backbone, we propose using MobileNetV3,
                            which performs well for memory-restricted tasks. For comparison, we also propose taking
                            a detector that was initially sharpened for a small memory while maintaining good quality.
                            EfficientDet was chosen as such a detector.
                                  Thus, as E, we will have Faster R-CNN ResNet101 used in the original Up-Down
                            model, as well as Faster R-CNN MobileNetV2 and EfficientDet. Because the sizes of these
                            detectors are already small, we have not considered other methods of encoder compression.

                            3.2. Decoder Compression
                                 Assuming that the current models contain more parameters than are necessary to
                            achieve the same quality, we first investigate whether they can be reduced in number with-
                            out changing the nature of the architecture. In this case, the method for a specific change in
                            the number of parameters depends on the model under study. Further, on the best architec-
                            ture, pruning methods are applied, and the best-pruned model is eventually quantized.

                            3.2.1. Architecture Changes
                                 Following [9], the decoder model has three important logical parts:
                            •    Embeddings Calculation
                                 In this part, one-hot encoded vectors representing words are transformed to word
                                 vectors. Let Π be one-hot encoding of the input word and We ∈ RE×|Σ| is a word-
                                 embedding matrix for a vocabulary Σ. Then, word vector π could be obtained by the
                                 following equation:
                                                                      π = We Π                                  (1)
                                 Here, the size of the parameter matrix depends on embedding dimension E and the
                                 size of the vocabulary |Σ|. As the size of the vocabulary is a fixed value based on
                                 the preprocessing of the dataset, we experimented with changing E and explored its
                                 influence on the overall model performance as well as on the model size.
                            •    Top-Down Attention LSTM and Language LSTM
                                 As these two modules are similar to each other in terms of number of parameters and
                                 inner representations, we treat them as one logical part. Both modules are LSTMs, so
                                 generally the operation of them over a single time step is the following:

                                                                   ht = LSTM( xt , ht−1 ),                                (2)

                                 where ht ∈ R M . In this case, the sizes of the parameter matrices of LSTM strongly
                                 depend on parameter M. We manipulated this parameter by trying to reduce it
                                 without causing huge losses in the model’s quality.
                            •    Attention Module
                                 The third important module of the Up-Down architecture is called Attention Module.
                                 It is used to calculate attention weights αt for the encoder’s features vi . It works in the
                                 following way:
                                                                αi,t = waT tanh(Wva vi + Wha h1t )                         (3)
                                                                     αt = softmax( at ),                                  (4)
                                 where Wva ∈ R H ×V , Wha ∈ R H × M . The main constants that influence the amount of
                                 weights in this module are H, V and M. V is the size of the encoder vectors, which is
                                 fixed by encoder choice, M has been discussed previously, and H is the dimension
                                 of input attention representation. We investigated how changing H influenced both
                                 model size and its performance along with other parameters.
Appl. Sci. 2022, 12, 1638                                                                                            6 of 14

                                 Thus, in trying to reach a smaller model size without harming its performance metrics,
                            we manipulated it with model parameters such as E, M and H.
                                 The model AoANet described in [10], similarly to Up-Down, has both embedding
                            calculation and LSTM parts. So, the reasoning for parameters E and M is similar to the
                            previous subsection. However, the other parts of the architecture are different, so we don’t
                            experiment with them in this paper.
                                 We will reduce the number of all the parameters multiplying E, M and H for Up-Down
                            and E and M for AoANet used in the original papers by the equal scale factor γ.

                            3.2.2. Decoder Pruning
                                 Because most of the layers of the decoders we have considered are linear or LSTM
                            layers, the most suitable pruning method is unstructured pruning. We propose to carry it
                            out after training, fixing the model.
                                 There are two parts of the decoder: calculating embeddings and processing them. The
                            embedding part is a separate, important part, because it is responsible for the selection
                            of vectors that will represent the words. In order to determine the effect of pruning of
                            embeddings, as the main semantic part, on the quality of a model, we consider two options:
                            pruning of the entire model and pruning of everything except embeddings.
                                 Let the model to prune be M (W1 , W2 ), where W1 ∈ Rm1 is a vector of model param-
                            eters to which the pruning algorithm is not applied, and W2 ∈ Rm2 is a vector of model
                            parameters to which the pruning algorithm is applied, sorted by the increase in their l1
                            norms, let α ∈ [0, 1] be the pruning coefficient, and A(W, α) be the pruning algorithm,
                            generating mask of weights which would be pruned. Thus, the final model after pruning
                            would be M (W1 , W2 · A(W2 , α)), where · denotes elementwise multiplication.
                                 We concentrate on two methods of choosing weights to prune:
                            •    random pruning: A(W ) generates mask M ∈ {0, 1}|W | , where Mi = 0 with probability α;
                            •    for l1 pruning: A(W ) generates mask M ∈ {0, 1}|W | , where Mi = 0 ∀i < α|W |.

                            3.2.3. Decoder Quantization
                                 Having fixed the model, we use post-training dynamic quantization. When converting
                            from floating point numbers to integers, this number is multiplied by some constant,
                            subsequently rounding the result so that it fits into an int. This constant can be defined
                            in different ways. The essence of dynamic quantization is that, despite the fact that it
                            quantizes the weights of the model before applying it and keeps the model compressed, it
                            calculates a constant by which to multiply activations during the calculation, depending on
                            the input data. This approach allows us to maintain maximum accuracy while storing the
                            model in a compressed form. This method of quantization was chosen due to the fact that
                            we are primarily interested in the accuracy and size of the model, and not the speed of its
                            application.

                            4. Experiments
                            4.1. Experimental Setup
                                  To compare the effectiveness of different model-compression methods, we used
                            MSCOCO [58], which is the most common image caption benchmark dataset. It consists of
                            82,783 training images and 40,504 validation images. There are five different captions for
                            each of the images. We used the standard Karpathy split from [7] for offline evaluation. As
                            a result, the final dataset consists for 113,287 images for training, 5000 images for validation
                            and 5000 images for testing.
                                  We also performed preprocessing of the dataset by replacing all the words that occurred
                            less than five times with a special token . Further to following commonly used
                            procedures, we truncated the words to maintain a maximal length of the caption equal to 16.
                                  Metrics BLEU [59], METEOR [60], ROUGE-L [61], CIDEr [31] and SPICE [32] were
                            used to compare the results. BLEU came from a machine translation task. It uses n-
                            grams precision to calculate a similarity score between reference and generated sentences.
Appl. Sci. 2022, 12, 1638                                                                                                         7 of 14

                                   METEOR uses synonym matching functions along with n-grams matching. ROUGE-L is
                                   based on the longest common subsequence statistics. Two of the most important metrics
                                   for image captioning are CIDEr and SPICE, as they are human consensus metrics. CIDEr is
                                   based on n-grams and performs TF-IDF weighting on each of them. SPICE uses semantic
                                   graph parsing in order to determine how well the model has captured the attributes of
                                   objects and their relationships.
                                        For our experiments, we took models based on the https://github.com/ruotianluo/
                                   ImageCaptioning.pytorch (accessed on 4 November 2021) public repository. The implemen-
                                   tation framework was PyTorch. For encoders, the built-in PyTorch Faster R-CNN was used,
                                   as well as the https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch (accessed on
                                   4 November 2021) repository for EfficientDet. All models were trained for 30 epochs in the
                                   usual way, and then for 10 more epochs in a self-critical style following [8]. The models
                                   were trained and tested on Google Cloud Platform using servers with eight CPU cores,
                                   32 GB of RAM and one Tesla V100 GPU accelerator.

                                   4.2. Encoder Compression
                                         In Table 1, we present the results of a comparison of Up-Down and AoANet models
                                   modified using different encoders. Variants with a Faster R-CNN ResNet101 encoder
                                   refer to the same models as in original papers. It can be observed that all of the proposed
                                   variations both for Up-Down and AoANet models have quality metrics very close to
                                   each other. This aligns well with the fact that despite their small size, image-detection
                                   models used as encoders in these experiments show great results in the standalone object
                                   detection task.
                                         Furthermore, usage of smaller encoders designed to be launched on mobile devices
                                   helps to reduce the encoder’s size from 468.4 MB to 15.1 MB (a 96.8% reduction), losing only
                                   0.6 in CIDEr and 0.3 in SPICE scores. We choose encoder EfficientDet-D0 to be the most
                                   suitable for this task among all other tested ones, and we use it in all further experiments.

                                   Table 1. Evaluation results of all of tested models with different encoders. Both “Size” and “#Params”
                                   columns refer to encoder.

 Encoder Type               B@1         B@4            M              R             C             S            Size         #Params
                                                               Up-Down
 Faster R-CNN
                            79.8         36.3          27.7          56.9         120.1          21.4        468.4 MB       122.6 M
 ResNet101 [9]
 Faster R-CNN
                            79.3         35.6          27.4          56.8         119.4           21         74.2 MB         19.4 M
 MobileNetV3
 EfficientDet-D0            79.4         35.8          27.3          56.7         119.5          21.1        15.1 MB          3.9 M
 EfficientDet-D1            79.7         36.2          27.5          56.9          120           21.3        25.7 MB          6.6 M
                                                                AoANet
 Faster R-CNN
                            80.2         38.9          29.2          58.8         129.8          22.4        468.4 MB       122.6 M
 ResNet101 [10]
 Faster R-CNN
                            79.4         38            29.5          58.2         127.2          21.9        74.2 MB         19.4 M
 MobileNetV3
 EfficientDet-D0            79.8         38.3          29.6          58.3         129.2          22.1        15.1 MB          3.9 M
 EfficientDet-D1            80.1         38.4          29.8          58.4         129.5          22.1        25.7 MB          6.6 M

                                   4.3. Decoder Architecture Changes
                                         In Table 2 we evaluate models with their decoder reduced using different scale factors.
                                   A value of γ equal to 1 corresponds to using original models, but, as said before, with
                                   EfficientDet-D0 encoder. For the Up-Down model as well as for AoANet, decoder reduction
                                   with scale factor equal to 0.5 shows results comparable to the model without reduction,
                                   but with approximately three times less memory and number of parameters. Reducing
                                   the model further leads to greater loss in quality. Thus, for example, Up-Down model’s
Appl. Sci. 2022, 12, 1638                                                                                                      8 of 14

                              CIDEr metric decreases by only 0.3 points when the scale factor of 0.5 is used, but when the
                              scale factor of 0.25 is used it decreases by 4.9 points, which is a much bigger loss. A similar
                              situation is seen for the CIDEr score drop for the AoANet model. So, in this case, using
                              scale factor γ = 0.5 acts as a compromise which helps to both reduce the model size and
                              leave quality metrics on the same level. We will use decoders reduced with a scale factor
                              equal to 0.5 for our experiments for the rest of the paper.

                              Table 2. Evaluation results of all of tested models decoders reduced with different scale factors. Both
                              “Size” and “#Params” columns refer to decoder.

 Decoder Scale Factor       B@1         B@4           M             R            C             S           Size         #Params
                                                            Up-Down
 γ=1                        79.4        35.8         27.3          56.7        119.5         21.1        193 MB          50.1 M
 γ = 0.5                    79.4        35.2         27.1          56.6        119.2         20.9        68.8 MB          18 M
 γ = 0.25                   78.3        33.2         26.2          55.5        114.3         19.8        27.5 MB          7.2 M
                                                            AoANet
 γ=1                        79.8        38.3         29.6          58.3        129.2         22.1        323.4 MB        84.8 M
 γ = 0.5                    79.5        38.5         29.5          58.3        129.1         22.1        100.7 MB        26.4 M
 γ = 0.25                   78.5        36.4         28.6          57.4        122.9         20.9         35.2 MB         9.2 M

                              4.4. Decoder Pruning
                                    Results of pruning using different methods and pruning coefficients are presented in
                              Figures 1 and 2. Blue and orange lines on both figures are correspondent to l1 pruning,
                              while red and green lines are correspondent to random pruning. “True” represents pruning
                              the embedding layer and “False” represents not doing so. NNZ means “number of non-zero
                              parameters”, which is the common measure for comparing pruning techniques.
                                    As it can be seen, random pruning is strictly worse than l1 pruning. This could be due
                              to the fact that diminishing less valuable weights of the model expectedly influences the
                              model’s quality metrics less than removing only random parameters of the model.
                                    The other fact is that although not pruning the embeddings layer helps to maintain
                              greater quality for big pruning coefficients for the Up-Down model, it almost doesn’t show
                              any difference for the AoANet model. Additionally, the great decline in metrics could
                              be observed for the Up-Down model starting from α = 0.1, and for AoANet starting
                              from α = 0.5. Taking this into account, in any case we are not interested in the area of
                              Up-Down model pruning where the choice of pruning or not pruning embeddings could
                              play any role.
                                    As said before, the biggest pruning coefficients (which lead to a bigger model com-
                              pression) with which models still show good quality are 0.1 and 0.5 for Up-Down and
                              AoANet, respectively. Increasing these values could lead to a drop in metrics. More detailed
                              comparisons of model results near this drop, along with the unpruned model quality, are
                              reported in Table 3. The table proves the observations obtained from figures. The chosen
                              boundary values for pruning coefficients help to reduce model size by 10% and losing only
                              0.6 CIDEr points for the Up-Down model, and reducing model size by 50% and losing only
                              1.5 CIDEr points for the AoANet model. Such a great reduction in AoANet could indicate
                              that the overall model is bigger and has more parameters than needed to show comparable
                              results with its architecture. Furthermore, this model’s weights could be quite sparse, and
                              lots of them could be close to 0. In this case, such extensive pruning wouldn’t lead to big
                              loss in performance.
Appl. Sci. 2022, 12, 1638                                                                                                                9 of 14

                              Table 3. Evaluation results of all of tested models’ decoders pruned using different pruning coeffi-
                              cients. Both “Size” and “NNZ” columns refer to the decoder.

 Pruning Coefficient        B@1                    B@4             M              R               C             S           Size     NNZ
                                                                         Up-Down
 α=0                        79.4                   35.2           27.1           56.6            119.2         20.9     68.8 MB       18 M
 α = 0.1                    79.2                   35.1           26.9           56.3            118.4         20.8      62 MB       16.2 M
 α = 0.3                    77.5                   32.6           25.7           54.7            109.4         19.5     48.2 MB      12.6 M
                                                                         AoANet
 α=0                        79.5                   38.5           29.5           58.3            129.1         22.1    100.7 MB      26.4 M
 α = 0.5                    79.3                   38.2           29.1           58.3            127.6         21.5    50.4 MB       13.2 M
 α = 0.7                    75.5                   34.1           26.6           56.3             112           19     30.2 MB        7.9 M

                                                                CIDEr score dependence on pruning coefficient
                                       120

                                       100

                                        80
                               CIDEr

                                        60

                                        40
                                                           L1, True
                                                           L1, False
                                        20                 Random, True
                                                           Random, False

                                             0.0          0.1     0.2      0.3        0.4       0.5      0.6    0.7   0.8      0.9
                                                                                            α
                                                                SPICE score dependence on pruning coefficient

                                       20.0

                                       17.5

                                       15.0

                                       12.5
                               SPICE

                                       10.0

                                        7.5
                                                           L1, True
                                        5.0                L1, False
                                                           Random, True
                                        2.5                Random, False

                                              0.0         0.1      0.2     0.3        0.4       0.5      0.6    0.7   0.8      0.9
                                                                                            α

                              Figure 1. CIDEr and SPICE validation scores dependence on pruning coefficient α for different
                              pruning methods applied to the Up-Down model. “L1” indicates that l1 pruning was used, whilist
                              “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer
                              and “False” indicates no pruning.
Appl. Sci. 2022, 12, 1638                                                                                               10 of 14

                                                        CIDEr score dependence on pruning coefficient

                                     120

                                     100

                                      80
                             CIDEr

                                      60

                                      40
                                                   L1, True
                                      20           L1, False
                                                   Random, True
                                       0           Random, False

                                           0.0    0.1      0.2     0.3    0.4       0.5   0.6   0.7     0.8   0.9
                                                                                α
                                                        SPICE score dependence on pruning coefficient

                                     20

                                     15
                            SPICE

                                     10

                                      5           L1, True
                                                  L1, False
                                                  Random, True
                                      0           Random, False

                                          0.0    0.1       0.2     0.3   0.4        0.5   0.6   0.7     0.8   0.9
                                                                                α
                            Figure 2. CIDEr and SPICE validation scores depending on pruning coefficient α for different pruning
                            methods applied to AoANet model. “L1” indicates that l1 pruning was used, whilst “Random”
                            indicates that random pruning was used. “True” indicates pruning of the embedding layer and
                            “False” indicates no pruning.

                            4.5. Decoder Quantization
                                 For the quantization experiments, the best models obtained in the previous section
                            were used. Thus, we used the Up-Down model pruned with α = 0.1 and the AoANet
                            model pruned with α = 0.5. The results of the quantization experiments can be found in
                            Table 4. Quantization helps to reduce model size without approximately any loss of quality.
                            The resulting decoder size of the Up-Down model is 43.4 MB, and the decoder size of the
                            AoANet model is 19.7 MB.
Appl. Sci. 2022, 12, 1638                                                                                                        11 of 14

                              Table 4. Evaluation results of all of tested models decoders with and without quantization. “Size”
                              column refers to decoder.

                                                              B@1      B@4       M          R           C              S       Size
                                                                              Up-Down
                                   Without quantization       79.2     35.1     26.9       56.3        118.4          20.8    62 MB
                                   With quantization          79.2     35.3     26.9       56.3        118.4          20.8   43.4 MB
                                                                              AoANet
                                   Without quantization       79.5     38.2     29.1       58.3        127.6          21.5   50.4 MB
                                   With quantization          79.4     38.3     29.0       58.3        127.4          21.4   19.7 MB

                              5. Discussion
                                   The final models compressed using all of the proposed methods compared with the
                              original ones are presented in Table 5. It can be observed that our methods helped to
                              significantly reduce model size with a comparably small change in performance metrics.
                              The Up-Down model’s size was reduced from 661.4 MB to 58.5 MB, with a loss of 1.7 points
                              in CIDEr and 0.6 points in SPICE metrics. The AoANet model’s size was reduced from
                              791.8 MB to 34.8 MB, with a loss of 2.4 points in CIDEr and 1 point in SPICE metrics. Using
                              the proposed methods worked well for both models, which could be a good indicator of
                              the methods’ generalizability. Sizes of 58.5 MB and 34.8 MB are already small enough to be
                              used on mobile devices in real-world applications.

                              Table 5. Evaluation results of all of tested models both original and compressed using methods from
                              this paper. Both “Size” and “NNZ” columns refer to the whole model.

                            B@1           B@4             M           R           C               S            Size           NNZ
                                                              Up-Down
 Original model             79.8          36.3        27.7           56.9       120.1           21.4        661.4 MB         172.7 M
 Compressed model           79.2          35.3        26.9           56.3       118.4           20.8        58.5 MB          20.1 M
                                                              AoANet
 Original model             80.2          38.9        29.2           58.8       129.8           22.4        791.8 MB         207.4 M
 Compressed model           79.4          38.3        29.0           58.3       127.4           21.4        34.8 MB          17.1 M

                              6. Conclusions
                                    In this work we proposed the use of different neural network compression techniques
                              for models trained to solve image-captioning tasks. Our extensive experiments showed that
                              all of the proposed methods help to reduce model size with an insignificant loss in quality.
                              Thus, the best obtained model reaches 127.4 CIDEr and 21.4 SPICE, with a weighting of
                              only 34.8 MB. This sets up a strong baseline for future studies on this topic.
                                    In future works, other compression methods such as knowledge distillation could be
                              investigated in order to reduce the models’ sizes even more. Also, some more complex
                              pruning or quantization techniques could be used. The other direction for further research
                              could be encoder compression, using not only different architectures, but also the other
                              methods, for example, pruning and quantization. Reducing the size of more complex and
                              bigger models such as [35,36] is also a promising direction of study.

                              Author Contributions: Conceptualization, methodology, software, writing—original draft prepara-
                              tion, visualization, investigation, editing V.A.; writing—review, supervision, project administration,
                              funding acquisition D.Š. All authors have read and agreed to the published version of the manuscript.
                              Funding: This research received no external funding
                              Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2022, 12, 1638                                                                                                             12 of 14

References
1.    Staniūtė, R.; Šešok, D. A Systematic Literature Review on Image Captioning. Appl. Sci. 2019, 9, 2024. [CrossRef]
2.    Zafar, B.; Ashraf, R.; Ali, N.; Iqbal, M.K.; Sajid, M.; Dar, S.H.; Ratyal, N.I. A novel discriminating and relative global spatial image
      representation with applications in CBIR. Appl. Sci. 2018, 8, 2242. [CrossRef]
3.    Belalia, A.; Belloulata, K.; Kpalma, K. Region-based image retrieval in the compressed domain using shape-adaptive DCT.
      Multimed. Tools Appl. 2016, 75, 10175–10199. [CrossRef]
4.    Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE
      Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164.
5.    Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption
      generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 June
      2015; pp. 2048–2057.
6.    Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern
      Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7219–7228.
7.    Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE
      Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137.
8.    Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the
      IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024.
9.    Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image
      captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
      Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086.
10.   Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the International
      Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4634–4643.
11.   Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. Interspeech
      2010, 2, 1045–1048.
12.   Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
13.   Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the
      International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015.
14.   Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you
      need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017;
      pp. 5998–6008.
15.   Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the
      Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10578–10587.
16.   Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled Transformer for Image Captioning. In Proceedings of the International Conference on
      Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8928–8937.
17.   Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans.
      Circuits Syst. Video Technol. 2019, 30, 4467–4480. [CrossRef]
18.   Zhu, X.; Li, L.; Liu, J.; Peng, H.; Niu, X. Captioning transformer with stacked attention modules. Appl. Sci. 2018, 8, 739. [CrossRef]
19.   He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image captioning through image transformer. In
      Proceedings of the Asian Conference on Computer Vision, Online, 30 November–4 December 2020.
20.   Tan, J.H.; Chan, C.S.; Chuah, J.H. End-to-End Supermask Pruning: Learning to Prune Image Captioning Models. Pattern Recognit.
      2022, 122, 108366. [CrossRef]
21.   Tan, J.H.; Chan, C.S.; Chuah, J.H. Image Captioning with Sparse Recurrent Neural Network. arXiv 2019, arXiv:1908.10797.
22.   Dai, X.; Yin, H.; Jha, N.K. Grow and prune compact, fast, and accurate LSTMs. IEEE Trans. Comput. 2019, 69, 441–452. [CrossRef]
23.   Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 11–18 December
      2015; pp. 1440–1448.
24.   Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural
      Inf. Process. Syst. 2015, 28, 91–99. [CrossRef] [PubMed]
25.   Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the Conference on Computer
      Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10781–10790.
26.   Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv
      2016, arXiv:1612.03651.
27.   See, A.; Luong, M.T.; Manning, C.D. Compression of Neural Machine Translation Models via Pruning.                            arXiv 2016,
      arXiv:1606.09274.
28.   Choudhary, T.; Mishra, V.; Goswami, A.; Sarangapani, J. A comprehensive survey on model compression and acceleration. Artif.
      Intell. Rev. 2020, 53, 5113–5155. [CrossRef]
29.   Reed, R. Pruning algorithms-a survey. IEEE Trans. Neural Netw. 1993, 4, 740–747. [CrossRef]
30.   Guo, Y. A Survey on Methods and Theories of Quantized Neural Networks. arXiv 2018, arXiv:1808.04752.
31.   Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the
      IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575.
Appl. Sci. 2022, 12, 1638                                                                                                            13 of 14

32.   Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In European
      Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 382–398.
33.   Huang, L.; Wang, W.; Xia, Y.; Chen, J. Adaptively Aligned Image Captioning via Adaptive Attention Time. In Proceedings of the
      Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8940–8949.
34.   Wang, W.; Chen, Z.; Hu, H. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on
      Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8957–8964.
35.   Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned
      pre-training for vision-language tasks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 121–137.
36.   Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language
      models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 5579–5588.
37.   Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556.
38.   Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Euro-
      pean Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37.
39.   Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002.
40.   Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for
      mobilenetv3. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019;
      pp. 1314–1324.
41.   Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International
      Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114.
42.   Hoefler, T.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in Deep Learning: Pruning and growth for efficient inference
      and training in neural networks. arXiv 2021, arXiv:2102.00554.
43.   He, Y.; Ding, Y.; Liu, P.; Zhu, L.; Zhang, H.; Yang, Y. Learning filter pruning criteria for deep convolutional neural networks accel-
      eration. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 2009–2018.
44.   Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic
      flow. arXiv 2020, arXiv:2006.05467.
45.   Anwar, S.; Hwang, K.; Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst.
      (JETC) 2017, 13, 1–18. [CrossRef]
46.   He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. Amc: Automl for model compression and acceleration on mobile devices. In
      Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 784–800.
47.   Lee, N.; Ajanthan, T.; Torr, P. Snip: Single-Shot Network Pruning Based on Connection Sensitivity. arXiv 2018, arXiv:1810.02340.
48.   Xiao, X.; Wang, Z. Autoprune: Automatic network pruning by regularizing auxiliary parameters. In Proceedings of the Advances
      in Neural Information Processing Systems 32, (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32.
49.   Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural
      Network Inference. arXiv 2021, arXiv:2103.13630.
50.   Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural
      networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern
      Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713.
51.   Yao, Z.; Dong, Z.; Zheng, Z.; Gholami, A.; Yu, J.; Tan, E.; Wang, L.; Huang, Q.; Wang, Y.; Mahoney, M.; et al. Hawq-v3: Dyadic
      neural network quantization. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021;
      pp. 11875–11886.
52.   Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation
      for quantized neural networks. arXiv 2018, arXiv:1805.06085.
53.   Li, R.; Wang, Y.; Liang, F.; Qin, H.; Yan, J.; Fan, R. Fully quantized network for object detection. In Proceedings of the Conference
      on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–25 June 2019; pp. 2810–2819.
54.   Parameswaran, S.N. Exploring memory and time efficient neural networks for image captioning. In National Conference on Computer
      Vision, Pattern Recognition, Image Processing, and Graphics; Springer: Singapore, 2017; pp. 338–347.
55.   Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer
      parameters and
Appl. Sci. 2022, 12, 1638                                                                                                     14 of 14

61.   Lin, C.Y.; Och, F.J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram
      statistics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July
      2004; p. 605.
You can also read