External Knowledge enabled Text Visual Question Answering

Page created by Cory Gomez
 
CONTINUE READING
External Knowledge enabled Text Visual Question Answering
External Knowledge enabled Text Visual Question Answering

                                                                               Arka Ujjal Deya , Ernest Valvenyb , Gaurav Harita
                                                                                               a IIT Jodhpur, Rajasthan, India
                                                                  b Computer   Vision Center, Universitat Autònoma de Barcelona, Bellaterra (Barcelona)
arXiv:2108.09717v5 [cs.CV] 20 Oct 2021

                                         Abstract
                                         The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously
                                         unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external
                                         knowledge to augment our understanding of the said scene-text. We design a framework to extract, validate, and reason
                                         with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical
                                         evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus
                                         help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We
                                         generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar
                                         upstream OCR systems and training data.
                                         Keywords: Scene Text, Visual Semantics, Language and Vision, External Knowledge

                                         1. Introduction
                                            The Visual Question Answering (VQA) problem (An-
                                         tol et al., 2015) has seen considerable attention in the past
                                         years. The appeal lies in the challenge itself, of trying to
                                         answer open-ended questions about an image and the pos-
                                         sible applications it may have in automated image under-
                                         standing. VQA entails working with multiple data modal-
                                         ities, viz, visual objects and question words, and model
                                         their interaction to generate an answer.
                                            However, despite significant progress (Goyal et al.,
                                         2018) the generic VQA setup addressed in most of the
                                         state-of-the-art frameworks (Kim et al., 2018; Anderson
                                         et al., 2018) was still ignoring an important information
                                         source: scene text. Since the advent of the printing press,
                                         the text has made inroads in our life, and thus also in our
                                         images (Dey et al., 2021). Scene text is ubiquitous in our
                                         everyday images, from street or store names (Karaoglu
                                         et al., 2017) to advertisements (Hussain et al., 2017)
                                         or brand logos. Thus, the definition of the Text-VQA
                                         task (Singh et al., 2019a; Biten et al., 2019) generalizes            Figure 1: Core Idea: Use detected visual and scene text objects, along
                                         VQA and requires to consider not just visual objects but              with external knowledge, to answer questions about an image.
                                         also the scene text content to generate an answer.
External Knowledge enabled Text Visual Question Answering
While the VQA task has its set of challenges (Kafle and          swer.
Kanan, 2017b), viz. answer vocabulary, upstream detec-                 Thus, in this work, we propose a framework to extract,
tors, long-tailed answer distribution, language bias (Goyal         validate, and reason with external knowledge in language
et al., 2018), evaluation metrics (Kafle and Kanan, 2017a),         vision understanding. In summary, our primary contribu-
Text-VQA makes the VQA problem setup more generic                   tions are:
and introduces new challenges requiring better reading
and reasoning abilities. This was seen in (Singh et al.,              • We present a method to extract external knowledge
2019a; Biten et al., 2019) when the state-of-the-art fixed              corresponding to scene text content in images and
vocabulary VQA models performed poorly on the inher-                    then validate the mined knowledge facts through im-
ently open-ended Text-VQA task. In the case of VQA,                     age context. This validation stage helps us deal with
fixed vocabulary-based systems (Anderson et al., 2018)                  the noisy nature of web scraped data.
have done reasonably well. Using answer statistics on a               • We propose an end-to-end trainable architecture that
large dataset can help us find a fixed set of words cover-              enables us to reason with external knowledge. Our
ing all or most of the answers. For Text-VQA, however,                  proposed External Knowledge enabled Text-VQA
such a fixed vocabulary of answers is not enough, as the                (KTVQA) framework integrates knowledge chan-
task is set up to require reading scene text in the current             nel with a standard transformer-based architecture
image. Thus a hybrid answer space, unique to each im-                   for the Text-VQA task. Through masked attention
age, consisting of a fixed global vocabulary of frequent                maps, we control the interaction between the image
answers, together with the dynamic local set of current                 components and the knowledge facts. To the best of
scene text tokens, is used in practice. The open-ended na-              our knowledge, we are the first to propose the gener-
ture of Text-VQA also makes it more susceptible to errors               alized use of External Knowledge to the Text-VQA
due to training data bias when it fails to notice an instance           task.
only cue and instead incorrectly focuses more on trained
statistics. E.g., in Figure 1, the standard Text-VQA mod-             • We apply the proposed framework on standard Text-
els predict the answer to be ‘samsung’ because of the mo-               VQA datasets, demonstrating the effectiveness of
bile phone-related visual content of the image. It is only              external knowledge in improving results. We gen-
with external knowledge that we bring awareness about                   erate results that match the state-of-the-art mod-
hitherto unseen text ‘vertu’ as a mobile phone company                  els without using additional annotated data or pre-
which finally leads to the correct answer. We demonstrate               training tasks. Our knowledge-related components
more such errors in Figure 6, row 2. While one can miti-                can be readily adopted into the existing state-of-the-
gate such dataset biases by additional training on large an-            art models leading to further improved results.
notated datasets, we argue for the use of External Knowl-
edge 1 . While the use of knowledge has been proposed in               We describe our organization here. In Sec. 2 we present
the past, it is mostly set up as a retrieval problem (Shah          the conventional approaches to Text-VQA and relate it
et al., 2019; Narasimhan and Schwing, 2018; Wang et al.,            to existing knowledge based systems. We follow this up
2018), where the answer is retrieved as the most relevant           with our proposed method in Sec. 3 incorporating exter-
knowledge fact. We propose an end-to-end pipeline of                nal knowledge into the Text-VQA task. In Sec. 4 we re-
extracting, validating, and reasoning with mined external           port our experimental results and draw our insights, and
knowledge to generate answers freely from the hybrid an-            finally, in Sec. 5 we present our concluding remarks.
swer space. This requires no additional training and is
thus a more resource-efficient solution. In Figure 1 we             2. Related Works
present our core idea, which is to use external knowledge
along with visual and scene text cues to generate an an-               In this section, we first describe the Text-VQA prob-
                                                                    lem and the conventional solutions thus far applied to it.
                                                                    We then discuss Knowledge enabled systems, in particu-
  1 https://developers.google.com/knowledge-graph
                                                                    lar how knowledge has been applied to Vision Language

                                                                2
External Knowledge enabled Text Visual Question Answering
tasks, and conclude with how we relate these approaches           they design a pre-training task called Relative (spatial)
to the Text-VQA problem.                                          Position Prediction (RPP). While we note the potential
                                                                  of pre-training tasks, we see it as a complementary ap-
2.1. Conventional methods on Text-VQA task                        proach to our work, where we propose the use of external
   The Text-VQA task has received widespread attention            knowledge to deal with training bias leading to better gen-
in the recent past, both in the number of datasets (Biten         eralization.
et al., 2019; Singh et al., 2019a; Mishra et al., 2019) and
network architectures proposed (Hu et al., 2020; Kant             2.2. Knowledge applied to Vision Language Under-
et al., 2020; Gao et al., 2020). The initial baselines pro-            standing Tasks
posed were based on augmenting existing VQA architec-                Before we begin our discussion on related works in-
tures with OCR inputs (Biten et al., 2019; Singh et al.,          volving knowledge, we briefly define the knowledge tax-
2019a; Mishra et al., 2019). However, Text-VQA is multi-          onomy we use. We differentiate between valid and rel-
modal by definition, and unlike traditional VQA with only         evant knowledge facts based on the image and question
the visual stream, Text-VQA requires us to detect and en-         context, respectively. We feel this distinction to be neces-
code multiple views of the image, viz. scene text and vi-         sary to highlight the gap areas in the related work.
sual. Leveraging the attentional framework of the recently        Valid Knowledge is the type of knowledge that is correct
introduced transformer framework and cognizant of this            or true for the given image, but it need not be relevant
multimodal nature of the data, the Multimodal MultiCopy           to the current task of question-answering. We propose
Mesh (M4C) (Hu et al., 2020) model was proposed.                  checking for knowledge validity as an important sub-task
   The multimodal transformer based M4C jointly en-               for any system that uses noisy web scraped data as exter-
codes the question, the image, and the scene text and             nal knowledge.
employs an iterative decoding mechanism that supports             Relevant Knowledge on the other hand, is not just valid
multi-step answer decoding. The performance improve-              for the image but also relevant to the current task of
ments attributed to M4C led to multiple variants of this          question-answering. While validity is specific to the im-
multimodal transformer-based architecture in subsequent           age itself, relevance is both image and task-specific. Thus
works. While SA-M4C (Kant et al., 2020) proposes                  for a knowledge fact to be deemed relevant, it must also
encoding spatial relationships between detected objects           be checked for validity.
through additional spatial layers atop M4C, LaAP-Net
(Han et al., 2020) proposes an evidence-based Multi-              Knowledge enabled Datasets. While the use of knowl-
modal transformer model where, first, the answer bound-           edge in question answering tasks has been seen in the
ing box is generated, which is subsequently embedded              past, they are mostly proposed together with a Knowl-
and used as part of the final answer word decoding. While         edge enabled dataset (Shah et al., 2019; Singh et al.,
M4C considers text and visual objects alike and relies            2019b; Mishra et al., 2019; Narasimhan and Schwing,
entirely on the internal transformer attentional frame-           2018; Wang et al., 2018), where the knowledge facts
work, SMA (Gao et al., 2020) has shown that more                  are part of the annotations provided. In these examples,
structured attention, aware of the inter-object relation-         the image and the corresponding ground truth knowledge
ship, can lead to improved results. Recently in TAP               facts are related by design, and thus they sidestep the
(Yang et al., 2021), the authors explore the role of Vi-          issue of knowledge correctness or validity with respect
sion Language pre-training (VLP) (Devlin et al., 2019;            to the image. In KVQA (Shah et al., 2019), for exam-
Lu et al., 2019; Chen et al., 2020) tasks, such as Image-         ple, the dataset is built using image search against a list
Text(contrastive) Matching (ITM) and Masked Language              of persons; thus, every visual cue or person in the im-
Modeling (MLM) in learning a better aligned multi-                age is mapped explicitly to a knowledge cue related to
modal representation for the Text-VQA task. They also             that person. Similarly, we see in (Singh et al., 2019b)
take into account the significant amount of spatial ref-          a large-scale knowledge enabled dataset, comprising im-
erences in the dataset. However, instead of using it as           ages of popular businesses/ brands, book covers, and film
ground truth information to reason with, like SA-M4C,             posters images with the names of the brands, movies, or

                                                              3
External Knowledge enabled Text Visual Question Answering
books acting as anchor entities in the external knowledge           tend to each other to generate the final image embedding.
base. Thus, for these Knowledge enabled datasets, the ac-           It is important to note that none of the above two works
companying knowledge is always valid by design. Such                ensure knowledge validity. This is despite the fact that
Knowledge facts provide extra supervision and forgo the             DBpedia offers multiple candidate meanings for a given
need for checking for knowledge validity. Given a set of            scene text query. Thus by skipping the validity checks,
valid knowledge facts, most methods (Narasimhan and                 both these methods allow room for spurious knowledge
Schwing, 2018; Wang et al., 2018; Narasimhan et al.,                facts, which can be incorrectly deemed as question rele-
2018) use contextual relevance, with the question and im-           vant by virtue of semantics.
age, to select a subset of relevant knowledge facts.
                                                                    Using Knowledge to Generate Answers. For knowl-
Mined External Knowledge applied to Vision Lan-                     edge fact-based VQA system (Narasimhan and Schwing,
guage Tasks. Closest to our work is (Wu et al., 2016),              2018; Wang et al., 2018), where the answer is from
where they propose the use of mined external knowl-                 a closed set of image-specific retrieved knowledge
edge to the existing VQA (Antol et al., 2015) task. They            facts, answering a question translates to searching from
use text attributes predicted from an image to query the            amongst a set of relevant knowledge facts. The answers in
DBpedia2 knowledge base. The questions in Text-VQA                  Text-VQA are, however, more generic. The answer space
are primarily focused on reading and reasoning about the            extends beyond the detected OCR tokens and requires
scene text objects. Thus, instead of externally gener-              generic English words, names, and numbers as shown in
ated captions or text attributes, which capture the visual          (Singh et al., 2019a). This is validated by our observa-
objects’ view, we exclusively use the scene text objects            tions in Table 4 where we note that only about 53.4% of
as our knowledge queries. Further, Wu et al. (2016) do              the questions from the validation set could be answered
not check for explicit relevance of the individual knowl-           with our Google-OCR-based OCR tokens alone. To ac-
edge facts obtained, instead aggregate them into a single           count for this, (Singh et al., 2019a) proposed that the
noisy knowledge vector, encoded through Doc2vec (Le                 answer space be augmented with a fixed vocabulary of
and Mikolov, 2014), before feeding them to an LSTM                  common answers words, which, together with the OCR
pipeline for answer generation. Instead of a single aggre-          tokens, can answer 84.4% of the questions. This diverse
gated knowledge representation for the entire image, we             answer space of Text-VQA implies that answering should
leverage individual OCR token-specific knowledge. This              not be restricted to searching amongst retrieved knowl-
enables us to discriminate amongst OCR tokens for an-               edge facts alone.
swer generation based on their knowledge relevance.                    Thus, instead of selecting a retrieved knowledge fact
In (Ye et al., 2021) we find an example of scene text-              as the answer, we propose reasoning with the retrieved
based mined external knowledge but applied to the image-            knowledge facts, along with image OCR tokens, question
caption pairing task for Advertisement images (Hussain              words, and image content, to guide the answer generation,
et al., 2017). The Advertisement images are rich in scene           and inferring the appropriate answer comprising the OCR
text and often include brand-related keywords which con-            tokens and the fixed vocabulary.
vey semantics, as observed in (Dey et al., 2021). Thus,
in (Ye et al., 2021) the authors explore this scene text
channel further and use external knowledge from DBpe-                  In this work, we propose the use of mined external
dia to enrich the scene text features. Similar to the contex-       knowledge to the Text-VQA task, where we deal with
tual relevance schemes discussed for Knowledge enabled              noisy or irrelevant knowledge and generate answers that
datasets, an image-guided attention mechanism controls              extend beyond the detected knowledge facts. As such, we
the incorporation of knowledge into scene text features.            solve the more general problem of what is valid, relevant
The knowledge-enriched scene text and visual objects at-            knowledge, and how to reason with it during answer gen-
                                                                    eration. We extend upon the basic M4C framework by
                                                                    including a separate knowledge processing pipeline. Our
  2 https://dbpedia.org                                             knowledge pipeline extracts and validates web scraped

                                                                4
External Knowledge enabled Text Visual Question Answering
Figure 2: Extraction, Validation and Reasoning : The three stages of using external knowledge in Text-VQA.

knowledge and uses the multimodal transformer to de-                   corporate knowledge into the framework through the in-
fine task-specific relevance and reasoning of the vali-                clusion of a separate knowledge channel and a constrained
dated knowledge. We use image context to filter out in-                interaction achieved through attention masks.
valid knowledge explicitly. Similar to the multimodal                     We now present the details of our method by first intro-
co-attention scheme of (Ye et al., 2021), we propose a                 ducing the different feature formulations used, followed
knowledge-enabled multimodal transformer allowing for                  by our external knowledge pipeline, and finally, our an-
the visual, text, knowledge, and question nodes to attend              swer generation process.
to each other through multiple layers of self-attention. We
believe that through our design choice, our knowledge-                 3.1. Multi modal Features
related components can be readily incorporated into any                   The input to our multimodal transformer, as illustrated
of the existing multimodal transformer-based architec-                 in both Figure 2 and 5, consists of detected visual objects
tures discussed above, leading to further improved results.            and scene text from the image, the given question, and the
                                                                       mined external knowledge.
3. KTVQA: External Knowledge enabled Text-VQA                          Question Words. Given a sequence of L question words,
                                                                       we embed them as {xques
                                                                                           l   } (where l = 1, ..., L) using pre-
   We now present our External Knowledge enabled Text-                 trained BERT (Devlin et al., 2019) with xques ∈ R768 .
                                                                                                                l
VQA framework, that proposes the use of a pre-existing
External Knowledge Base to improve the Text-VQA task.                  Visual Objects. Following M4C, each of the M visual
KTVQA, as shown in Figure 2, entails extracting, val-                  object regions are encoded as xmf r , using a pre-trained
idating, and reasoning with noisy External Knowledge                   ResNet based Faster R-CNN (FRCNN) (Ren et al., 2015)
in a multimodal transformer framework. The inputs to                   model trained on Visual Genome. Additionally the
the Text-VQA task are question-answer pairs and associ-                bounding box information is embedded as xbm to preserve
ated images. While VQA tasks only require exploring the                spatial information. The appearance and spatial features
visual content of images, Text-VQA requires leveraging                 thus obtained are linearly projected and combined to gen-
scene text content too. Thus the input stream consists of              erate the final set of visual object features {xob j
                                                                                                                       m } (where m
detected local visual objects and scene text. To augment               = 1, ..., M) as
our understanding of scene text objects, we use an Exter-                            ob j
nal Knowledge Base to extract candidate meanings, which                             xm    = LN(W1 xmf r ) + LN(W2 xbm )        (1)
we validate through context. Our proposal is based on the              where W1 and W2 are learned projection matrices. LN()
multimodal transformer-based M4C architecture. We in-                  is layer normalization

                                                                   5
External Knowledge enabled Text Visual Question Answering
Scene Text. For the N OCR tokens, both the recognized               is 4 for a validation subset of 25000 words. Based on this
OCR text as well as the visual appearance is encoded.               observation, we limit our maximum number of candidates
The same FRCNN based detector, used in object detec-                to 4.
tion above, extracts and encodes the visual appearance
features corresponding to the bounding box regions as xnf r ,       Knowledge Validation through Context. The particu-
which is combined with xnf t and xnph , Fasttext (Bojanowski        lar meaning a scene text word assumes depends on the
et al., 2017) and PHOC (Almazán et al., 2014) embed-               context it occurs in. Thus disambiguation amongst candi-
ding respectively of the OCR token. Finally the bounding            date meanings is done based on the context of the current
boxes coordinates are embedded as xbn similar to visual             image. We see an example of this in Figure 3b, where the
objects, to obtain {xocr
                     n } (where n = 1,... , N) as
                                                                    context word like ‘mobile-phone’ helps select the valid
                                                                    knowledge description for ‘vertu’.
                                                                       We define image context c = [t, vc], as the aggregation
              ft      fr      ph
   n = LN(W3 xn + W4 xn + W5 xn ) + LN(W6 xn ) (2)
  xocr                                     b
                                                                    of the N OCR tokens t, and V class labels vc of visual
where W3 ,W4 , W5 and W6 , are learned projection matri-            objects co-occurring in the current image. This context is
ces.                                                                encoded in BERT to generate {xcp } with xcp ∈ R768 (where
                                                                    p = 1,..,N+V).
External Knowledge. Knowledge obtained from                            Given a scene text word ti and its set of candidate mean-
Google Knowledge Base (GKB) is in the form of a text                ings S i = {S i j }, we use the context c of the current image
description, which is encoded using pre-trained BERT                to select the valid meaning S id ∈ S i . For each candidate
sentence embedding. Corresponding to each encoded                   meaning S i j ∈ S i , we learn a validity score r j , given by
OCR token xocr                              k         k
              n , we have a knowledge node xn , with xn ∈
                                                                    Eq. 3, which compares its BERT embeddings with those
  768
R . This knowledge node is selected from among the                  of the context words. We deem the highest scoring can-
candidate meanings following the procedure explained in             didate meaning as being the image context appropriate,
the next section.                                                   therefore valid.
                                                                                                                   X
Previous D Steps Prediction. The answer is generated                         r j = Wc (Relu(Wa xS i j ) · Relu(Wa      xcp ))  (3)
step-wise, one word at a time either from closed global an-                                        d = argmax r j   (4)
swer vocabulary or current image OCR tokens. As seen                                                        j
in Figure 5, the current step answer word prediction re-                                                xki = xS id (5)
quires, as part of its input, the predicted answer words
from the previous D steps {xt−1d } (where d = 1,..,D) and Where Wa , Wc , are learned projection matrix. Further,
          768
xt−1
  d ∈R                                                        xS i j and xc , are the BERT embeddings of jth candidate
                                                              meaning S i j and context c respectively.
3.2. External Knowledge
   The Knowledge Pipeline has three stages, namely ex- Reasoning with External Knowledge. Reasoning with
traction, validation, and reasoning, as illustrated in Figure External Knowledge implies identifying relevant knowl-
2, which we elaborate upon now.                               edge facts and enforcing constrained interaction between
                                                              knowledge and scene text nodes. We now discuss how
Knowledge Extraction. For each detected scene text implicit relevance is achieved through transformers fol-
word, we query GKB. In case of a hit in GKB, for each lowed by our proposed constrained interaction which
query scene text word ti , we get a set of candidate mean- incorporates knowledge into scene text representation
ings S i , which we retain and later disambiguate in the while ensuring legitimacy.
knowledge validation step. In Figure 3a, we show a sam- Relevance While the validity or correctness of the ex-
ple response we get. With the default GKB API setting of ternal knowledge can be judged from its image context;
a maximum of 10 candidates per query, we observe that it says nothing about its relevance or applicability while
the median number of candidates returned per query word generating answers. The decision to include or exclude

                                                                6
External Knowledge enabled Text Visual Question Answering
Figure 3: Knowledge Extraction and Validation. Figure (a) shows knowledge extraction using Google API and Figure (b) demonstrates knowledge
validation and selection through image context. We identify knowledge validation as an important subtask for web-scraped knowledge data that is
not anchored to our Text-VQA datasets explicitly

the external knowledge fact about a scene text node is
crucial when it comes to generating answers. Instead
of an explicit task to define knowledge relevance like
Narasimhan and Schwing (2018); Wang et al. (2018), we
include the set of knowledge nodes {xkn } as part of the
transformer input and let the internal attentional frame-
work (Vaswani et al., 2017) select relevant knowledge
facts. The transformer attentional framework treats the
scene text, visual objects, question words, and knowledge
nodes alike, attending to relevant nodes based on the final
task of question answering.
Constrained Interaction The self-attention layer in the
transformer architecture allows a node to form its rep-
resentation using the other nodes in the neighborhood.
In practice, this neighborhood is typically the entire set
of co-occurring nodes, but it can also be constrained to
a smaller subset, thereby limiting which nodes interact.                  Figure 4: Constrained Interaction: Knowledge inclusion through Atten-
In the past, such constrained interaction through the use                 tion Mask matrix A. It controls the interaction between the nodes. Dark
of attention masks has found application in causal step-                  Cells allow interaction between corresponding nodes. AKT and AT K ,
                                                                          highlighted in blue, are N × N identity matrices, binding the knowledge
wise decoding (Hu et al., 2020) and enforcing spatial rela-               facts with their corresponding OCR tokens. Highlighted in pink is the
tions (Kant et al., 2020). In this work, we extend the use of             D × D lower triangular sub-matrix APP enforcing causality between pre-
constrained interaction to preserve node-to-node knowl-                   vious decoding steps
edge legitimacy i.e., the one-to-one relationship between
scene text nodes and their corresponding validated knowl-
edge facts.
Formally, Attention Mask matrix A ∈ RE×E controls the
interaction between the L Question words, M visual ob-

                                                                      7
External Knowledge enabled Text Visual Question Answering
jects, N OCR objects, N knowledge nodes and D previous                                   The current decoding state zt−1d  is linearly projected
stage predictions, with E = L + M + N + N + D. Specif-                                through Wvoc to generate the H dimensional vocabulary
ically, Ai j is added as a bias term to the softmax layer of                          scores yvoc                                         ocr
                                                                                              d . While, the N dimensional OCR scores yd are
the self-attention, effectively controlling j’s role in i’s new                       generated through a bi-linear interaction between the cur-
representation. For intuitive understanding, the dark cells                           rent decoder state zt−1                         ocr
                                                                                                          d and the OCR outputs {zn } (where
in Figure 4 represent 0, and the grey/empty cells repre-                              n= 1,..,N). We refer to this as the Pointer network in the
sent −∞. As seen in the figure, our Attention Mask matrix                             Figure 5. We take the argmax on the concatenated score
                                                                                       d = [yd , yd ] to generate the current answer token yd .
                                                                                               voc ocr                                        ans
A allows complete interaction between the visual, scene                               yall
text, and question words. While the scene text words                                  We successively decode for 12 steps or till we predict the
{xocr
   n } and their corresponding mined knowledge facts {xn }
                                                              k
                                                                                      special end marker token.
are represented as distinct nodes, their interaction is sub-
ject to certain constraints.
                                                                                      4. Experiments
The scene text words are symbolic representations, with
the mined knowledge facts bringing in the associated                                     We benchmark our method on the TextVQA
semantics. This symbolic-semantic one-to-one relation                                 dataset (Singh et al., 2019a) and the ST-VQA (Biten et al.,
needs to be preserved if we want to later discriminate                                2019) dataset. Our results show that the improvements
amongst scene text words based on knowledge. The final                                from using knowledge are significant for both datasets.
answer generation consists of comparing the transformer                               We briefly discuss our implementation and dataset details,
output {zocr
          n } of the scene text words with the current de-                            followed by a comparison of our results with the recent
coding state zt−1d . Thus, we ensure that for a given scene                           methods, a detailed analysis of how knowledge helps,
text word xocrn  , only its own legitimate knowledge fact xkn                         and finally conclude with an ablation study of our model.
contributes to this final representation zocrn . This is en-
forced through the identity sub-matrices AKT , AT K , high-                           4.1. Implementation and Dataset details
lighted in Figure 4, preserving node-to-node knowledge
                                                                                      Implementation Details. Our implementation is based
legitimacy by allowing only the related knowledge fact
                                                                                      on PyTorch Transformers library 3 . We adopt a multi-
and OCR token to attend to each other. Further, the lower
                                                                                      modal transformer architecture akin to (Hu et al., 2020).
triangular sub-matrix APP enables step-wise causal de-
                                                                                      Our basic transformer design is similar in spirit, with eight
coding, enforcing that the current answer generation can
                                                                                      attention heads and four layers, but we add masking to
only interact with previously decoded answer tokens.
                                                                                      constraint the external knowledge interaction. We use
   Formally the outputs of our multimodal transformer T
                                                                                      Adam as our optimizer. We start with a learning rate of
can be defined as,
                                                                                      1e − 4, and step-wise reduce it further upon stagnation or
 [zq , zob j , zocr , zk , zt−1 ] = T ([xq , xob j , xocr , xk ], xt−1 , A) (6)       plateau.
Where xt−1 = {xt−1
                d } is the set of predictions from the pre-                           Dataset 1: TextVQA. The TextVQA dataset consists of
vious D time steps, we initialize this with the start token.                          28,408 images collected from OpenImages (Kuznetsova
                                                                                      et al., 2020), with 45,336 questions-answer pairs. The
3.3. Answer Generation
                                                                                      evaluation metric proposed is accuracy, measured via soft
   Our Answer Generation consists of stepwise causal de-                              voting of the 10 Ground Truth answers provided.
coding of the transformer outputs, akin to the iterative de-                             We use Google-OCR 4 , for scene text detection and
coder introduced in (Hu et al., 2020). Briefly, we gen-                               recognition. Of the total 189551 detected scene text
erate the answer tokens yansd , ∀d, one token at a time or                            words, GKB offers an average of 2.29 different candi-
step-wise, predicting either from a fixed vocabulary of H                             date meanings for each of the 100972 words found in the
words or from amongst current N OCR tokens. Similar to
(Hu et al., 2020; Kant et al., 2020), the attention mask en-
forces causality by preventing current answer tokens from                               3 https://pytorch.org/hub/huggingface   pytorch-transformers
attending to answer tokens yet to be generated.                                         4 https://cloud.google.com/vision/docs/ocr

                                                                                  8
Figure 5: KTVQA: Our proposed External Knowledge enabled Text-VQA

Knowledge Base. In Figure 3a, we see that the GKB re-             ANLS refers to averaged NLS scores over all questions,
sponse includes 3 tags; name, description and attribute.          while truncating scores below 0.5 to 0 before averaging
Interestingly we note that GKB exhibits tolerance towards            For scene text words, we use the Rosetta-en OCR to-
spelling errors, often returning candidates with high lex-        kens from Pythia (Singh et al., 2018). From the total of
ical similarity. While in some cases, this leads to correct       64830 detected scene text words, GKB offers candidate
candidate meanings even with incorrect OCR, it also leads         meanings for 39053 words. The final knowledge annota-
to a significant number of false positives. Thus we re-           tions mined, after the filtering described above, consist of
move entries where the queried scene text is not present in       23132 words and their candidate meanings.
the ‘name’ or ‘description’ tag of the returned candidate
meaning. For the remaining words, we merge their de-              4.2. Comparison with state-of-the-art
scription and attribute tags. This results in mined knowl-
edge facts for 57666 words, including 2917 multiword en-         In Table 1 we report our Text-VQA open-ended ques-
tity names(e.g., ‘new york’, ‘burger king’)                   tion answering results on the datasets TextVQA and ST-
                                                              VQA. In row 7, we present our ablated instance without
                                                              knowledge, namely TVQA, and in row 8, we present our
Dataset 2: ST-VQA. The ST-VQA dataset consists of
                                                              full model KTVQA. For all models, we specify the OCR
23,038 images, from various sources (Karatzas et al.,
                                                              system used on the dataset as part of its column informa-
2013, 2015; Bigham et al., 2010; Mishra et al., 2013; Kr-
                                                              tion.
ishna et al., 2016; Veit et al., 2016), with 31791 questions-
                                                                 Our primary observation is the 2.5% absolute improve-
answer pairs. To deal with OCR recognition errors that
                                                              ment of KTVQA over non-knowledge baseline TVQA
lead to answers with misplaced /missing characters, the
                                                              across datasets. For the TextVQA dataset, our model
authors (Biten et al., 2019) propose the Average Normal-
                                                              improves upon the state-of-the-art (Kant et al., 2020).
ized Levenshtein Similarity (ANLS) as an alternate eval-
                                                              For the ST-VQA dataset, our results are comparable with
uation metric. Normalized Levenshtein Similarity (NLS)
                                                              models with a similar OCR system. Table 1 also high-
is defined as,
                                                              lights the correlation between a better OCR system and
                                                              improved answer generation. We observe that, in gen-
          NLS = 1 − dL (y , y )/max(y , y )
                            ans GT           ans GT
                                                          (7)
                                                              eral, models with the Google-OCR system perform better
where yans and yGT are the predicted and ground truth compared to those with Rosetta OCR systems. To eluci-
answers respectively, with dL referring to edit distance. date this further, in Table 4, we report the Upper Bound

                                                              9
Table 1: Results of Text-VQA Task on TextVQA and ST-VQA Dataset

                                                     TextVQA Dataset                                  ST-VQA Dataset
          Model                               OCR         Val. Acc. Test Acc.                OCR         Val. ANLS. Test ANLS
     1    LoRRA Singh et al. (2019a)        Rosetta-ml     26.56     27.63                     -              -          -
     2    M4C (Hu et al., 2020)             Rosetta-en     39.44     39.10                 Rosetta-en       0.472     0.462
     3    SMA (Gao et al., 2020)            Rosetta-en     40.05     40.66                 Rosetta-en         -       0.466
     4    CRN (Liu et al., 2020)            Rosetta-en     40.39     40.96                 Rosetta-en         -       0.483
     5    LaAP-Net (Han et al., 2020)       Rosetta-en     40.68     40.54                 Rosetta-en       0.497     0.485
     6    SA-M4C (Kant et al., 2020)       Google-OCR       43.9        -                 Google-OCR        0.512     0.504
     7    TVQA                             Google-OCR       41.7        -                  Rosetta-en       0.468        -
     8    KTVQA                            Google-OCR      44.26      44.2                 Rosetta-en       0.499      0.48

                    Table 2: Results of related methods on TextVQA Dataset with additional pre-training tasks and datasets

                                                             pre-training
         Model                             Task                        Dataset                       OCR               Val. Acc.   Test Acc.
 1       KTVQA                               -                            -                       Google-OCR            44.26        44.2
 2       M4C (Hu et al., 2020)               -                        ST-VQA                       Rosetta-en           40.55       40.46
 3       SA-M4C (Kant et al., 2020)          -                        ST-VQA                      Google-OCR             45.4        44.6
 4       TAP Yang et al. (2021)         MLM,ITM,RPP                       -                        Rosetta-en            44.4          -
 5       TAP Yang et al. (2021)         MLM,ITM,RPP                       -                      Microsoft-OCR          49.91       49.71
 6       TAP Yang et al. (2021)         MLM,ITM,RPP            ST-VQA,Textcaps,OCR-CC            Microsoft-OCR          54.71       53.97

Table 3: Dataset Comparison. OCR UB: Percentage of answers com-         from the TextVQA dataset, are not as good as the Google-
prising of OCR tokens alone, Vocab UB: Percentage of answers com-
prising of fixed vocabulary words only, OCR + Vocab UB: Percentage      OCR tokens. Thus for the ST-VQA dataset, the erro-
of answers comprising of OCR tokens and answer vocabulary words,        neous Rosetta-en tokens mean not just missing out on
i.e., the maximum achievable accuracy.                                  answers words, leading to a lower OCR UB compared
                                                                        to TextVQA, but also amount to fewer knowledge hits
         Dataset                TextVQA            ST-VQA
                                                                        compared to TextVQA, as listed in Table 3. While our
         #Images                 28408              23038
                                                                        method generates results comparable to the state-of-the-
         #Question-Answer        45,336             31791
         OCR system            Google-OCR         Rosetta-en
                                                                        art on both datasets, what is more, promising is that the
         #Scene-Text             189551             64830               knowledge channel can be readily incorporated into the
         #Mined Knowledge        57666              23132               existing state-of-the-art models, which could lead to fur-
         OCR UB (%)                53.4             50.53               ther improvements in those models.
         LA UB (%)                57.14             67.16                  Across models, pre-training on additional tasks and
         OCR + LA UB (%)           84.4             83.33               datasets can lead to significant improvements, as is ob-
                                                                        served in Table 2. These datasets, however, comprise
accuracies on the TextVQA validation set against the dif-               additional images and question/answer pairs and require
ferent OCR systems. The OCR systems can be judged                       resource-intensive manual annotation. Similarly, pre-
by comparing their OCR Upper Bound accuracy, which                      training requires additional computing resources. In con-
is the percentage of Ground truth answers that can be                   trast, our mined external knowledge consists of text de-
composed of OCR tokens alone. For the Validation set,                   scriptions for a subset of scene text words obtained in
the Rosetta-en OCR tokens have an OCR Upper Bound                       an unsupervised manner from web scraped data. Fur-
accuracy of 44.9% compared to 53.4% of Google-OCR,                      ther, our mined knowledge facts form part of our feature
as shown in Table 4. For the ST-VQA dataset, we only                    set for each image and do not require additional training.
have Rosetta-en OCR tokens, which, as we have seen                      However, despite the additional pre-training tasks for TAP

                                                                     10
(row 4) and additional pre-training datasets for M4C (row                 In row 3 of Figure 6, we see examples of correct an-
2) and SA-M4C (row 3), our Knowledge enabled model                     swers using Multiword entities. For Popular words like
KTVQA (row 1) still performs at par with them. Finally,                ‘New York’, our trained multimodal transformer itself
we note that the improvements in TAP (row 6) can be at-                should be enough to predict them in sequence, but for a
tributed to not just the additional annotated datasets but             lesser-seen couplet of words such as ‘voll damm’, such
also improved OCR quality, as is evident by comparing                  binding is more beneficial.
rows 4 and 5.
Table 4: Upper Bounds and OCR systems: OCR UB: Percentage of an-        Mitigating dataset bias. Problems due to dataset biases
swers comprising OCR tokens alone, Vocab UB: Percentage of answers     are well studied (Kafle and Kanan, 2017b; Agrawal et al.,
comprising fixed vocabulary words only, OCR + Vocab UB: Percentage     2017). They can be attributed to data distribution. This re-
of answers comprising OCR tokens and answer vocabulary words, i.e.,
the maximum achievable accuracy .
                                                                       sults in the network favoring answers or class types which
                                                                       it has seen more examples of, often ignoring the lesser-
                              TextVQA Val accuracy(%)                  seen correct answers. Row 2 of Figure 6 shows exam-
                       Rosetta-ml Rosetta-en Google-OCR                ples of knowledge mitigating dataset bias. Despite the
 OCR UB                  37.12        44.9          53.4               presence of the correct answer words in the OCR tokens
 Vocab UB                48.46       59.02         57.14               detected, the baseline TVQA failed to produce the cor-
 OCR + Vocab UB          67.56       79.72          84.4               rect answer words, instead predicting semantically simi-
                                                                       lar words from the vocabulary. Ostensibly, the predicted
                                                                       answer words are popular, with many examples in the
                                                                       training data. Knowledge helps against bias by bringing
4.3. How knowledge helps                                               in much-needed additional information about lesser-seen
Improved Answer Entity type correctness. The ques-                     OCR tokens.
tion asked, often, pre-determine the type of answer en-
tity. Entity type question words like ‘what time’, ‘which              4.4. Ablation Studies
place’, ‘how wide’ point towards distinct answer entity
                                                                          In this section, we report results on ablated instances of
types. In row 1 of Figure 6, we see examples where exter-
                                                                       our model across datasets to highlight the contribution of
nal knowledge about the OCR tokens enables us to learn
                                                                       individual model components and draw insights about the
this entity type association better. As seen with the re-
                                                                       transferable properties of external knowledge.
sponses from TVQA in row 1, we note that semantic word
embedding (Bojanowski et al., 2017) and character infor-                   Table 5: Ablation study 1: : Role of Knowledge related Components
mation (Almazán et al., 2014) about OCR token lead to
basic entity type correctness like numeric or non-numeric.                                                               TextVQA      ST-VQA
                                                                               Model        Validation    Reasoning      Val. Acc.   Val. ANLS.
However, the knowledge descriptions of the scene text                      1   TVQA              -             -            41.7       0.4893
words are typically verbose and detailed, allowing us to                   2   KTVQAUnc      Context     Unconstrained     42.73       0.4875
                                                                           3   KTVQARnd      Random       Constrained      42.89       0.4893
better associate the scene text words with related question
                                                                           4   KTVQAAll     Sum of All    Constrained      43.54       0.4856
words.                                                                     5   KTVQA         Context      Constrained      44.26       0.4991

Multiword Entity Recognition. Multiword Entities are
entities with multiple words in the name, e.g. ‘New York’,
‘Bob Dylan’ , ‘Into The Wild’, etc. We detect multiword                Ablation study 1: Role of Model Components. We pro-
entities during the external knowledge extraction phase,               pose validating candidate knowledge facts through im-
where we bind together OCR tokens if they are part of                  age context and reasoning with this validated knowledge
the same-named entity. E.g., if for OCR token ‘York’, we               through constrained interaction. In this section, we inves-
find a returned named entity ‘New York’ from GKB and                   tigate the individual role played by these two knowledge
the OCR token ‘New’ is in the neighborhood, we bind                    components by comparing our full model with ablated in-
them as ‘New York’.                                                    stances. We briefly describe the ablated models below:

                                                                      11
Figure 6: Examples of knowledge effectiveness. TVQA is our baseline similar to M4C (Hu et al., 2020) and does not use external knowledge. The
correct answers are marked in Blue, while incorrect ones are marked with Red. Multiword Entities, row 3, are marked with Purple. In Row 1, we
see observe that with knowledge we are able to improve upon answer entity correctness. Row 2 shows the effectiveness of external knowledge in
dealing with Dataset Bias. Finally in Row 3, we showcase our Multi-word Entities.

                                                                    12
Figure 7: Examples of knowledge limitations. The correct answers are marked in Blue, while incorrect ones are marked with Red. Multi-word
Entities are marked with Purple

KTVQA. KTVQA is our full model, which uses con-                As shown in Table 5, for both datasets, the inclusion
text to validate and thus select knowledge, and enforces    of knowledge leads to improvement over the M4C based
constrained interaction during reasoning with knowledge     non-knowledge baseline TVQA. We see this improve-
node.                                                       ment to be more pronounced for the TextVQA dataset,
                                                            compared to ST-VQA, by virtue of having more knowl-
TVQA. As described in Sec. 4.2 TVQA is our non- edge entries, as evident in Table 3. Further, we demon-
knowledge baseline similar to M4C. It does not use any strate the effectiveness of our constrained interaction
knowledge nodes and thus is devoid of all of our knowl- through its improvements upon the unconstrained base-
edge pipeline components.                                   line KT V QAUnC . As discussed in Sec. 3.2, our con-
                                                            strained interaction enforces knowledge legitimacy, en-
KTVQAUnC . We propose the use of constrained inter- abling it to discriminate amongst OCR tokens based on
action, as detailed in Sec. 3.2 to preserve knowledge le- their corresponding knowledge node. Next, we high-
gitimacy. In KTVQAUnC , we allow unconstrained inter- light the improvements through context-based candidate
action between the knowledge and text nodes through the knowledge validation and selection by comparing with
use of a zero matrix as the attention Mask A. The remain- two baselines; KT V QA
                                                                                     Rnd which selects randomly, and
ing components are the same as the KTVQA full model, KT V QA which makes no selection and instead settles
                                                                     All
including the context-based validation.                     for aggregation. Our results validate our hypothesis that
                                                            context can be used to select valid candidates that help
KTVQARnd . In Sec. 3.2 we argue for the use of context
                                                            downstream reasoning.
to validate and select from amongst candidate meanings.
KTVQARnd is our ablated instance, which does not use Ablation study 2: Transferable property of external
context but instead randomly selects from among the can- knowledge. We propose the use of externally mined
didate meanings. It is similar to KTVQA in every other knowledge to augment our understanding of scene text
respect, including the constrained interaction.             nodes. In this section, we investigate the transferable
                                                            property of the said external knowledge by comparing our
KTVQA All . This is our ablated instance which, unlike full model with the ablated instances below:
KTVQA or KTVQARnd does not select one particular can-
didate meaning, instead aggregates all candidate mean- TVQA∗ . TVQA∗ is the TVQA model loaded with
ings. Our implementation achieves this by taking a sum KTVQA trained weights, mimicking a model that has
of the Bert embeddings of all candidates instead of select- seen external knowledge during training, but does not use
ing one.                                                    knowledge during testing.

                                                                   13
KTVQA∗ . KTVQA∗ is the KTVQA model loaded with                            predicted answer word was the incorrect OCR token ‘er-
TVQA trained weights. It mimics a model that is not                       icsso’.
trained with knowledge but can use knowledge during
testing.                                                                  Knowledge is not Complete. Incompleteness in knowl-
   In Table 6 rows 1-2, we observe that even if the use                   edge can manifest itself in two ways, either through a
of external knowledge is limited to the training phase, we                complete miss or through false positives by incorrectly
obtain significant improvements in the test results. This                 referring to another entity the queried text shares name
is in line with our previous hypothesis, Sec. 4.3, about                  with. Thus knowledge incompleteness is a major cause of
how the verbose knowledge descriptions lead to better                     failure, as illustrated in Figure 7. For the fourth and fifth
generalization. Thus, while our model uses no explicit                    images, the correct answer forming OCR tokens ‘worth’
instance-specific knowledge during testing, the learned                   and ‘vnes’ were absent in the External Knowledge Base.
knowledge-enabled semantics from the training phase                       While for the third image, the OCR token ‘suica’ refers
leads to the improvement. Rows 3-4 suggest that knowl-                    to a smart card or a village in Bosnia, according to GKB,
edge can not simply be plugged in only at test time, and                  failing to include any reference to a cell phone case.
learning the text-knowledge association, enabled through
our knowledge components, is essential.                                   4.6. Results Summary
                                                              In this section we highlight the key takeaways from our
Table 6: Ablation study 2: Transferable property of external knowledge
                                                           results. Our main takeaway is that the use of external
                     Knowledge    TextVQA       ST-VQA     knowledge leads to improved generalization and allows
       Model        Train Test Val. Acc. Val. ANLS.        us to deal with dataset bias. Our primary results in Ta-
   1 TVQA             ×      ×       41.7        0.4893    ble 1, shows that, given the same upstream OCR system,
   2 TVQA∗           X       ×      43.12        0.4908    and training data, our results match the state-of-the-art on
   3 KTVQA∗           ×      X      39.36        0.4635    two public datasets. Furthermore Table 2 suggests our
   4 KTVQA           X       X      44.26        0.4991    model(row 1) is at par with models trained on additional
                                                           datasets(rows 2,3), and even those incorporating pretrain-
                                                           ing tasks(rows 4,5).
                                                           Through our experiments, with empirical evidence, we
4.5. Bottleneck and Limitations                            also justify our model components. We identify im-
                                                           age contextual validity check as an important subtask for
Knowledge effectiveness depends on OCR quality.
                                                           noisy web data, and justify it through our results (Table
Valid knowledge extraction is dependent on correct OCR
                                                           5 rows 3,4 against row 5). Our arguments for node-to-
recognition. Incorrect OCR recognition can lead to either
                                                           node knowledge legitimacy achieved through constrained
a complete miss or false positives in the knowledge ex-
                                                           interaction are validated by our results in Table 5 (rows 2
traction phase. In the first and second images of Figure
                                                           against row 5).
7, the incorrect OCR tokens ‘goisol’ (‘goisot’) and ‘eric-
                                                           Finally our results also identify OCR quality and Knowl-
sso’ (‘ericsson’) can not be helped by external knowledge,
                                                           edge incompleteness as major bottlenecks to overcome in
which subsequently leads to incorrect answer predictions
                                                           the future.
in both cases. As discussed in Sec. 4.1, we observe that
GKB has tolerance towards incorrect spellings, wherein
it can return correct meanings even with incorrect OCR 5. Conclusion
tokens. So while in the case of ‘goisot’ we received no
hits, for ‘ericsso’, GKB tolerance allowed us to get the      In this work, we look at the use of mined external
meaning corresponding to ‘ericsson’. However, this does knowledge applied to the Open-ended Text-VQA ques-
not help as the final answer words are selected from the tion answering task. Through our experiments, we ob-
OCR tokens, as discussed in Sec. 3.3. Thus despite hav- serve that this external knowledge not only provides in-
ing the correct meaning corresponding to ‘ericsson’, our valuable information about unseen scene text elements

                                                                         14
but also augments our understanding of the text in gen-             Bigham, J., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.,
eral with detailed verbose descriptions. Our knowledge-               Miller, R., Tatarowicz, A., White, B., White, S., Yeh, T.,
enabled model is robust to novel text, predicts answers               2010. Vizwiz: Nearly real-time answers to visual questions,
with improved entity type correctness, and can even                   pp. 333–342.
recognize multiword entities. However, the knowledge                Biten, A.F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Val-
pipeline is susceptible to erroneous OCR tokens, which                 veny, E., Jawahar, C., Karatzas, D., 2019. Scene text visual
can lead to false positives or complete misses. This also              question answering, in: Proceedings of the IEEE/CVF inter-
explains how our performance on the datasets is corre-                 national conference on computer vision, pp. 4291–4301.
lated with the OCR systems used.
   While the use of knowledge in Vision Language un-                Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enrich-
derstanding tasks has been proposed in the past, we pro-              ing word vectors with subword information. Transactions of
pose an end-to-end pipeline to extract, validate and reason           the Association for Computational Linguistics 5, 135–146.
with noisy web-scraped External Knowledge. We present               Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z.,
this as a complementary approach to training on addi-                 Cheng, Y., Liu, J., 2020. Uniter: Universal image-text rep-
tional annotated data or pre-training tasks to achieve gen-           resentation learning, in: European conference on computer
eralization and deal with dataset bias. Finally, we hope              vision, Springer. pp. 104–120.
our results inspire the community to incorporate external
knowledge pipelines into their works, opening them up to            Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. Bert:
knowledge and potential improvements in results.                      Pre-training of deep bidirectional transformers for language
                                                                      understanding, in: NAACL-HLT.

6. Acknowledgement                                                  Dey, A.U., Ghosh, S.K., Valveny, E., Harit, G., 2021. Beyond
                                                                      visual semantics: Exploring the role of scene text in image
   This work was supported by Scheme for Promotion                    understanding. Pattern Recognition Letters 149, 164–171.
of Academic and Research Collaboration (SPARC), Min-
                                                                    Gao, C., Zhu, Q., Wang, P., Li, H., Liu, Y., Hengel, A.V., Wu, Q.,
istry of Human Resource Development (MHRD), Gov-
                                                                      2020. Structured multimodal attentions for textvqa. ArXiv
ernment of India.                                                     abs/2006.00753.

                                                                    Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D.,
References
                                                                      Parikh, D., 2018. Making the v in vqa matter: Elevating the
Agrawal, A., Kembhavi, A., Batra, D., Parikh, D., 2017. C-vqa:        role of image understanding in visual question answering. In-
  A compositional split of the visual question answering (vqa)        ternational Journal of Computer Vision 127, 398–414.
  v1.0 dataset. ArXiv abs/1704.08243.
                                                                    Han, W., Huang, H., Han, T., 2020. Finding the evidence:
Almazán, J., Gordo, A., Fornés, A., Valveny, E., 2014. Word         Localization-aware answer prediction for text visual question
  spotting and recognition with embedded attributes. IEEE             answering, in: Proceedings of the 28th International Confer-
  Transactions on Pattern Analysis and Machine Intelligence           ence on Computational Linguistics, COLING 2020, Interna-
  36, 2552–2566.                                                      tional Committee on Computational Linguistics.

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,          Hu, R., Singh, A., Darrell, T., Rohrbach, M., 2020. Iterative
  Gould, S., Zhang, L., 2018. Bottom-up and top-down atten-           answer prediction with pointer-augmented multimodal trans-
  tion for image captioning and visual question answering, in:        formers for textvqa. 2020 IEEE/CVF Conference on Com-
  Proceedings of the IEEE conference on computer vision and           puter Vision and Pattern Recognition (CVPR) , 9989–9999.
  pattern recognition, pp. 6077–6086.
                                                                    Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., Agha,
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick,     Z., Ong, N., Kovashka, A., 2017. Automatic understanding
  C.L., Parikh, D., 2015. Vqa: Visual question answering, in:         of image and video advertisements. 2017 IEEE Conference
  Proceedings of the IEEE international conference on com-            on Computer Vision and Pattern Recognition (CVPR) , 1100–
  puter vision, pp. 2425–2433.                                        1110.

                                                                15
Kafle, K., Kanan, C., 2017a. An analysis of visual question an-     Liu, F., Xu, G., Wu, Q., Du, Q., Jia, W., Tan, M., 2020. Cascade
  swering algorithms, in: 2017 IEEE International Conference          reasoning network for text-based visual question answering,
  on Computer Vision (ICCV), pp. 1983–1991.                           in: Proceedings of the 28th ACM International Conference
                                                                      on Multimedia, p. 4060–4069.
Kafle, K., Kanan, C., 2017b. Visual question answering:
  Datasets, algorithms, and future challenges. Computer Vi-         Lu, J., Batra, D., Parikh, D., Lee, S., 2019. Vilbert: Pretraining
  sion and Image Understanding 163, 3–20.                             task-agnostic visiolinguistic representations for vision-and-
                                                                      language tasks, in: Advances in Neural Information Process-
Kant, Y., Batra, D., Anderson, P., Schwing, A., Parikh, D., Lu,       ing Systems 32: Annual Conference on Neural Information
  J., Agrawal, H., 2020. Spatially aware multimodal transform-        Processing Systems 2019, NeurIPS 2019, December 8-14,
  ers for textvqa, in: Computer Vision–ECCV 2020: 16th Eu-            2019, Vancouver, BC, Canada, pp. 13–23.
  ropean Conference, Glasgow, UK, August 23–28, 2020, Pro-
                                                                    Mishra, A., Karteek, A., Jawahar, C.V., 2013. Image retrieval
  ceedings, Part IX 16, Springer. pp. 715–732.
                                                                      using textual cues. 2013 IEEE International Conference on
                                                                      Computer Vision , 3040–3047.
Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W., 2017. Words
  matter: Scene text for image classification and retrieval. IEEE   Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A., 2019.
  Transactions on Multimedia 19.                                      Ocr-vqa: Visual question answering by reading text in im-
                                                                      ages, in: 2019 International Conference on Document Anal-
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bag-        ysis and Recognition (ICDAR), IEEE. pp. 947–952.
  danov, A., Iwamura, M., Matas, J., Neumann, L., Chan-
  drasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.,    Narasimhan, M., Lazebnik, S., Schwing, A.G., 2018. Out of the
  2015. Icdar 2015 competition on robust reading, in: 2015            box: Reasoning with graph convolution nets for factual vi-
  13th International Conference on Document Analysis and              sual question answering, in: Advances in Neural Information
  Recognition (ICDAR), pp. 1156–1160.                                 Processing Systems 31: Annual Conference on Neural Infor-
                                                                      mation Processing Systems 2018, NeurIPS 2018, December
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda,        3-8, 2018, Montréal, Canada, pp. 2659–2670.
  L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A.,
  De Las Heras, L.P., 2013. Icdar 2013 robust reading compe-        Narasimhan, M., Schwing, A.G., 2018. Straight to the facts:
  tition, in: 2013 12th International Conference on Document          Learning knowledge base retrieval for factual visual question
  Analysis and Recognition, IEEE. pp. 1484–1493.                      answering, in: Proceedings of the European Conference on
                                                                      Computer Vision (ECCV).
Kim, J.H., Jun, J., Zhang, B.T., 2018. Bilinear attention net-
  works, in: Proceedings of the 32nd International Conference       Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: To-
  on Neural Information Processing Systems, p. 1571–1581.             wards real-time object detection with region proposal net-
                                                                      works, in: Advances in neural information processing sys-
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz,      tems, pp. 91–99.
  J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D., Bernstein,
                                                                    Shah, S., Mishra, A., Yadati, N., Talukdar, P.P., 2019. Kvqa:
  M.S., Fei-Fei, L., 2016. Visual genome: Connecting lan-
                                                                      Knowledge-aware visual question answering, in: Proceed-
  guage and vision using crowdsourced dense image annota-
                                                                      ings of the AAAI Conference on Artificial Intelligence, pp.
  tions. International Journal of Computer Vision 123, 32–73.
                                                                      8876–8884.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin,         Singh, A., Natarajan, V., Jiang, Y., Chen, X., Shah, M.,
  I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,              Rohrbach, M., Batra, D., Parikh, D., 2018. Pythia-a plat-
  Kolesnikov, A., et al., 2020. The open images dataset v4.            form for vision & language research, in: SysML Workshop,
  International Journal of Computer Vision 128, 1956–1981.             NeurIPS.

Le, Q., Mikolov, T., 2014. Distributed representations of sen-      Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra,
  tences and documents, in: Proceedings of the 31st Interna-           D., Parikh, D., Rohrbach, M., 2019a. Towards vqa models
  tional Conference on International Conference on Machine             that can read, in: Proceedings of the IEEE/CVF Conference
  Learning - Volume 32, JMLR.org. p. II–1188–II–1196.                  on Computer Vision and Pattern Recognition, pp. 8317–8326.

                                                                16
Singh, A.K., Mishra, A., Shekhar, S., Chakraborty, A., 2019b.                         Arka Ujjal Dey received his Bache-
   From strings to things: Knowledge-enabled vqa model that                           lor’s and Master’s Degrees in Com-
   can read and reason, in: Proceedings of the IEEE/CVF Inter-                        puter Science from Calcutta Univer-
   national Conference on Computer Vision, pp. 4602–4612.                             sity in 2010, and 2012 respectively.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,                       In 2015 he obtained his Master of
  Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is                         Technology Degree from IIT Jodh-
  all you need, in: Advances in neural information processing                         pur, where he is currently enrolled as
  systems.                                                                            a Ph.D. student in the Computer Sci-
                                                                                      ence and Engineering Department.
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.J.,     His research interests lie at the intersection of Computer
  2016. Coco-text: Dataset and benchmark for text detection
                                                                  Vision and Natural Language Processing.
  and recognition in natural images. ArXiv abs/1601.07140.

Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A., 2018.
  Fvqa: Fact-based visual question answering. IEEE Transac-                             Ernest Valveny received the Ph.D.
  tions on Pattern Analysis and Machine Intelligence 40, 2413–                          degree in 1999 from the Universi-
  2427.                                                                                 tat Autònoma de Barcelona (UAB),
                                                                                        where he is currently an associate
Wu, Q., Wang, P., Shen, C., Dick, A., Van Den Hengel, A., 2016.
 Ask me anything: Free-form visual question answering based                             professor and member of the Com-
 on knowledge from external sources, in: Proceedings of the                             puter Vision Center (CVC). His re-
 IEEE conference on computer vision and pattern recognition,                            search work focuses mainly on doc-
 pp. 4622–4630.                                                                         ument analysis and pattern recogni-
                                                                                        tion, and more specifically in the
Yang, Z., Lu, Y., Wang, J., Yin, X., Florencio, D., Wang, L.,     fields of robust reading, text recognition and retrieval, and
  Zhang, C., Zhang, L., Luo, J., 2021. Tap: Text-aware pre-
                                                                  document understanding. He has published more than 20
  training for text-vqa and text-caption, in: Proceedings of
  the IEEE/CVF Conference on Computer Vision and Pattern
                                                                  papers in international journals and more than 100 papers
  Recognition, pp. 8751–8761.                                     in international conferences. He is a member of IAPR
                                                                  and of the editorial board of the International Journal on
Ye, K., Zhang, M., Kovashka, A., 2021. Breaking shortcuts by      Document Analysis and Recognition and has served as a
  masking for robust visual reasoning, in: Proceedings of the     reviewer for many international journals and conferences.
  IEEE/CVF Winter Conference on Applications of Computer
  Vision, pp. 3520–3530.
                                                                                       Gaurav Harit received a Bachelor
                                                                                       of Engineering degree, in Electrical
                                                                                       Engineering, from M.B.M. Engineer-
                                                                                       ing College, Jai Narayan Vyas Uni-
                                                                                       versity Jodhpur, in 1999. He received
                                                                                       a Master of Technology and Ph.D.
                                                                                       degrees from IIT Delhi, in 2001 and
                                                                                       2007, respectively. He is a faculty
                                                                                       member in the Department of Com-
                                                                  puter Science and Engineering at the Indian Institute of
                                                                  Technology Jodhpur since 2010. Before joining IIT Jodh-
                                                                  pur, he was a faculty member in the Department of Com-
                                                                  puter Science and Engineering at IIT Kharagpur. His ar-
                                                                  eas of interest are Document Image Analysis and Video
                                                                  Analysis.

                                                              17
You can also read