ICDSST 2021 on Decision Support Systems, Analytics and Technologies in response to Global Crisis Management Automated physical document ...

Page created by Joyce Ward
 
CONTINUE READING
ICDSST 2021 on Decision Support Systems, Analytics and Technologies in response to Global Crisis Management Automated physical document ...
————————————————————————————————————————————————————————
                                                     ICDSST 2021 Proceedings – Online Version
       The EWG-DSS 2021 INTERNATIONAL CONFERENCE ON DECISION SUPPORT SYSTEM TECHNOLOGY
                                                                                       (editors)
                                                              Loughborough, UK, 26-28 May 2021

                          ICDSST 2021 on
      Decision Support Systems, Analytics and Technologies in
               response to Global Crisis Management

   Automated physical document interpretation and validation via
                      artificial intelligence
  Giulia De Poli, Davide Cui, Manuela Bazzarelli, Leone De Marco, Matteo Bregonzio
                                      3rdPlace SRL
                          Foro Buonaparte 71, 20121 Milan, Italy
 giulia.depoli@3rdplace.com, davide.cui@3rdplace.com, manuela.bazzarelli@3rdplace.com,
              leone.demarco@3rdplace.com, matteo.bregonzio@3rdplace.com
                              web-page: www.3rdplace.com

                                         ABSTRACT

Identity validation and data gathering through physical documents (papers and cards) is still
widely adopted in several sectors including retail banking, insurance, and public
administration. Those tasks generally consist in manually transcribing relevant fields from
physical documents to a dedicated software application. This is due to many reasons such as
the sensitive nature of these operations and the lack of suitable digital alternatives; however,
this process remains time consuming, expensive, and prone to human error. For these reasons,
there is still a strong need for automating physical document interpretation and data extraction,
mainly in the instances where documents could rapidly be digitised (scanned). Therefore, we
devised an innovative A.I. based pipeline for automatically acquire, process, and validate
digitised documents and the related manually extracted fields. Our solution constitutes a
support layer for the human agents enabling faster document processing and drastically reduce
errors in the operations. The proposed pipeline leverages Cloud infrastructure for scalability,
employing several A.I. techniques from Computer Vision to Convolutional Neural Networks.
Despite the accuracy may vary depending on document type and the specific field to recognize,
we observed promising experimental results. On average, it reached 92.6% accuracy for
document recognition tasks and 81.1% for the field extraction tasks. Furthermore, we foresee
a significant reduction in operational time and errors.

Keywords: Document Processing, Document OCR, Intelligent data capture, Machine
learning, Document validation, Object detection, Computer Vision.

                                               1
ICDSST 2021 on Decision Support Systems, Analytics and Technologies in response to Global Crisis Management Automated physical document ...
INTRODUCTION
    Nowadays, we are more and more accustomed to quickly validating our identity or sharing
personal data using digital devices and secrets: passwords, digital signatures, fingerprint
readers, or two factor authentications. Despite the global digitalisation process, physical
documents remain still relevant and widely adopted for sensitive tasks such as opening a new
bank account, loan or insurance applications, and public administration enquiries. Although a
significant effort has been made by governments in order to create “digital ready” physical
documents, not all institutions have caught up and, most importantly, a good amount of
circulating documents is still of the old kind. Physical documents are, to this day, processed
manually by agents that have to extract the relevant information and transcribe it within
specialized software applications. In such a landscape, there is a strong need for a way to
automate, at least partially, this pipeline in order to make the process more efficient, quicker
and less error prone. To answer this question, we devised and implemented an innovative
system able to automatically extract the relevant information from scanned documents and
compare it with the data extracted by a human agent. The checks performed by the automated
system support the agents by promptly notifying them if inconsistencies are detected between
the analysed documents and the entered information, as shown in Figure 1. Given the high
specificity of the proposed system, we could not find in the literature works that encompassed
the entirety of its scope. On the other hand, automatic document processing is a very popular
domain, for obvious reasons. A huge body of work has been devoted to many different aspects
such as Optical Character Recognition [1], document classification employing Natural
Language Processing techniques [2] etc. In brief, the two main problems tackled by our system
are: (1) Identification of document type from its scanned image; with document type we mean
ID card, driver license, passport, etc. (2) For a given document type, extracting a subset of
salient fields such as name, surname, address, expiry date, etc. Regarding the first problem, the
literature is rich with works that either focus on the subtask of identifying and classifying
different parts of the document (such as tables, graphs) [3,4] or methods that focus on a specific
language (usually English) and need a huge amount of training data [5]. In contrast to this, our
application needed methods that were language agnostic, fast and that did not require a large
training set. Additionally, the system needed to be able to identify several types of documents,
including several formats for each type (i.e., paper version ID and plastic card ID). For this
reason, we took inspiration from template matching techniques [6] and Deep Learning with
transfer learning [7] for solving the document object detection task. Similarly, when developing
the information extraction module, we struggled to find relevant literature suitable for our task:
most works were too specific [8]. To this end, we decided to integrate Rusinol et al. [9] method
in our pipeline.

                                                2
ICDSST 2021 on Decision Support Systems, Analytics and Technologies in response to Global Crisis Management Automated physical document ...
Figure 1. Support system for physical document fields extraction. Scenario a: the agent
    correctly transcribes relevant fileds from physical documents to a dedicated software
 application; the file is successfully submitted. Scenario b: a field is erroneously transcribed
   hence the agent promptly receives a message indicating the mistake enabling him/her to
                                quickly fix it and resubmit the file.

SYSTEM FLOW AND ARCHITECTURE
   In a nutshell, our system is a support layer for the human agents in charge of manually
processing documents and entering data in specialized software applications. This support on
one hand aims to drastically speed up and reduce operational errors; on the other it needs to be
highly compatible with legacy applications. To deliver quasi-real-time outputs, the proposed
architecture exploits Google Cloud Platform technologies such as Functions, Compute Engine,
Pub Sub, Storage, and Cloud Run.
   To better express the complexity of the problem, it is important to mention that each input
scanned image may contain more than one document (i.e., identity card and driver license) and
the system is not notified with the kind of documents contained in the image. Additionally, the
key-value pairs transcribed by the human agent are provided as input in a JSON, without
specifying from which document they were extracted from. As presented in Figure 2, the
architecture is composed of 4 logical modules:
   REST Interface - allows the system to receive both the scanned documents (images) and
the information extracted by the human agent (JSON payload). Saves both images and
information on the cloud and activates a preprocessor instance.
   Preprocessor - Standardizes all images without compromising quality by converting them
to grayscale png format, and subsequently, for each image it invokes one Document Service
instance for each possible document type.
   Document Service - Each document service instance will try to detect a specific document
type and, if successful, process the image in order to crop it, straighten it and perform Optical
Character Recognition and field key-value assignment on it.
   Aggregator - Finally, the aggregator receives the outputs of all Document Service
instances: for each instance, either a document was recognized, processed and its field values
recognized and extracted, or not (in this case the output is “not recognized”). Based on these
results, the aggregator is able to determine which fields and documents were recognized and
can compare the field key-values pairs extracted with the ones sent by the agent. If any
documents could not be recognized and/or any fields were not consistent with the input, an

                                                3
error message with the details is generated. The message is sent by the aggregator to the REST
Interface that will in turn relay the message to the agent.

      Figure 2. System architecture of the proposed support system. The system input is
composed of scanned images of physical documents and a JSON with the human agent field
 transcriptions. A REST interface is employed to interact with the application used by the
agent. Preprocessor, Document Service, and Aggregator are used to process input images
                           and validate the agent transcriptions.

MATERIALS & METHODS
Dataset
   The training dataset is composed of (1) document images and (2) the related information
extracted by the agents which is stored as keys-value pairs within a JSON file. The analysed
documents belong to six types: fiscal code card, NHS card, identity card (three versions),
driving licence (three versions), passport and residence permit. Some documents appear in
more than one version due to layout changes over time, making a total of ten document layouts.
In addition, each document layout has both front and back side. Document layouts are not
equally represented in the dataset; for instance, paper driving licences, which were released
until 1999, have much fewer samples compared to modern card driver licenses.

Document Service
  The Document Service module is composed of two submodules responsible for: i)
document recognition ii) field recognition and text extraction.

   The document recognition submodule identifies the area in the image (box) where the
document is present, crops it and straightens it. Like its parent module, each document
recognition instance is specific for a document type and it is trained to recognize its front and
back side. We developed two methods, Method1 for documents for which we had at least 100

                                               4
samples in our training set, and Method 2 otherwise.
    In Method 1 an AutoML [10] Vision model based on Convolutional Neural Networks is
trained specifically for each document type that met the training quota; in our experiment it
resulted in four models. Each trained model detects three bounding box classes: front template,
back template or other (for document types different from the one recognized by the model).
Finally, bounding boxes are straightened using image registration. Image registration [11] is
achieved by comparing phase cross correlation on a reference template and t he selected
bounding box Discrete Fourier Transform (DFT). The Method 1 end -to-end is depicted in
Figure 3.

              Figure 3: Method 1 used to detect, crop and straighten documents

   Method 2 is instead used with document types having fewer training samples. Basically, it
consists in a template matching approach based on SIFT [12] (Scale-Invariant Feature
Transform). The SIFT output allows to automatically align the detected box against a reference
template.

Once all document bounding boxes are extracted Google Vision API [14] Optical Character
Recognition (OCR) is applied. In order to improve OCR performances, super-resolution [13]
is applied to enhanced the image quality of the boxes. The OCR returns a set of text-bounding
box pairs utilised in the downstream steps.
    In our application, given a document type, we select only the bounding boxes corresponding
to the relevant fields (such as “Name”, “Surname”, etc). Furthermore, for each document type
a geometric template is created in order to define distances of relevant fields from selected
reference boxes. Reference boxes are non-varying portions of the document used as anchors to
understand the position of relevant fields within the image. Geometric template is the strategy
adopted to then extract relevant field values.

Field text validation
   The Aggregator module receives the document service’s outputs and performs two tasks.
The first involves comparing the results from the different document services and determines
whether one or more documents are present and which type they belong to. Secondly, it
compares the extracted relevant fields with the human agent input and checks for
inconsistencies. Comparison tolerance depends on the field type, e.g., some fields like
document number have to be a perfect match, while for others like addresses where it happens
that the same address is reported in a slightly different way (e.g., M.L. King St. vs Martin
Luther King Street), a larger tolerance is applied. Finally, a response is delivered highlighting
                                               5
missing and inconsistent fields in order for the agent to perform a second check and resubmit.
Differently, if everything is consistent, an OK message is returned to the agent.

RESULTS
    We tested the system using a set of 100 randomly selected procedure submissions composed
of a total of 218 documents. Evaluation was based on two metrics:
    1. Document recognition accuracy: percentage of documents correctly identified
    2. Field extraction accuracy: percentage of correctly extracted fields on the identified
    documents.
The overall document recognition reported an accuracy of 92.6%. It is worth mentioning that
document recognition via Method 1 performed far better than Method 2, as shown in Table1
In general, the system found some difficulties processing blurry images such as scans of
photocopied documents. Field extraction, with an average accuracy of 81.1%, presented
different performances based on the specific fields to be detected as shown in Table 2. Some
fields such as name, surname which are always in the same position within the document
template reported an accuracy around 90% while others, such as province, with a variable
position had a lower accuracy. Moreover, the proposed system appears reliable for certain
document layouts such as NHS cards where fields are always printed in the same position. On
the contrary, for paper identity cards, in which fields can be printed in slightly different
positions, field extraction did not achieve such a high accuracy.

    Table 1: Document recognition accuracy reported for a subset of relevant documents
                           Relevant Document                Method     Accuracy
                  NHS Card                              1              95.4%
                  Identity Card (card version)          1              93.7.%
                  Identity Card (paper version)         1              92.1%
                  Driving Licence                       1              91.6%
                  Driving Licence (paper version)       2              89.2%
                  Passport                              2              88.4%
                  Residence Permit                      2              78.1%

          Table 2: Field extraction accuracy reported for a subset of relevant fields
                                    Relevant Field          Accuracy
                               Citizenship                  94.6%
                               Surname                      89.9%
                               Document Number              89.4%
                               Name                         88.3%
                               Release Place                82.7%
                               Date of Birth                81.4%
                               Release Date                 71.5%
                               Province                     70.3%
                               Address                      61.7%

CONCLUSIONS
   This paper proposes a support system for automatically processing scanned documents,
identifying relevant information and validating it against manually extracted ones. We
demonstrated that the proposed system handles multiple challenges including (1) working with
                                                    6
a variety of document types, (2) each document type may have different layouts, (3) presence
of multiple documents in an image (4) handling different acquisition methods (e.g., smartphone
pictures, scanned photocopy, etc.).
   The presented pipeline delivers a quasi-real-time output benefitting from Cloud high
flexibility and scalability. Experimental results on real world documents show promising
results in both document recognition and field extraction with an average accuracy of 92.6%
and 81.1% respectively.
   Future improvements can be achieved by integrating in the pipeline deep learning models
for field recognition and improving recognition for blurry and low-quality images.

REFERENCES
   1. Islam, N., Islam, Z., & Noor, N. (2017). A survey on optical character recognition
       system. arXiv preprint arXiv:1710.05703.
   2. Yesenia, O., & Veronica, S. F. (2021, February). Automatic Classification of Research
       Papers Using Machine Learning Approaches and Natural Language Processing. In
       International Conference on Information Technology & Systems (pp. 80-87). Springer,
       Cham.
   3. Tran, D. N., Tran, T. A., Oh, A., Kim, S. H., & Na, I. S. (2015). Table detection from
       document image using vertical arrangement of text blocks. International Journal of
       Contents, 11(4), 77-85.
   4. Saha, R., Mondal, A., & Jawahar, C. V. (2019, September). Graphical object detection
       in document images. In 2019 International Conference on Document Analysis and
       Recognition (ICDAR) (pp. 51-58). IEEE.
   5. Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., ... & Zhou, L. (2020). LayoutLMv2:
       Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv preprint
       arXiv:2012.14740.
   6. Tang, F., & Tao, H. (2007, February). Fast multi-scale template matching using binary
       features. In 2007 IEEE Workshop on Applications of Computer Vision (WACV'07)
       (pp. 36-36). IEEE.
   7. Yabuki, N., Nishimura, N., & Fukuda, T. (2018, June). Automatic object detection from
       digital images by deep learning with transfer learning. In Workshop of the European
       Group for Intelligent Computing in Engineering (pp. 3-15). Springer, Cham.
   8. Chatelain, C., Heutte, L., & Paquet, T. (2006, February). Segmentation-driven
       recognition applied to numerical field extraction from handwritten incoming mail
       documents. In International Workshop on Document Analysis Systems (pp. 564-575).
       Springer, Berlin, Heidelberg.
   9. Rusinol, M., Benkhelfallah, T., & Poulain d’Andecy, V. (2013, August). Field
       extraction from administrative documents by incremental structural templates. In 2013
       12th International Conference on Document Analysis and Recognition (pp. 1100-
       1104). IEEE.
   10. https://cloud.google.com/vision/automl/docs
   11. Guizar-Sicairos, Manuel, Samuel T. Thurman, and James R. Fienup. "Efficient
       subpixel image registration algorithms." Optics letters 33.2 (2008): 156-158.
   12. Lowe, David G. "Distinctive image features from scale-invariant keypoints."
       International journal of computer vision 60.2 (2004): 91-110.
   13. Dong, C., Loy, C. C., & Tang, X. (2016, October). Accelerating the super-resolution
       convolutional neural network. In European conference on computer vision (pp. 391-
       407). Springer, Cham.
                                              7
14. https://cloud.google.com/vision/docs/ocr
15. Yu, Wenwen, et al. "Pick: processing key information extraction from documents using
    improved graph learning-convolutional networks." arXiv preprint arXiv:2004.07464
    (2020).

                                         8
You can also read