Web application for Data Import from XLSX into a Relational Database - MASARYK UNIVERSITY

Page created by John Robles
 
CONTINUE READING
Web application for Data Import from XLSX into a Relational Database - MASARYK UNIVERSITY
MASARYK UNIVERSITY
    FACULTY OF INFORMATICS

Web application for Data
Import from XLSX into a
  Relational Database

       BACHELOR'S THESIS

       Samuel Toman

        Brno, Spring 2021
Web application for Data Import from XLSX into a Relational Database - MASARYK UNIVERSITY
MASARYK UNIVERSITY
    FACULTY OF INFORMATICS

Web application for Data
Import from XLSX into a
  Relational Database

       BACHELOR'S THESIS

       Samuel Toman

        Brno, Spring 2021
This is where a copy of the official signed thesis assignment and a copy of the
Statement of an Author is located in the printed version of the document.
Declaration

Hereby I declare that this paper is my original authorial work, which
I have worked out on my own. A l l sources, references, and literature
used or excerpted during elaboration of this work are properly cited
and listed in complete reference to the due source.

                                                      Samuel Toman

Advisor: Mgr. Luděk Bártek Ph.D.

                                                                    i
Acknowledgements

I would like to express gratitude towards my advisor Mgr. Luděk
Bártek, Ph.D., for always being available to patiently answer all my
questions. Likewise, I would like to thank my consultants JUDr. Ing.
František Kasl, Ph.D. and JUDr. Pavel Loutocký, Ph.D., BA.

                                                                 iii
Abstract

Spreadsheets are often used in office environments due to their user-
friendliness coupled with their practicality. The majority of spread-
sheet users are non-professional programmers, and as such, keeping
them user-friendly remains a high priority Their intuitiveness comes
at a price, however. Due to their design, they are not well suited for stor-
ing and querying large, structured data. They are nevertheless often
relegated to precisely that role. The conversion process from a spread-
sheet to a relational database can often be problematic and requires
some level of technical knowledge. The main objective of this thesis is
to provide a semi-automatic means of importing spreadsheets into a
relational database, easing the process of conversion while still pro-
viding enough modularity to design a suitable database schema. The
thesis examines existing solutions and addresses their shortcomings
in a resulting web application. As part of the thesis, the application
was incorporated into an existing system called "CyQualf."

iv
Keywords
database systems, MySQL, PHP, web application, Docker

                                                        v
Contents

Introduction                                                   1

1   Data representation in XLSX documents and SQL databases    3
    1.1 Office Open X M L Workbook (XLSX)                      3
         1.1.1 Data representation                             4
    1.2 Relational database                                    4
         1.2.1 Relational model                                5

2   Project requirements                                       7
    2.1 Functional requirements                                7
         2.1.1 Mapping schema                                  8
         2.1.2 HTTP API                                        9
    2.2 Non-functional requirements                            9

3   Exploration of existing XLSX to SQL conversion tools      11
    3.1 Web converters                                        11
        3.1.1 SQLizer                                         11
        3.1.2 Other web converters                            12
    3.2 Desktop application converters                        13
    3.3 Conclusion                                            14
        3.3.1 Missing functionality                           14

4 Technology stack and frameworks                             17
  4.1 PHP language                                            17
  4.2 PHP spreadsheet parser                                  17
      4.2.1 PhpSpreadsheet                                    18
  4.3 JavaScript                                              19
      4.3.1 React                                             19
  4.4 Docker                                                  20

5 Implementation and project structure                        21
  5.1 Project structure                                       21
  5.2 Server-side                                             22
      5.2.1 Server-side file structure                        22
      5.2.2 Parsing the mapping schema                        23
      5.2.3 Mapping relationships                             24

                                                              vii
5.3      Client-side                         26
             5.3.1 Client-side file structure    26
             5.3.2 Front-end design components   27

6   Deployment                                   29
    6.1 Docker project structure                 29
        6.1.1 Mariadb service                    29
        6.1.2 Adminer service                    30
        6.1.3 Php service                        30
        6.1.4 Server service                     31
        6.1.5 React-frontend service             31

       6.2   Summary of configuration files      32

7      Conclusion                                33

Bibliography                                     35

A Usage example                                  37
  A . l Running the application                  37
        A. 1.1 Service configurations            37
  A.2 Using the GUI                              38
  A 3 Using the API                              39
B Graphical user interface design                43

viii
List of Tables
1.1   One-to-many relationship represented in XLSX. 4
5.1   An example worksheet Employee   25

                                                        ix
List of Figures

4.1     A comparison of download counts (from NPM package
        manager) of the three most popular JavaScript front-end
       frameworks/libraries. Downloads measured from April 2019
        to April 2021. 20
5.1    A class diagram of the mapping schema data structure. 24
5.2    Component decomposition of the webpage GUI. 28
A.l    A single table of the mapping schema. 38
A. 2   A mapping schema containing two tables. 39
B. l    The webpage GUI on a desktop-sized screen width. 43
B.2    The webpage GUI on a smartphone-sized screen width. 44

                                                             xi
Introduction

Spreadsheet programs are often considered to be a significant factor
in the introduction and establishment of personal computers (PCs),
due to the spreadsheets being one of the main use-cases for the early
PCs [1, p. C-177]. The first spreadsheet application for PCs called
VisiCalc, originally released for Apple II in 1979, was considered a
huge commercial success. It was often referred to as Apple IPs first
"killer app" [2], meaning a program so essential, one would buy a
computer just to be able to use it.
    As seen from their continued success, it is clear that spreadsheets
provide essential services, often considered irreplaceable by their
users. However, as is the case with any software, they are not the tool
for everything. Spreadsheets work well enough when manipulating or
analyzing manageably small data; they begin to struggle once the data
gets sufficiently big, however. Among the many problems exacerbated
by a growing dataset are poor performance, data redundancy, error
proliferation, and many more. Their structure does not allow them
to link and cross-reference data between tables easily, enforce data
integrity rules, or retrieve data using complex querying functions. A l l
of the above-mentioned are desirable traits for a system maintaining
a sizeable or a critical dataset. In conclusion, a spreadsheet is not a
database; it is not designed for the purpose of long-term storage of
large or essential data. Thus, a problem of conversion to a proper
database emerges.
    The existing web applications for importing spreadsheets into
relational databases do not offer a solution functionally sufficient
enough, to design a relational schema and subsequently map the
data into it. The only available option is a simple import of the entire
worksheet into the database as-is, without the option of establishing
relations. To fully leverage the advantages of the relational model,
the imported tables would have to be further processed into a new
schema, which might be an uneasy procedure. This thesis aims to
develop a web application capable of importing spreadsheet data
into a database according to a user-defined schema, using a pleasant
graphical interface.

                                                                       1
The first chapter of the thesis contains a quick overview and com-
parison of data representation in spreadsheets and relational databases.
The second chapter describes the project requirements, detailing what
the application should be capable of and how it should behave. The
third chapter explores existing solutions, compares both web and
desktop variants, and draws a conclusion based on this analysis. The
selected technologies and the reasoning for their selection are outlined
in the following, fourth chapter. The fifth chapter details the project
structure and selected parts of the implementation. The following sixth
chapter explains the deployment of the application using Docker, de-
tailing individual Docker services comprising the project. The seventh
chapter contains a conclusion while summarizing the thesis. Addi-
tionally, two appendices concerning the usage of the application and
its graphical design are appended at the end of the thesis.

2
1 Data representation in XLSX documents and
SQL databases

This chapter offers an overview of the XLSX format in comparison
to relational databases. It describes the differences in data represen-
tation, structure, and functionality between the two paradigms. A
short insight into the file structure of XLSX is also given to deepen the
understanding of the format.

1.1     Office Open XML Workbook (XLSX)

XLSX is a spreadsheet format designed by Microsoft, introduced to-
gether with Microsoft Excel 2007 and standardized by Ecma Interna-
tional, ISO and IEC . The format was designed to comply with the
            1           2

Office Open X M L specification [3] and served as a successor to the
previous proprietary Excel Binary File Format (XLS) used by earlier
versions of Microsoft Excel. Since its inception in December 2006, XLSX
has become widely supported by most modern spreadsheet programs
due to it being a standardized format.
    In contrast to the previous XLS, which is a binary format, XLSX is a
ZIP-compressed archive containing several X M L files. Compared to
                  3                                      4

its predecessor, XLSX offers a significant file size reduction [4, p. 324].
As a ZIP archive, the file can be unpacked, revealing the underlying
structure of the format:

      • [Content_Types].xml - Contains references to all X M L files
        included in the package.

      • _rels/ - A folder consisting of a single XML file storing package-
        level relationships.

1. International Organization for Standardization
2. International Electrotechnical Commission
3. ZIP is an archive file format, supporting lossless compression
4. Extensible M a r k u p Language format designed to be both human-readable and
machine-readable

                                                                              3
i.   DATA REPRESENTATION IN    XLSX   DOCUMENTS AND   SQL DATABASES

         Table 1.1: One-to-many relationship represented in XLSX.

 Department           Department        Employee          Employee
 Tag                                    Name              Wage
    ENG               Engineering       Julian Johnson    35000
    ENG               Engineering       Jane Jones        39000
    ACC               Accounting        Martin Moore      28000
    ACC               Accounting        Larry Lewis       46000

        • docProps/ - A folder containing X M L files with overall doc-
          ument properties, such as author, last modification date, and
          metadata about the file's content.
        • x l / - This is the main folder, branching into further subfolders
          and X M L files. As a whole, it contains the details about the
          workbook contents and the data itself.

1.1.1     Data representation
The data is represented in tabular form, in rows and columns of cells.
The entire spreadsheet can contain multiple worksheets, where each
worksheet contains its own cells of data. The individual cells can
contain various data types and styles, and can be formatted in different
ways.
    Representing relationships in spreadsheets can be problematic.
Table 1.1 shows how a one-to-many relationship can be represented
between employees and their departments, where a single depart-
ment can have multiple employees. Maintaining data this way can
be problematic, primarily due to the fact that spreadsheets do not en-
force data integrity. Meaning, it would allow deleting the name of the
Engineering department from a single row, resulting in inconsistent
data.

1.2       Relational database
A relational database is a specific type of database that stores and
provides access to data related to one another, thus conforming to

4
i.   DATA REPRESENTATION IN XLSX DOCUMENTS AND   SQL DATABASES

the so-called relational model. The data in this type of database can
be linked and cross-referenced, creating relationships. This facilitates
effective storage and searchability.
    The usability of the database is administered by a software called
Relational Database Management System (RDBMS). There are many
such systems available, some proprietary and some open-source. A
characteristic feature provided by a vast majority of RDBMSs is SQL    5

or some variation of it. SQL serves as a means of interacting with the
database using an English-like syntax. It provides broad functionality
for querying and maintaining the database.

1.2.1        Relational model
SQL databases do not implement the relational model perfectly; in-
stead, they try to approximate the theoretical model with minor devi-
ations. The general principles of the model still apply, however. For
clarity, the more common SQL terms will be used instead of the rela-
tional database terms, both of which are interchangeable. The model
organizes data into tables, each consisting of rows and columns, where
a unique key identifies each row. The columns are constrained by a
domain (or data type).
    The model permits a table row to hold a foreign key, referencing
the primary key of another table row, thus establishing proper relation-
ships between tables. This allows us to represent the earlier example
in Table 1.1 as two separate tables with an Employee table holding
references to a Department table, forming a one-to-many relationship.
The used RDBMS ensures the integrity of the data, and each Employee
row can only reference an existing Department.

5.   Structured Query Language

                                                                      5
2 Project requirements

The base requirement was to create a web application developed in
PHP, allowing a semi-automatic data conversion from a spreadsheet
into a MySQL/MariaDB database. This includes a back-end accessible
by an API, handling the data conversion and import into the database,
and a front-end allowing the user to intuitively design a database
schema.

2.1     Functional requirements
Functional requirements define the functions that the system must im-
plement. They generally describe system behavior under specific con-
ditions. Using plain language, functional requirements define "what"
the system should be able to do.
    The back-end portion of the application is responsible for the main
bulk of functionality regarding the conversion of data and its import
into a database. The connection to the target database should be con-
figurable in the back-end by editing database connection details and
credentials. Provided a valid mapping schema and XLSX file, the back-
end should be able to convert and import the spreadsheet data into
the configured database, according to the mapping schema.
    The front-end portion of the application is responsible for easing
the conversion process by creating a visual representation, creating a
sort of a bridge between the user and the back-end functionality. Its
primary purpose is to give the user the capacity to create the mapping
schema using a graphical interface. It should prompt the user to select
an XLSX file, after which it should allow the user to design a mapping
schema for the selected spreadsheet. The user should be able to use all
properties of the mapping schema using the web page, which includes
the following:

      • Creating and deleting tables from the schema.

      • Changing the names of schema tables.

      • Adding and deleting columns from particular tables.

                                                                     7
2. PROJECT REQUIREMENTS

        • Changing the names and data types of columns.

        • Adding and removing references to other tables.

Once the user has selected an XLSX file and designed a valid schema,
the user can import the data into a database using the web page
interface.

2.1.1     Mapping schema

Mapping schema is a JSON file describing how the spreadsheet data
                                 1

should be mapped into a database. The schema describes tables to be
created in the database if they do not yet exist. The database tables
will then be populated with data from the spreadsheet as defined by
the mapping schema.
    The schema must allow for the following:

        • Defining tables, which includes specifying columns and refer-
          ences to other tables.

        • The user can change table and column names.

        • The user can change column data types.

        • Creating one-to-many and many-to-many relationship s between
          tables.

        • Defining multiple tables from a single worksheet by mapping
          different worksheet columns to different tables.

        • Specifying the schema for the entire spreadsheet instead of just
          one worksheet.

1. JavaScript Object Notation is a human-readable format used to store and transmit
data.

8
2. PROJECT REQUIREMENTS

2.1.2   HTTP API
The application must provide an HTTP API to access the back-end
converter. Sending an HTTP POST request to a specific URI imports
                                           2                               3

the spreadsheet into a specified database. The request must be sent
along with schema and XLSX spreadsheet. If either the schema or the
spreadsheet file is missing from the POST request, no changes should
be made to the database, and a response code 400 shall be returned.
Similarly, if either the schema is incorrectly formed or the spreadsheet
file cannot be parsed or imported into the defined database, response
code 400 should be returned. If the import is successful, response code
200 shall be returned.

2.2     Non-functional requirements
Non-functional requirements specify "how" the system should behave
and describe the limits of its functionality. Even when not met, they
do not impact the system's basic functionality, though they usually
impact the user experience.
    The back-end must be implemented using the PHP programming
language. Due to the fact that data from an XLSX file are being con-
verted and imported into an SQL database, the possibility of SQL
injections arises. The PHP implementation should be able to prevent
           4

SQL injection attempts from the spreadsheet data.
    The code should be kept clean and readable by adhering to good
programming principles, such as consistent and descriptive naming
conventions, avoiding repetitiveness and unnecessary complexity. This
ensures easier extensibility and maintainability of the code in the
future. The code should also provide some level of documentation to
aid readability.

2. POST is a request method defined by HTTP, usually used to submit an entity to
the specified resource, often causing a change i n state or side effects o n the server
[5].
3. U n i f o r m Resource Identifier; a unique sequence of characters specifying a re-
source.
4. SQL injection is a security vulnerability i n which malicious SQL statements are
inserted for execution.

                                                                                     9
2. PROJECT REQUIREMENTS

    The primary purpose of a front-end when an API is available is to
provide a more accessible and intuitive option of using the application
while also retaining the same capabilities as the API, if possible. As
such, the graphical interface should be designed to fulfill this goal,
providing a capable yet straightforward method of use. To further
improve usability, the web page layout should be responsive, meaning
it should adjust itself depending on the available screen size and
provide a comfortable experience on all device sizes.

10
3 Exploration of existing XLSX to SQL conver-
sion tools

Several solutions capable of importing data from XLSX format into an
SQL database, or at least solutions that partially achieve this goal, were
encountered during exploration. This chapter evaluates and compares
a selection of them by inspecting their functionality and capabilities.
There are tens of applications, the majority of them in the form of
web applications with similar functionalities; therefore, this analysis
will focus mainly on the more prominent ones, mentioning potential
unique features.

3.1      Web converters
The format of a web application grants a level of comfort to the user.
There is no need to download and install a desktop application on
the user's computer, taking up memory and computing resources.
The only requirement for a web application is a working internet con-
nection. With sufficient connection, it will work independently of the
user's machine and its operating system. Especially for an applica-
tion that is likely to be single-use for a significant amount of users, a
web-based solution seems to be optimal.

3.1.1    SQLizer
At the time of writing, SQLizer belongs amid the more prominent
                                      1

online converters, based on the results of an anonymized search engine
query. SQLizer allows the user to upload an XLSX file, select a single
worksheet to import, and convert it into an SQL script , which can be
                                                         2

used to import the data into the user's database. The tool is free to use
on files with under 5000 rows of data and offers a paid version with
no limit [6].
    SQLizer is capable of converting data from XLSX, CSV, and JSON,
though we are only interested in XLSX. Several additional settings

1.     https://sqlizer.io/#/
2.    A sequence of S Q L commands.

                                                                       11
3- EXPLORATION     OF EXISTING    XLSX   TO   SQL CONVERSION       TOOLS

related to the conversion from XLSX are available. Some of the more
relevant options are:
     • The option to designate the first row as a header (which will
       be ignored during import).
        • Selecting which worksheet to convert.
        • Converting either the entire worksheet or selecting an area
          within the worksheet to convert.
        • The possibility of naming the resulting SQL table.
The app allows selecting from three database types, including MySQL,
PostgreSQL, and Microsoft SQL Server.
   The U I can be characterized as straightforward and simple to
              3

use. The available options are either in the form of drop-down lists or
checkboxes. The UI is designed to be responsive and scales well based
on the current display size. It is usable comfortably on both mobile
and desktop. SQLizer also provides a public REST A P I with similar   4

functionality to the graphical UI.
   SQLizer has some missing functionality that we require. For ex-
ample, it has no capability of importing multiple worksheets at once,
though this can be achieved by importing the worksheets one by one.
A more significant drawback is the tool's inability to create references
between imported tables or references to tables already existing in the
database.

3.1.2     Other web converters
More online converters exist, though they offer mostly comparable
functionality with some differences. Not all online converters can be
analyzed within the scope of this thesis. The following ones provide
extra functionality that others do not.
    BeautifyTools contains a tool for converting XLSX to SQL basi-
                      5

cally identical to SQLizer in base functionality, albeit missing some

3. User Interface, i n the case of a web application, it is usually the graphical inter-
face, also referred to as G U I .
4. Application programming interface that conforms to REST architectural style
constraints.
5.    https://beautifytools.com/excel-to-sql-converter.php

12
3- EXPLORATION OF EXISTING XLSX        TO   SQL CONVERSION TOOLS

additional capabilities, such as ignoring the first row of the worksheet
as a header. However, it has one unique feature not found in SQLizer
or other similar tools as of the time of writing. That is the capability to
delete or update the data in the worksheet from the database instead
of just importing the data as other converters do.
    RebaseData contains a broad suite of converters, mostly between
                6

different SQL versions. Amid its array of available tools, RebaseData
offers a conversion tool from XLSX to SQL . Functionality-wise, this
                                              7

tool is identical to SQLizer and BeautifyTools without the extra settings
mentioned before. Besides the web application and a public REST API,
it offers its services as a PHP library, a Python library, a Java tool, and
a Linux command-line tool. Similar to SQLizer, RebaseData offers a
free and a paid version.

3.2    Desktop application converters
A desktop application is an application that runs locally on a single
computer, used to perform a specific task. This type of application has
to be installed on the users' machine, consuming disk space and other
resources. One of the main advantages of desktop applications is their
lower reliance on a network connection. They possess the ability to
better leverage their host systems resources, whereas a web application
needs to make calls to a server and wait for a response.
    SQL Server Import and Export Wizard is a tool developed by M i -
                                              8

crosoft and is a part of their SQL Server Integration Services (SSIS).
Once the wizard is installed, it does not require an internet connec-
tion to function. Compared to its online counterparts, the Wizard has
expanded functionality. First off, instead of returning an SQL script,
it is capable of connecting to a provided database connection and
automatically creating and populating tables in the database. Another
handy feature is the ability to filter the spreadsheet data using an SQL
query before inserting it into the database. Additionally, the names of
columns and tables can be changed, along with column data types, in

6. https://www.rebasedata.com/
7. https://www.rebasedata.com/convert-xlsx-to-mysql-online
8. https://docs.microsoft.com/en-us/sql/relational-databases/import
-export/import-data-from-excel-to-sql?view=sql-server-ver!5#wiz

                                                                        13
3- EXPLORATION OF EXISTING X L S X TO S Q L CONVERSION TOOLS

a simple and intuitive GUI . The entire conversion process is described
                           9

on a Microsoft tutorial page . 10

    Full Convert is a powerful database converter. Among the many
                 11

formats supported by the program is also the conversion from XLSX
into SQL. Much like the aforementioned SQL Server Wizard, it allows
for modifying tables' names and the types and names of their columns.
The program offers a myriad of extra functionalities, though for our
purposes, the most important of them is referencing other tables, thus
allowing the program to create a custom database schema during
conversion and populate it with selected spreadsheet data.
    Full Convert is by far the most complete solution in terms of func-
tionality. Its main drawback, however, is its lack of accessibility. It
offers a 30-day free trial period, after which the least expensive variant
costs 699 USD per year for a single user.

3.3     Conclusion
The available web converters generally favor high usability and ease
of use at the cost of high performance and functionality. They require
an internet connection and do not connect directly to the database
to insert the data. On the other hand, they do not require an installa-
tion, are generally free to use at least to some capacity, and are very
straightforward in terms of usability.
    Desktop converters generally offer a wider range of functionality,
though the solutions with sufficient abilities are expensive, compli-
cated, and do not offer their services online.

3.3.1   Missing functionality
The Full Convert desktop application offers a wide range of function-
ality not supported by the various online converters. Among the most
significant missing features are the following:

9. Graphical User Interface.
10. https://docs.microsoft.com/en-us/sql/integration-services/import
-export-data/get-started-with-this-simple-example-of-the-import-and-
export-wizard?view=sql-server-verl5#heres-the-new-table-of-data- cop
 ied-to-sql-server
11. https://www.fullconvert.com/

14
3- EXPLORATION OF EXISTING X L S X TO S Q L CONVERSION TOOLS

    • Filtering what tables and columns to import from the spread-
      sheet.
    • Naming of the resulting tables that are to be created in the
      database.

    • Naming of columns and selecting column types for the resulting
      tables.

    • Establishing references between tables, thus allowing the user
      to create a desired relational schema from the spreadsheet data,
      including one-to-many and many-to-many relationships be-
      tween both the tables that are being imported, and the tables
      already existing in the database.

As of the time of writing, none of the web converters are capable of
the stated features. As such, using the available online tools, the user
is only able to import the spreadsheet table as is.
     In case the spreadsheet contains multiple tables with one-to-many
or many-to-many relationships, the existing converters will not be
able to break this table into multiple interconnected tables and instead
keep the data in a single joined table. The resulting table will hold
redundant and duplicate data, and the database will be unable to
leverage many of the advantages of the relational model. Furthermore,
if several tables are defined in the spreadsheet, and one wishes to create
relationships between them, the online tools would not enable it. The
references would have to be later added manually in the database,
which is a task requiring further technical expertise. A tool handling
this process automatically for a defined schema would greatly reduce
the difficulty of converting spreadsheets to a relational database.

                                                                       15
4 Technology stack and frameworks

Selecting the appropriate technologies for specific purposes is an often
underestimated aspect of developing software. Choosing an incorrect
tool for the job, however, can cause significant issues. Different tech-
nologies are designed with different goals in mind; picking the one
suitable for our goals depends on many factors. The reasoning on why
specific tools were selected is given in this chapter.

4.1     PHP language
PHP is a general-purpose scripting language especially suited for web
development [7]. Despite its steady decline over the past years, as of
April 2021, PHP is still the 9th most popular programming language,
according to TIOBE[8]. According to W3Techs, it is used by 79.2% of
all websites whose server-side programming language is known[9].
    For those reasons, PHP was selected to be used in the back-end of
the application to perform the conversion and import of data into an
SQL database.

4.2     PHP spreadsheet parser
First introduced with Microsoft Office in 2007, XLSX is a format de-
fined in the Office Open X M L standard, which has since its introduc-
tion became standardized and used by many spreadsheets on different
platforms.
    Since the format is actually a ZIP-compressed archive that contains
a directory structure of X M L text documents, it would be possible to
unpack the ZIP archive and parse the X M L files using a P H P X M L
parser extension called SimpleXML to extract the relevant data. Us-
                                      1

ing an X M L parser would be unwieldy and require unnecessary pro-
gramming, however. Upon further exploration, several PHP libraries
capable of parsing specifically XLSX files were found. A selection of
the most prominent ones were analyzed, giving a short description
and reasoning on why the library was selected or not.

1.    https://www.php.net/manual/en/book.simplexml.php

                                                                     17
4. TECHNOLOGY STACK A N D FRAMEWORKS

     Spreadsheet-reader is a small pure-PHP spreadsheet reader spe-
                               2

cializing in efficient data extraction, capable of handling large files
without running out of memory. Apart from XLSX files, it also sup-
ports reading XLS, CSV, and ODS files. Unfortunately, as of the time
of writing, development has stopped, and the library is no longer
maintained, therefore not a viable choice for our purposes.
     SimpleXLSX is a lightweight pure-PHP spreadsheet reader fo-
                        3

cused on reading specifically XLSX files. It has the smallest size and
amount of dependencies of all compared spreadsheet parser libraries.
It is currently being maintained with regular commits to the master
branch, though it has a relatively small amount of contributors and
users. The most significant drawbacks of SimpleXLSX in comparison
to the selected library are its relatively limited functionality and its lack
of comprehensive documentation. The library contains examples of
basic usage in its README file along with a few example PHP scripts
                                       4

sufficient to comprehend the usage, though it is not as extensively
documented as the eventually selected library.

4.2.1 PhpSpreadsheet
PhpSpreadsheet is the library we elected to use in the back-end as an
                    5

XLSX parser. It is a pure-PHP library for reading and writing spread-
sheet files, a popular successor to an older and no longer maintained
library called PHPExcel . PhpSpreadsheet is the most heavyweight
                                   6

library of the compared ones both in terms of size and dependencies,
though in exchange for its size, it provides extra functionality.
    PhpSpreadsheet is by far the most popular out of the aforemen-
tioned parsers, with the greatest amount of contributors and regular
updates. Even though our application does not currently utilize plenty
of the extra functionality, it allows for future improvements without
the need to switch the library. Due to its popularity and a large number
of contributors, it increases the chances of being supported long-term.

2. https://github.com/nuovo/spreadsheet-reader
3. https://github.com/shuchkin/simplexlsx
4. R E A D M E is a text file usually contained within a git repository, specifying basic
information about the project.
5. https://github.com/PHPOffice/PhpSpreadsheet
6. https://github.com/PHPOffice/PHPExcel

18
4- TECHNOLOGY STACK A N D FRAMEWORKS

    PhpSpreadsheet provides the most extensive documentation, and
despite its considerably larger codebase compared to its competitors,
it is the easiest to use. The documentation is hosted on a separate
domain and provides a wide range of tutorials and instructions for
         7

various tasks. A separate site for API documentation is also provided .      8

4.3     JavaScript
JavaScript is a general-purpose programming language conforming
to the ECMAScript specification. It is often described as the pro-
                      9

gramming language of the web due to its strong presence in web
development, both in front-end and back-end. StackOverflow's 2020
Developer Survey, with a sample size of nearly 65000 people, con-
cluded that JavaScript is used by 67,7% of developers and marked it as
the most popular language for the eighth year in a row [10]. Addition-
ally, according to W3Techs, it is used as a client-side programming
language by 97,2% of all websites[ll].
    JavaScript's popularity can partially be attributed to its wide variety
of tools and frameworks suited for an array of purposes; this includes
several established frameworks and libraries designed for front-end
development, such as React.]s, Angular.]s, Vue.js, and more.

4.3.1   React
React.]'s is an open-source, front-end JavaScript library, which was
chosen to implement the web application front-end. Along with A n -
gular.js and Vue.js, it belongs to one of the three most widely used
JavaScript front-end frameworks/libraries, according to StackOver-
flow's 2020 Developer Survey [10]. It is steadily the most downloaded
package of the three (according to N P M package manager) [12], as
                                             1 0

shown in Figure 4.1.
   React.js is maintained by Facebook and a community of open-
source contributors, and since its initial release in 2013, it has enjoyed

7. https://phpspreadsheet.readthedocs.io/en/latest/
8. https://phpoffice.github.io/PhpSpreadsheet/
9. A programming language, standardized by Ecma International, meant to ensure
web page interoperability across different web browsers.
10. A JavaScript package manager.

                                                                           19
4. TECHNOLOGY STACK AND FRAMEWORKS

                              0   react   £   viae   £   ©angular/core
12.000,000

                           3020                                          2021

Figure 4.1: A comparison of download counts (from N P M package
manager) of the three most popular JavaScript front-end framework-
s/libraries. Downloads measured from April 2019 to April 2021.

stable growth and is very likely to be continually supported in the
following years.

4.4          Docker
Docker is a tool allowing the developer to create, deploy and run appli-
cations using so-called containers [13]. Docker containers refer to an
operating system paradigm called OS-level virtualization, in which the
kernel of the operating system allows the existence of multiple isolated
user spaces, called containers. Each container comes bundled with
its own software and configurations. It can only see its own contents
and is completely isolated from its host system and other containers,
providing a layer of security. Communication channels between the
containers themselves or the host system can be established as needed,
however.
    The containers can run with their own environments, defined by
the developer and independent of the host system. Therefore, the
project can be developed without considering what system the appli-
cation will ultimately be running on. This significantly eases the de-
velopment process and the deployment of the system. The developed
web application runs in several discrete containers, each providing a
service, together creating the desired functionality. The specific struc-
ture and deployment of the application using docker will be detailed
in Chapter 6.

20
5 Implementation and project structure

This chapter details selected parts of the implementation and the
development process. The structure of the project and the functionality
of some of its components are also described.

5.1     Project structure
The root of the project is comprised of the following structure:

      • api/ - A directory containing the server-side portion of the web
        application, implemented using PHP.

      • frontend/ - A directory comprising of the client-side portion of
        the web application implemented using a JavaScript framework,
        React.js.

      • example_files/ - A folder consisting of example files, includ-
        ing XLSX spreadsheets and corresponding JSON schemas. The
        file schema_definition.txt contains a description of a schema
        example.

      • docker-compose.yml - A Y A M L configuration file used by
                                                   1

        Docker to configure the application's services.

      • php.Dockerfile - It is a text document used by Docker, con-
        taining a series of commands used to assemble an image. This
        particular Dockerfile contains instructions to build an image
        running a P H P FastCGI Process Manager , used by the PHP
                                                              2

        back-end.

      • react.Dockerfile - Similar to PHP.Dockerfile, except responsi-
        ble for building an image running a react front-end server.

1. Y A M L stands for recursive acronym Y A M L Ain't Markup Language, which is a
human-readable language commonly used for configuration files.
2. FastCGI is a variation of an earlier C o m m o n Gateway Interface (CGI), a protocol
for interfacing external applications to web servers. FastCGI Process Manager (PHP-
F P M ) provides extra features, such as faster uploads, logging, and more.[14]

                                                                                   21
5. IMPLEMENTATION AND PROJECT STRUCTURE

        • nginx.Dockerfile - A Dockerfile used for building an image
          running an Nginx web server.

        • README.md - A markup language file describing the project,
          with examples and explanations of usage.

The structure of the docker-compose.yml file together with the two
Dockerfiles will be further expanded upon in the following chapter
focused on the deployment of the application.

5.2       Server-side
As the name suggests, server-side code runs on a server, which awaits
requests from clients. The server-side implementation then processes
the requests, and a response is sent back to the client.

5.2.1     Server-side file structure
The PHP back-end implementation is entirely contained within the
above-mentioned api/ folder. The folder itself contains the following:

        • config/ - A directory containing configuration files used mainly
          by docker services running the server and the aforementioned
          PHP-FPM and a db-credentials.env file containing database cre-
          dentials. The database credentials are used by the application
          to get a database connection, which will be used as a target of
          the spreadsheet imports. The user can change the credentials
          to gain a connection to the desired database. The default values
          are set to connect to a MariaDB database, running in a docker
          container by default.

        • public/ - A directory holding a single file import_xlsx.php,
          which is a PHP script implementing the HTTP API. The script
          is making calls to classes defined in the source folder.

        • source/ - This directory holds the main portion of the logic
          responsible for converting data from XLSX, mapping it onto a
          user-defined schema, and importing it into a database.

22
5. IMPLEMENTATION A N D PROJECT STRUCTURE

        • vendor/ - The application automatically generates this folder
          when run. It contains back-end dependencies. If additional PHP
          libraries are added to the project in future development, this
          folder needs to be deleted and a new one autogenerated for the
          changes to take place.

        • composer.j son - A JSON configuration file for a PHP package
          manager called Composer. The file defines PHP package de-
          pendencies and specifies their versions, ensuring future com-
          patibility.

        • composer.lock - Composer package manager automatically
          generates this file. It serves to lock the project dependencies to
          a known state when run.

5.2.2     Parsing the mapping schema
The mapping schema is appended to the HTTP POST request either
as a JSON file or a string. The application must parse the provided
schema and create an in-memory object representation, which can be
used to map XLSX data into a database.
    The class responsible for parsing the schema is called Schema-
Parser. It is an abstract class providing two public static methods,
                           3                                         4

one for parsing a JSON file and the other for parsing a JSON string.
The methods transform the schema into a data structure described by a
class diagram in Figure 5.1. The class diagram is simplified for the sake
of readability and only shows properties relevant to the representation
of the mapping schema. The following classes are utilized:

        • SheetSchema - Represents a single worksheet of the spread-
          sheet document.

        • TableSchema - Describes a table, which will be created in the
          target database.

3. A n abstract class cannot be instantiated.
4. A static method belongs to the class itself and can be invoked without the need
of instantiating the said class.

                                                                               23
5. IMPLEMENTATION AND PROJECT STRUCTURE

                                                       TableSchema
                 SheetSchema
                                                      • title: string
          • title: string
          • firstRowlsHeader: boolean
          • calculateFomulas: boolean
          • formatData: boolean

                                                                                        0..*
                                                                             ReferenceSchema
                                                                          - referencedTable: string
               0..*
                                                                                          1
           ColumnSchema
          • name: string                    ColumnLink
          • col: int                    - fromSheetCol: int
          • type: string                                           0..*
                                        - toTableCol: string

 Figure 5.1: A class diagram of the mapping schema data structure.

        • ColumnSchema - Defines a table column, including its name
          and data type. The property col specifies which column in the
          parent worksheet should be mapped onto this table column.

        • ReferenceSchema - This class represents a table reference. The
          property referencedTable holds the name of a table other than
          the parent table. The parent table will contain a reference to the
          referenced table.

        • ColumnLink - Describes the columns on which to create a link.
          The property fromSheetCol specifies a column in the parent sheet
          and the property toTableCol specifies the the column name of
          the referenced table.

5.2.3     Mapping relationships
The application allows for establishing one-to-one, one-to-many and
many-to-many relationships between tables. A single worksheet can
be broken up into multiple in-database tables, and relationships can

24
5. IMPLEMENTATION AND PROJECT STRUCTURE

                Table 5.1: A n example worksheet Employee

 Employee Name            Employee Wage        Department
 Julian Johnson           35000                ENG
 Jane Jones               39000                ENG
 Martin Moore             28000                ACC
 Larry Lewis              46000                ACC

be established between themselves and the other tables. Consider the
following snippet of a mapping schema:

      "references": [
          {
            " t a b l e " : "Department",
            "mapOnColumns": [
                {
                    " fromThisSheetCol     3,
                    " toOtherTableCol"    "DepartmentTag"
                     }
                 ]
            }
        ]

Each table must be constructed from a worksheet. Let us assume the
snippet above is set within the context of a worksheet called Employee,
specified in Table 5.1. The snippet references a table Department and
specifies that the tables should be joined on the third column of work-
sheet Employee (which contains department tags) and the Department
column DepartmentTag. Each row with a valid department tag will
be replaced with a foreign key to the corresponding department in
the resulting table. If a corresponding department tag does not ex-
ist, a null value will take the place of the foreign key, thus creating a
one-to-many relationship between the tables.

                                                                   25
5. IMPLEMENTATION AND PROJECT STRUCTURE

5.3       Client-side
Client-side code refers to operations running on the user's machine. In
the case of this application, it primarily refers to a front-end developed
in React.] s. It is responsible for what the user sees and interacts with in
the browser, allowing the user to create a mapping schema and send
a request to the back-end.

5.3.1     Client-side file structure

The React front-end is contained within the frontend/ directory. The
directory has the following structure:

        • public / - This folder is automatically generated when creating a
          React project using the create-react-app command. Among other
          files, it contains index.html, which is the site index template.

        • source/ - The directory holding the front-end implementation.

            - components/ - A folder holding React components re-
              sponsible for the interactive graphical interface. The inter-
              face serves to create a mapping schema.
            - entities/ - This directory contains JavaScript classes used
              for representing the mapping schema created in the graph-
              ical interface. When sending a request to back-end, this
              data structure is converted into a JSON and attached to
              the request.

        • node_modules / - A folder that is automatically generated when
          the application is run, containing packages installed by the
          N P M package manager.

        • package.]'son - A JSON configuration file used by the N P M
          package manager, listing project dependencies and their ver-
          sions. The node_modules/ folder is generated based on this
          file.

26
5. IMPLEMENTATION AND PROJECT STRUCTURE

5.3.2   Front-end design components
React.js makes use of so-called React components, which allow the
U I to be split into small, independent, and reusable pieces. A React
component can accept arbitrary inputs and, based on them, returns
elements describing what should appear on the page. This allows us
to create a hierarchy of components, interacting with each other. A
component can pass data to its children using a constructor argument
props, and a child component can pass data back to its parent using
callback functions.
    A visualization in Figure 5.2 shows the resulting component hier-
archy A n important note is also the ability of SchemaComponent to
lay out the individual SchemaTableComponent instances depending
on their amount and the available screen space, guaranteeing a re-
sponsive design capable of adapting to a range of screen sizes. Figure
B.l shows an example of the U I adapting to a desktop-sized screen,
and Figure B.2 shows the U I adjusting to a smartphone-sized screen.

                                                                   27
5. IMPLEMENTATION A N D PROJECT STRUCTURE
6 Deployment

Because the web application was developed using Docker, the process
of deployment is very straightforward. A l l the dependencies needed
by the application are included in the Docker project and automatically
installed during the deployment process.
    The application will run regardless of the host's operating system
because the application's individual services will run each in their
separate Linux container. The only requirement is having installed
Docker or a Docker alternative on the host machine.

6.1     Docker project structure
The Docker project is built according to a specification of a docker-
compose.yml file. The Compose Specification defines the format of
the file [15]; this project using Compose file format version 3. The
docker-compose.yml file specifies five services to be created and run,
each inside its separate container.
    According to Docker, volumes are the preferred mechanism for
persisting data generated and used by Docker containers [16]. In
contrast to bind mounts, volumes are not dependent on the directory
structure and the OS of the host machine, allowing us to retain easy
portability [17]. Among other functionalities, they provide an easier
way of migrating and backing up than bind mounts, which is why
they are used to persist a database described in Subsection 6.1.1.

6.1.1   Mariadb service

The mariadb service provides a MariaDB database running inside a
container, which the application can use. The database is created with
credentials specified in mariadb-credentials.env. The credentials are
recommended to be changed if the database is to be used.
   The service is built from mariadbrlatest image, which is an open-
source relational database, forked from MySQL. A volume named
mysqldata is assigned to the service, providing data persistence to the
database when the application is restarted. The database is listening

                                                                    29
6. DEPLOYMENT

on port 3306, but the port is not exposed to the host. Thus it can only
be accessed by other containers in the network.
    This service is not necessary for the functioning of the web appli-
cation and can be omitted if the user elects to use a different database.
However, if the service is omitted, connection credentials to another
database must be provided in file api/config/db-credentials.env.

6.1.2   Adminer service
Adminer is a tool for managing database content, authored by a Czech
programmer Jakub Vrana [18], allowing the user to view and edit the
database using a user-friendly interface.
     The service is built from image adminerdatest and is by default
exposed on port 8080, though the port can be configured in the .env
file by changing the variable ADMINER_PORT.
     The adminer service is not necessary for the functioning of the
application and can be omitted if not needed. It serves as a means to
easily view the database after the XLSX import to verify that the data
has been imported according to expectations.

6.1.3   Php service
The php service is running a P H P FastCGI Process Manager (PHP-
FPM), which is an improved PHP FastCGI implementation. FastCGI is
a binary protocol responsible for interfacing external applications with
a web server. The earlier Common Gateway Interface (CGI) protocol
created a new process for each request, which was torn down when
finished. The high overhead of this approach is addressed by FastCGI,
which persists processes to handle a series of requests [19].
    The build of this service is described in a separate php.Dockerfile.
The used base image called php:fpm provides PHP-FPM preinstalled,
together with all necessary dependencies. Additional packages are
installed in the Dockerfile, along with a PHP package manager called
Composer, which is used to install dependencies specified in the com-
poser.json file mentioned in previous Chapter 5 (Item 5.2.1). The
entirety of the api/ directory containing the PHP back-end implemen-
tation is copied into the container during the build, and then finally,
the php-fpm command is executed, and the service is available for

30
6. DEPLOYMENT

use. Internally, it is accessible on port 9000, though the port is not
exposed to the host machine.
    Since this service is working as a PHP interpreter, it is necessary
for the application and cannot be omitted.

6.1.4   Server service
The server service is running a popular open-source web server named
Nginx, often used in conjunction with PHP-FPM. According to an
April 2021 Web Server Survey by Netcraft, it recently became the most
used web server when accounting for the surveyed websites [20].
    Defined in a separate nginx.Dockerfile, the service is built from
an nginx:alpine-perl image. Alpine-based image versions are smaller
compared to their usual counterparts. The reason for using a Perl -    1

enabled image is that Perl is used to read environment variables in the
Nginx configuration files.
    The server will internally run on port 5000, and by default, the port
will be exposed to the host machine so that API requests can be made
to the server, which will be automatically forwarded to the PHP-FPM
running on the php service. However, the port to the host machine
can be remapped by changing the property API_PORT in file .env.
The server service is dependent on the php service running and is
necessary for the base functioning of the application.

6.1.5   React-frontend service
The service is running a server responsible for providing the applica-
tion front-end. It is defined in a separate react.Dockerfile, built from
a noderalpine base image providing much of the necessary function-
ality for running the React front-end. Using the Alpine variant results
in smaller image size. During the build, the entirety of the frontend/
directory is copied into the container. The JavaScript package manager
N P M , which comes already preinstalled on the node image, is used
to install the necessary dependencies defined in the package.]'son file
mentioned in previous Chapter 5 (Item 5.3.1).
     The server listens on port 3000 by default but can be changed in the
file .env by editing the REACT_PORT variable. The service depends

1.   A general-purpose programming language

                                                                     31
6. DEPLOYMENT

on php and server service to run. It is not necessary for the functioning
of the application if the user only plans to use the API without a GUI.

6.2     Summary of configuration files
The following is a summary of the aforementioned configuration files
along with short descriptions:

      • .env - A n environment file containing the mapping of ports
        for individual services. The user can change them to suit their
        needs.

      • mariadb-credentials.env - Specifies credentials for a MariaDB
        database specified in the mariadb service. Relevant only if the
        mariadb service is used.

      • api/config/db-credentials.env - Defines a connection to the
        target database. Can be different from the connection specified
        in mariadb-credentials.env.

32
7 Conclusion

The aim of the thesis was to create a web application implemented
using PHP, capable of a semi-automatic import of data from an XLSX
format into a MySQL/MariaDB database. It provides an intuitive
way of mapping a spreadsheet into a configured database using a
graphical interface or an API. The goal was achieved by creating a
Docker application, resulting in easy deployment regardless of the
host system's OS. The application's functionality is broken up into
separate services, which can be configured to suit the user's needs or
even be entirely omitted if not needed.
    A mapping schema was developed for the purpose of mapping
the spreadsheet into a relational schema. The mapping schema takes
the form of a JSON file that can be automatically generated using
the application's front-end or created manually and used with the
provided API.
    As part of the thesis, existing solutions were explored to gain a
perspective on the available tools and find possible shortcomings and
flaws, so that they can be avoided. The technology stack was detailed,
and the chosen technologies were briefly described in the following
chapter to explain why the specific tools were selected. A description
of the project structure and implementation follows next. Finally, the
deployment of the project is explained, including the specifications
and objectives of the individual services of the Docker application.
    None of the existing web applications can create a mapping schema
for the spreadsheet and directly import it into a database. The only
available solution is a complex and expensive desktop application,
inaccessible to a regular non-corporate user due to its restrictive price.
Therefore, compared to its online alternatives, the developed web ap-
plication provides a major expansion of functionality while retaining
its accessibility.

                                                                       33
Bibliography

 1.   HILL, Charles WL; JONES, Gareth R; SCHILLING, Melissa A .
      Strategic management: Theory & cases: An integrated approach. Cen-
      gage Learning, 2014. ISBN 1285184491.
 2.   How computing's first 'killer app' changed everything [online]. Lon-
      don: BBC, 2019 [visited on 2021-04-07]. Available from: https:
      //www.bbc.com/news/business-47802280.
 3.   Excel (.xlsx) Extensions to the Office Open XML SpreadsheetML File
      Format [online]. Microsoft, 2021 [visited on 2021-04-30]. Avail-
      able from: https : //docs .microsoft. com/en-us/openspecs/
      office _ standards / ms - xlsx / 2c5dee00 - ef f 2 - 4b22 - 92b6 -
      0738acd4475e.
 4.   EAIRHURST, Danielle Stein. Using Excel for business analysis: A
      guide tofinancialmodelling fundamentals. John Wiley & Sons, 2015.
      ISBN 1119062462.

 5.   HTTP request methods [online]. Mozilla, 2021 [visited on 2021-
      04-13]. Available from: https://developer.mozilla.org/en-
      US/docs/Web/HTTP/Methods.
 6.   The internet's leading data migration tool [online]. London: SQLizer,
      2021 [visited on 2021-04-09]. Available from: https: / / s q l i z e r .
      io/about/.
 7. PHP: Hupertext Preprocessor [online]. The PHP Group, 2021 [vis-
    ited on 2021-04-18]. Available from: https : //www. php. net/.
 8.   TIOBE Index for April 2021 [online]. TIOBE Software BV, 2021
      [visited on 2021-04-18]. Available from: https : / /www . tiobe .
      com/tiobe-index/.
 9.   Usage statistics of PHP for websites [online]. Q-Success, 2021 [vis-
      ited on 2021-04-18]. Available from: https : //w3techs . com/
      technologies/details/pl-php.
10.   2020 Developer Survey [online]. Stack Overflow, 2021 [visited on
      2021-04-26]. Available from: https : / / i n s i g h t s . stackoverf low.
      com/survey/2020.

                                                                         35
BIBLIOGRAPHY
11.   Usage statistics of JavaScript as client-side programming language on
      websites [online]. Q-Success, 2021 [visited on 2021-04-26]. Avail-
      able from: https : //w3techs . com/technologies/details/cp-
      javascript.
12.   POTTER, John, react vs vue vs ©angular/core [online]. 2021 [visited
      on 2021-04-26]. Available from: https : //www. npmtrends . com/
      react-vs-vue-vs-Oangular/core.
13.   MERKEL, Dirk. Docker: lightweight linux containers for consis-
      tent development and deployment. Linux journal. 2014, vol. 2014,
      no. 239, p. 2.
14.   FastCGI Process Manager (FPM) [online]. The PHP Group, 2021
      [visited on 2021-04-18]. Available from: https : //www. php. net/
      manual/en/install.fpm.php.
15.   The Compose Specification [online]. 2021 [visited on 2021-04-27].
      Available from: https : / / g i t h u b . com/compose-spec/compose-
      spec/blob/master/spec.md.
16.   Use volumes [online]. Docker Inc., 2021 [visited on 2021-04-28].
      Available from: https: //docs . docker. com/storage/volumes/.
17.   BHAT, Sathyajith. Understanding Docker Volumes. In: Practical
      Docker with Python. Apress, Berkeley, CA, 2018, pp. 91-118. ISBN
      9781484237847.
18.   Adminer [online]. 2021 [visited on 2021-04-27]. Available from:
      https://github.com/vrana/adminer.
19.   FastCGI Specification [online]. Open Market Inc., 1996 [visited on
      2021-04-28]. Available from: http : //www .mit. edu/~yandros/
      doc/specs/fcgi-spec.html.
20.   April 2021 Web Server Survey [online]. Netcraft Ltd, 2021 [visited
      on 2021-04-30]. Available from: https : //news . netcraft. com/
      archives/2021/04/30/april-2021-web-server-survey.html.

36
You can also read