COMPARISON OF POPULAR DATA PROCESSING SYSTEMS - KTH THESIS REPORT KAMIL NASR - DIVA

Page created by Julian Jacobs

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

COMPARISON OF POPULAR DATA PROCESSING SYSTEMS - KTH THESIS REPORT KAMIL NASR - DIVA

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021

Comparison of
Popular Data
Processing Systems
KTH Thesis Report

Kamil Nasr

 KTH ROYAL INSTITUTE OF TECHNOLOGY
 ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Author
Kamil Nasr 
Communication Systems Design
KTH Royal Institute of Technology

Examiner and Supervisor
Slimane Ben Slimane
Division of Communication Systems
KTH Royal Institute of Technology

                                    ii

Abstract

Data processing is generally defined as the collection and transformation of data to
extract meaningful information. Data processing involves a multitude of processes
such as validation, sorting summarization, aggregation to name a few. Many analytics
engines exit today for largescale data processing, namely Apache Spark, Apache
Flink and Apache Beam. Each one of these engines have their own advantages and
drawbacks. In this thesis report, we used all three of these engines to process data
from the Carbon Monoxide Daily Summary Dataset to determine the emission levels
per area and unit of time. Then, we compared the performance of these 3 engines
using different metrics. The results showed that Apache Beam, while offered greater
convenience when writing programs, was slower than Apache Flink and Apache Spark.
Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data
processing framework overall.

Keywords

Apache Spark, Apache Flink, Apache Beam, Spark Runner, Flink Runner, Direct
Runner, Big Data Analytics, Data Processing Systems, Benchmarking, Kaggle

                                                                                  iii

Abstract

Databehandling definieras generellt som insamling och omvandling av data för att
extrahera meningsfull information. Databehandling involverar en mängd processer
som validering, sorteringssammanfattning, aggregering för att nämna några. Många
analysmotorer lämnar idag för storskalig databehandling, nämligen Apache Spark,
Apache Flink och Apache Beam. Var och en av dessa motorer har sina egna fördelar
och nackdelar. I den här avhandlingsrapporten använde vi alla dessa tre motorer för att
bearbeta data från kolmonoxidens dagliga sammanfattningsdataset för att bestämma
utsläppsnivåerna per område och tidsenhet. Sedan jämförde vi prestandan hos dessa
3 motorer med olika mått. Resultaten visade att Apache Beam, även om det erbjuds
större bekvämlighet när man skriver program, var långsammare än Apache Flink och
Apache Spark. Spark Runner in Beam var den snabbaste löparen och Apache Spark
var den snabbaste databehandlingsramen totalt.

Nyckelord

Apache Spark, Apache Flink, Apache Beam, Spark Runner, Flink Runner, Direct
Runner, Big Data Analytics, Data Processing Systems, Benchmarking, Kaggle

                                                                                     iv

Acknowledgements

I would like to thank my professor Slimane Ben Slimane for his constant help and
support during the writing of this thesis report, as well as all my other professors at
KTH who made this possible.

                                                                                     v

Contents

1 Introduction                                                                           1
  1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    1
  1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     2
  1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     2
  1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    3
  1.5 Benefits, Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . .       3
  1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     4
  1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     4

2 Background                                                                             5
  2.1 Data Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . .         5
  2.2 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    7
  2.3 Batch and Stream processing . . . . . . . . . . . . . . . . . . . . . . . .         8
       2.3.1 Batch processing . . . . . . . . . . . . . . . . . . . . . . . . . .         8
       2.3.2 Stream processing . . . . . . . . . . . . . . . . . . . . . . . . . .        9
  2.4 Apache Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        11
       2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      11
       2.4.2 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       13
       2.4.3 Event time and processing time . . . . . . . . . . . . . . . . . .          16
       2.4.4 Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    16
       2.4.5 Programming model . . . . . . . . . . . . . . . . . . . . . . . . .         16
  2.5 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       19
       2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      19
       2.5.2 Apache Spark stack . . . . . . . . . . . . . . . . . . . . . . . . .        21
  2.6 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     24
       2.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      24

                                                                                         vi

CONTENTS

       2.6.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        24

3 Related Work                                                                             29
  3.1 Data Stream Processing Systems . . . . . . . . . . . . . . . . . . . . .             29
  3.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        34
  3.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         35
  3.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        36

4 System Design and Implementation                                                         39
  4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        39
       4.1.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      39
       4.1.2 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       39
  4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       40
       4.2.1 Apache Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . .           40
       4.2.2 Apache Spark in Batch . . . . . . . . . . . . . . . . . . . . . . .           43
       4.2.3 Apache Spark in Stream . . . . . . . . . . . . . . . . . . . . . .            48
       4.2.4 Apache Flink in Batch . . . . . . . . . . . . . . . . . . . . . . . .         50
       4.2.5 Apache Flink in Stream . . . . . . . . . . . . . . . . . . . . . . .          53

5 Results and Evaluation                                                                   57
  5.1 Apache Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          57
       5.1.1 Spark Runner . . . . . . . . . . . . . . . . . . . . . . . . . . . .          57
       5.1.2 Flink Runner . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        58
       5.1.3 Direct Runner . . . . . . . . . . . . . . . . . . . . . . . . . . . .         58
  5.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         59
  5.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       60
  5.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . .           61
       5.4.1 Spark Runner vs Flink Runner vs Direct Runner . . . . . . . . .               61
       5.4.2 Apache Spark vs Spark Runner . . . . . . . . . . . . . . . . . .              62
       5.4.3 Apache Flink vs Flink Runner . . . . . . . . . . . . . . . . . . . .          63
       5.4.4 Apache Spark vs Apache Flink . . . . . . . . . . . . . . . . . . .            64
  5.5 Evaluation     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   65

6 Conclusions and Future Work                                                              66

References                                                                                 68

                                                                                           vii

List of Figures

 2.1.1 Generations of Big Data Analytics [57] . . . . . . . . . . . . . . . . . . .         5
 2.1.2 Data pipeline evolution [67]     . . . . . . . . . . . . . . . . . . . . . . . .     6
 2.2.1 Kaggle [42] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      7
 2.3.1 Batch vs Stream Processing [20] . . . . . . . . . . . . . . . . . . . . . .          8
 2.3.2Processing Data Using MapReduce [20] . . . . . . . . . . . . . . . . . .              9
 2.3.3Real Time Processing in Spark [20] . . . . . . . . . . . . . . . . . . . .          10
 2.4.1 Apache Beam [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        11
 2.4.2The tradeoff between correctness, latency and cost in parallel
      processing [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       12
 2.4.3Overview of Apache Beam [5] . . . . . . . . . . . . . . . . . . . . . . . .          13
 2.4.4Windowing [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        13
 2.4.5 Fixed Windows [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       14
 2.4.6Sliding Windows [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        14
 2.4.7 Session Windows [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       15
 2.4.8Common Windowing Patterns [3] . . . . . . . . . . . . . . . . . . . . .              15
 2.4.9Apache Beam programming model [40] . . . . . . . . . . . . . . . . . .               16
 2.5.1 Apache Spark [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       19
 2.5.2 Architecture of Apache Spark in Cluster Mode [12] . . . . . . . . . . . .          20
 2.5.3 Highlevel architecture of Apache Spark stack [60] . . . . . . . . . . . .          21
 2.6.1 Apache Flink [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     24
 2.6.2Apache Flink Runtime Architecture [24] [35] [36] . . . . . . . . . . . .            24
 2.6.3Apache Flink Ecosystem [7] . . . . . . . . . . . . . . . . . . . . . . . . .        25

 3.1.1 IBM Nexmark Benchmark Evaluation [44] . . . . . . . . . . . . . . . .              29
 3.1.2 Datalake in Google Cloud Dataflow [27] . . . . . . . . . . . . . . . . . .         30
 3.1.3 Input data goes through MillWheel computations. An external anomaly
      notification system consumes the output [2] . . . . . . . . . . . . . . .           30

                                                                                          viii

LIST OF FIGURES

  3.1.5 STREAM Query plans [14] . . . . . . . . . . . . . . . . . . . . . . . . .        31
  3.1.4 Simplified Input Data Stream Management System [14] . . . . . . . . .            31
  3.1.6 Apache Calcite architecture and interaction [18] . . . . . . . . . . . . .       32
  3.1.7 Snapshot Algorithm [23] . . . . . . . . . . . . . . . . . . . . . . . . . .      32
  3.1.8 Apache Samza [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     33
  3.1.9 Heron [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34
  3.1.10Storm High Level Architecture [68] . . . . . . . . . . . . . . . . . . . .       34
  3.2.1 Oracle CEP Architecture [56] . . . . . . . . . . . . . . . . . . . . . . . .     35
  3.4.1 Word count comparison between spark and flink [49] . . . . . . . . . .           37
  3.4.2Throughput results in function of the task parallelism [45] . . . . . . .         37
  3.4.3Average Execution Times in s [36] . . . . . . . . . . . . . . . . . . . . .       38

                                                                                         ix

Chapter 1

Introduction

The world of Big Data is at a massive expansion. People are generating a huge amount
of data everyday through many means, such as online shopping, communications,
media consumption, etc. Data, in order to be of value, needs to be operated on, sorted,
and refined.

In order to do that, many analytics engines, also called data stream processing systems
(DSPSs), exist today, namely Apache Spark, Apache Flink and Apache Beam. Even
though Beam could be better described as an abstraction layer, it still consists of
many runners, such as Spark Runner, Flink Runner and Direct Runner. Each of these
engines have the ability to operate on data whether it is in streaming form or batch
form. Although these engines have quite a bit of similarities, they all have advantages
and drawbacks, depending on the use case.

The idea behind Apache Beam is to give developers the ability to write one code and
choose the appropriate runner. In theory, this is supposed to offer a great deal of
flexibility without the need to rewrite code for different engines.

1.1 Motivation
Studying and comparing different engines can give us a lot of insight on when it is ideal
to use each one of them, as well as the potential sacrifices compared with the gains of
using another. Such knowledge can be of great use for any group of people wanting
to operate and handle large volumes of data, because with the huge portions of time
dedicated to writing appropriate code, as well as the seemingly never ending increases

                                                                                       1

CHAPTER 1. INTRODUCTION

in data size, any potential optimization on the programming side or performance side
can be crucial.

1.2 Problem
The need to process evergrowing amounts of data has dictated the creation of data
processing systems or frameworks. However, changing between these frameworks
isn’t guaranteed to be smooth. Adapting a new system can be quite challenging for
most developers, especially when each system uses its own APIs.

Although Apache beam has been created with the premise of a unified model where one
code can be written in Java, Scala or Go, and ran on a multitude of execution engines,
the shift might not be justified with the increase in convenience but with huge potential
drawbacks in performance and execution times.

Even though this balance can only be struck by the developers themselves, it is
important to grasp the potential limitations and opportunities of each framework.
Not only that, but even within Beam itself, which execution engine or runner
is faster, and in which context can have tremendous importance, because the
performance differences between Apache Spark and Apache Flink aren’t guaranteed
to be transferred to Flink Runner and Spark Runner within Apache Beam.

This leads us to four research questions:

   • ”How does the performance of Apache Spark compare with that of Apache Flink?”

   • ”How does the performance of Spark Runner compare with that of Flink Runner
     and Direct Runner?”

   • ”How does the Performance of Flink Runner compare with that of Flink?”

   • ”How does the performance of Spark Runner compare with that of Spark?”

1.3 Purpose
This thesis report plans on addressing these questions, with the hope of providing some
insight on whether the increase in convenience offered by Beam is at a net win or loss,
depending on the context. Regardless of that, understanding the different performance

                                                                                       2

CHAPTER 1. INTRODUCTION

variations in Apache Spark and Apache Flink should give future developers an idea
on which framework or engine best suits their needs before getting stuck with one
framework with the possible challenge of switching to another.

Using the Carbon Monoxide Daily Summary dataset gives us the unique opportunity
to run these different inquiries on data that allows us to get valuable information about
the environmental impact and changes in Carbon Monoxide levels.

All of the above hopefully results in a step forward towards having more efficient and
effective data processing.

1.4 Goal

The goal of this thesis is to sort many different data fields in the very large Carbon
Monoxide Daily Summary dataset using a multitude of data processing engines and
frameworks and appropriate queries in order to determine which ones are faster and
in which scenarios.

1.5 Benefits, Ethics and Sustainability

The main beneficiaries from this project thesis are the companies and developers who
have an interest in using data processing frameworks in their work. This type of
research should hopefully give them an insight on which technologies better suit them
and in which context.

It will also facilitate the choice of changing frameworks based on the priorities they give
to performance versus convenience. Other than that, the Carbon Monoxide emission
results that are sorted in this experiment should give people who are interested in this
cause as well as people who are in charge of decisions affecting the changes in these
metrics some valuable insights on the Carbon Monoxide changes.

When it comes to the ethics and sustainability aspects of this project, no major concerns
are present. The Carbon Monoxide Daily Summary dataset is publicly available on the
Kaggle website, and all data processing frameworks and tools that were used are open
source software.

CHAPTER 1. INTRODUCTION

1.6 Delimitations
The main delimitations of this project would be that the different results and metrics
that we gather from a certain dataset, even though might offer valuable insights
on the performance differences of all the data processing frameworks used, can not
necessarily be generalized to all datasets.

Other than that, the results and metrics might potentially be different on other
machines and for different versions of the data processing systems.

1.7 Outline
The report is structured in the following way:

First, in Chapter 2, we present and discuss in detail all the technologies and tools
used in this project in order to give the reader better understanding of the rest of the
report.

In Chapter 3, we talk about related work that might add more context to the work done
in this project.

Chapter 4 presents the system design and discusses in detail the experiment
done.

Chapter 5 presents the different results gathered which should help us answer the
research questions.

The last chapter, which is Chapter 6, gives a conclusion to the report and provides some
insight on future work.

                                                                                      4

Chapter 2

Background

The aim of this section is to give more information on the different tools and
technologies used in this project. The world of Big Data is an evergrowing world. With
that comes many opportunities but a lot of complexity as well. We hope the reader will
have an easier time understanding the experiments and results of this project after
going through this section.

2.1 Data Processing Systems

Big data processing has evolved around the years and can be generally divided into four
generations, as shown in fig 2.1.1.

                 Figure 2.1.1: Generations of Big Data Analytics [57]

                                                                                     5

CHAPTER 2. BACKGROUND

First generation

The firstgeneration data processing system is Apache Hadoop [8], which primarily
focused on batch data. It also introduced the concepts of Map and Reduce, and
thus provided an opensource implementation of MapReduce [28]. Apache Hadoop
offered many advantages, but its biggest limitation was the involvement of many disk
operations.

Second generation

The second generation data processing systems included slight improvements over
first generation systems. One of the most popular second generation systems is
Tez. Tez introduced interactive programming in addition to batch processing [64]
[19].

Third generation

Apache Spark [10] is the most famous third generation data processing system.
It is considered as a unified model for both stream and batch processing.          The
concept of Resilient Distributed Dataset, or RDD, is at the core of Apache Spark
[83]. Machine learning is possible with Apache Spark as it offers support for iterative
processing.

One of the advantages of Apache Spark is that is supports inmemory computation as
well as processing optimization. Spark applications can be written in Java, R, Python
and Scala.

                      Figure 2.1.2: Data pipeline evolution [67]

                                                                                     6

CHAPTER 2. BACKGROUND

Fourth generation

Apache Flink [6] is basically the fourth generation data processing system. Unlike
many other frameworks, Flink supports realtime stream processing. It also supports
both batch and stream processing thanks to the DataSet and DataStream core APIs.
Other supported APIs include SQL as well as the Table API.

Flink also handles stateful stream processing and iterative processing computations.
It can also efficiently handle the challenges of Fault tolerance and scalable state
management [22] [64].

2.2 Kaggle

                               Figure 2.2.1: Kaggle [42]

Kaggle.com is a very famous website for data scientists and engineers. It gives them
access to huge volume of datasets. It also hosts frequent competitions and challenges
for anyone to join. It even gives prizes for winners.

The Carbon Monoxide Daily Summary dataset that was used to conduct the
experiments in this project was found on Kaggle.           It was published by the US
Environmental Protection Agency and contains a summary of daily CO levels of 1990
to 2017.

What is interesting about Kaggle is that it has been called the “AirBnB for Data
Scientists”. It has around half a million active users from over 190 countries. It was
acquired by Google in 2017. What is also important about Kaggle is that it aims to give

                                                                                     7

CHAPTER 2. BACKGROUND

data scientists, who don’t often get the chance to practice on real data at least before
joining a company, the opportunity to get some practice on the datasets available on the
platform in different ways including the organized competitions and challenges.

2.3 Batch and Stream processing
In the world of Big Data and Data Analytics, batch data processing and stream data
processing [20] are very important concepts. It is crucial to understand the distinction
between the two principles. Generally speaking, in batch processing, data is collected
first and then processed, whereas stream processing is real time, meaning data is sent
into the analytics tool piecebypiece. Let’s discuss the two concepts in more detail and
give a few examples and use cases for each one of them.

Figure 2.3.1: Batch vs Stream Processing [20]

2.3.1 Batch processing

Batch processing is ideal when we are dealing with relatively large quantities of data,
and/or when the sources of this data are old or legacy systems that aren’t compatible
with stream data processing. For example, mainframe [81] data is processed in batch
by default. It would be quite timely and inconvenient to use mainframe data with newer
analytics environments, thus the challenge in turning it into streaming data.

Figure 2.3.1 shows how Hadoop MapReduce, a popular batch processing framework,
processes data.

CHAPTER 2. BACKGROUND

                 Figure 2.3.2: Processing Data Using MapReduce [20]

Batch processing usually shines in scenarios where realtime analytics aren’t necessary,
as well as in those where the ability to process large amounts of data is more important
than the speed of processing said data (slower results for the analytics are acceptable)
[20]. Examples of batch processing use cases include:

   • Bills

   • Customer orders

   • Payroll

2.3.2 Stream processing

If we require analytics results in real time, then stream processing is the only way to go.
The moment the data is generated, it is fed into the analytics tools using data streams.
This allows us to get results that are almost instant. Stream processing can be useful
in fraud detection, because it allows real time detection of anomalies.

The latency in stream processing is usually in seconds or milliseconds, and this is
possible due to the fact that in stream processing data is analyzed before it hits the
disk [20].

Figure 2.3.2 explains how real time processing works in a tool such as Apache
Spark.

                                                                                         9

CHAPTER 2. BACKGROUND

                   Figure 2.3.3: Real Time Processing in Spark [20]

Examples of batch processing use cases include:

   • Fraud detection

   • Log monitoring

   • Customer behavior analysis

   • Analyzing social media sentiment

Batch vs Stream processing

The type of data that the data engineer or scientist is dealing with determines to a
large extent whether batch or stream processing is more optimal. However, it is
possible to transform batch data into stream data in order to leverage real time analytic
results. This might provide the chance of being able to react faster to opportunities or
challenges in cases where time constraints apply.

           Batch processing                        Stream processing
           Data is collected over a certain        Data          is          collected
           period of time                          continuously
           Data is processed only after it’s all   Data is processed live,
           collected                               piecebypiece
           It can take a long time and is more     It’s fast and more suitable
           suitable for a large quantity of data   for    data        that      needs
           with low time restriction               immediate processing

                                                                                         10

CHAPTER 2. BACKGROUND

2.4 Apache Beam

2.4.1 Overview

                           Figure 2.4.1: Apache Beam [48]

Apache Beam is a parallel computing framework. It’s an open SDK based on the
Dataflow Model which was presented by Google in this paper [3].             The Google
Dataflow is in turn based on the processing frameworks FlumeJava [25] for batch data
processing and MillWheel [2] for stream data processing [5].

Most parallel processing frameworks attempt to optimize either latency, correctness or
cost. For example, developers might wait more before beginning processing in order
to make sure that the data to be processed is complete and all late data is present. This
most likely results in an increase in correctness but also an increase in latency. The
opposite scenario would be for the developers to start the processing early which ends
up resulting in lower latency but incomplete data and an increase in costs.

                                                                                      11

CHAPTER 2. BACKGROUND

Figure 2.4.2: The tradeoff between correctness, latency and cost in parallel processing
[5]

The main problem behind parallel processing frameworks is that input data is expected
to become complete at some point in time. The unified model that Apache Beam is
based on offers a solution to this problem. It states that we might never know when or
if all of our data is present [5]. It is unified in the sense that there is no differentiation
between bounded and unbounded datasets (batch and streaming). Apache Beam is
able to run using many execution engines, or runners, using the same code written in
Java, Python, Scala or Go. Some of these runners are DirectRunner [77], Spark Runner
[76], Flink Runner [72], Google Cloud Dataflow Runner [78], IBM Streams Runner
[39], Apache Hadoop MapReduce Runner [73], Hazelcast Jet Runner [34], Apache
Nemo Runner [74], Twister2 Runner [70], Apache Samza Runner [75], JStorm Runner
[79] [36].This theoretically should give developers a certain degree of flexibility when
different runners are better suited for different cases.

                                                                                           12

CHAPTER 2. BACKGROUND

                      Figure 2.4.3: Overview of Apache Beam [5]

The Dataflow model innovates in the windowing and triggering areas.

2.4.2 Windowing

                             Figure 2.4.4: Windowing [5]

In streaming, data needs to be grouped in finite chunks in order for it to be aggregated.
In other words, this cannot be achieved with an infinite dataset.         This is where
windowing comes.      It’s time based, which means that data points are grouped
depending on when they were observed (when they happened). Let’s give an example of
windowing in web analytics. In web analytics data, events are consumed in a streaming
pipeline. Each of these events has a userId key and a timestamp stating the date
that the event occurred. If we were interested in windowing the different data points
representing userclicks, we would simply create a fixed window.

                                                                                      13

CHAPTER 2. BACKGROUND

Fixed Windows

                           Figure 2.4.5: Fixed Windows [5]

Fixed windows consist of a predefined static window size, such as 1 hour for example,
and are applied across every userId. This specific type of windowing allows us to add
up all the different times the users initiated clicks in a specific window and thus we
know how many clicks occurred in 1 hour.

Sliding Windows

                          Figure 2.4.6: Sliding Windows [5]

Another type of windowing are sliding windows. Sliding windows, not unlike fixed
windows, have a predefined static window size but on top of that they have a slide
period which means there is possibility of overlap. For example, every window can be
for 1 hour, and thus we can know the number of userclicks in each hour just like in
the case of fixed windows, but the difference is that the results are recalculated every
minute instead of waiting for each window to finish.

                                                                                     14

CHAPTER 2. BACKGROUND

Session Windows

                         Figure 2.4.7: Session Windows [5]

The last type of windowing is called session windowing. Data points are organized
in groups according to their keys, and then activity periods are captured in those
subgroups, or session windows. Every session window is generally defined by a timeout
gap.

In the example of web analytics where events can be grouped based on userId, session
windows allow us to group userclicks into sessions. Session windows are not aligned
and thus not applied across every key.

                   Figure 2.4.8: Common Windowing Patterns [3]

One of the most important contributions of Apache Beam is that it supports unaligned
windows, and windowing in general is one of the main elements of the Dataflow model
that Apache Beam was based on. We can say that Apache Beam is a data processing
system that allows us to work with batch data processing as a special case of stream
data processing.

                                                                                  15

CHAPTER 2. BACKGROUND

2.4.3 Event time and processing time

Another central concept of the Dataflow model is the distinction between processing
time and event time.

Event time is the time of the event actually taking place. For example, every userclick
on a certain website is an event time.

Processing time represents the time that an event reaches our system in order to be
processed. The reason this is critical is that unlike an ideal scenario where all of our
data is always present and we can process all the events the moment they occur, we
actually need to take into consideration late data.

2.4.4 Triggers

The Dataflow model also includes triggers with the goal of handling the late data. In
Apache Beam, developers can use triggers to choose when to emit output results for
a certain window. On top of that, triggers work hand in hand with windowing. In
other words, windowing specifies where in event time data are grouped together, and
triggering specifies when in processing time the results are emitted [3].

2.4.5 Programming model

                Figure 2.4.9: Apache Beam programming model [40]

Apache Beam SDK is comprised of the following main elements:

   • Pipeline: the pipeline consists of the inputted data, the transformations on it,
     and the output, which makes up the application definition.

                                                                                     16

CHAPTER 2. BACKGROUND

        • PCollection: the PCollection consists of a bounded or unbounded distributed
          dataset.

        • PTransform:        It’s where data transformation happens.     Before the data
          transformation happens, PTransform receives PCollection object(s) and then
          outputs PCollection object(s). Apache Beam offers a multitude of transforms,
          such as:

             – ParDo: ParDo is a generic parallel processing transform. It performs a
               processing function on each element in the PCollection input, then emits
               to the PCollection output either zero or multiple elements. ParDo supports
               side inputs and stateful processing.

             – GroupByKey: GroupByKey, just like the name says, takes a collection of
               elements with Keys, then produces another collection where elements are
               comprised of a Key and the value related to that Key.

             – Flatten: If multiple PCollection objects have data of the same type, Flatten
               merges that data into a single PCollection.

     Sample Code

     The following sample code represents a simple Apache Beam version of WordCount,
     written in Java [69].

 1   // Source : https :// beam. apache .org/get - started /try -apache -beam/
2    // Accessed 2020 -09 -16
3    // Example use of Pipeline
4

5    package samples . quickstart ;
6

 7   import org. apache .beam.sdk. Pipeline ;
8    import org. apache .beam.sdk.io. TextIO ;
9    import org. apache .beam.sdk. options . PipelineOptions ;
10   import org. apache .beam.sdk. options . PipelineOptionsFactory ;
11   import org. apache .beam.sdk. transforms . Count ;
12   import org. apache .beam.sdk. transforms . Filter ;
13   import org. apache .beam.sdk. transforms . FlatMapElements ;
14   import org. apache .beam.sdk. transforms . MapElements ;

                                                                                        17

CHAPTER 2. BACKGROUND

15   import org. apache .beam.sdk. values .KV;
16   import org. apache .beam.sdk. values . TypeDescriptors ;
17

18   import java.util. Arrays ;
19

20   public class WordCount {
21       public static void main( String [] args) {
22           String inputsDir = "data /*";
23           String outputsPrefix = " outputs /part";
24

25           PipelineOptions options = PipelineOptionsFactory . fromArgs (args).
                create ();
26           Pipeline pipeline = Pipeline . create ( options );
27           pipeline
28               .apply ("Read lines ", TextIO .read ().from( inputsDir ))
29               .apply ("Find words ", FlatMapElements .into( TypeDescriptors .
                     strings ())
30                      .via (( String line) -> Arrays . asList (line. split ("[^\p{L
                           }]+"))))
31               .apply(" Filter empty words ", Filter .by (( String word) -> !
                     word. isEmpty ()))
32               . apply("Count words ", Count . perElement ())
33               .apply("Write results ", MapElements .into( TypeDescriptors .
                     strings ())
34                      .via ((KV  wordCount ) ->
35                            wordCount . getKey () + ": " + wordCount . getValue ()))
36               . apply( TextIO . write ().to( outputsPrefix ));
37           pipeline .run ();
38       }
39   }

                             Listing 2.1: Sample Beam code in Java [69]

                                                                                        18

CHAPTER 2. BACKGROUND

2.5 Apache Spark

2.5.1 Overview

Apache Spark is an open source distributed data processing system. Not only does
it have functionalities for batch data processing, but it also works with stream data
processing through its Apache Spark Streaming library. Unlike other Data Processing
Engines such as Apache Flink which uses tuplebytuple processing, Apache Spark uses
microbatches in order to handle stream data processing.

                           Figure 2.5.1: Apache Spark [11]

Apache Spark Streaming programs can be written in multiple programming languages
such as Java, Python or Scala. There are also many other libraries available on top
of Spark, such as graph processing libraries and machine learning libraries [85], [63],
[36].

Figure 2.5.2 describes the Apache Spark installation architecture. An application is
basically executed as many independent processes that are distributed across a cluster,
and these processes are coordinated by SparkContext. The said coordinator is actually
an object found in the main() function of the application also known as the Driver
Program. On top of that, SparkContext is connected to a Cluster Manager, and the
Cluster Manager has the role of resource allocation [36].

                                                                                    19

CHAPTER 2. BACKGROUND

          Figure 2.5.2: Architecture of Apache Spark in Cluster Mode [12]

Apache Spark supports four Cluster Managers: Kubernetes [21], Apache Mesos [37],
Spark Standalone, and Apache Hadoop YARN (Yet Another Resource Negotiator)
[80].

Whenever a connection is established, SparkContext acquires executors on the Worker
Node instances, and each one of these executors is basically a process belonging to
only one application, which performs computations and stores data. This means that,
unlike with Flink, applications running on the same cluster are executed in different
JVMs.

This means that an external storage system is needed in order to exchange data
between Apache Spark applications [36].

SparkContext transmits the program to the executors as python files or a JAR when
the executors are acquired, then tasks are sent to the executor processes, and each one
of these processes can run more than one task in multiple threads [12] [47].

Apache Spark uses a central data structure called the Resilient Distributed Dataset, or
RDD. An RDD is a readonly and partitioned record collection. It can be considered as
a distributed memory abstraction. Apache Spark Streaming also uses the discretized
streams processing model, or DStreams for short, which is basically a sequence of
RDDs.

When an incoming data stream reaches the system, it is divided into multiple batches
that get stored in the RDDs, then data transformations are performed on the RDDs
and a DStream is outputted [83] [84].

                                                                                    20

CHAPTER 2. BACKGROUND

2.5.2 Apache Spark stack

           Figure 2.5.3: Highlevel architecture of Apache Spark stack [60]

Apache Spark is comprised of a multitude of main components which include Spark
core as well as upperlevel libraries depicted in Fig 2.5.3. Spark core can access data
in any Hadoop data source and it can also run on different cluster managers. On top
of that, many packages exist today that work with Spark core as well as upperlevel
libraries [60].

Spark core

Spark core provides a simple programming interface to process largescale datasets,
and it is the main foundation of Apache Spark. Spark core has many APIs in Java,
Python, Scala, and R, but its main implementation is in Scala.

Spark core APIs support data transformation, actions, as well as many other
operations. These operations are very important for data analysis algorithms found
in the upperlevel libraries.

Spark core also offers many inmemory cluster computing functionalities such as
job scheduling, memory management, job scheduling and data shuffling.           These

                                                                                    21

CHAPTER 2. BACKGROUND

functionalities make it possible for an Apache Spark application to be developed using
the CPU, storage resources and memory of a cluster [60].

Upperlevel libraries

There are many upperlevel libraries on top of Spark core that allow the handling
of many workloads such as: GraphX [32][82] for graph processing, Spark’s MLlib
for machine learning [50], Spark SQL [17] for structured data processing and Spark
Streaming [84] for streaming analysis.

Any improvement in the Spark core naturally causes an improvement of the
upperlevel libraries since they are built on top of the Spark core. The RDD
abstraction includes graph representation extensions as well as ones for stream data
representation. On top of that, a higher level of abstraction for structured data is
provided by the Spark SQL DataFrame and Dataset APIs [60].

Cluster managers and data sources

As stated in the overview, the cluster manager allows the execution of jobs by acquiring
resources, and it also has the task of handling resource sharing between the Spark
applications. Spark supports data in Cassandra, HDFS, Alluxio, Hive, HBase and
basically any Hadoop data source.

Spark applications

Five entities are involved in order to run a Spark application (as depicted in Fig 2.5.2):
a driver program, workers, a cluster manager, tasks and executors.

A driver program defines a highlevel control flow for the target computation and it
also uses Spark as a library. A worker offers the CPU, storage resources and memory
to the Spark application.

Spark creates on each worker for the Spark application a Java Virtual Machine (or
JVM) process, which is called an executor.

Spark also performs computations such as processing algorithms on a cluster in order
to deliver these results to the previously explained driver program, and this process is
referred to as a job. Each Spark application can handle more than one job. Each job is

CHAPTER 2. BACKGROUND

     split into a DAG (or directed acyclic graph) of stages. These stages are basically a task
     collection.

     The smallest work unit sent to an executor is referred to as a task. A SparkContext is
     the main entry point for Spark functionalities, and the driver program can access Spark
     through the SparkContext. A connection to a computing cluster is also represented by
     a SparkContext [60].

     Sample Code

     The following code written in Java is an example of searching through error messages
     in a log file using Apache Spark [62].

 1   // Creates a DataFrame having a single column named "line"
2    JavaRDD  textFile = sc. textFile ("hdfs ://... ");
3    JavaRDD  rowRDD = textFile .map( RowFactory :: create );
4    List < StructField > fields = Arrays . asList (
5      DataTypes . createStructField ("line", DataTypes . StringType , true));
6    StructType schema = DataTypes . createStructType ( fields );
 7   DataFrame df = sqlContext . createDataFrame (rowRDD , schema );
8

9    DataFrame errors = df. filter (col("line").like("%ERROR %"));
10   // Counts all the errors
11   errors . count ();
12   // Counts errors mentioning MySQL
13   errors . filter (col("line").like("%MySQL %")). count ();
14   // Fetches the MySQL errors as an array of strings
15   errors . filter (col("line").like("%MySQL %")). collect ();

                          Listing 2.2: Sample Spark code in Java [62]

                                                                                           23

CHAPTER 2. BACKGROUND

2.6 Apache Flink

                            Figure 2.6.1: Apache Flink [6]

2.6.1 Overview

Apache Flink [6] is a very popular fourth generation data processing system that offers
support for both batch and stream processing. Flink is also open source, and Flink
programs can be written in Java or Scala.

Many libraries exist today on top of Apache Flink in order to handle many extra
functionalities such as graph processing or machine learning [24] [35].

2.6.2 Architecture

           Figure 2.6.2: Apache Flink Runtime Architecture [24] [35] [36]

Figure 2.6.2 presents the Apache Flink Runtime Architecture, which includes a Flink
Client, Task Managers and a Job Manager. Whenever an application is deployed, it’s
transformed into a dataflow graph by the Flink Client and sent to the Job Manager.
Then the Flink Client can actually disconnect from the Job Manager or stay connected

                                                                                    24

CHAPTER 2. BACKGROUND

to it in case receiving information about the execution progress is needed, and that’s
due to the fact that the Flink Client isn’t actually part of the program execution.

One of the important roles of the Job Manager is to schedule work with the Task
Manager instances and to keep track of the execution. Even though it is possible to
have more than one Job Manager instances, only one can be the leader (the others can
take over in case there is a failure).

Every Apache Flink installation has at least one Task Manager. A Task Manager is
a JVM process, and its instances execute the assigned parts of the program. Task
Managers can exchange data between themselves when necessary, and at least one
task slot is provided by each one of them.

Subtasks within one application can share a task slot even if they belong to separate
tasks. Each task is executed by one thread, and many operator subtasks are chained
into one task. This allows for many advantages such as reduced overhead [24] [35]
[29] [36].

                        Figure 2.6.3: Apache Flink Ecosystem [7]

Sample Code

The following flink java code is an example of a ”WordCount” program in streaming
mode that outputs a word occurrence histogram from text files.

                                                                                      25

CHAPTER 2. BACKGROUND

 1   public static void main( String [] args) throws Exception {
 2

 3       // Checking input parameters
 4       final MultipleParameterTool params = MultipleParameterTool .
             fromArgs (args);
 5

 6       // set up the execution environment
 7       final StreamExecutionEnvironment env =
             StreamExecutionEnvironment . getExecutionEnvironment ();
8

 9       // make parameters available in the web interface
10       env. getConfig (). setGlobalJobParameters ( params );
11

12       // get input data
13       DataStream  text = null;
14       if ( params .has("input ")) {
15            // union all the inputs from text files
16            for ( String input : params . getMultiParameterRequired ("input "
                 )) {
17                  if (text == null) {
18                      text = env. readTextFile ( input );
19                  } else {
20                      text = text.union (env. readTextFile ( input ));
21                  }
22           }
23            Preconditions . checkNotNull (text , "Input DataStream should
                 not be null.");
24       } else {
25            System .out. println (" Executing WordCount example with default
                    input data set.");
26            System .out. println ("Use --input to specify file input .");
27            // get default test text data
28            text = env. fromElements ( WordCountData . WORDS );
29       }
30

31       DataStream  counts =

                                                                              26

CHAPTER 2. BACKGROUND

32                          // split up the lines in pairs (2- tuples ) containing : (
                               word ,1)
33                          text. flatMap (new Tokenizer ())
34                                  // group by the tuple field "0" and sum up tuple
                                          field "1"
35                                  . keyBy ( value -> value .f0)
36                                  .sum (1);
37

38            // emit result
39            if ( params .has(" output ")) {
40                  counts . writeAsText ( params .get(" output "));
41            } else {
42                  System .out. println (" Printing result to stdout . Use --output
                        to specify output path.");
43                  counts .print ();
44            }
45            // execute program
46            env. execute (" Streaming WordCount ");
47   }
48

49   // ****************************************************************
50   // USER FUNCTIONS
51   // ****************************************************************
52

53   /**
54       * Implements the string tokenizer that splits sentences into words
              as a user - defined
55       * FlatMapFunction . The function takes a line ( String ) and splits it
                  into multiple pairs in the
56       * form of "( word ,1)" ({ @code Tuple2 }).
57       */
58   public static final class Tokenizer
59                  implements FlatMapFunction 
                        {
60

61            @Override

                                                                                     27

CHAPTER 2. BACKGROUND

62       public void flatMap ( String value , Collector  out) {
63            // normalize and split the line
64            String [] tokens = value . toLowerCase (). split ("\\W+");
65

66            // emit the pairs
67            for ( String token : tokens ) {
68                if (token . length () > 0) {
69                    out. collect (new Tuple2 (token , 1));
70                }
71           }
72       }
73   }

                       Listing 2.3: Sample Flink code in Java [30]

                                                                            28

Chapter 3

Related Work

In this section, we discuss related work and papers that give us some further insight
on our topic.

3.1 Data Stream Processing Systems

IBM Streams

As previously mentioned, Apache Beam is considered as an advanced unified
programming model with many runners [4]. One of these runners is IBM Streams [38]
whose development, optimization as well as performance evaluation was discussed in
[44]. The article also discusses the performance differences between IBM Streams,
Apache Spark [10] and Apache Flink [6].

                Figure 3.1.1: IBM Nexmark Benchmark Evaluation [44]

                                                                                  29

CHAPTER 3. RELATED WORK

Google Cloud Dataflow

Apache Beam was inspired by the SDK (programming model) of another DSPS called
Google Cloud Dataflow [33]. It was created to work with unordered and unbounded
large datasets.

                  Figure 3.1.2: Datalake in Google Cloud Dataflow [27]

MillWheel

The implementation as well as the model details of the DSPS MillWheel was presented
here [2][3]. It’s used to build lowlatency dataprocessing applications.

Figure 3.1.3: Input data goes through MillWheel computations. An external anomaly
notification system consumes the output [2]

STREAM

Other than Apache Beam as an abstraction layer that allows the processing of data
using many languages such as Java, SQL can be used in a certain way to accomplish the
same goal. It’s known as Continuous Query Language, or CQL [15]. Stanford university

                                                                                  30

CHAPTER 3. RELATED WORK

                       Figure 3.1.5: STREAM Query plans [14]

has developed a Data Stream Management System (DSMS) called Stanford Stream
Data Manager (also known as STREAM) [14], which integrates the SQLbased CQL
language, with the aim of processing continuous queries with many continuous data
streams. The linked paper also compares CQL with other languages.

        Figure 3.1.4: Simplified Input Data Stream Management System [14]

Apache Calcite

A more recent player in the data processing world is Apache Calcite [18]. This tool
is quite multifunctional, with ability to process queries, optimize them, and support
query language. The linked work discusses the architecture behind Apache Calcite,
as well as others. It also discusses SQL extensions for geospatial queries or semi
structured data. Other mentioned extensions are data stream processing queries
extensions, or STREAM extensions. The previously stated CQL language was the
source of inspiration behind them. More information is available on their website
[65]. It is also worth mentioning that Apache Calcite is integrated by multiple DSPSs,
including Apache Flink and Apache Apex. [18]

                                                                                   31

CHAPTER 3. RELATED WORK

            Figure 3.1.6: Apache Calcite architecture and interaction [18]

Distributed Snapshot Algorithm for Flink

When it comes to Flink, the state management module includes a consistent
distributed snapshot algorithm (it resembles ChandyLamport’s protocol [26]). This
algorithm is explained in this paper [23].

                        Figure 3.1.7: Snapshot Algorithm [23]

                                                                               32

CHAPTER 3. RELATED WORK

Samza

Samza [9] was developed by LinkedIn, and its highlevel design was presented in
this paper [53]. Unlike Flink which relys on distributed snapshots, Samza replays
the change logs when attempting failure recovery. Samza uses Host Affinity to make
recoveries faster. With Samza there is no guarantee on the global consistency because
it doesn’t rely on a distributed snapshot.

                            Figure 3.1.8: Apache Samza [9]

Storm and Heron

In the beginning, Twitter used Storm [68] for stream data processing, but it later
switched to Heron [71]. Storm uses spouts and bolts to run apps in a distributed way.
Heron is basically the upgraded version of Storm in order to increase performance and
scalability, as well as supporting back pressure for data dropping avoidance. A self
healing and selftuning Heron version was discussed in this paper [31].

                                                                                  33

CHAPTER 3. RELATED WORK

                               Figure 3.1.9: Heron [55]

                  Figure 3.1.10: Storm High Level Architecture [68]

3.2 Languages
SQLbased Data Stream Processing Extensions

Oracle Continuous Query Language [59] and StreamBase StreamSQL [66] are
languages intended to define streaming queries based on SQL. The difference between
these two languages was discusses in this paper [41]. The authors also attempt to unify
the two languages, but more obstacles need to be overcome.

                                                                                    34

CHAPTER 3. RELATED WORK

                      Figure 3.2.1: Oracle CEP Architecture [56]

On top of Oracle CEP CQL and StreamSQL, other SQLbased data stream processing
extensions exist for specific systems, such as SamzaSQL [58] used in DSPS Samza [54],
Continuous Computation Language (CCL) used in SAP HANA Smart Data Streaming
[61], and KSQL [43] used in Apache Kafka, to name a few.

3.3 Benchmarks
Linear Road Benchmark

Generally speaking, DSPSs can be benchmarked with different tools. One of the
more popular tools is the Linear Road benchmark [16]. It is basically an application
benchmark with a toolkit for conducting benchmarks. The Linear Road benchmark in
comprised of data generator, followed by a data sender, then finally a result validator.
This benchmark was based on the concept of a variable trolling system that is used in a
metropolitan area, which has many highways with a number of cars moving. The tolls
vary depending of many aspects related to the situation of the traffic.

Reports of the different car positions are transmitted to the DSPS by the data sender.
The DSPS then either outputs data or not, depending on the traffic situation of the
highways, judged by the received reports. Other than car reports, an explicit query in
also sent as input data. An answer is always required by this explicit query. The Linear
Road benchmark paper mentions four different queries. The last one of these queries
was not implemented in the paper [16] because it was too complex.

The Lrating is basically the result of a system benchmark. The Linear Road paper
defined the Lrating as the acceptable number of highways that can tolerated by the
system all while being able to meet a certain query requirement for the response time.
The number of highways can be chosen when the data is being generated. The more

                                                                                     35

CHAPTER 3. RELATED WORK

highways there are in the system, the higher the input rate is. In the same paper, the
Linear Road benchmark is used on a relational database as well as on a DSPS called
Aurora [1]. The paper also includes the different results.

StreamBench

StreamBench [46] is another related benchmark, with the focus on distributed DSPSs.
It can more accurately be considered as a microbenchmark because it is more suitable
to benchmark atomic operations, instead of complex applications such as it is the
case with the previously mentioned Linear Road. StreamBench includes three stateful
queries and four stateless queries, for a total of seven queries. Some of these queries
have more than one computational step, while others only have one. Unlike stateless
queries, stateful queries don’t need to keep any state to produce a correct answer or
result. When it comes to input, all queries use textual data expect one query that
processes numerical data.

Regarding the architecture of this benchmark, Apache Kafka is the message broker that
StreamBench uses for data consumption and generation, which isn’t the case with the
Linear Road benchmark. In the evaluation chapter, Apache Spark and Apache Storm
[13] are both benchmarked using StreamBench.

NEXMark

Another DSPS benchmark tool is NEXMark [51]. When it comes to Apache Beam,
NEXMark has been ported to Beam from Dataflow, and then refactored according
to the latest version of Beam. It supports all the different runners in Apache Beam.
NEXMark consists of a generator (timestamped events), NEXMarkLauncher (source
creation and launches the query pipelines), Output Metrics (such as execution time)
and Modes (batch and stream) [52].

3.4 Comparisons
Spark and Flink

A comparison of Apache Spark and Apache Flink was discussed in [49]. Multiple
queries were used in the experiments, such as grep query. The authors paid careful
attention to the changes in scaling that are affected by the total number of cluster

                                                                                    36

CHAPTER 3. RELATED WORK

nodes. The paper doesn’t focus on the data stream perspective in the experiments
done.

          Figure 3.4.1: Word count comparison between spark and flink [49]

Spark, Flink and Storm

Apache Flink, Apache Spark and Apache Storm are compared in this related paper [45].
The authors addressed the behavior in the context of a node failure. They also present
the different architectures for the DSPSs.

        Figure 3.4.2: Throughput results in function of the task parallelism [45]

Spark, Flink, Apex and Beam

In the following paper [36], the authors investigate whether there are any impacts on
performance when using Apache Beam (streaming mode) with the following runners:
Spark, Flink and Apex.

                                                                                    37

CHAPTER 3. RELATED WORK

              Figure 3.4.3: Average Execution Times in s [36]

                                                                38

Chapter 4

System Design and Implementation

In this chapter, we discuss how we designed the system with the intention of
conducting the different experiments.

4.1 System Design

4.1.1 Data set

The Data set that we chose is the Carbon Monoxide Daily Summary set, downloaded
from Kaggle.com. The choice of this data set was inspired by two reasons. The first one
was because it’s a relatively large enough dataset, that allows meaningful comparison
of the different data processing frameworks. The second but equal reason, was because
the nature of the data set itself is something that has meaning to us. The fact that we
can benchmark different data processing frameworks and at the same time get some
interesting insights on the Carbon Monoxide emissions is something that is likely to
be useful not only in regards to the technical aspect of engine comparison, but also for
the environmental cause.

We used the first 2 million rows of the Carbon Monoxide dataset on a hardware
consisting of 16gb of ram and an i5 8 core 8th generation CPU.

4.1.2 Queries

In order to compare the different data processing systems, in batch as well as in stream,
we will be using the 3 following queries:

                                                                                      39

CHAPTER 4. SYSTEM DESIGN AND IMPLEMENTATION

       1. Filtering: Filter the data for a specific county (filter based on a county == 31) and
         get the count of records.

      2. State management: Get the sum of arithmetic mean by county where state is 06
         and year is 2017.

      3. Windowing: Get the average of state 05 from year 1997 till 2015.

    4.2 Implementation
    In the following section we present the implementation of our queries in Spark, Flink,
    as well as in all the runners within Beam, in batch and streaming mode.

    4.2.1 Apache Beam

    Apache Beam unifies both Batch and Stream processing so the code is the same for
    both, which is one of the most important features offered by Beam.

    Query 1

    Query 1 is considered the simplest query to implement compared to the other queries.
    It reads the data from the input file and parses each row individually using the map
    function. It extracts the county, and creates a two valued tuple with count. After
    parsing, we will have just the required information and we can filter it based on the
    first tuple value which is county. So we have used the Filter operation to filter the
    results for the county code 31.

    Furthermore, we have grouped the different observations for each county with keyBy
    operation and counted those values. Finally, the query will print a final aggregated
    value for the selected county.

1   static void runQuery1 ( WordCountOptions options ) {
2        Pipeline p = Pipeline . create ( options );
3

4        // pipeline to read , filter based on county , split to kV pair , do
              the count
5        p.apply(" ReadLines ", TextIO .read ().from( options . getInputFile ()))
6

                                                                                            40

CHAPTER 4. SYSTEM DESIGN AND IMPLEMENTATION

 7                  .apply(new PTransform < PCollection , PCollection <
                         KV >>() {
8                         @Override
 9                        public PCollection  expand (
                              PCollection  input ) {
10                             return input . apply (
11                                      MapElements .into(
12                                                 TypeDescriptors .kvs(
                                                        TypeDescriptors . strings () ,
                                                        TypeDescriptors . longs ()))
13                                                 .via(line -> KV.of(line. split (",
                                                        ")[1] , 1L))); //1 is
                                                        county_code since we need
                                                        filter on that;
14                        }
15                  })
16                  .apply( Filter .by(obj -> obj. getKey (). equals ("031")))
17                  .apply( Count . perKey ())
18                  .apply( MapElements .via(new WordCount . FormatAsTextFn ()))
19                  .apply(" WriteCounts ", TextIO . write ().to( options .
                         getOutput ()));
20

21        p.run (). waitUntilFinish ();
22   }

                   Listing 4.1: Query 1 implementation using Apache Beam

     Query 2

     The second query evaluates the performance of the different stream processing engine
     and capabilities for stateful stream processing  processing the queries which requires
     state storage to complete the operation. This query implementation streams the data
     from the input file, parses each row and creates a CO object with few attributes which
     are going to be used for filtering. Then, the query filters the value for the state code
     and for a specific year. Afterwards, each object is mapped into a key value pair of the
     required information which is used to sum up all the values for a specific key, average
     emission. Finally, we transform the pairs into the more user friendly strings and write

                                                                                          41

CHAPTER 4. SYSTEM DESIGN AND IMPLEMENTATION

     them into the file.

 1   static void runQuery2 ( WordCountOptions options ) {
2         Pipeline p = Pipeline . create ( options );
3

4         // pipeline to read , filter based on county , split to kV pair , do
               the count
5         p.apply(" ReadLines ", TextIO .read ().from( options . getInputFile ()))
6

 7                  .apply(new CreateCOObjects ())
8                   .apply( Filter .by(obj -> obj. state_code == 6 && obj.
                           date_local . getYear () == 2017) )
9                   .apply( MapElements
10                              .into( TypeDescriptors .kvs( TypeDescriptors .
                                    integers () , TypeDescriptors . doubles ()))
11                              .via(row -> KV.of(row. county_code , row.
                                    arithmetic_mean )))
12                  .apply(Sum. doublesPerKey ())
13                  .apply( MapElements
14                              .into( TypeDescriptors . strings ())
15                              .via(x -> x. getKey (). toString () + ": " + x.
                                    getKey ()))
16                  .apply(" WriteCounts ", TextIO . write ().to( options .
                           getOutput ()));
17

18        p.run (). waitUntilFinish ();
19   }

                    Listing 4.2: Query 2 implementation using Apache Beam

     Query 3

     The final query is comparing the performance of the different processing engines in
     terms of windowing. It creates the grouping annually (or windows) and calculates
     the mean for the windows. Similarly, it starts with streaming the data from the input
     file and transforms each row into CO objects. Those objects are filtered based on the
     criteria. Then, we apply the fixed size windowing on the data to see the average for
     a year and calculate the mean emission. Finally, we convert objects into the pair and

                                                                                       42

You can also read