Multi-agent architecture for real-time Big Data processing

Page created by Hector Davis

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-agent architecture for real-time Big Data processing

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

          Multi-agent architecture for real-time Big Data
                            processing
                           Bartłomiej Twardowski                                      Dominik Ryżko
                        Institute of Computer Science                         Institute of Computer Science
                      Warsaw Universtity of Technology                       Warsaw Universtity of Technology
                                 Allegro Group                                         Allegro Group
                      Email: B.Twardowski@ii.pw.edu.pl                         Email: d.ryzko@ii.pw.edu.pl

      Abstract—The paper describes the architecture for processing                II. M OTIVATION AND RELATED WORK
   of Big Data in real-time based on multi-agent system paradigms.
   The overall approach to processing of ofﬂine and online data is
                                                                         The work was motivated by the study on applications of
   presented. Possible applications of the architecture in the area    Big Data processing in Allegro.pl, the largest e-commerce
   of recommendation system is shown, however it is argued the         marketplace platform in Central Europe. The main challenge
   approach is general purpose.                                        was to create an architecture, which would allow efﬁcient
                                                                       processing of various data stored in the long term storage as
                         I. I NTRODUCTION                              well as captured in real-time in order to serve the results of
                                                                       such processing to the on-line users.
      Big Data is one of the most popular research topics nowa-
   days. With the advances in parallel computation architectures       A. Big data
   more and more instances of the algorithms can be run con-              There has already been a lot of work done in the area of
   currently. Systems such as Hadoop allow storage of huge             Big Data applications for Business Intelligence and analytic in
   ammounts of unstructured information and processing them            particular in the e-commerce area [5]. Large vendors both in
   with algorithms such as MapReduce [1], optimized for parallel       America (Amazon, eBay) as well as in Europe (Allegro) and
   and asynchronous computations.                                      other continents were able to build highly scalable platforms
      Also in the database systems we observe the shift of             as well as analytical capabilities around them. Furthermore
   paradigms. In most practical applications of Big Data the tradi-    they analyze social content generated around to get even better
   tional properties of ACID (Atomicity, Consistency, Isolation,       insight into the client experience.
   Durability) are no longer applicable. As the CAP Theorem               An important paradigm which is more and more present in
   states [2], we can assure at the same time only two out of the      the Big Data analysis is the concept of autonomous, intelligent
   following three properties namely:                                  and proactive agents. In the context of this work the notion
                                                                       of agent mining is especially interesting [6]. It combines
     •   Consistency (all nodes see the same data at the same
                                                                       methodologies, technology, tools and systems from the area of
         time)
                                                                       multi-agent technology, data mining and knowledge discovery,
     •   Availability (a guarantee that every request receives a
                                                                       machine learning, statistics and semantic web in order to
         response about whether it was successful or if it has
                                                                       address issues that cannot be tackled by any single technique
         failed)
                                                                       with the same quality and performance.
     •   Partition tolerance (the system continues to operate de-
                                                                          The multi-agent approach can also be applied to streaming
         spite arbitrary message loss or failure of part of the
                                                                       data processing in the actors model. In [7] a S4 platform is
         system)
                                                                       presented, which allows distributed processing of the streams
      There are several important applications of Big Data, which      of data. The authors show how this can be applied to tune
   require new approaches in order to provide robust and usable        parameters in dynamic systems.
   solutions. Among the most popular there are: recommender
   systems [3], social network analysis [4], web mining, algorith-     B. Recommender systems
   mic trading, fraud detection etc. The diversity of these topics        The work on recommender systems has been a popular
   shows the importance of Big Data and imposes challenges on          topic since the beginning of large commercial services on the
   the tools and architectures, which have to be applicable in         Internet. Several approaches to this subject have been proposed
   various scenarios.                                                  [8] with Collaborative Filtering techniques being the most
      This paper shows an architecture, which is designed for          popular. Other approaches include content-based, knowledge-
   different implementations in which both real-time and ofﬂine        based and hybrid approaches.
   Big Data processing is necessary. As an example of application         Multi-agent systems have been often employed to the task of
   a recommender system is shown.                                      building recommender systems. In [9] a recommender system

978-1-4799-4143-8/14 $31.00 © 2014 IEEE                          333
DOI 10.1109/WI-IAT.2014.185

           Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21

can be used together with batch veiws for the complete view
of the data.
B. Other approaches
Several other approaches to processing of Big Data in
real-time have also been proposed. In [13] the following 8
requirements for real-time data processing are proposed:
• keep the data moving
• query using SQL on Streams (StreamSQL)
• handle stream imperfections
Figure 1. The lambda architecture
• generate predictable Outcomes
• integrate stored and streaming data
• guarantee data safety and availability
of personalized TV contents, named AVATAR, is presented.
• partition and scale applications automatically
A collaborative approach to recommendations combined with
• process and respond instantaneously
intelligent agents can be found in [10], [11].
Also other approaches to the subject of real-time or near
III. R EAL - TIME PROCESSING WITH B IG DATA real-time processing of Big Data can be found in the literature
One of the main challenges, with respect to processing [14], [15].
of very large sets of data, is handling of the real-time data IV. A RCHITECTURE FOR MULTI - AGENT B IG DATA
streams. While both offline and online parts of data can ANALYSIS
be process independently, very often we need to provide
Presented approach for processing Big Data in a real-time
answers to questions regarding online events based on the past
manner is trying to solve the main problem of analyzing
history. The lambda architecture comes as an answer to these
large, continuously growing data sets. Lambda architecture is
challenges.
one of the newest method which has lately gained a lot of
A. The Lambda Architecture popularity, mainly because of its simplicity and use of a well-
known tools for processing the data. Three easily recognizable
The lambda architecture has been proposed by Nathan Marz
layers for batch, speed and service is making clear division
[12]. It is based on several assumptions:
of components functionality. Moreover for each of this layer,
• fault tolerance
there is wide selection of solutions for implementation to
• support of ad-hoc queries
choose. Most of them are available on market for years and
• scalability
are known for the reliability.
• extensibility
Despite the fact, that the recalled architecture is presented
The architecture as shown in the Figure 1 is composed of as a simple and a clear one, still many decisions and lots
the following components: of integration work needs to be done. The architecture gives
• Batch Layer - responsible for managing the master dataset only guidelines on how the system should be designed and
and for precomputation of batch views which parts should it consist of. This gives freedom in the
• Serving layer - indexes the batch views for fast ad-hoc choice of existing solutions for a specific job. Still, interaction
queries between batch, service and speed layers has to be handled
• Speed Layer - serves only the new data, which has not properly. Furthermore, even in a single layer a few components
yet been processed by the batch Layer must interact together using different protocols and methods
The Batch layer can be implemented with the use of of communication. In the Big Data environments these means,
systems such as Hadoop. It is responsible for storing the that we have to cope with the integration of distributed data
imputable master data set. Furthermore, with the use of the processing systems where the scalability and reliability should
MapReduce algorithms it contiguously computes views of this be taken into account.
data available to the various applications. The lambda architecture for processing Big Data can be
The Serving layer is responsible for serving the views modeled as a heterogeneous multi-agent environment. Tree
computed by the batch layer. This process can be facilitated distinct layers with different characteristics exists, between
by additional indexing of the data in order to speed up the which multiple components have to interact with each other.
reads. An example of a technology used typically to do this This communication can be simplified using the multi-agent
job is Impala, which is easily integrated with Hadoop used in system approach. Each agent is responsible for specific task
the Batch layer. in data processing, e.g.: receiving data, aggregating result, etc.
Finally, the role of the Speed layer is to compute in real-time Agents are autonomous and distributed. Cooperation between
the data that has just arrived and has not yet been processed agents is done using message passing. All agent are com-
by the batch layer. It serves this data in the form of real-time municating in the same manner. Therefore, the integration is
views, which are incremented the as the new data arrives and simplified.

334

Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21

consideration, this views are fast in-memory data sets prepared
for the quick online access.
Both the Batch Views and the Real-Time Views are created
for a specific use case. This use case problem is solved in
the serving layer. Request from the outside world is handled
by a dedicated Service Agent. The particular problem e.g. a
new bank credit decision, is solved by the appropriate types
of agent. For every new request the Service Agent is created.
To solve the given problem and prepare the response, Service
Agent is collecting necessary data. Historical data are provided
by a precomputed Batch Views. To access this data a Batch
Aggregator Agent is used. This agent query appropriate Batch
Views.
Figure 2. Architecture for multi-agent big-data processing A similar processing is done to collect fresh online data.
Real-Time Aggregator Agent is preparing data from Real-
Time Views data sets. Both batch and real-time views are
The Figure 2 presents the lambda architecture for Big Data combined together to present the whole picture of the data.
processing using multi-agent systems. There are no changes After collecting all required data from the aggregators agents
to the main concept. In MAS approach there are still batch, the response is created. In this point, when request is served
speed and serving layers. Where the batch layer is creating and response is send back to the client, a life-cycle of Service
aggregates - Batch Views - from the all data. The speed layer Agent ends.
is only incremental - Real-Time Views - for new, non-archived Depending on an infrastructure and the system domain, the
data. The service layer which brings both offline and online online layer can hold views for the different time periods. It
calculated data (views) for solving specified problems, e.g. can vary from days to seconds. The main idea behind the
analytic query, new credit decision, music recommendation batch and the speed layers is to work together to present
etc. a coherent view for the data at the service layer. The first
The input data is processed by the system as a data stream. thing which can be noticed in the proposed MAS architecture
Depending on the domain, it can be a stream of: page views, is that all communication in the system is only between the
user transactions, system logs, diagnostic events etc. Stream agents. Every single task presented in the lambda architecture
- as data series - is collected by the Stream Receiver Agent. in previous section is encapsulated inside an autonomous
This agent is responsible for simple data pre-processing like: agent. This results in simplified integration and distributed
filtering, data format change, serialization to objects etc. After computation.
that, every data event in the stream is passed to the Archiver Moreover, in the presented approach it is strongly rec-
Agent and the Stream Processing Agent. Both of these agents ommended to use the same event representation in a both
are in control of handling new data in the batch and the speed processing: batch and online. Despite the differences in the
layer respectively. infrastructure a data schema can be the same (for lambda the
In the batch processing the new data is first written to the most common ones are Hadoop for batch and Storm for online
Data Store, e.g. Hadoop Distributed File System [16]. The processing). What is more, when agents are fine grained and
Data Store has to handle large data sets and store all events designed for a single responsibility then it can be reused in
in the system. Holding every single event from data stream speed and batch layer. For example: the same calculation done
allows to run computation for a selected time period from the in Batch Worker Agent and Real-Time Worker Agent can be
history. The computations are coordinated by the Batch Driver done by the same agent implementation.
Agent. This agent is created for running specific jobs, while
the actual work is done by its slave agents - Batch Worker V. A PPLICATIONS
Agents. Each worker agent processes part of the data, in order
to successfully produce the output of the job - Batch Views. The lambda architecture is showing how the batch and
The Batch Views are all kinds of aggregates that need to be stream processing can work together solving most of the
produced from the stored data. This is a general overview of modern Big Data problems. The batch Big Data processing is
the batch processing, which can be easily implemented in a well known for the most of analytic tasks like segmentation,
distributed processing clusters like YARN[17] or Mesos[18]. ad-hoc queries for data exploration, etc. However, the general
The same events from the incoming data are processed by movement in Big Data processing is towards more responsive,
the speed layer. Here, a Stream Processing Agent is the first real-time systems where results are available near online. The
point of contact. The Processing Agent routes every event to lambda architecture and the proposed multi-agent solution
appropriate Real-Time Worker Agent where the actual work is trying to cope with a real-time and batch processing in
is done. The result are represented by the Real-Time Views a reasonable fashion. The proposed MAS approach can by
which are the online updated aggregates. Due to the speed applied whenever the Big Data analysis need quick response

335

Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21

time and the query results scheduled by batch windows are
not enough.
One of the attractive new areas, where the support of Big
Data processing can results in a better customer experiences,
are recommendation engines. For a long period of time Internet
search engines were used for finding products, media content
and any other kind of interesting information. However, in a
still growing information flood, environment tools for better
user profiling, web personalization and new product recom-
mendation are required. Internet search engines have ceased
to be convenient tools for most users. The recommendation Figure 3. Recommendation system agent-lambda architecture
systems, instead of forcing users for active exploration, are
trying proactively to suggest new interesting content. The
better user is profiled in the recommendation system - the solution can be used. Akka is one of the most popular actor
better content is targeted. Therefore Big Data solutions comes system framework prepared for java and scala language. The
handy. actor model has been designed for parallel computation based
Collaborative filtering is one of the most widely adopted and on multi-agent systems paradigm [24].
successful recommendation methods[19]. Unlike other recom- In the presented recommendation system all communication
mendation approaches no expert knowledge and explicit rules with the external systems can be done using JSON-REST
are needed. System is characterizing users and recommended endpoints. All services should be implemented as an agents.
items implicitly from the previous interactions - events. Thus The data about items, user and new events are collected
it is becoming a good choice for e-commerce[20] systems and respectively by: Item Service Actor, User Service Actor and
online website personalization[21]. Popularity of collaborative Event Service Actor. The Akka framework supports full asyn-
filtering methods has led to new algorithms and methods chronous communication between actors and easy message
for representing user preferences, like matrix factorization passing pattern. For REST services we proposed a spray.io
techniques[3]. server. The server is build on top of Akka and simplifies
In the collaborative filtering algorithms the more data is handling HTTP protocol in actor environment. After receiving
gathered about the users interaction, the more precisely user data on HTTP endpoint, messages are passed to Receiver
can be profiled and better recommendation can be. Users Actor which stores data stream in simple queue system[25]
using web and mobile application leave trace of theirs action in common format (e.g. Avro). A new data is then send to
- events. The simplest form of user events for the web site is a both: batch and speed layer.
click-stream. Collecting data from a click-stream is the most The batch layer is mainly Hadoop ecosystem. New data are
widely used method for web mining and analysis techniques in handled by Archiver Actor which stores data to HDFS - main
Big Data projects. From the collected users event collaborative offline data storage. The data saved in HDFS is processed
filtering algorithms can implicitly [22] predict user preferences by the batch framework in order to produce Batch Model.
based on the others users interactions. For the collaborative filtering recommendation, batch model
Collecting large amount of users events as the online stream consists of a prepared user rating model and calculated item-
is the first difficulty to cope with in collaborative filtering to-item similarity matrix. This Batch Model is produced in
for the large scale application. After the data is gathered, offline using YARN computation cluster on which a Spark[26]
user preferences should be prepared for serving online recom- jobs are run. The Spark framework for a communication and
mendation. Events generated by the users are converted into coordination of its Driver and Workers using as well the Akka
the rating model for specific recommendation items. Then, actor system framework. The Batch Model could be stored in
such prepared data can be used for collaborative filtering Cassandra NoSQL database for further processing and fast
offline model generation. For the item-to-item version of the access.
algorithm this operation applies to preparation of an item In the speed layer, new data is received by the Akka actors
similarity matrix. in the Spark Streaming framework [27]. An architecture of
For building large scale online recommendation system with the Spark Streaming is very similar to its offline equivalent.
Big Data analytic stack, the multi-agent architecture presented The Driver as master coordinator as well as Workers actors
in previous section can be used. In the recommendation exists. But in each worker there is a receiver actor of new
systems new data stream include: items to recommend, users data. This allows to easily scale the processing. All is run on
data and all events generated by users e.g. explicit item rating top of Mesos cluster[18]. Online streaming processing produce
or page view. The problem to solve for the system is to Online Model: an up-to-date user preferences (based on user
provide best online recommendation in a timely manner for events) and new items with rating. Here, the same rating model
every user. A simplified MAS architecture for collaborative as in the batch layer should be used. Online Model is held in
filtering recommendation system is presented on diagram 3. As memory databases e.g. Redis.
a framework for agent based system implementation Akka[23] The data computed by batch and speed layer is prepared in

336

Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21

order to handle a new recommendation requests. The request in                    [14] Y. Zhu and D. Shasha, “Statstream: Statistical monitoring of
                                                                                      thousands of data streams in real time,” in Proceedings of the 28th
JSON-REST format is handled by a Recommendation Service                               International Conference on Very Large Data Bases, ser. VLDB
Actor on spray.io server. Just like any other request to the                          ’02. VLDB Endowment, 2002, pp. 358–369. [Online]. Available:
system. But for this requests processing is somehow different.                        http://dl.acm.org/citation.cfm?id=1287369.1287401
                                                                                 [15] H. H. et al., “Starﬁsh: A self-tuning system for big data analytics,” ser.
The request is being passed to a Recommendation Engine                                CIDR2011, 2011.
where a Recommender Actor is created. Its responsibility is                      [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop
to carry the process of recommendation. The Recommender                               distributed ﬁle system,” in MSST, 2010, pp. 1–10.
                                                                                 [17] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
Actor is collecting necessary data from Batch Model Actor and                         R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
Online Model Actor. Based on request data context, precalcu-                          O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
lated item similarities and online updated user preferences new                       hadoop yarn: Yet another resource negotiator,” in Proceedings of
                                                                                      the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13.
recommendation are generated for the user. All process ends                           New York, NY, USA: ACM, 2013, pp. 5:1–5:16. [Online]. Available:
when JSON response is send with a set of new recommended                              http://doi.acm.org/10.1145/2523616.2523633
items to the user.                                                               [18] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D.
                                                                                      Joseph, R. Katz, S. Shenker, and I. Stoica, “Mesos: a
                                                                                      platform for ﬁne-grained resource sharing in the data center,”
                          VI. C ONCLUSION                                             Proceedings of the 8th USENIX conference on Networked systems
                                                                                      design and implementation, p. 22, 2011. [Online]. Available:
   The paper presents a multi-agent architecture for processing                       http://dl.acm.org/citation.cfm?id=1972457.1972488
of Big Data. The approach is based on the lambda architecture                    [19] J. Bobadilla, F. Ortega, a. Hernando, and a. Gutiérrez,
                                                                                      “Recommender systems survey,” Knowledge-Based Systems,
proposed before. It has been shown how autonomous agents                              vol. 46, pp. 109–132, Jul. 2013. [Online]. Available:
can enhance the architecture and provide capabilities for robust                      http://linkinghub.elsevier.com/retrieve/pii/S0950705113001044
processing of data in real-time.                                                 [20] Z. H. Z. Huang, D. Zeng, and H. C. H. Chen, “A Comparison of
                                                                                      Collaborative-Filtering Recommendation Algorithms for E-commerce,”
   Possible applications of the proposed architecture to recom-                       IEEE Intelligent Systems, vol. 22, 2007.
mender systems have been discussed. The article argues that                      [21] A. Das, M. Datar, A. Garg, and S. Rajaram, “Google news
                                                                                      personalization: scalable online collaborative ﬁltering,” Proceedings of
the approach is very well suited for such a task, where on-line                       the 16th international conference on, pp. 271–280, 2007. [Online].
events from users visiting an online site have to be combined                         Available: http://portal.acm.org/citation.cfm?id=1242610
with the ofﬂine data on users and items from the past.                           [22] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative Filtering for
                                                                                      Implicit Feedback Datasets,” in Proceedings of the 2008 Eighth IEEE
                                                                                      International Conference on Data Mining, ser. ICDM ’08, vol. 0.
                             R EFERENCES                                              Washington, DC, USA: IEEE Computer Society, Dec. 2008, pp.
                                                                                      263–272. [Online]. Available: http://dx.doi.org/10.1109/icdm.2008.22
 [1] J. Dean and S. Ghemawat, “MapReduce : Simpliﬁed Data Processing on          [23] [Online]. Available: http://akka.io/
     Large Clusters,” Communications of the ACM, vol. 51, pp. 1–13, 2008.        [24] C. Hewitt, “Actor Model for Discretionary, Adaptive Concurrency,”
     [Online]. Available: http://portal.acm.org/citation.cfm?id=1327492               1008.1459, 2010. [Online]. Available: http://arxiv.org/abs/1008.1459
 [2] E. Brewer, “Towards robust distributed systems. (podc invited talk),”       [25] J. Kreps, N. Narkhede, and J. Rao, “Kafka: A distributed messaging
     2000.                                                                            system for log processing,” in Proceedings of the NetDB, 2011.
 [3] Y. Koren, R. Bell, and C. Volinsky, “Matrix Factorization Techniques        [26] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
     for Recommender Systems,” Computer, vol. 42, 2009.                               “Spark: cluster computing with working sets,” in Proceedings of the 2nd
 [4] L. Manovich, “Trending: the promises and the challenges of big social            USENIX conference on Hot topics in cloud computing, 2010, pp. 10–10.
     data,” In Debates in the Digital Humanities, 2011.                          [27] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized
 [5] H. Chen, R. Chiang, and V. Storey, “Business intelligence and analytics:         Streams: An Efﬁcient and Fault-tolerant Model for Stream Processing
     From big data to big impact,” MIS Quarterly, vol. 36, pp. 1165–1188,             on Large Clusters,” in Proceedings of the 4th USENIX Conference
     2012.                                                                            on Hot Topics in Cloud Ccomputing, ser. HotCloud’12. Berkeley,
 [6] L. Cao, G. Weiss, and P. Yu, “A brief introduction to agent mining,”             CA, USA: USENIX Association, 2012, p. 10. [Online]. Available:
     Autonomous Agent Multi-Agent Systems, vol. 25, pp. 419–424, 2012.                http://portal.acm.org/citation.cfm?id=2342773
 [7] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed
     Stream Computing Platform,” 2010 IEEE International Conference on
     Data Mining Workshops, pp. 170–177, 2010.
 [8] D. e. a. Jannach, Recommender Systems, An Introduction. Cambridge
     University Press, 2011.
 [9] Y. Blanco-fern, J. Pazos-arias, A. Gil-solla, M. Ramos-cabrer, J. Garc,
     A. Fern, and P. D. Rebeca, “AVATAR : An Advanced Multi-agent
     Recommender System of Personalized TV Contents by Semantic Rea-
     soning,” pp. 415–421, 2004.
[10] J. Carbo and J. M. Molina, “Agent-based collaborative ﬁltering based
     on fuzzy recommendations,” International Journal of Web Engineering
     and Technology, vol. 1, no. 4, p. 414, 2004. [Online]. Available:
     http://www.inderscience.com/link.php?id=6267
[11] J. Gómez-Sanz, J. Pavón, and A. Díaz-Carrasco, “The PSI3 agent
     recommender system,” Web Engineering, pp. 30–39, 2003. [Online].
     Available: http://link.springer.com/chapter/10.1007/3-540-45068-8_5
[12] Nathan Marz and James Warren, “Big data - Principles and best practices
     of scalable realtime data systems (Chapter 1),” 2014.
[13] M. Stonebraker, U. Çetintemel, and S. Zdonik, “The 8
     requirements of real-time stream processing,” SIGMOD Rec.,
     vol. 34, no. 4, pp. 42–47, Dec. 2005. [Online]. Available:
     http://doi.acm.org/10.1145/1107499.1107504

                                                                           337

         Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21

You can also read