Multi-agent architecture for real-time Big Data processing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) Multi-agent architecture for real-time Big Data processing Bartłomiej Twardowski Dominik Ryżko Institute of Computer Science Institute of Computer Science Warsaw Universtity of Technology Warsaw Universtity of Technology Allegro Group Allegro Group Email: B.Twardowski@ii.pw.edu.pl Email: d.ryzko@ii.pw.edu.pl Abstract—The paper describes the architecture for processing II. M OTIVATION AND RELATED WORK of Big Data in real-time based on multi-agent system paradigms. The overall approach to processing of offline and online data is The work was motivated by the study on applications of presented. Possible applications of the architecture in the area Big Data processing in Allegro.pl, the largest e-commerce of recommendation system is shown, however it is argued the marketplace platform in Central Europe. The main challenge approach is general purpose. was to create an architecture, which would allow efficient processing of various data stored in the long term storage as I. I NTRODUCTION well as captured in real-time in order to serve the results of such processing to the on-line users. Big Data is one of the most popular research topics nowa- days. With the advances in parallel computation architectures A. Big data more and more instances of the algorithms can be run con- There has already been a lot of work done in the area of currently. Systems such as Hadoop allow storage of huge Big Data applications for Business Intelligence and analytic in ammounts of unstructured information and processing them particular in the e-commerce area [5]. Large vendors both in with algorithms such as MapReduce [1], optimized for parallel America (Amazon, eBay) as well as in Europe (Allegro) and and asynchronous computations. other continents were able to build highly scalable platforms Also in the database systems we observe the shift of as well as analytical capabilities around them. Furthermore paradigms. In most practical applications of Big Data the tradi- they analyze social content generated around to get even better tional properties of ACID (Atomicity, Consistency, Isolation, insight into the client experience. Durability) are no longer applicable. As the CAP Theorem An important paradigm which is more and more present in states [2], we can assure at the same time only two out of the the Big Data analysis is the concept of autonomous, intelligent following three properties namely: and proactive agents. In the context of this work the notion of agent mining is especially interesting [6]. It combines • Consistency (all nodes see the same data at the same methodologies, technology, tools and systems from the area of time) multi-agent technology, data mining and knowledge discovery, • Availability (a guarantee that every request receives a machine learning, statistics and semantic web in order to response about whether it was successful or if it has address issues that cannot be tackled by any single technique failed) with the same quality and performance. • Partition tolerance (the system continues to operate de- The multi-agent approach can also be applied to streaming spite arbitrary message loss or failure of part of the data processing in the actors model. In [7] a S4 platform is system) presented, which allows distributed processing of the streams There are several important applications of Big Data, which of data. The authors show how this can be applied to tune require new approaches in order to provide robust and usable parameters in dynamic systems. solutions. Among the most popular there are: recommender systems [3], social network analysis [4], web mining, algorith- B. Recommender systems mic trading, fraud detection etc. The diversity of these topics The work on recommender systems has been a popular shows the importance of Big Data and imposes challenges on topic since the beginning of large commercial services on the the tools and architectures, which have to be applicable in Internet. Several approaches to this subject have been proposed various scenarios. [8] with Collaborative Filtering techniques being the most This paper shows an architecture, which is designed for popular. Other approaches include content-based, knowledge- different implementations in which both real-time and offline based and hybrid approaches. Big Data processing is necessary. As an example of application Multi-agent systems have been often employed to the task of a recommender system is shown. building recommender systems. In [9] a recommender system 978-1-4799-4143-8/14 $31.00 © 2014 IEEE 333 DOI 10.1109/WI-IAT.2014.185 Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21
can be used together with batch veiws for the complete view of the data. B. Other approaches Several other approaches to processing of Big Data in real-time have also been proposed. In [13] the following 8 requirements for real-time data processing are proposed: • keep the data moving • query using SQL on Streams (StreamSQL) • handle stream imperfections Figure 1. The lambda architecture • generate predictable Outcomes • integrate stored and streaming data • guarantee data safety and availability of personalized TV contents, named AVATAR, is presented. • partition and scale applications automatically A collaborative approach to recommendations combined with • process and respond instantaneously intelligent agents can be found in [10], [11]. Also other approaches to the subject of real-time or near III. R EAL - TIME PROCESSING WITH B IG DATA real-time processing of Big Data can be found in the literature One of the main challenges, with respect to processing [14], [15]. of very large sets of data, is handling of the real-time data IV. A RCHITECTURE FOR MULTI - AGENT B IG DATA streams. While both offline and online parts of data can ANALYSIS be process independently, very often we need to provide Presented approach for processing Big Data in a real-time answers to questions regarding online events based on the past manner is trying to solve the main problem of analyzing history. The lambda architecture comes as an answer to these large, continuously growing data sets. Lambda architecture is challenges. one of the newest method which has lately gained a lot of A. The Lambda Architecture popularity, mainly because of its simplicity and use of a well- known tools for processing the data. Three easily recognizable The lambda architecture has been proposed by Nathan Marz layers for batch, speed and service is making clear division [12]. It is based on several assumptions: of components functionality. Moreover for each of this layer, • fault tolerance there is wide selection of solutions for implementation to • support of ad-hoc queries choose. Most of them are available on market for years and • scalability are known for the reliability. • extensibility Despite the fact, that the recalled architecture is presented The architecture as shown in the Figure 1 is composed of as a simple and a clear one, still many decisions and lots the following components: of integration work needs to be done. The architecture gives • Batch Layer - responsible for managing the master dataset only guidelines on how the system should be designed and and for precomputation of batch views which parts should it consist of. This gives freedom in the • Serving layer - indexes the batch views for fast ad-hoc choice of existing solutions for a specific job. Still, interaction queries between batch, service and speed layers has to be handled • Speed Layer - serves only the new data, which has not properly. Furthermore, even in a single layer a few components yet been processed by the batch Layer must interact together using different protocols and methods The Batch layer can be implemented with the use of of communication. In the Big Data environments these means, systems such as Hadoop. It is responsible for storing the that we have to cope with the integration of distributed data imputable master data set. Furthermore, with the use of the processing systems where the scalability and reliability should MapReduce algorithms it contiguously computes views of this be taken into account. data available to the various applications. The lambda architecture for processing Big Data can be The Serving layer is responsible for serving the views modeled as a heterogeneous multi-agent environment. Tree computed by the batch layer. This process can be facilitated distinct layers with different characteristics exists, between by additional indexing of the data in order to speed up the which multiple components have to interact with each other. reads. An example of a technology used typically to do this This communication can be simplified using the multi-agent job is Impala, which is easily integrated with Hadoop used in system approach. Each agent is responsible for specific task the Batch layer. in data processing, e.g.: receiving data, aggregating result, etc. Finally, the role of the Speed layer is to compute in real-time Agents are autonomous and distributed. Cooperation between the data that has just arrived and has not yet been processed agents is done using message passing. All agent are com- by the batch layer. It serves this data in the form of real-time municating in the same manner. Therefore, the integration is views, which are incremented the as the new data arrives and simplified. 334 Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21
consideration, this views are fast in-memory data sets prepared for the quick online access. Both the Batch Views and the Real-Time Views are created for a specific use case. This use case problem is solved in the serving layer. Request from the outside world is handled by a dedicated Service Agent. The particular problem e.g. a new bank credit decision, is solved by the appropriate types of agent. For every new request the Service Agent is created. To solve the given problem and prepare the response, Service Agent is collecting necessary data. Historical data are provided by a precomputed Batch Views. To access this data a Batch Aggregator Agent is used. This agent query appropriate Batch Views. Figure 2. Architecture for multi-agent big-data processing A similar processing is done to collect fresh online data. Real-Time Aggregator Agent is preparing data from Real- Time Views data sets. Both batch and real-time views are The Figure 2 presents the lambda architecture for Big Data combined together to present the whole picture of the data. processing using multi-agent systems. There are no changes After collecting all required data from the aggregators agents to the main concept. In MAS approach there are still batch, the response is created. In this point, when request is served speed and serving layers. Where the batch layer is creating and response is send back to the client, a life-cycle of Service aggregates - Batch Views - from the all data. The speed layer Agent ends. is only incremental - Real-Time Views - for new, non-archived Depending on an infrastructure and the system domain, the data. The service layer which brings both offline and online online layer can hold views for the different time periods. It calculated data (views) for solving specified problems, e.g. can vary from days to seconds. The main idea behind the analytic query, new credit decision, music recommendation batch and the speed layers is to work together to present etc. a coherent view for the data at the service layer. The first The input data is processed by the system as a data stream. thing which can be noticed in the proposed MAS architecture Depending on the domain, it can be a stream of: page views, is that all communication in the system is only between the user transactions, system logs, diagnostic events etc. Stream agents. Every single task presented in the lambda architecture - as data series - is collected by the Stream Receiver Agent. in previous section is encapsulated inside an autonomous This agent is responsible for simple data pre-processing like: agent. This results in simplified integration and distributed filtering, data format change, serialization to objects etc. After computation. that, every data event in the stream is passed to the Archiver Moreover, in the presented approach it is strongly rec- Agent and the Stream Processing Agent. Both of these agents ommended to use the same event representation in a both are in control of handling new data in the batch and the speed processing: batch and online. Despite the differences in the layer respectively. infrastructure a data schema can be the same (for lambda the In the batch processing the new data is first written to the most common ones are Hadoop for batch and Storm for online Data Store, e.g. Hadoop Distributed File System [16]. The processing). What is more, when agents are fine grained and Data Store has to handle large data sets and store all events designed for a single responsibility then it can be reused in in the system. Holding every single event from data stream speed and batch layer. For example: the same calculation done allows to run computation for a selected time period from the in Batch Worker Agent and Real-Time Worker Agent can be history. The computations are coordinated by the Batch Driver done by the same agent implementation. Agent. This agent is created for running specific jobs, while the actual work is done by its slave agents - Batch Worker V. A PPLICATIONS Agents. Each worker agent processes part of the data, in order to successfully produce the output of the job - Batch Views. The lambda architecture is showing how the batch and The Batch Views are all kinds of aggregates that need to be stream processing can work together solving most of the produced from the stored data. This is a general overview of modern Big Data problems. The batch Big Data processing is the batch processing, which can be easily implemented in a well known for the most of analytic tasks like segmentation, distributed processing clusters like YARN[17] or Mesos[18]. ad-hoc queries for data exploration, etc. However, the general The same events from the incoming data are processed by movement in Big Data processing is towards more responsive, the speed layer. Here, a Stream Processing Agent is the first real-time systems where results are available near online. The point of contact. The Processing Agent routes every event to lambda architecture and the proposed multi-agent solution appropriate Real-Time Worker Agent where the actual work is trying to cope with a real-time and batch processing in is done. The result are represented by the Real-Time Views a reasonable fashion. The proposed MAS approach can by which are the online updated aggregates. Due to the speed applied whenever the Big Data analysis need quick response 335 Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21
time and the query results scheduled by batch windows are not enough. One of the attractive new areas, where the support of Big Data processing can results in a better customer experiences, are recommendation engines. For a long period of time Internet search engines were used for finding products, media content and any other kind of interesting information. However, in a still growing information flood, environment tools for better user profiling, web personalization and new product recom- mendation are required. Internet search engines have ceased to be convenient tools for most users. The recommendation Figure 3. Recommendation system agent-lambda architecture systems, instead of forcing users for active exploration, are trying proactively to suggest new interesting content. The better user is profiled in the recommendation system - the solution can be used. Akka is one of the most popular actor better content is targeted. Therefore Big Data solutions comes system framework prepared for java and scala language. The handy. actor model has been designed for parallel computation based Collaborative filtering is one of the most widely adopted and on multi-agent systems paradigm [24]. successful recommendation methods[19]. Unlike other recom- In the presented recommendation system all communication mendation approaches no expert knowledge and explicit rules with the external systems can be done using JSON-REST are needed. System is characterizing users and recommended endpoints. All services should be implemented as an agents. items implicitly from the previous interactions - events. Thus The data about items, user and new events are collected it is becoming a good choice for e-commerce[20] systems and respectively by: Item Service Actor, User Service Actor and online website personalization[21]. Popularity of collaborative Event Service Actor. The Akka framework supports full asyn- filtering methods has led to new algorithms and methods chronous communication between actors and easy message for representing user preferences, like matrix factorization passing pattern. For REST services we proposed a spray.io techniques[3]. server. The server is build on top of Akka and simplifies In the collaborative filtering algorithms the more data is handling HTTP protocol in actor environment. After receiving gathered about the users interaction, the more precisely user data on HTTP endpoint, messages are passed to Receiver can be profiled and better recommendation can be. Users Actor which stores data stream in simple queue system[25] using web and mobile application leave trace of theirs action in common format (e.g. Avro). A new data is then send to - events. The simplest form of user events for the web site is a both: batch and speed layer. click-stream. Collecting data from a click-stream is the most The batch layer is mainly Hadoop ecosystem. New data are widely used method for web mining and analysis techniques in handled by Archiver Actor which stores data to HDFS - main Big Data projects. From the collected users event collaborative offline data storage. The data saved in HDFS is processed filtering algorithms can implicitly [22] predict user preferences by the batch framework in order to produce Batch Model. based on the others users interactions. For the collaborative filtering recommendation, batch model Collecting large amount of users events as the online stream consists of a prepared user rating model and calculated item- is the first difficulty to cope with in collaborative filtering to-item similarity matrix. This Batch Model is produced in for the large scale application. After the data is gathered, offline using YARN computation cluster on which a Spark[26] user preferences should be prepared for serving online recom- jobs are run. The Spark framework for a communication and mendation. Events generated by the users are converted into coordination of its Driver and Workers using as well the Akka the rating model for specific recommendation items. Then, actor system framework. The Batch Model could be stored in such prepared data can be used for collaborative filtering Cassandra NoSQL database for further processing and fast offline model generation. For the item-to-item version of the access. algorithm this operation applies to preparation of an item In the speed layer, new data is received by the Akka actors similarity matrix. in the Spark Streaming framework [27]. An architecture of For building large scale online recommendation system with the Spark Streaming is very similar to its offline equivalent. Big Data analytic stack, the multi-agent architecture presented The Driver as master coordinator as well as Workers actors in previous section can be used. In the recommendation exists. But in each worker there is a receiver actor of new systems new data stream include: items to recommend, users data. This allows to easily scale the processing. All is run on data and all events generated by users e.g. explicit item rating top of Mesos cluster[18]. Online streaming processing produce or page view. The problem to solve for the system is to Online Model: an up-to-date user preferences (based on user provide best online recommendation in a timely manner for events) and new items with rating. Here, the same rating model every user. A simplified MAS architecture for collaborative as in the batch layer should be used. Online Model is held in filtering recommendation system is presented on diagram 3. As memory databases e.g. Redis. a framework for agent based system implementation Akka[23] The data computed by batch and speed layer is prepared in 336 Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21
order to handle a new recommendation requests. The request in [14] Y. Zhu and D. Shasha, “Statstream: Statistical monitoring of thousands of data streams in real time,” in Proceedings of the 28th JSON-REST format is handled by a Recommendation Service International Conference on Very Large Data Bases, ser. VLDB Actor on spray.io server. Just like any other request to the ’02. VLDB Endowment, 2002, pp. 358–369. [Online]. Available: system. But for this requests processing is somehow different. http://dl.acm.org/citation.cfm?id=1287369.1287401 [15] H. H. et al., “Starfish: A self-tuning system for big data analytics,” ser. The request is being passed to a Recommendation Engine CIDR2011, 2011. where a Recommender Actor is created. Its responsibility is [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop to carry the process of recommendation. The Recommender distributed file system,” in MSST, 2010, pp. 1–10. [17] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, Actor is collecting necessary data from Batch Model Actor and R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, Online Model Actor. Based on request data context, precalcu- O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache lated item similarities and online updated user preferences new hadoop yarn: Yet another resource negotiator,” in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13. recommendation are generated for the user. All process ends New York, NY, USA: ACM, 2013, pp. 5:1–5:16. [Online]. Available: when JSON response is send with a set of new recommended http://doi.acm.org/10.1145/2523616.2523633 items to the user. [18] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, “Mesos: a platform for fine-grained resource sharing in the data center,” VI. C ONCLUSION Proceedings of the 8th USENIX conference on Networked systems design and implementation, p. 22, 2011. [Online]. Available: The paper presents a multi-agent architecture for processing http://dl.acm.org/citation.cfm?id=1972457.1972488 of Big Data. The approach is based on the lambda architecture [19] J. Bobadilla, F. Ortega, a. Hernando, and a. Gutiérrez, “Recommender systems survey,” Knowledge-Based Systems, proposed before. It has been shown how autonomous agents vol. 46, pp. 109–132, Jul. 2013. [Online]. Available: can enhance the architecture and provide capabilities for robust http://linkinghub.elsevier.com/retrieve/pii/S0950705113001044 processing of data in real-time. [20] Z. H. Z. Huang, D. Zeng, and H. C. H. Chen, “A Comparison of Collaborative-Filtering Recommendation Algorithms for E-commerce,” Possible applications of the proposed architecture to recom- IEEE Intelligent Systems, vol. 22, 2007. mender systems have been discussed. The article argues that [21] A. Das, M. Datar, A. Garg, and S. Rajaram, “Google news personalization: scalable online collaborative filtering,” Proceedings of the approach is very well suited for such a task, where on-line the 16th international conference on, pp. 271–280, 2007. [Online]. events from users visiting an online site have to be combined Available: http://portal.acm.org/citation.cfm?id=1242610 with the offline data on users and items from the past. [22] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets,” in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ser. ICDM ’08, vol. 0. R EFERENCES Washington, DC, USA: IEEE Computer Society, Dec. 2008, pp. 263–272. [Online]. Available: http://dx.doi.org/10.1109/icdm.2008.22 [1] J. Dean and S. Ghemawat, “MapReduce : Simplified Data Processing on [23] [Online]. Available: http://akka.io/ Large Clusters,” Communications of the ACM, vol. 51, pp. 1–13, 2008. [24] C. Hewitt, “Actor Model for Discretionary, Adaptive Concurrency,” [Online]. Available: http://portal.acm.org/citation.cfm?id=1327492 1008.1459, 2010. [Online]. Available: http://arxiv.org/abs/1008.1459 [2] E. Brewer, “Towards robust distributed systems. (podc invited talk),” [25] J. Kreps, N. Narkhede, and J. Rao, “Kafka: A distributed messaging 2000. system for log processing,” in Proceedings of the NetDB, 2011. [3] Y. Koren, R. Bell, and C. Volinsky, “Matrix Factorization Techniques [26] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, for Recommender Systems,” Computer, vol. 42, 2009. “Spark: cluster computing with working sets,” in Proceedings of the 2nd [4] L. Manovich, “Trending: the promises and the challenges of big social USENIX conference on Hot topics in cloud computing, 2010, pp. 10–10. data,” In Debates in the Digital Humanities, 2011. [27] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized [5] H. Chen, R. Chiang, and V. Storey, “Business intelligence and analytics: Streams: An Efficient and Fault-tolerant Model for Stream Processing From big data to big impact,” MIS Quarterly, vol. 36, pp. 1165–1188, on Large Clusters,” in Proceedings of the 4th USENIX Conference 2012. on Hot Topics in Cloud Ccomputing, ser. HotCloud’12. Berkeley, [6] L. Cao, G. Weiss, and P. Yu, “A brief introduction to agent mining,” CA, USA: USENIX Association, 2012, p. 10. [Online]. Available: Autonomous Agent Multi-Agent Systems, vol. 25, pp. 419–424, 2012. http://portal.acm.org/citation.cfm?id=2342773 [7] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,” 2010 IEEE International Conference on Data Mining Workshops, pp. 170–177, 2010. [8] D. e. a. Jannach, Recommender Systems, An Introduction. Cambridge University Press, 2011. [9] Y. Blanco-fern, J. Pazos-arias, A. Gil-solla, M. Ramos-cabrer, J. Garc, A. Fern, and P. D. Rebeca, “AVATAR : An Advanced Multi-agent Recommender System of Personalized TV Contents by Semantic Rea- soning,” pp. 415–421, 2004. [10] J. Carbo and J. M. Molina, “Agent-based collaborative filtering based on fuzzy recommendations,” International Journal of Web Engineering and Technology, vol. 1, no. 4, p. 414, 2004. [Online]. Available: http://www.inderscience.com/link.php?id=6267 [11] J. Gómez-Sanz, J. Pavón, and A. Díaz-Carrasco, “The PSI3 agent recommender system,” Web Engineering, pp. 30–39, 2003. [Online]. Available: http://link.springer.com/chapter/10.1007/3-540-45068-8_5 [12] Nathan Marz and James Warren, “Big data - Principles and best practices of scalable realtime data systems (Chapter 1),” 2014. [13] M. Stonebraker, U. Çetintemel, and S. Zdonik, “The 8 requirements of real-time stream processing,” SIGMOD Rec., vol. 34, no. 4, pp. 42–47, Dec. 2005. [Online]. Available: http://doi.acm.org/10.1145/1107499.1107504 337 Pobrano z http://repo.pw.edu.pl / Downloaded from Repository of Warsaw University of Technology 2021-11-21
You can also read