Real Time Processing Karthik Ramasamy - Streamlio - Stanford Platform Lab
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
3 Real Time Connected World Internet of Things Connected Vehicles Ñ 30 B connected devices by 2020 Data transferred per vehicle per month 4 MB -> 5 GB Health Care Digital Assistants (Predictive Analytics) + 153 Exabytes (2013) -> 2314 Exabytes (2020) ! $2B (2012) -> $6.5B (2019) [1] Siri/Cortana/Google Now Machine Data Augmented/Virtual Reality 40% of digital universe by 2020 $150B by 2020 [2] > Oculus/HoloLens/Magic Leap [1] hXp://www.siemens.com/innovaZon/en/home/pictures-of-the-future/digitalizaZon-and-so[ware/digital-assistants-trends.html [2] hXp://techcrunch.com/2015/04/06/augmented-and-virtual-reality-to-hit-150-billion-by-2020/#.7q0heh:oABw
4 Why Real Time? real time trends real time conversations real time recommendations real time search G Ü ! s Emerging break out Real time sports Real time product Real time search of trends in Twitter (in the conversations related recommendations based tweets form #hashtags) with a topic (recent goal on your behavior & or touchdown) profile
5 Value of Data It’s contextual Time/cri8cal& Decisions& Value&of&Data&to&Decision/Making& Informa9on&Half%Life& In&Decision%Making& Preven8ve/& Predic8ve& Tradi8onal&“Batch”&&&&&&&&&&&&&&& Business&&Intelligence& Ac8onable& Reac8ve& Historical& Real%& Seconds& Minutes& Hours& Days& Months& [1] Courtesy Michael Franklin, BIRTE, 2015. Time& Time&
6 What is Real-Time? It’s contextual > 1 hour 10 ms - 1 sec < 500 ms < 1 ms high throughput approximate latency sensitive low latency BATCH Real Time REAL OLTP REAL TIME adhoc monthly active users ad impressions count deterministic fanout Tweets Financial queries relevance for ads hash tag trends workflows search for Tweets Trading
7 Real Time Analytics INTERACTIVE Store data and provide results instantly when a query is posed C H Analyze data as it is being produced STREAMING
8 Real Time Use Cases I Online Services Real Time Data for Batch Analytics 10s of ms 10-100s of ms secs to mins TransacZon log, Queues, Change propagaZon, Log aggregaZon, Client RPCs Streaming analyZcs events
9 Real Time Stack Components: Many moving parts s a Data Messaging Collectors REAL TIME STACK b J Storage Compute
10 Scribe Open source log aggregation " Originally from Facebook. TwiXer made significant enhancements for real Zme event aggregaZon { High throughput and scale Delivers 125M messages/min. Provides Zght SLAs on data reliability # Runs on every machine Simple, very reliable and efficiently uses memory and CPU
Event Bus & Distributed Log Next Generation Messaging "
12 Twitter Messaging Core Business Logic Scribe (tweets, fanouts …) Deferred Gizzard Database Search RPC Book Kestrel Kestrel Kestrel My SQL Kala Keeper HDFS
13 Kestrel Queue 5 e \ ( MESSAGE MULTIPLE HIGHLY LOOSELY QUEUE PROTOCOLS SCALABLE ORDERED
14 Kestrel Queue P # C # P # C zk
15 Kestrel Limitations Durability is hard to achieve # $ Adding subscribers is expensive Read-behind degrades performance Too many random I/Os ! 7 Scales poorly as #queues increase " Cross DC replication
16 Kafka Queue V \ Ñ ( CLIENT STATE CONSUMER HIGH SIMPLE FOR MANAGEMENT SCALABILITY THROUGHPUT OPERATIONS
17 Kafka Limitations Relies on file system page cache # " Performance degradation when subscribers fall behind - too much random I/O Scaling to many topics ! 7 Loss of data " No centralized operational stats
18 Rethinking Messaging Durable writes, intra cluster and Unified Stack - tradeoffs for various workloads # $ geo-replication Multi tenancy ! 7 Scale resources independently Cost efficiency " Ease of Manageability
19 Event Bus - Pub-Sub Write Read Publisher Event Proxy Bus + Distributed Proxy Log Subscriber Distributed Log Metadata
20 Distributed Log BK Write Read Publisher BK Subscriber Proxy Proxy BK Distributed Log Metadata - ZK
21 Distributed Log @Twitter . , / - / 01 02 03 04 05 Manhattan Durable Deferred Real Time Pub Sub Globally Key Value Store RPC Search Indexing System Replicated Log
22 Distributed Log @Twitter 400 TB/Day IN 2 Trillion Events/Day 20 PB/Day PROCESSED OUT 5-10 MS latency
TwiHer Heron Next Generation Streaming Engine "
24 Twitter Heron Better Storm \ Container Based Architecture - Separate Monitoring and Scheduling 2 Simplified ExecuQon Model % Much BeHer Performance
25 Twitter Heron Design: Goals Fully API compatible with Storm Task isolation Directed acyclic graph Topologies, Spouts and Bolts # $ Ease of debug-ability/isolaZon/profiling Use of main stream languages C++, Java and Python ! g Support for back pressure Topologies should self adjusZng Efficiency G Batching of tuples AmorZzing the cost of transferring tuples " Reduce resource consumption
26 Twitter Heron b \ Ñ / Guaranteed Horizontal Robust Concise Message Scalability Fault Code-Focus Passing Tolerance on Logic
27 Heron Terminology Topology , Directed acyclic graph verZces = computaZon, and edges = streams of data tuples Spouts Sources of data tuples for the topology Examples - Kala/Kestrel/MySQL/Postgres % Bolts Process incoming tuples, and emit outgoing tuples Examples - filtering/aggregaZon/join/any funcZon
28 Heron Topology % Bolt 1 % Spout 1 Bolt 4 % Bolt 2 % Spout 2 Bolt 5 % Bolt 3
29 Stream Groupings . , / - 01 02 03 04 Shuffle Grouping Fields Grouping All Grouping Global Grouping Random distribution of tuples Group tuples by a field or Replicates tuples to all tasks Send the entire stream to one multiple fields task
30 Heron Architecture: High Level Topology 1 Scheduler Topology 2 Topology Submission Topology N
31 Heron Architecture: Topology Logical Plan, Physical Plan and Topology Execution State Master ZK Sync Physical Plan Cluster Metrics Stream Stream Metrics Manager Manager Manager Manager I1 I2 I3 I4 I1 I2 I3 I4 CONTAINER CONTAINER
32 Heron Stream Manager: BackPressure % % % S1 B2 B3 B4
33 Stream Manager Stream Manager: BackPressure S1 B2 S1 B2 Stream Stream Manager Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Stream Manager Manager B3 B4 B3
CONTAINER Topology Master CONTAINER CONTAINER S1 B2 S2 B1 Stream Stream Manager Manager B3 B4 B3 B4 S1 B2 S2 B1 Stream Stream Manager Manager B3 B4 B3 B5 CONTAINER CONTAINER
35 Heron Stream Manager: Spout BackPressure S1 B2 S1 B2 Stream Stream Manager Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Stream Manager Manager B3 B4 B3
36 Heron Use Cases SPAM REALTIME REAL TIME REAL TIME REALTIME REAL TIME DETECTION ETL BI TRENDS ML OPS
37 Heron Sample Topologies
38 Heron @Twitter Heron has been in production for 3 years 1 stage 10 stages 3x reduction in cores and memory
39 Heron Performance: Settings COMPONENTS EXPT #1 EXPT #2 EXPT #3 Spout 25 100 200 Bolt 25 100 200 # Heron containers 25 100 200
40 Heron Performance: Atmost Once Throughput CPU usage Heron (paper) Heron (master) Heron (paper) Heron (master) 11000 450 8250 5 - 6x 10,200 337.5 1.4 -1.6x 397.5 million tuples/min # cores used 261 5500 5,820 225 217.5 137 2750 112.5 1,920 1,545 54 249 965 32 0 0 25 100 200 25 100 200 Spout Parallelism Spout Parallelism
41 Heron Performance: CPU Usage Heron (paper) Heron (master) 40 30 4-5x million tuples/min 20 10 0 25 100 200 Spout Parallelism
42 Heron @Twitter > 400 Real Time Jobs 500 Billions Events/Day PROCESSED 25-50 MS latency
43 Stateful Processing in Heron OpZmisZc PessimisZc Approaches Approaches
Tying Together "
45 Lambda Architecture Combining batch and real time Client New Data
46 Lambda Architecture - The Good Scribe CollecZon Pipeline Event Bus Heron Compute Pipeline Results
47 Lambda Architecture - The Bad Have to write everything twice! # $ Have to fix everything (may be Subtle differences in semantics ! 7 How much Duct Tape required? " What about Graphs, ML, SQL, etc?
48 Summingbird to the Rescue Online key value result Heron Topology store Message broker Summingbird Program Client HDFS Batch key value result Scalding/Map Reduce store
49 Curious to Learn More? Twitter Heron: Stream Processing at Scale Streaming@Twitter Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry, Sailesh Mittal, Jignesh M. Patel*,1, Karthik Ramasamy, Siddarth Taneja Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, Jingwei Wu @sanjeevrk, @challenger_nik, @Louis_Fumaosong, @vikkyrk, @cckellogg, @saileshmittal, @pateljm, @karthikz, @staneja Twitter, Inc. Twitter, Inc., *University of Wisconsin – Madison system process, which makes debugging very challenging. Thus, we Abstract ABSTRACT needed a cleaner mapping from the logical units of computation to Storm has long served as the main platform for real-time analytics each physical process. The importance of such clean mapping for at Twitter. However, as the scale of data being processed in real- Twitter generates tens of billions of events per hour when users interact with it. Analyzing these Storm @Twitter time at Twitter has increased, along with an increase in the debug-ability is really crucial when responding to pager alerts for a failing topology, especially if it is a topology that is criticalevents diversity and the number of use cases, many limitations of Storm to the to surface relevant content and to derive insights in real time is a challenge. To address this, we underlying business model. have become apparent. We need a system that scales better, has developed Heron, a new real time distributed streaming engine. In this paper, we first describe the design better debug-ability, has better performance, and is easier to In addition, Storm needs dedicated cluster resources, which requires manage Ankit– all while working Toshniwal, in a shared Siddarth cluster infrastructure. Taneja, Amit Shukla, WeKarthik special hardware allocation Ramasamy, Jignesh to run M.Storm topologies. Patel*, Sanjeev goals of Heron and show how the Heron architecture achieves task isolation and resource reservation ThisKulkarni, approach considered various alternatives to meet these needs, and in the end Jason Jackson, Krishna Gade, Maosong Fu, concluded that we needed to build a new real-time stream data Jake leads to inefficiencies Donham, Nikunj in usingSailesh Bhagat, precious cluster Mittal, resources, Dmitriy to and ease debugging, troubleshooting, and seamless use of shared cluster infrastructure with other critical also Ryaboy limits the ability to scale on demand. We needed the ability to work processing system. This paper presents the design and @ankitoshniwal, @staneja, @amits,in a more flexible @karthikz, way with @pateljm, popular cluster scheduling software @sanjeevrk, Twitter that services. We subsequently explore how a topology self adjusts using back pressure so that the implementation of this @krishnagade, @jason_j, new system, called Heron. Heron is now @Louis_Fumaosong, allows sharing @jakedonham, the cluster resources @challenger_nik, across different @saileshmittal, types pace @squarecog of data of the topology goes as its slowest component. Finally, we outline how Heron implements at most the de facto stream data processing engine inside Twitter, and in processing systems (and not just a stream processing system). Twitter, Inc., *University of Wisconsin – Madison this paper we also share our experiences from running Heron in Internally at Twitter, this meant working with Aurora [1], asonce that is and at least once semantics and we describe a few operational stories based on running Heron in production. In this paper, we also provide empirical evidence the dominant cluster management system in use. demonstrating the efficiency and scalability of Heron. production. With Storm, provisioning a new production topology requires ACM Classification manual isolation of machines, and conversely, when a topology is H.2.4 [Information Systems]: Database Management—systems no longer needed, the machines allocated to serve that topology Keywords 1 Introduction now have to be decommissioned. Managing machine provisioning in this way is cumbersome. Furthermore, we also wanted to be far Stream data processing systems; real-time data processing. more efficient than the Storm system in production, simply because at Twitter’s scale, any improvement in Stream performance processing platforms enable enterprises to extract business value from data in motion similar to batch translates into significant reduction in infrastructure costs and also 1. INTRODUCTION processing platforms that facilitated the same with data at rest [42]. The goal of stream processing is to enable significant improvements in the productivity of our end users. Twitter, like many other organizations, relies heavily on real-time real time or near real time decision making by providing capabilities to inspect, correlate and analyze data as
50 Curious to Learn More?
51 Interested in Heron? HERON IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://heronstreaming.io FOLLOW US @HERONSTREAMING
52 Interested in Distributed Log? DISTRIBUTED LOG IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://distributedlog.io FOLLOW US @DISTRIBUTEDLOG
53 We are hiring @Streamlio SOFTWARE ENGINEERS PRODUCT ENGINEERS SYSTEM ENGINEERS UI ENGINEERS/UX DESIGNERS EMAIL CAREERS [AT] STREAML.IO
54 Any Question ??? WHAT WHY WHERE WHEN WHO HOW
55 Get in Touch @karthikz
THANKS FOR ATTENDING !!!
You can also read