Real Time Processing Karthik Ramasamy - Streamlio - Stanford Platform Lab

Page created by Esther Brady

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Real Time Processing Karthik Ramasamy - Streamlio - Stanford Platform Lab

Real Time Processing

     Karthik Ramasamy
         Streamlio

          @karthikz

2

Information Age
Real-time is key

      á            K   !

3

                                                 Real Time Connected World

                             Internet of Things                                                                                   Connected Vehicles

     Ñ                       30 B connected devices by 2020                                                                       Data transferred per vehicle per month
                                                                                                                                  4 MB -> 5 GB

                             Health Care                                                                                          Digital Assistants (Predictive Analytics)

      +                      153 Exabytes (2013) -> 2314 Exabytes (2020)
                                                                                                                              !   $2B (2012) -> $6.5B (2019) [1]
                                                                                                                                  Siri/Cortana/Google Now

                             Machine Data                                                                                         Augmented/Virtual Reality
                             40% of digital universe by 2020                                                                      $150B by 2020 [2]
                                                                                                                              >   Oculus/HoloLens/Magic Leap

[1] hXp://www.siemens.com/innovaZon/en/home/pictures-of-the-future/digitalizaZon-and-so[ware/digital-assistants-trends.html
[2] hXp://techcrunch.com/2015/04/06/augmented-and-virtual-reality-to-hit-150-billion-by-2020/#.7q0heh:oABw

4

                                 Why Real Time?
  real time trends          real time conversations      real time recommendations   real time search

        G                            Ü                            !                       s
  Emerging break out             Real time sports               Real time product    Real time search of
trends in Twitter (in the     conversations related         recommendations based           tweets
    form #hashtags)          with a topic (recent goal         on your behavior &
                                  or touchdown)                       profile

5

                                                                                            Value of Data    It’s contextual

                                                    Time/cri8cal&
                                                    Decisions&
Value&of&Data&to&Decision/Making&

                                                                          Informa9on&Half%Life&
                                                                          In&Decision%Making&
                                    Preven8ve/&
                                    Predic8ve&

                                                                                      Tradi8onal&“Batch”&&&&&&&&&&&&&&&
                                                                                      Business&&Intelligence&
                                                  Ac8onable&

                                                                     Reac8ve&
                                                                                                   Historical&

                                    Real%&         Seconds&    Minutes&          Hours&              Days&                     Months&   [1] Courtesy Michael Franklin, BIRTE, 2015.
                                    Time&                                         Time&

6

                                       What is Real-Time?
                                                       It’s contextual

  > 1 hour                               10 ms - 1 sec                       < 500 ms                                < 1 ms
  high throughput                        approximate                         latency sensitive                       low latency

            BATCH                       Real Time                                                                      REAL
                                                                                    OLTP                             REAL TIME

adhoc           monthly active users      ad impressions count           deterministic           fanout Tweets               Financial
queries         relevance for ads         hash tag trends                workflows               search for Tweets           Trading

7

Real Time Analytics
                                      INTERACTIVE
                                      Store data and provide results
                                      instantly when a query is
                                      posed

                                  C

H
    Analyze data as it is being
    produced

    STREAMING

8

                          Real Time Use Cases

                                                             I

  Online Services                  Real Time         Data for Batch Analytics
    10s of ms                    10-100s of ms             secs to mins
TransacZon log, Queues,         Change propagaZon,     Log aggregaZon, Client
         RPCs                   Streaming analyZcs             events

9

        Real Time Stack
             Components: Many moving parts

   s                                           a
  Data                                       Messaging
Collectors
                       REAL TIME
                        STACK

   b                                            J
Storage                                      Compute

10

                                           Scribe

    Open source log aggregation

"
    Originally from Facebook. TwiXer
    made signiﬁcant enhancements for
    real Zme event aggregaZon

                            {          High throughput and scale
                                       Delivers 125M messages/min.
                                       Provides Zght SLAs on data reliability

                                                              #
                                                                            Runs on every machine
                                                                            Simple, very reliable and eﬃciently
                                                                            uses memory and CPU

Event Bus & Distributed Log
        Next Generation Messaging

                          "

12

                             Twitter Messaging
 Core Business Logic
                                                                     Scribe
 (tweets, fanouts …)

                  Deferred
                               Gizzard   Database    Search
                    RPC

                                          Book
Kestrel          Kestrel      Kestrel               My SQL    Kala
                                         Keeper

                                                                          HDFS

13

             Kestrel Queue

 5          e          \          (
MESSAGE    MULTIPLE    HIGHLY    LOOSELY
 QUEUE    PROTOCOLS   SCALABLE   ORDERED

14

    Kestrel Queue

P        #               C
         #
P        #               C
                    zk

15

                                       Kestrel Limitations
       Durability is hard to achieve
                                       #                           $   Adding subscribers is expensive

Read-behind degrades performance
            Too many random I/Os
                                       !                           7   Scales poorly as #queues increase

                                                    "
                                            Cross DC replication

16

                   Kafka Queue

   V              \             Ñ           (
CLIENT STATE    CONSUMER        HIGH      SIMPLE FOR
MANAGEMENT     SCALABILITY   THROUGHPUT   OPERATIONS

17

                                   Kafka Limitations

Relies on file system page cache
                                   #                                      "   Performance degradation when subscribers
                                                                                   fall behind - too much random I/O

         Scaling to many topics
                                   !                                      7   Loss of data

                                                     "
                                       No centralized operational stats

18

                           Rethinking Messaging
                                                                Durable writes, intra cluster and
Unified Stack - tradeoﬀs for
         various workloads     #                           $    geo-replication

               Multi tenancy
                               !                           7   Scale resources independently
                                                               Cost eﬀiciency

                                           "
                                   Ease of Manageability

19

            Event Bus - Pub-Sub

            Write                         Read
Publisher     Event
            Proxy     Bus +   Distributed Proxy
                                          Log               Subscriber

                                              Distributed
                                                  Log

                          Metadata

20

                    Distributed Log
                            BK

            Write                       Read
Publisher                   BK                            Subscriber
            Proxy                       Proxy

                            BK
                                            Distributed
                                                Log

                        Metadata - ZK

21

                  Distributed Log @Twitter

                      .                                ,
   /                                    -                           /
     01                 02                03            04            05

  Manhattan       Durable Deferred      Real Time      Pub Sub     Globally
Key Value Store         RPC          Search Indexing   System    Replicated Log

22

             Distributed Log @Twitter
400 TB/Day
    IN

                          2 Trillion Events/Day
              20 PB/Day        PROCESSED
                 OUT

                                                  5-10 MS
                                                  latency

TwiHer Heron
Next Generation Streaming Engine

                     "

24

                             Twitter Heron
                                         Better Storm

\   Container Based Architecture

-   Separate Monitoring and Scheduling

2   Simpliﬁed ExecuQon Model

%    Much BeHer Performance

25

                                                    Twitter Heron
                                                        Design: Goals

Fully API compatible with Storm                                             Task isolation
                           Directed acyclic graph
                    Topologies, Spouts and Bolts
                                                    #                   $   Ease of debug-ability/isolaZon/proﬁling

  Use of main stream languages
                           C++, Java and Python     !                   g   Support for back pressure
                                                                            Topologies should self adjusZng

                                                                            Eﬀiciency
                                                                        G
                 Batching of tuples
        AmorZzing the cost of transferring tuples   "                       Reduce resource consumption

26

              Twitter Heron

 b           \             Ñ /
Guaranteed   Horizontal     Robust       Concise
 Message     Scalability     Fault     Code-Focus
 Passing                   Tolerance    on Logic

27

                          Heron Terminology

    Topology

,
    Directed acyclic graph
    verZces = computaZon, and
    edges = streams of data tuples

                                     Spouts
                                     Sources of data tuples for the topology
                                     Examples - Kala/Kestrel/MySQL/Postgres

                                                            %         Bolts
                                                                      Process incoming tuples, and emit outgoing tuples
                                                                      Examples - ﬁltering/aggregaZon/join/any funcZon

28

          Heron Topology
                 %
                Bolt 1      %
Spout 1                    Bolt 4

                 %
                Bolt 2

                            %
Spout 2                    Bolt 5

                 %
                Bolt 3

29

                                Stream Groupings

                                        .                                                               ,
         /                                                             -
           01                              02                             03                              04

  Shuﬀle Grouping                 Fields Grouping                  All Grouping                  Global Grouping
Random distribution of tuples    Group tuples by a field or   Replicates tuples to all tasks   Send the entire stream to one
                                     multiple fields                                                       task

30

                 Heron
              Architecture: High Level

                                         Topology 1
  Scheduler

                                         Topology 2

 Topology
Submission

                                         Topology N

31

                                  Heron
                                Architecture: Topology
                                                                     Logical Plan,
                                                                   Physical Plan and
                                  Topology                          Execution State
                                   Master
                                                                         ZK
           Sync Physical Plan                                          Cluster

Metrics        Stream                                    Stream            Metrics
Manager        Manager                                   Manager           Manager

I1    I2      I3      I4                                 I1    I2        I3       I4

     CONTAINER                                                CONTAINER

32

           Heron
     Stream Manager: BackPressure

     %                         %     %
S1   B2                         B3   B4

33

            Stream Manager
                    Stream Manager: BackPressure

S1   B2                                                      S1   B2
          Stream                                   Stream
          Manager                                  Manager
B3   B4                                                      B3   B4

S1   B2                                                      S1   B2
          Stream                                   Stream
          Manager                                  Manager
B3   B4                                                      B3

CONTAINER

                            Topology
                             Master

     CONTAINER                              CONTAINER

S1   B2                                              S2      B1
                 Stream                Stream
                 Manager               Manager
B3   B4                                             B3       B4

S1   B2                                                 S2   B1
                 Stream                Stream
                 Manager               Manager
B3   B4                                                 B3   B5

     CONTAINER                               CONTAINER

35

                        Heron
               Stream Manager: Spout BackPressure

S1   B2                                                       S1   B2
          Stream                                    Stream
          Manager                                   Manager
B3   B4                                                       B3   B4

S1   B2                                                       S1   B2
          Stream                                    Stream
          Manager                                   Manager
B3   B4                                                       B3

36

                       Heron Use Cases

                           SPAM
REALTIME   REAL TIME                 REAL TIME   REALTIME   REAL TIME
                         DETECTION
  ETL         BI                      TRENDS        ML        OPS

37

 Heron
Sample Topologies

38

             Heron @Twitter
Heron has been in production for 3 years

1 stage                             10 stages

          3x reduction in cores and memory

39

                       Heron
                     Performance: Settings

 COMPONENTS          EXPT #1                 EXPT #2   EXPT #3

      Spout              25                    100       200

       Bolt              25                    100       200

# Heron containers       25                    100       200

40

                                                                              Heron
                                                                        Performance: Atmost Once
                                                Throughput                                                                   CPU usage

                             Heron (paper)                     Heron (master)                                 Heron (paper)                 Heron (master)

                     11000                                                                             450

                      8250
                                           5 - 6x                             10,200

                                                                                                      337.5
                                                                                                                         1.4 -1.6x                         397.5
million tuples/min

                                                                                       # cores used
                                                                                                                                                   261
                      5500                                 5,820                                       225
                                                                                                                                        217.5

                                                                                                                                137
                      2750                                                                            112.5
                                                                      1,920
                                        1,545                                                                           54
                             249                   965                                                        32
                        0                                                                                0
                                   25                    100              200                                      25                 100                200
                                                Spout Parallelism                                                            Spout Parallelism

41

                                           Heron
                                        Performance: CPU Usage

                               Heron (paper)                       Heron (master)

                     40

                     30                        4-5x
million tuples/min

                     20

                     10

                      0
                          25                         100                            200
                                               Spout Parallelism

42

             Heron @Twitter
> 400 Real
Time Jobs

              500 Billions Events/Day
                   PROCESSED
                                         25-50
                                          MS
                                        latency

43

  Stateful Processing in Heron

 OpZmisZc               PessimisZc
Approaches              Approaches

Tying Together

                 "

45

           Lambda Architecture
               Combining batch and real time

                                               Client
New Data

46

Lambda Architecture - The Good

Scribe CollecZon Pipeline   Event Bus   Heron Compute Pipeline   Results

47

                Lambda Architecture - The Bad
Have to write everything twice!
                                  #                                      $   Have to fix everything (may be

Subtle diﬀerences in semantics
                                  !                                      7   How much Duct Tape required?

                                                    "
                                      What about Graphs, ML, SQL, etc?

48

                 Summingbird to the Rescue
                                            Online key value result
                        Heron Topology
                                                    store

Message broker

                      Summingbird Program                             Client

    HDFS

                                            Batch key value result
                      Scalding/Map Reduce
                                                    store

49

                                                                Curious to Learn More?
                Twitter Heron: Stream Processing at Scale                                                                                                      Streaming@Twitter
      Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg,                                Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry,
             Sailesh Mittal, Jignesh M. Patel*,1, Karthik Ramasamy, Siddarth Taneja                                         Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, Jingwei Wu
                    @sanjeevrk, @challenger_nik, @Louis_Fumaosong, @vikkyrk, @cckellogg,
                                @saileshmittal, @pateljm, @karthikz, @staneja                                                                                               Twitter, Inc.
                                Twitter, Inc., *University of Wisconsin – Madison

                                                              system process, which makes debugging very challenging. Thus, we
                                                                                                                                                                               Abstract
ABSTRACT
                                                              needed a cleaner mapping from the logical units of computation to
Storm has long served as the main platform for real-time analytics
                                                              each physical process. The importance of such clean mapping for
at Twitter. However, as the scale of data being processed in real-                                                                  Twitter generates tens of billions of events               per hour when users interact with it. Analyzing these
                                              Storm @Twitter
time at Twitter has increased, along with an increase in the  debug-ability is really crucial when responding to pager alerts for a
                                                              failing topology, especially if it is a topology that is criticalevents
diversity and the number of use cases, many limitations of Storm                                                                to the to surface relevant content and to derive               insights in real time is a challenge. To address this, we
                                                              underlying business model.
have become apparent. We need a system that scales better, has                                                                               developed Heron, a new real time distributed streaming engine. In this paper, we first describe the design
better debug-ability, has better performance, and is easier to            In addition, Storm needs dedicated cluster resources, which requires
manage
    Ankit– all while working
            Toshniwal,       in a shared
                           Siddarth       cluster infrastructure.
                                       Taneja,     Amit Shukla,   WeKarthik
                                                                          special  hardware allocation
                                                                              Ramasamy,          Jignesh  to run
                                                                                                              M.Storm   topologies.
                                                                                                                  Patel*,   Sanjeev          goals of Heron and show how the Heron architecture achieves task isolation and resource reservation
                                                                                                                                     ThisKulkarni,
                                                                                                                                           approach
considered various alternatives to meet these needs, and in the end
   Jason    Jackson,     Krishna    Gade,    Maosong        Fu,
concluded that we needed to build a new real-time stream data
                                                                  Jake    leads to inefficiencies
                                                                       Donham,       Nikunj         in usingSailesh
                                                                                               Bhagat,        precious cluster
                                                                                                                       Mittal, resources,
                                                                                                                                 Dmitriy     to
                                                                                                                                            and ease debugging, troubleshooting, and seamless use of shared cluster infrastructure with other critical
                                                                                                                                                also
                                                                                                                                             Ryaboy
                                                                          limits the ability to scale on demand. We needed the ability to work
processing system. This paper          presents   the   design    and
                                   @ankitoshniwal, @staneja, @amits,in       a more flexible
                                                                           @karthikz,         way with
                                                                                        @pateljm,        popular cluster scheduling software
                                                                                                      @sanjeevrk,                            Twitter
                                                                                                                                                that    services. We subsequently explore how a topology self adjusts using back pressure so that the
implementation   of this @krishnagade,
             @jason_j,   new system, called   Heron. Heron is now
                                          @Louis_Fumaosong,               allows sharing
                                                                    @jakedonham,            the cluster resources
                                                                                    @challenger_nik,                across different
                                                                                                           @saileshmittal,            types pace
                                                                                                                             @squarecog      of data of the topology goes as its slowest component. Finally, we outline how Heron implements at most
the de facto stream data processing engine inside Twitter, and in         processing   systems    (and not just a stream processing system).
                                          Twitter, Inc., *University of       Wisconsin       – Madison
this paper we also share our experiences from running Heron in            Internally at Twitter, this meant working with Aurora [1], asonce   that is and at least once semantics and we describe a few operational stories based on running Heron in
production. In this paper, we also provide empirical evidence             the dominant cluster management system in use.
demonstrating the efficiency and scalability of Heron.                                                                                       production.
                                                                          With Storm, provisioning a new production topology requires
ACM Classification                                                        manual isolation of machines, and conversely, when a topology is
H.2.4 [Information Systems]: Database Management—systems                  no longer needed, the machines allocated to serve that topology
Keywords
                                                                                                             1     Introduction
                                                                          now have to be decommissioned. Managing machine provisioning
                                                                          in this way is cumbersome. Furthermore, we also wanted to be far
Stream data processing systems; real-time data processing.                more efficient than the Storm system in production, simply
                                                                          because at Twitter’s scale, any improvement in Stream        performance processing platforms enable enterprises to extract business value from data in motion similar to batch
                                                                          translates into significant reduction in infrastructure costs and also
1. INTRODUCTION                                                                                                                       processing platforms that facilitated the same with data at rest [42]. The goal of stream processing is to enable
                                                                          significant improvements in the productivity of our end users.
Twitter, like many other organizations, relies heavily on real-time                                                                   real time or near real time decision making by providing capabilities to inspect, correlate and analyze data as

50

Curious to Learn More?

51

    Interested in Heron?
  HERON IS OPEN SOURCED
CONTRIBUTIONS ARE WELCOME!
     https://github.com/twitter/heron

        http://heronstreaming.io

FOLLOW US @HERONSTREAMING

52

 Interested in Distributed Log?
DISTRIBUTED LOG IS OPEN SOURCED
  CONTRIBUTIONS ARE WELCOME!
        https://github.com/twitter/heron

            http://distributedlog.io

  FOLLOW US @DISTRIBUTEDLOG

53

        We are hiring @Streamlio
SOFTWARE ENGINEERS
PRODUCT ENGINEERS
SYSTEM ENGINEERS
UI ENGINEERS/UX DESIGNERS

     EMAIL CAREERS [AT] STREAML.IO

54

Any Question ???

WHAT   WHY   WHERE   WHEN   WHO   HOW

55

Get in Touch
@karthikz

THANKS FOR ATTENDING !!!

You can also read

Neural Decoding of Cursor Motion using a Kalman Filter

East Herts Council Anti-Fraud Plan 2020/2021 in partnership with The Shared Anti-Fraud Service

ENGLISH CURRICULUM Bethersden Primary School 2020-2021 - SUGGESTED TEXTS Dedicated to Excellence

BMus Extended programme (01132004) - University of Pretoria

University Centre Colchester Examination Procedures 2019/20 - Colchester Institute

Performance evaluation of AE-pulse of wire EDM process on Ti-10V- 2Fe-3Al alloy by Taguchi GRA technique - IOPscience

HOW TO MAKE CHORD CORRECT - Pamela Zave AT&T Laboratories-Research Florham Park, New Jersey, USA

HOROSCOPE OF RUSSIAN PRESIDENT VLADIMIR PUTIN

The effect of depth of operation and soaking time on catch rates in the experimental tuna longline fisheries in Lakshadweep Sea, India

Education Authority Northern Ireland Parent Guide to Pre-School Admissions 2020/2021

Performance of Parabolic Concentrated Solar Cooker used for Cooking Oil in Bauchi-Nigeria

OPERATING INSTRUCTIONS - COMFORT CONTROL CENTER 2 THERMOSTAT MODEL 3314080.000 BLACK 3314080.015 WHITE - Dometic

Market Enhancements for 2021 Summer Readiness - Straw Proposal January 27, 2021

Improving Access to Quality Medical Care Webinar Series - The Southwest Telehealth Resource Center, Arizona Telemedicine Program, and the Arizona ...

SMD I2C Digital Ambient Light Sensor - ALS-PDIC17-79NB/TR8

MY PLAN. MY WAY. NOW, A PLAN TO POWER MY DREAMS

CONNECTORS WITH INTEGRATED EEPROMS - ODU WHITE PAPER 07 | 2019 - Odu.de

Alberta Transportation Springbank Off-stream Reservoir - Project Response to Part A IAAC Follow-up Questions - NET

Deliverable D2.2 Social Data Collection and Processing Pipeline - (DEMO) - PlasticTwist

893 Professional Biodiesel Rancimat - Standard compliant determination of the oxidation stability of biodiesel and biodiesel blends

3D SCANNER 3D Scanning Comes Full Circle - Your Most Valuable QA and Dosimetry Tools

WORKING GROUP FOR THE BAY OF BISCAY AND THE IBERIAN WATERS ECOREGION (WGBIE)

SEISMIC SIGNAL ANALYSIS USING MINIMUM PHASE AND SINGULAR VALUE DECOMPOSITION METHODS. APPLICATION TO GROUNDROLL ATTENUATION - SBGf

European Community Directive on the Conservation of Natural Habitats and of Wild Fauna and Flora - (92/43/EEC) - JNCC

GTC FALL 2020 FINANCIAL ANALYST EVENT

Country/Territory profile for Indonesia

CONFÉRENCE ANNUELLE LES NOUVEAUX FONCTIONNEMENTS ET MÉTIERS DE LA BANQUE 28 MARS 2019

Literatur - World Health ...

Proposal of the S-score for measuring the performance of researchers, institutions, and journals in Indonesia - Science Editing

2020 Activity & 2021-2023 Outlook - Intelligent Data Processing From Cloud to Edge

And editorial calendar 2021 - Rackcdn ...

FRBSF Economic Letter - Federal ...

Conservation and Management Measure for Exploratory Fishing for Toothfish by New Zealand-Flagged Vessels in the SPRFMO Convention Area

The Open Chemical Engineering Journal - Bentham Open

Healthy Mom, Healthy Family Wave 1 Action Period Call 4 June 16, 2021 - Ohio AAP

Certificate Bypass: Hiding and Executing Malware from a Digitally Signed Executable - Black Hat

Comparison of JSON and XML Data Interchange Formats: A Case Study

AIRCOM FLIGHTTRACKER MEETING THE CHALLENGE OF INFLIGHT SAFETY - SITAONAIR

Thermo Scientific Mass Frontier Software

Sentiment Analysis - Sarcasm Detection in Twitter - IOSR ...

SECTION 1: Identification of the substance/mixture and of the company/ undertaking - Keim Mineral Paints

Learning Binary Search Trees Through Serious Games - Schloss ...

Country Profile South Korea - www.lloyds.com/SouthKoreaMI April 2014

Management Alert - FEMA Did Not Safeguard Disaster Survivors' Sensitive Personally Identifiable Information (REDACTED) - March 15, 2019 - Quanexus

FBD DATA REPORT 2021 - Groningen and Friesland, the Netherlands

System integration oriented data center planning In terms of ATEN's eco Sensors DCIM solution

Problem Description - Transport for NSW

SUSTAINABLE MANAGEMENT OF WATER BODIES BASED ON MULTISTEMPORAL SATELLITE IMAGES

Risk-Transfer considerations & Insurance Landscape - September 2020

Tourism Destinations Popularity Rating In Malang Raya using Naive Bayes Classifier and Selection Sort Based on Twitter Word - IJISTECH