Petabyte Scale Data at Facebook - Dhruba Borthakur , Engineer at Facebook, UC Berkeley, Nov 2012

Page created by Ron Hunt

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Petabyte Scale Data at Facebook - Dhruba Borthakur , Engineer at Facebook, UC Berkeley, Nov 2012

Petabyte Scale Data at Facebook

Dhruba Borthakur ,
Engineer at Facebook,
UC Berkeley, Nov 2012

Agenda

 1   Types of Data

 2   Data Model and API for Facebook Graph Data

 3   SLTP (Semi-OLTP) and Analytics data

 4   Why Hive?

Four major types of storage systems
▪   Online Transaction Processing Databases (OLTP)
    ▪   The Facebook Social Graph

▪   Semi-online Light Transaction Processing Databases (SLTP)
    ▪   Facebook Messages and Facebook Time Series

▪   Immutable DataStore
    ▪   Photos, videos, etc

▪   Analytics DataStore
    ▪   Data Warehouse, Logs storage

Size and Scale of Databases
                       Total Size           Technology        Bottlenecks

                   Single digit petabytes   MySQL and TAO    Random read IOPS
  Facebook Graph

     Facebook
                                                              Write IOPS and
   Messages and      Tens of petabytes      HBase and HDFS   storage capacity
    Time Series
       Data
    Facebook
                       High tens of            Haystack      storage capacity
     Photos             petabytes

      Data
                       Hundreds of          Hive, HDFS and   storage capacity
    Warehouse           petabytes              Hadoop

Characteristics
                 Query          Consistency      Durability
                Latency

                   < few           quickly       No data loss
 Facebook      milliseconds      consistent
  Graph                          across data
                                   centers
  Facebook
                                 consistent      No data loss
  Messages     < 200 millisec   within a data
  and Time                         center
 Series Data
 Facebook
               < 250 millisec    immutable       No data loss
  Photos

   Data
                  < 1 min       not consistent   No silent data
 Warehouse                       across data          loss
                                   centers

Facebook Graph: Objects and Associations

Data model
Objects & Associations
                                           18429207554
                                                 (page)

                                    fan
                                                                  name: Barack Obama
                    8636146                                       birthday: 08/04/1961
                                          admin
                    (user)                                        website: http://…
                                                                  verified: 1
                                               friend             …
                 likes
                         liked by     friend
                                                          604191769
                                                              (user)

   6205972929
       (story)

Facebook Social Graph: TAO and MySQL
An OLTP workload:
▪   Uneven read heavy workload
▪   Huge working set with creation-time locality
▪   Highly interconnected data
▪   Constantly evolving
▪   As consistent as possible

Data model
Content aware data store
▪   Allows for server-side data processing
▪   Can exploit creation-time locality
▪   Graph data model
    ▪   Nodes and Edges : Objects and Associations
▪   Restricted graph API

Data model
Objects & Associations
▪   Object -> unique 64 bit ID plus a typed dictionary
    ▪   (id) -> (otype, (key -> value)* )
    ▪   ID 6815841748 -> {‘type’: page, ‘name’: “Barack Obama”, … }
▪   Association -> typed directed edge between 2 IDs
    ▪   (id1, atype, id2) -> (time, (key -> value)* )
    ▪   (8636146, RSVP, 130855887032173) -> (1327719600, {‘response’: ‘YES’})
▪   Association lists
    ▪   (id1, atype) -> all assocs with given id1, atype in desc order by time

Data model
API
▪   Object : (id) -> (otype, (key -> value)* )
    ▪   obj_add(otype, (k->v)*) : creates new object, returns its id
    ▪   obj_update(id, (k->v)*) : updates some or all fields
    ▪   obj_delete(id): removes the object permanently
    ▪   obj_get(id) : returns type and fields of a given object if exists

Data model
API
▪   Association : (id1, atype, id2) -> (time, (key -> value)* )
    ▪   assoc_add(id1, atype, id2, time, (k->v)*) : adds/updates the given assoc
    ▪   assoc_delete(id1, atype, id2) : deletes the given association

Data model
API
▪   Association : (id1, atype, id2) -> (time, (key -> value)* )
    ▪   assoc_get(id1, atype, id2set) : returns assocs where id2 ∈ id2set
    ▪   assoc_range(id1, atype, offset, limit, filters*): get relevant matching
          assocs from the given assoc list
    ▪   assoc_count(id1, atype): returns size of given assoc list

Architecture
Cache & Storage

                                       MySQL Storage
     Web servers   TAO Storage Cache

Architecture
Sharding
▪   Object ids and Assoc id1s are mapped to shard ids
                                                  MySQL Storage
          Web Servers         TAO Cache
                                                        db2 db4
                                s1 s3
                                s5

                                                        db1 db3
                                                        db8
                                s2 s6

                                                         db7

                                s4 s7
                                s8
                                                        db5 db6

Workload

▪   Read-heavy workload
    ▪   Significant range queries
▪   LinkBench benchmark being open-sourced
    ▪   http://www.github.com/facebook/linkbench

▪   Real data distribution of Assocs and their access patterns

Messages & Time Series Database
        SLTP workload

Facebook Messages

 Messages   Chats   Emails   SMS

Why we chose HBase
▪   High write throughput
▪   Horizontal scalability
▪   Automatic Failover
▪   Strong consistency within a data center
▪   Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪   Why is this SLTP?
     ▪   Semi-online: Queries run even if part of the database is offline
     ▪   Light Transactions: single row transactions
     ▪   Storage capacity bound rather than iops or cpu bound

What we store in HBase
▪   Small messages
▪   Message metadata (thread/message indices)
▪   Search index
▪   Large attachments stored in Haystack (photo store)

Size and scale of Messages Database
▪   6 Billion messages/day
▪   74 Billion operations/day
▪   At peak: 1.5 million operations/sec
▪   55% read, 45% write operations
▪   Average write operation inserts 16 records
▪   All data is lzo compressed
▪   Growing at 8 TB/day

Haystack: The Photo Store

Facebook Photo DataStore
                         2009                     2012

    Total Size      15 billion photos
                      1.5 Petabyte        High tens of petabytes

                  30 million photos/day   300 million photos/day
   Upload Rate           3 TB/day               30 TB/day

                    555K images/sec
   Serving Rate

Haystack based Design
                        Haystack Store
        Haystack
        Directory

          Web             Haystack
         Server            Cache

        Browser             CDN

Haystack Internals
▪   Log structured, append-only object store
▪   Built on commodity hardware
▪   Application-aware replication
▪   Images stored in 100 GB xfs-files called needles
▪   An in-memory index for each needle file
    ▪   32 bytes of index per photo

Hive Analytics Warehouse

Life of a photo tag in Hadoop/Hive storage
              Periodic	
  Analysis	
  (HIVE)	
                                Adhoc	
  Analysis	
  (HIVE)	
  
  Daily report on count of               nocron	
                     hipal	
         Count photos tagged by
  photo tags by country (1day)                                                        females age 20-25 yesterday

                                                                                     Scrapes	
  
                                                                                  User info reaches
                                               Hive	
  Warehouse	
                Warehouse (1day)            MySQL	
  DB	
  

                                   copier/loader	
            Log line reaches
                                                              warehouse (15 min)                                     User tags
                                                                                                                     a photo

                         puma	
  
RealHme	
  AnalyHcs	
  (HBASE)	
            Scribe	
  Log	
  Storage	
  (HDFS)	
                   www.facebook.com	
  
Count users tagging photos                         Log line reaches                                   Log line generated:
in the last hour (1min)                            Scribeh (10s)

Analytics Data Growth(last 4 years)

           Facebook               Scribe Data/    Nodes in
                      Queries/Day                            Size (Total)
             Users                    Day        warehouse

  Growth     14X         60X          250X         260X        2500X

Why use Hive instead of a Parallel DBMS?
▪   Stonebraker/DeWitt from the DBMS community:
    ▪   Quote “major step backwards”
    ▪   Published benchmark results which show that Hive is not as performant
        as a traditional DBMS
    ▪   http://database.cs.brown.edu/projects/mapreduce-vs-dbms/

What is BigData? Prospecting for Gold..
▪   “Finding Gold in the wild-west”
▪   A platform for huge data-experiments
▪   A majority of queries are searching
    for a single gold nugget
▪   Great advantage in keeping all data in
    one queryable system
▪   No structure to data, specify
    structure at query time

How to measure performance
▪   Traditional database systems:
    ▪   Latency of queries
▪   Big Data systems:
    ▪   How much data can we store and
        query? (the ‘Big’ in BigData)
    ▪   How much data can we query in
        parallel?
    ▪   What is the value of this system?

Measure Cost of Storage
▪   Distributed Network Encoding of data
    ▪   Encoding is better than replication
    ▪   Use algorithms that minimize network
        transfer for data repair
▪   Tradeoff cpu for storage & network
    ▪   Remember lineage of data, e.g. record
        query that created it
    ▪   If data is not accessed for sometime,
        delete it
    ▪   If a query occurs, recompute the data
        using query lineage

Measure Network Encoding
Start the same: triplicate every
data block
                                                               A                     B         C
Background encoding
                                                               A                     B         C
▪   Combine third replica of blocks from a
    single file to create parity block                         A                     B         C

▪   Remove third replica
                                                                                    A+B+C
▪   Reed Solomon encoding for much older
    files                                                     A file with three blocks A, B and C
                                                              (XOR Encoding)
     http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html

Measuring Data Discovery: Crowd Sourcing
▪   There are 50K tables in a single
    warehouse
▪   Users are Data Adminstrators
    themselves
▪   Questions about a table are
    directed to users of that table
▪   Automatic query lineage tools

Measuring Testabilty
▪   Traditional systems
    ▪   Recreate load using tests
    ▪   Validate results
▪   Big Data Systems
    ▪   Cannot replicate production load on test
        environment
    ▪   Deploy new service on a small percentage of service
        ▪   Monitor metrics
        ▪   Rolling upgrades
    ▪   Gradually deploy to larger section of service

Fault Tolerance and Elasticity
▪   Commodity machines
▪   Faults are the norm
▪   Anomalous behavior rather
    than complete failures
    ▪   10% of machines are always
        50% slower than the others

Measuring Fault Tolerance and Elasticity
▪   Fault tolerance is a must
    ▪   Continuously kill machines
        during benchmarking
    ▪   Slow down 10% of machine
        during benchmark
▪   Elasticity is necessary
    ▪   Add/remove new machines
        during benchmarking

Measuring Value of the System
▪   cost /GB decreasing with time
    ▪   So users can store more data
    ▪   But users need a metric to determine whether this cost is worth it
▪   What is the VALUE of this system?
    ▪   A metric that aids the user (and not the service provider)

Value per Byte (VB) for the System
▪   A new metric named VB
    ▪   Compare differences in value over time
    ▪   If VB increases with time, then user is
        satisfied
▪   You touch a byte, its VB is MAX (say 100)
▪   System VB = weighted sum of VB of each byte
    in the system

VB – Even a turtle ages with time
▪   The VB decreases with time
    ▪   A more recent access has more
        value than an older access
    ▪   Different ageing models (linear,
        exponential)

Why use Hive instead of a Parallel DBMS?
▪   Stonebraker/DeWitt from the DBMS community:
    ▪   Quote “Hadoop is a major step backwards”
    ▪   Published benchmark results which show that Hadoop/Hive is not as
        performant as a traditional DBMS
    ▪   http://database.cs.brown.edu/projects/mapreduce-vs-dbms/
    ▪   Hive query is 50 times slower than DBMS query

▪   Stonebraker’s Conclusion: Facebook’s 4000 node
    cluster (100PB) can be replaced by a 20 node DBMS
    cluster
▪   What is wrong with the above conclusion?

Hive/Hadoop instead of Parallel DBMS
▪   Dr Stonebraker’s proposal would put 5 PB per node on DBMS
    ▪   What will be the io throughput of that system? Abysmal
    ▪   How many concurrent queries can it support? Certainly not 100K
        concurrent clients
    ▪   He is using a wrong metric to make a conclusion
▪       Hive/Hadoop is very very slow
    ▪    Hive/Hadoop needs to be fixed to reduce query latency
    ▪    But an existing DBMS cannot replace Hive/Hadoop

Future Challenges

New trends in storage software
▪   Trends:
    ▪   SSDs cheaper, increasing number of CPUs per server
    ▪   SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
▪   New projects
    ▪   Evaluate OLTP databases that scales linearly with the number of cps
    ▪   Prototype storing cold photos on spin-down disks

Questions?
   dhruba@gmail.com
http://hadoopblog.blogspot.com/

You can also read

Government Data Center Modernization - Strategy Focus Group Discussion 13 March 2017

GOES-R PLT Lightning Mapping Array - (LMA) - NASA

Order your POSTCARD right away - Deutsche Post

The Meteo-Climatological Observatory in Antarctica: an Overview, as Browseable on the Web.

UNIVERSITE DE FRIBOURG INFORMATICS COLLOQUIUM - 31 Janvier 2017 - sigma legal

COHERENT Collaboration data release from the first detection of coherent elastic neutrino-nucleus scattering on argon

Data.census.gov: A Brief Demonstration - Brown University ...

BLOCKCHAIN BASED SMART TRAFFIC SYSTEM TO ENHANCE TRAFFIC NEUTRALITY - IRJET

Supplement of New insights into the environmental factors controlling the ground thermal regime across the Northern Hemisphere: a comparison ...

Bayesian gaussian finite mixture model - Journal of Physics: Conference Series - IOPscience

SARS-COV-2 VARIANTS OF CONCERN AND VARIANTS UNDER INVESTIGATION IN ENGLAND - TECHNICAL BRIEFING 22 - GOV.UK

Package 'BAwiR' July 19, 2021 - CRAN

Lessons learnt from the Météo-France data rescue program

Statement by the German Road Safety Council DVR on EDR and DSSAD Resolution of 28th October 2020 on the basis of the recommendations of the ...

Wireless data designed with your business essential applications and budget in mind - AT&T Business

Franco-German Position on GAIA-X - BMWi

Steganography Method Based On Data Embedding By Sudoku Solution Matrix

Development of a DCU monitoring web application for CMS ECAL - Assignee: Olov Günther-Hanssen Supervisor: Giacomo Cucciati CMS Department 18th ...

ORDER YOUR POSTCARD RIGHT AWAY - Deutsche Post

MPSC-2000 Microprocessor-based protection and metering devices for low voltage power circuit breakers

A Search for Radio-loud Supernovae - Elaine M. Sadler and Duncan Campbell-Wilson

ASHBURY COLLEGE NOVEMBER 8th 2018 - CHRISTIE LAKE KIDS & TAGGART PARKES FOUNDATION PRESENT

EVENT PROGRAM - PACIFIC SOUTHWEST CONFERENCE 2021 University of California, Los Angeles - ASCE Student Conferences

Social Commerce Playbook 2021

Part-time 2021 Employee Benefits Guide - City of Hillsboro Employee Benefits Guide 2018 - City of Hillsboro

SUMMER 2018 NOVEMBER 23 - FEBRUARY 27 - Justin Merriss, Aquatics Manager x325 - Wheaton Sport Center

TALENT FORWARD - FUNDING OPTIONS TO BRING STUDENT TALENT ONBOARD Funding, grants, and wage subsidies for businesses hiring student talent - BC ...

January 2021 - YMCA of Niagara

STAR WARS: REBELS TABLE GUIDE - BY SHORYUKENTOTHECHIN PAGE 1 OF 29

New Varieties FROM SAKATA 2020 2021

OSTEONECROSIS Statement of Principles concerning

Library Orientation - Faculty Feedback - 31 responses - Instructors Carolyn Robertson - Cerritos College

Magnetic tunnel junctions and magnetic logic circuits driven by spin-orbit torques - Uni Mainz

GWP iOCBC User Guide (Web) - 08 February 2021 - OCBC Securities

GROUP TOURS INFORMATION PACK - SECONDARY SCHOOLS - November 2020 - June 2021 - Boggo Road Gaol

BCCDC Data Summary 16 September 2021

Language, Culture & Education - Ara

FALL 2021 SEPTEMBER 7 - NOVEMBER 20 - Justin Merriss, Aquatics Manager x325 - Wheaton Sport Center

Direct PLUS Loan Basics for Parents - FEDERAL STUDENT LOANS

IHT Club SHOJIN time sharing resort ATO - IHT Club Philippines Series - i-house.com

2018 GCE O-LEVEL SCIENCE PRACTICAL EXAMINATION - Briefing for candidates You are not allowed to distribute the contents of this document without ...

STUDENT FINANCE GUIDE 2018/19 - University Centre Weston

Sleep Support for Adolescents - Sleep Scotland

Mandala Lightserve Model MWJB 90A Operating Manual

PRE BACHELOR MASTER PROGRAMME INDONESIA - the Erasmus Training Centre

M.S. in Information Systems - New Student Orientation Information - NYU Computer Science

Measuring Poverty: What about Eating Out? - Khazanah ...

PHOTOGRAPHY POLICIES AND RATES - Shopify