Hive and Pig! Juliana Freire - New York University

Page created by Bill Mcbride
 
CONTINUE READING
Hive and Pig! Juliana Freire - New York University
Hive and Pig!

   Juliana Freire
  New York University

Some slides from J. Lin
Need for High-Level Languages!
• Hadoop is great for large-data processing!
 • But writing Java programs for everything is verbose and slow
 • Not everyone wants to (or can) write Java code
• Solution: develop higher-level data processing languages
 • Hive: HQL is like SQL
 • Pig: Pig Latin is a bit like Perl
Hive and Pig!
• Hive: data warehousing application in Hadoop
 • Query language is HQL, variant of SQL
 • Tables stored on HDFS as flat files
 • Developed by Facebook, now open source
• Pig: large-scale data processing system
 • Scripts are written in Pig Latin, a dataflow language
 • Developed by Yahoo!, now open source
 • Roughly 1/3 of all Yahoo! internal jobs
• Common idea:
 • Provide higher-level language to facilitate large-data processing
 • Higher-level language “compiles down” to Hadoop jobs
Hive: Background!
• Started at Facebook
• Data was collected by nightly cron jobs into Oracle DB
• “ETL” via hand-coded python
• Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x
  that

                                                     ETL

                                          (Extract, Transform, and Load)
          OLTP

                                                            Hadoop

   OLAP

    Source: cc-licensed slide by Cloudera
Hive Components!
     standard input and write out rows to standard output. This
• Shell: allows interactive queries
     flexibility does come at a cost of converting rows from and
     to strings.
• Driver: session handles, fetch,
        We omit more details due to lack of space. For a complete
     description of HiveQL see the language manual [5].
    execute
     2.3 Running Example: StatusMeme
• Compiler:
   Meme, inspired parse,
                  by Facebookplan,
                              Lexicon optimize
       We now present a highly simplified application, Status-
                                      [6]. When Facebook
     users update their status, the updates are logged into flat
• Execution engine: DAG of stages
     files in an NFS directory /logs/status updates which are ro-
     tated every day. We load this data into hive on a daily basis
  (MR, HDFS, metadata)
     into a table
   status updates(userid int,status string,ds string)

• Metastore:         schema, location in
   using a load statement like below.

  HDFS,
   INTO TABLE SerDe
   LOAD DATA LOCAL INPATH ‘/logs/status_updates’
               status_updates PARTITION (ds=’2009-03-20’)

        Each status update record contains the user identifier (userid),
     the actual status string (status), and the date (ds) when the
     status update occurred. This table is partitioned on the ds
     column. Detailed user profile information, like the gender of
     the user and the school the user is attending, is available in
     the profiles(userid int,school string,gender int) table.
        We first want to compute daily statistics on the frequency
     of status updates based on gender and school which the user
     attends. The following multi-table insert statement gener-
     ates the daily counts of status updates by school (into
     school summary(school string,cnt int,ds string)) and gen-              Figure 1: Hive Architecture
     der (into gender summary(gender int,cnt int,ds string)) us-
     ing a single scan of the join of the status updates and
     profiles     tables. slide
        Source: cc-licensed Notebythat  the output tables are also parti-
                                   Cloudera

                              FROM status_updates a JOIN profiles b
     tioned on the ds column, and HiveQL allows users to insert
Framework
                                                   Data Model!
 husoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,
Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy
  • Tables:
    Facebookanalogous    to tables
             Data Infrastructure    in RDBMS
                                 Team
     • Typed columns (int, float, string, boolean)
     • Structs:     {a	
  
 lected and analyzed in the
                           I NT;	
   b 	
   I NT}.
                                                databases. Each table has a corresponding HDFS di-
                                                rectory. The data in a table is serialized and stored in
    is •  Also,
       growing    list,
               rapidly, arrays
                        mak-
 tions prohibitively expen-
                               : map     (for
                                            theJSON-like          data)of the underlying data.
                                     files within that directory. Users can associate tables
                                     with       serialization format
                                     Hive provides builtin serialization formats which ex-
    •  Partitions
  en-source map-reduce im-
   as an alternative to store        ploit compression and lazy de-serialization. Users can
                                     also add support for new data formats by defining cus-
       • For example, range-partition
   sets on commodity hard-
  rogramming model is very           tom serializetables      by date
                                                    and de-serialize       methods (called SerDe’s)
                                     written in Java. The serialization format of each table
    •
reuse. Buckets
  to write custom programs
                                     is stored in the system catalog and is automatically
  an open-source data ware-          used by Hive during query compilation and execution.
       • Hash
 f Hadoop.     Hivepartitions
                     supports  withinHive
                                        ranges     (usefulexternal
                                            also supports        for sampling,
                                                                           tables on data       join optimization)
                                                                                                  stored in
 clarative language - HiveQL,        HDFS, NFS or local directories.
  ce jobs executed on Hadoop.      • Partitions - Each table can have one or more parti-
  tom map-reduce scripts to          tions which determine the distribution of data within
 guage includes a type sys-          sub-directories of the table directory. Suppose data
aining primitive types, col-         for table T is in the directory /wh/T. If T is partitioned
 nd nested compositions of           on columns ds and ctry, then data with a particular
braries can be extended to           ds value 20090101 and ctry value US, will be stored in
Hive also includes a system          files within the directory /wh/T/ds=20090101/ctry=US.
  ng schemas and statistics,       • Buckets - Data in each partition may in turn be divided
  n and query optimization.                                                        [Thusoo et al., VLDB 2009]!
                                     into buckets based on the hash of a column in the table.
   contains several thousand         Each bucket is stored    as a file in the partition directory.
                                                        Source: cc-licensed slide by Cloudera

  data and is being used ex-      Hive supports primitive column types (integers, floating
   ad-hoc analyses by more
Metastore!
• Database: namespace containing a set of tables
• Holds table definitions (column types, physical layout)
• Holds partitioning information
• Can be stored in Derby, MySQL, and many other relational databases

     Source: cc-licensed slide by Cloudera
Physical Layout!
• Warehouse directory in HDFS
 • E.g., /user/hive/warehouse
• Tables stored in subdirectories of warehouse
 • Partitions form subdirectories of tables
 • Each table has a corresponding HDFS directory
• Actual data stored in flat files
 • Users can associate a table with a serialization format
 • Control char-delimited text, or SequenceFiles
 • With custom SerDe, can use arbitrary format

     Source: cc-licensed slide by Cloudera
Hive: Example!
• Hive looks similar to an SQL database
• Relational join on two tables:
 • Table of word counts from Shakespeare collection
 • Table of word counts from the bible
         SELECT s.word, s.freq, k.freq FROM shakespeare s

          JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 
          ORDER BY s.freq DESC LIMIT 10;

                                     the

25848

62394

                                      I

23031

8854

                                     and

19671

38985

                                     to

18038

13526

                                     of

16700

34654

                                      a

14170

8057

                                      you

12702

2720

                                      my

11297

4135

                                     in

10797

12445

                                      is

8882

6884

    Source: Material drawn from Cloudera training VM
files in an NFS directory /logs/status updates which are ro-
of  status
 tated       updates
         every   day. based
                       We loadon gender    andinto
                                   this data    school
                                                   hivewhich   the user
                                                          on a daily  basis
attends.     The following multi-table insert statement gener-
 status
         Hive: Another Example!
 into a table
ates   theupdates(userid
            daily counts of status     updates
                              int,status         by school
                                             string,ds       (into
                                                           string)
 using asummary(school
school      load statement   string,cnt
                                like below.int,ds string)) and gen-
der (into gender summary(gender int,cnt int,ds string)) us-
ing
 LOADa DATA
         singleLOCAL
                  scan INPATH
                        of the ‘/logs/status_updates’
                                  join of the status updates and
profiles
 INTO TABLE   tables. Note that the
                 status_updates         output tables
                                     PARTITION           are also parti-
                                                  (ds=’2009-03-20’)
tioned on the ds column, and HiveQL allows users to insert
query
    Each results
            statusinto a specific
                    update   record partition
                                      containsofthe
                                                  theuser
                                                      output    table.(userid),
                                                           identifier
 the actual status string (status), and the date (ds) when the                      G
FROM (SELECT a.status, b.school, b.gender                                           D
 status update occurred. This table is partitioned on the ds
         FROM status_updates a JOIN profiles b                                      S
 column. Detailed       user  profile  information,
                ON (a.userid = b.userid and            like the  gender  of   ) subq2
 the user and the      school the user is
                    a.ds=’2009-03-20’       ) attending, is available in
 the profiles(userid
         ) subq1            int,school string,gender int) table.
INSERT    OVERWRITE
    We first    want toTABLE          daily statistics on the frequency 3. H
                              gender_summary
                          compute
                              PARTITION(ds=’2009-03-20’)
 of  status   updates   based   on gender                                       Figur
SELECT subq1.gender, COUNT(1)           GROUPand
                                               BY school   which the user
                                                   subq1.gender
 attends.
INSERT        The following
          OVERWRITE              multi-table insert statement gener- teractio
                       TABLE school_summary
 ates the daily counts ofPARTITION(ds=’2009-03-20’)
                                 status updates by school (into                  • E
SELECT
 school subq1.school,
           summary(school   COUNT(1)
                              string,cntGROUPint,ds
                                               BY subq1.school
                                                      string)) and gen-            fa
 der (into gender summary(gender int,cnt int,ds string)) us-                       ca
    Next, we want to display the ten most popular memes                            O
 ing   a single scan of the join of the status updates and
per school as determined by status updates by users who
 profiles tables. Note that the output tables are also parti-                    • Th
attend that school. We now show how this computation
 tioned on the ds column, and HiveQL allows users to insert                        A
can be done using HiveQLs map-reduce constructs. We
 querythe results  into                                                            fra
parse          result of athe
                           specific   partitionstatus
                               join between      of the updates
                                                          output table.
                                                                    and
Hive: Another Example!
• HiveQL provides MapReduce constructs
 REDUCE subq2.school, subq2.meme, subq2.cnt
           USING ‘top10.py’ AS (school,meme,cnt)
 FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt
     FROM (MAP b.school, a.status
                 USING ‘meme-extractor.py’ AS (school,meme)
                 FROM status_updates a JOIN profiles b
                               ON (a.userid = b.userid)
                 ) subq1
            GROUP BY subq1.school, subq1.meme
            DISTRIBUTE BY school, meme
           SORT BY school, meme, cnt desc
  ) subq2;

   Source: Material drawn from Cloudera training VM
Example Data Analysis Task!

                      Find users who tend to visit “good” pages.

Visits

                                                Pages

 user        url                                 time    url                      pagerank
 Amy         www.cnn.com                         8:00    www.cnn.com                0.9
 Amy         www.crap.com                        8:05    www.flickr.com             0.9
 Amy         www.myblog.com                      10:00   www.myblog.com             0.7
 Amy         www.flickr.com                      10:05   www.crap.com               0.2
 Fred        cnn.com/index.htm                   12:00

                                                                       . . .

                                  . . .

      Pig Slides adapted from Olston et al.
Conceptual Dataflow!
                            Load

                                                  Load

                    Visits(user, url, time)

                                 Pages(url, pagerank)

                  Canonicalize URLs

                                                            Join

                                                           url = url

                                                       Group by user

                                                 Compute Average Pagerank

                                                           Filter

                                                         avgPR > 0.5

Pig Slides adapted from Olston et al.
System-Level Dataflow!
                                        Visits

                                  Pages

              load

                           . . .

                            . . .

       load

canonicalize

                                                                                  join by url

                                                             . . .

                                                                           group by user

                                                             . . .

      compute average pagerank

                                                                                filter

                                                           the answer

     Pig Slides adapted from Olston et al.
MapReduce Code!
import   java.io.IOException;                                                            reporter.setStatus("OK");                                       lp.setOutputKeyClass(Text.class);
import   java.util.ArrayList;                                                        }                                                                   lp.setOutputValueClass(Text.class);
import   java.util.Iterator;                                                                                                                             lp.setMapperClass(LoadPages.class);
import   java.util.List;                                                            // Do the cross product and collect the values                       FileInputFormat.addInputPath(lp, new
                                                                                    for (String s1 : first) {                                    Path("/user/gates/pages"));
import   org.apache.hadoop.fs.Path;                                                     for (String s2 : second) {                                       FileOutputFormat.setOutputPath(lp,
import   org.apache.hadoop.io.LongWritable;                                                 String outval = key + "," + s1 + "," + s2;                       new Path("/user/gates/tmp/indexed_pages"));
import   org.apache.hadoop.io.Text;                                                         oc.collect(null, new Text(outval));                          lp.setNumReduceTasks(0);
import   org.apache.hadoop.io.Writable;                                                     reporter.setStatus("OK");                                    Job loadPages = new Job(lp);
import   org.apache.hadoop.io.WritableComparable;                                       }
import   org.apache.hadoop.mapred.FileInputFormat;                                  }                                                                    JobConf lfu = new JobConf(MRExample.class);
import   org.apache.hadoop.mapred.FileOutputFormat;                              }                                                                       lfu.setJobName("Load and Filter Users");
import   org.apache.hadoop.mapred.JobConf;                                   }                                                                           lfu.setInputFormat(TextInputFormat.class);
import   org.apache.hadoop.mapred.KeyValueTextInputFormat;                   public static class LoadJoined extends MapReduceBase                        lfu.setOutputKeyClass(Text.class);
import   org.apache.hadoop.mapred.Mapper;                                        implements Mapper {                     lfu.setOutputValueClass(Text.class);
import   org.apache.hadoop.mapred.MapReduceBase;                                                                                                         lfu.setMapperClass(LoadAndFilterUsers.class);
import   org.apache.hadoop.mapred.OutputCollector;                              public void map(                                                         FileInputFormat.add InputPath(lfu, new
import   org.apache.hadoop.mapred.RecordReader;                                         Text k,                                                  Path("/user/gates/users"));
import   org.apache.hadoop.mapred.Reducer;                                              Text val,                                                        FileOutputFormat.setOutputPath(lfu,
import   org.apache.hadoop.mapred.Reporter;                                             OutputColle ctor oc,                             new Path("/user/gates/tmp/filtered_users"));
import   org.apache.hadoop.mapred.SequenceFileInputFormat;                              Reporter reporter) throws IOException {                          lfu.setNumReduceTasks(0);
import   org.apache.hadoop.mapred.SequenceFileOutputFormat;                         // Find the url                                                      Job loadUsers = new Job(lfu);
import   org.apache.hadoop.mapred.TextInputFormat;                                  String line = val.toString();
import   org.apache.hadoop.mapred.jobcontrol.Job;                                   int firstComma = line.indexOf(',');                                  JobConf join = new JobConf( MRExample.class);
import   org.apache.hadoop.mapred.jobcontrol.JobC ontrol;                           int secondComma = line.indexOf(',', first Comma);                    join.setJobName("Join Users and Pages");
import   org.apache.hadoop.mapred.lib.IdentityMapper;                               String key = line.substring(firstComma, secondComma);                join.setInputFormat(KeyValueTextInputFormat.class);
                                                                                    // drop the rest of the record, I don't need it anymore,             join.setOutputKeyClass(Text.class);
public class MRExample {                                                            // just pass a 1 for the combiner/reducer to sum instead.            join.setOutputValueClass(Text.class);
    public static class LoadPages extends MapReduceBase                             Text outKey = new Text(key);                                         join.setMapperClass(IdentityMap per.class);
        implements Mapper {                         oc.collect(outKey, new LongWritable(1L));                            join.setReducerClass(Join.class);
                                                                                }                                                                        FileInputFormat.addInputPath(join, new
           public void map(LongWritable k, Text val,                         }                                                                   Path("/user/gates/tmp/indexed_pages"));
                   OutputCollector oc,                           public static class ReduceUrls extends MapReduceBase                        FileInputFormat.addInputPath(join, new
                   Reporter reporter) throws IOException {                       implements Reducer {                                                                     FileOutputFormat.se tOutputPath(join, new
               String line = val.toString();                                                                                                     Path("/user/gates/tmp/joined"));
               int firstComma = line.indexOf(',');                              public void reduce(                                                      join.setNumReduceTasks(50);
               String key = line.sub string(0, firstComma);                             Text ke y,                                                       Job joinJob = new Job(join);
               String value = line.substring(firstComma + 1);                           Iterator iter,                                     joinJob.addDependingJob(loadPages);
               Text outKey = new Text(key);                                             OutputCollector oc,                joinJob.addDependingJob(loadUsers);
               // Prepend an index to the value so we know which file                   Reporter reporter) throws IOException {
               // it came from.                                                     // Add up all the values we see                                      JobConf group = new JobConf(MRE xample.class);
               Text outVal = new Text("1 " + value);                                                                                                     group.setJobName("Group URLs");
               oc.collect(outKey, outVal);                                           long sum = 0;                                                       group.setInputFormat(KeyValueTextInputFormat.class);
           }                                                                         wh ile (iter.hasNext()) {                                           group.setOutputKeyClass(Text.class);
    }                                                                                     sum += iter.next().get();                                      group.setOutputValueClass(LongWritable.class);
    public static class LoadAndFilterUsers extends MapReduceBase                          reporter.setStatus("OK");                                      group.setOutputFormat(SequenceFi leOutputFormat.class);
        implements Mapper {                          }                                                                   group.setMapperClass(LoadJoined.class);
                                                                                                                                                         group.setCombinerClass(ReduceUrls.class);
           public void map(LongWritable k, Text val,                                 oc.collect(key, new LongWritable(sum));                             group.setReducerClass(ReduceUrls.class);
                   OutputCollector oc,                               }                                                                       FileInputFormat.addInputPath(group, new
                   Reporter reporter) throws IOException {                   }                                                                   Path("/user/gates/tmp/joined"));
               // Pull the key out                                           public static class LoadClicks extends MapReduceBase                        FileOutputFormat.setOutputPath(group, new
               String line = val.toString();                                     implements Mapper {                                                                         group.setNumReduceTasks(50);
               String value = line.substring( firstComma + 1);                                                                                           Job groupJob = new Job(group);
               int age = Integer.parseInt(value);                               public void map(                                                         groupJob.addDependingJob(joinJob);
               if (age < 18 || age > 25) return;                                        WritableComparable key,
               String key = line.substring(0, firstComma);                              Writable val,                                                    JobConf top100 = new JobConf(MRExample.class);
               Text outKey = new Text(key);                                             OutputCollector oc,                          top100.setJobName("Top 100 sites");
               // Prepend an index to the value so w e know which file                  Reporter reporter) throws IOException {                          top100.setInputFormat(SequenceFileInputFormat.class);
               // it came from.                                                     oc.collect((LongWritable)val, (Text)key);                            top100.setOutputKeyClass(LongWritable.class);
               Text outVal = new Text("2" + value);                             }                                                                        top100.setOutputValueClass(Text.class);
               oc.collect(outKey, outVal);                                   }                                                                           top100.setOutputFormat(SequenceFileOutputF ormat.class);
           }                                                                 public static class LimitClicks extends MapReduceBase                       top100.setMapperClass(LoadClicks.class);
    }                                                                            implements Reducer {            top100.setCombinerClass(LimitClicks.class);
    public static class Join extends MapReduceBase                                                                                                       top100.setReducerClass(LimitClicks.class);
        implements Reducer {                            int count = 0;                                                           FileInputFormat.addInputPath(top100, new
                                                                                public void reduce(                                              Path("/user/gates/tmp/grouped"));
           public void reduce(Text key,                                             LongWritable key,                                                    FileOutputFormat.setOutputPath(top100, new
                   Iterator iter,                                             Iterator iter,                                         Path("/user/gates/top100sitesforusers18to25"));
                   OutputCollector oc,                                  OutputCollector oc,                              top100.setNumReduceTasks(1);
                   Reporter reporter) throws IOException {                          Reporter reporter) throws IOException {                              Job limit = new Job(top100);
               // For each value, figure out which file it's from and                                                                                    limit.addDependingJob(groupJob);
store it                                                                             // Only output the first 100 records
               // accordingly.                                                       while (count < 100 && iter.hasNext()) {                             JobControl jc = new JobControl("Find top   100 sites for users
               List first = new ArrayList();                             oc.collect(key, iter.next());                           18 to 25");
               List second = new ArrayList();                            count++;                                                        jc.addJob(loadPages);
                                                                                     }                                                                   jc.addJob(loadUsers);
            while (iter.hasNext()) {                                             }                                                                       jc.addJob(joinJob);
                Text t = iter.next();                                        }                                                                           jc.addJob(groupJob);
                String value = t.to String();                                public static void main(String[] args) throws IOException {                 jc.addJob(limit);
                if (value.charAt(0) == '1')                                      JobConf lp = new JobConf(MRExample.class);                              jc.run();
first.add(value.substring(1));                                                   lp.setJobName("Load Pages");                                        }
                else second.add(value.substring(1));                             lp.setInputFormat(TextInputFormat.class);                       }

                    Pig Slides adapted from Olston et al.
Pig Latin Script!

Visits = load     ‘/data/visits’ as (user, url, time);!
Visits = foreach Visits generate user, Canonicalize(url), time;!
!
Pages = load     ‘/data/pages’ as (url, pagerank);!
!
VP = join     Visits by url, Pages by url;!
UserVisits = group    VP by user;!
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;!
GoodUsers = filter UserPageranks by avgpr > ‘0.5’;!
!
store   GoodUsers into '/data/good_users';!

    Pig Slides adapted from Olston et al.
Java vs. Pig Latin!

                1/20 the lines of code                                     1/16 the development time
180                                                                  300
160
                                                                     250
140

                                                           Minutes
120                                                                  200
100
                                                                     150
 80
 60                                                                  100
 40
                                                                     50
 20
  0                                                                    0

              Hadoop                           Pig                         Hadoop          Pig

                                       Performance on par with raw Hadoop!

  Pig Slides adapted from Olston et al.
Pig takes care of…!
• Schema and type checking
• Translating into efficient physical dataflow
 • (i.e., sequence of one or more MapReduce jobs)
• Exploiting data reduction opportunities
 • (e.g., early partial aggregation via a combiner)
• Executing the system-level dataflow
 • (i.e., running the MapReduce jobs)
• Tracking progress, errors, etc.
References!
• Getting started with Pig: http://pig.apache.org/docs/r0.11.1/start.html
• Pig Tutorial: http://pig.apache.org/docs/r0.7.0/tutorial.html
• Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Questions!
You can also read