Hive and Pig! Juliana Freire - New York University
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Need for High-Level Languages! • Hadoop is great for large-data processing! • But writing Java programs for everything is verbose and slow • Not everyone wants to (or can) write Java code • Solution: develop higher-level data processing languages • Hive: HQL is like SQL • Pig: Pig Latin is a bit like Perl
Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL • Tables stored on HDFS as flat files • Developed by Facebook, now open source • Pig: large-scale data processing system • Scripts are written in Pig Latin, a dataflow language • Developed by Yahoo!, now open source • Roughly 1/3 of all Yahoo! internal jobs • Common idea: • Provide higher-level language to facilitate large-data processing • Higher-level language “compiles down” to Hadoop jobs
Hive: Background! • Started at Facebook • Data was collected by nightly cron jobs into Oracle DB • “ETL” via hand-coded python • Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that ETL (Extract, Transform, and Load) OLTP Hadoop OLAP Source: cc-licensed slide by Cloudera
Hive Components! standard input and write out rows to standard output. This • Shell: allows interactive queries flexibility does come at a cost of converting rows from and to strings. • Driver: session handles, fetch, We omit more details due to lack of space. For a complete description of HiveQL see the language manual [5]. execute 2.3 Running Example: StatusMeme • Compiler: Meme, inspired parse, by Facebookplan, Lexicon optimize We now present a highly simplified application, Status- [6]. When Facebook users update their status, the updates are logged into flat • Execution engine: DAG of stages files in an NFS directory /logs/status updates which are ro- tated every day. We load this data into hive on a daily basis (MR, HDFS, metadata) into a table status updates(userid int,status string,ds string) • Metastore: schema, location in using a load statement like below. HDFS, INTO TABLE SerDe LOAD DATA LOCAL INPATH ‘/logs/status_updates’ status_updates PARTITION (ds=’2009-03-20’) Each status update record contains the user identifier (userid), the actual status string (status), and the date (ds) when the status update occurred. This table is partitioned on the ds column. Detailed user profile information, like the gender of the user and the school the user is attending, is available in the profiles(userid int,school string,gender int) table. We first want to compute daily statistics on the frequency of status updates based on gender and school which the user attends. The following multi-table insert statement gener- ates the daily counts of status updates by school (into school summary(school string,cnt int,ds string)) and gen- Figure 1: Hive Architecture der (into gender summary(gender int,cnt int,ds string)) us- ing a single scan of the join of the status updates and profiles tables. slide Source: cc-licensed Notebythat the output tables are also parti- Cloudera FROM status_updates a JOIN profiles b tioned on the ds column, and HiveQL allows users to insert
Framework Data Model! husoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy • Tables: Facebookanalogous to tables Data Infrastructure in RDBMS Team • Typed columns (int, float, string, boolean) • Structs: {a lected and analyzed in the I NT; b I NT}. databases. Each table has a corresponding HDFS di- rectory. The data in a table is serialized and stored in is • Also, growing list, rapidly, arrays mak- tions prohibitively expen- : map (for theJSON-like data)of the underlying data. files within that directory. Users can associate tables with serialization format Hive provides builtin serialization formats which ex- • Partitions en-source map-reduce im- as an alternative to store ploit compression and lazy de-serialization. Users can also add support for new data formats by defining cus- • For example, range-partition sets on commodity hard- rogramming model is very tom serializetables by date and de-serialize methods (called SerDe’s) written in Java. The serialization format of each table • reuse. Buckets to write custom programs is stored in the system catalog and is automatically an open-source data ware- used by Hive during query compilation and execution. • Hash f Hadoop. Hivepartitions supports withinHive ranges (usefulexternal also supports for sampling, tables on data join optimization) stored in clarative language - HiveQL, HDFS, NFS or local directories. ce jobs executed on Hadoop. • Partitions - Each table can have one or more parti- tom map-reduce scripts to tions which determine the distribution of data within guage includes a type sys- sub-directories of the table directory. Suppose data aining primitive types, col- for table T is in the directory /wh/T. If T is partitioned nd nested compositions of on columns ds and ctry, then data with a particular braries can be extended to ds value 20090101 and ctry value US, will be stored in Hive also includes a system files within the directory /wh/T/ds=20090101/ctry=US. ng schemas and statistics, • Buckets - Data in each partition may in turn be divided n and query optimization. [Thusoo et al., VLDB 2009]! into buckets based on the hash of a column in the table. contains several thousand Each bucket is stored as a file in the partition directory. Source: cc-licensed slide by Cloudera data and is being used ex- Hive supports primitive column types (integers, floating ad-hoc analyses by more
Metastore! • Database: namespace containing a set of tables • Holds table definitions (column types, physical layout) • Holds partitioning information • Can be stored in Derby, MySQL, and many other relational databases Source: cc-licensed slide by Cloudera
Physical Layout! • Warehouse directory in HDFS • E.g., /user/hive/warehouse • Tables stored in subdirectories of warehouse • Partitions form subdirectories of tables • Each table has a corresponding HDFS directory • Actual data stored in flat files • Users can associate a table with a serialization format • Control char-delimited text, or SequenceFiles • With custom SerDe, can use arbitrary format Source: cc-licensed slide by Cloudera
Hive: Example! • Hive looks similar to an SQL database • Relational join on two tables: • Table of word counts from Shakespeare collection • Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10; the 25848 62394 I 23031 8854 and 19671 38985 to 18038 13526 of 16700 34654 a 14170 8057 you 12702 2720 my 11297 4135 in 10797 12445 is 8882 6884 Source: Material drawn from Cloudera training VM
files in an NFS directory /logs/status updates which are ro- of status tated updates every day. based We loadon gender andinto this data school hivewhich the user on a daily basis attends. The following multi-table insert statement gener- status Hive: Another Example! into a table ates theupdates(userid daily counts of status updates int,status by school string,ds (into string) using asummary(school school load statement string,cnt like below.int,ds string)) and gen- der (into gender summary(gender int,cnt int,ds string)) us- ing LOADa DATA singleLOCAL scan INPATH of the ‘/logs/status_updates’ join of the status updates and profiles INTO TABLE tables. Note that the status_updates output tables PARTITION are also parti- (ds=’2009-03-20’) tioned on the ds column, and HiveQL allows users to insert query Each results statusinto a specific update record partition containsofthe theuser output table.(userid), identifier the actual status string (status), and the date (ds) when the G FROM (SELECT a.status, b.school, b.gender D status update occurred. This table is partitioned on the ds FROM status_updates a JOIN profiles b S column. Detailed user profile information, ON (a.userid = b.userid and like the gender of ) subq2 the user and the school the user is a.ds=’2009-03-20’ ) attending, is available in the profiles(userid ) subq1 int,school string,gender int) table. INSERT OVERWRITE We first want toTABLE daily statistics on the frequency 3. H gender_summary compute PARTITION(ds=’2009-03-20’) of status updates based on gender Figur SELECT subq1.gender, COUNT(1) GROUPand BY school which the user subq1.gender attends. INSERT The following OVERWRITE multi-table insert statement gener- teractio TABLE school_summary ates the daily counts ofPARTITION(ds=’2009-03-20’) status updates by school (into • E SELECT school subq1.school, summary(school COUNT(1) string,cntGROUPint,ds BY subq1.school string)) and gen- fa der (into gender summary(gender int,cnt int,ds string)) us- ca Next, we want to display the ten most popular memes O ing a single scan of the join of the status updates and per school as determined by status updates by users who profiles tables. Note that the output tables are also parti- • Th attend that school. We now show how this computation tioned on the ds column, and HiveQL allows users to insert A can be done using HiveQLs map-reduce constructs. We querythe results into fra parse result of athe specific partitionstatus join between of the updates output table. and
Hive: Another Example! • HiveQL provides MapReduce constructs REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt) FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.status USING ‘meme-extractor.py’ AS (school,meme) FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORT BY school, meme, cnt desc ) subq2; Source: Material drawn from Cloudera training VM
Example Data Analysis Task! Find users who tend to visit “good” pages. Visits Pages user url time url pagerank Amy www.cnn.com 8:00 www.cnn.com 0.9 Amy www.crap.com 8:05 www.flickr.com 0.9 Amy www.myblog.com 10:00 www.myblog.com 0.7 Amy www.flickr.com 10:05 www.crap.com 0.2 Fred cnn.com/index.htm 12:00 . . . . . . Pig Slides adapted from Olston et al.
Conceptual Dataflow! Load Load Visits(user, url, time) Pages(url, pagerank) Canonicalize URLs Join url = url Group by user Compute Average Pagerank Filter avgPR > 0.5 Pig Slides adapted from Olston et al.
System-Level Dataflow! Visits Pages load . . . . . . load canonicalize join by url . . . group by user . . . compute average pagerank filter the answer Pig Slides adapted from Olston et al.
MapReduce Code! import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class); import java.util.ArrayList; } lp.setOutputValueClass(Text.class); import java.util.Iterator; lp.setMapperClass(LoadPages.class); import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/user/gates/pages")); import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0); import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp); import org.apache.hadoop.io.WritableComparable; } import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.FileOutputFormat; } lfu.setJobName("Load and Filter Users"); import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class); import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class); import org.apache.hadoop.mapred.Mapper; implements Mapper { lfu.setOutputValueClass(Text.class); import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class); import org.apache.hadoop.mapred.OutputCollector; public void map( FileInputFormat.add InputPath(lfu, new import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users")); import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.Reporter; OutputColle ctor oc, new Path("/user/gates/tmp/filtered_users")); import org.apache.hadoop.mapred.SequenceFileInputFormat; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0); import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu); import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString(); import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JobConf( MRExample.class); import org.apache.hadoop.mapred.jobcontrol.JobC ontrol; int secondComma = line.indexOf(',', first Comma); join.setJobName("Join Users and Pages"); import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class); // drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class); public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(IdentityMap per.class); implements Mapper { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class); } FileInputFormat.addInputPath(join, new public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages")); OutputCollector oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer { FileOutputFormat.se tOutputPath(join, new String line = val.toString(); Path("/user/gates/tmp/joined")); int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50); String key = line.sub string(0, firstComma); Text ke y, Job joinJob = new Job(join); String value = line.substring(firstComma + 1); Iterator iter, joinJob.addDependingJob(loadPages); Text outKey = new Text(key); OutputCollector oc, joinJob.addDependingJob(loadUsers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRE xample.class); Text outVal = new Text("1 " + value); group.setJobName("Group URLs"); oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class); } wh ile (iter.hasNext()) { group.setOutputKeyClass(Text.class); } sum += iter.next().get(); group.setOutputValueClass(LongWritable.class); public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class); implements Mapper { } group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class); OutputCollector oc, } FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.toString(); implements Mapper { group.setNumReduceTasks(50); String value = line.substring( firstComma + 1); Job groupJob = new Job(group); int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob); if (age < 18 || age > 25) return; WritableComparable key, String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outKey = new Text(key); OutputCollector oc, top100.setJobName("Top 100 sites"); // Prepend an index to the value so w e know which file Reporter reporter) throws IOException { top100.setInputFormat(SequenceFileInputFormat.class); // it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class); Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class); oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class); } public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class); } implements Reducer { top100.setCombinerClass(LimitClicks.class); public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class); implements Reducer { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator iter, Iterator iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector oc, OutputCollector oc, top100.setNumReduceTasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.addDependingJob(groupJob); store it // Only output the first 100 records // accordingly. while (count < 100 && iter.hasNext()) { JobControl jc = new JobControl("Find top 100 sites for users List first = new ArrayList(); oc.collect(key, iter.next()); 18 to 25"); List second = new ArrayList(); count++; jc.addJob(loadPages); } jc.addJob(loadUsers); while (iter.hasNext()) { } jc.addJob(joinJob); Text t = iter.next(); } jc.addJob(groupJob); String value = t.to String(); public static void main(String[] args) throws IOException { jc.addJob(limit); if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.setJobName("Load Pages"); } else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); } Pig Slides adapted from Olston et al.
Pig Latin Script! Visits = load ‘/data/visits’ as (user, url, time);! Visits = foreach Visits generate user, Canonicalize(url), time;! ! Pages = load ‘/data/pages’ as (url, pagerank);! ! VP = join Visits by url, Pages by url;! UserVisits = group VP by user;! UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;! GoodUsers = filter UserPageranks by avgpr > ‘0.5’;! ! store GoodUsers into '/data/good_users';! Pig Slides adapted from Olston et al.
Java vs. Pig Latin! 1/20 the lines of code 1/16 the development time 180 300 160 250 140 Minutes 120 200 100 150 80 60 100 40 50 20 0 0 Hadoop Pig Hadoop Pig Performance on par with raw Hadoop! Pig Slides adapted from Olston et al.
Pig takes care of…! • Schema and type checking • Translating into efficient physical dataflow • (i.e., sequence of one or more MapReduce jobs) • Exploiting data reduction opportunities • (e.g., early partial aggregation via a combiner) • Executing the system-level dataflow • (i.e., running the MapReduce jobs) • Tracking progress, errors, etc.
References! • Getting started with Pig: http://pig.apache.org/docs/r0.11.1/start.html • Pig Tutorial: http://pig.apache.org/docs/r0.7.0/tutorial.html • Hive Tutorial: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Questions!
You can also read