- 1.Hive -Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009Santa Clara Marriott Namit Jain, Zheng Shao Facebook
2. Agenda
- Hive Progress and Roadmap
Facebook 3.
Facebook 4. Why Another Data Warehousing System?
- ~1TB per day in March 2008
Facebook 5. 6. Lets try Hadoop
- Superior in availability/scalability/manageability
- Efficiency not that great, but throw more hardware
- Partial Availability/resilience/scale more important than ACID
- Cons: Programmability and Metadata
- Map-reduce hard to program (users know sql/bash/python)
- Need to publish data in well known schemas
Facebook 7. Lets try Hadoop (continued)
- RDBMS> select key, count(1) from kv1 where key > 100 group by key;
- uniq -c | awk '{print $2""$1}
- awk -F '01' '{if($1 > 100) print $1}
- $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
- $ bin/hadoop dfs cat /tmp/largekey/part*
Facebook 8. What is HIVE?
- A system for managing and querying structured data built on top of Hadoop
- SQL as a familiar data warehousing tool
- Extensibility Types, Functions, Formats, Scripts
- Scalability and Performance
Facebook 9. Simplifying Hadoop
- RDBMS> select key, count(1) from kv1 where key > 100 group by key;
- hive> select key, count(1) from kv1 where key > 100 group by key;
Facebook 10.
Facebook 11. Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive onHadoop Cluster Oracle RAC Federated MySQL 12. Hive/Hadoop Usage @ Facebook
- Eg: Daily/Weekly aggregations of impression/click counts
- SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = 2009-05-01;
- Complex measures of user engagement
- Eg: how many group admins broken down by state/country
- Data Mining (Assembling training data)
- Eg: User Engagement as a function of user attributes
- Anomalous patterns for Site Integrity
- Application API usage patterns
Facebook 13. Hadoop Usage @ Facebook
- Compression Factor ~5x (gzip)
- 3200 jobs/day with 800K tasks(map-reduce tasks)/day
- 55TB of compressed data scanned daily
- 15TB of compressed output data written to hdfs
- 150 active users within Facebook
Facebook 14.
- Hive Progress and Roadmap
Facebook 15.
- CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe'WITH SERDEPROPERTIES ('testserde.default.serialization.format'='03');
Facebook 16. Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 Tables Metastore DB Data Location Bucketing Info Partitioning Cols 17. HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB 18. Hive Query Language
- Subqueries in from clause
- SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) sWHERE s.ds = 2009-04-22 GROUP BY s.key
Facebook 19.
- INSERT INTOTABLEpv_gender_sum
- SELECT gender, count(DISTINCT userid)
- INSERT INTODIRECTORY /user/facebook/tmp/pv_age_sum.dir
- SELECT age, count(DISTINCT userid)
- INSERT INTOLOCAL DIRECTORY/home/me/pv_age_sum.dir
- SELECT age, count(DISTINCT userid)
Facebook 20. Hive Query Language (continued)
- Pluggable Map-reduce scripts
- Pluggable User Defined Functions
- Pluggable User Defined Types
- Complex object types: List of Maps
Facebook 21.
- MAPpv_users.userid, pv_users.date
- INSERT INTO TABLE pv_users_reduced
Pluggable Map-Reduce Scripts Facebook 22. Map Reduce Example Facebook Machine 2 Machine 1 Local Map Global Shuffle Local Sort Local Reduce 23. Hive QL Join
- INSERT INTO TABLE pv_users
- ON (pv.userid = u.userid);
Facebook 24. Hive QL Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32 25. Join Optimizations
- User specified small tables stored in hash tables on the mapper backed by jdbm
- INSERT INTO TABLE pv_users
- SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age
- FROM page_view pv JOIN user u
- ON (pv.userid = u.userid);
- Exploit table/column statistics for deciding strategy
Facebook 26. Hive QL Map Join Facebook page_view user Hash table pv_users key