Hadoop Summit 2009 Hive

  • View
    3.228

  • Download
    3

Embed Size (px)

Text of Hadoop Summit 2009 Hive

  • 1.Hive -Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009Santa Clara Marriott Namit Jain, Zheng Shao Facebook

2. Agenda

  • Introduction
  • Facebook Usage
  • Hive Progress and Roadmap
  • Open Source Community

Facebook 3.

  • Introduction

Facebook 4. Why Another Data Warehousing System?

  • Data, data and more data
  • ~1TB per day in March 2008
  • ~10TB per day today

Facebook 5. 6. Lets try Hadoop

  • Pros
    • Superior in availability/scalability/manageability
    • Efficiency not that great, but throw more hardware
    • Partial Availability/resilience/scale more important than ACID
  • Cons: Programmability and Metadata
    • Map-reduce hard to program (users know sql/bash/python)
    • Need to publish data in well known schemas
  • Solution: HIVE

Facebook 7. Lets try Hadoop (continued)

  • RDBMS> select key, count(1) from kv1 where key > 100 group by key;
  • vs.
  • $ cat > /tmp/reducer.sh
  • uniq -c | awk '{print $2""$1}
  • $ cat > /tmp/map.sh
  • awk -F '01' '{if($1 > 100) print $1}
  • $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
  • $ bin/hadoop dfs cat /tmp/largekey/part*

Facebook 8. What is HIVE?

  • A system for managing and querying structured data built on top of Hadoop
    • Map-Reduce for execution
    • HDFS for storage
    • Metadata on raw files
  • Key Building Principles:
    • SQL as a familiar data warehousing tool
    • Extensibility Types, Functions, Formats, Scripts
    • Scalability and Performance

Facebook 9. Simplifying Hadoop

  • RDBMS> select key, count(1) from kv1 where key > 100 group by key;
  • vs.
  • hive> select key, count(1) from kv1 where key > 100 group by key;

Facebook 10.

  • Facebook Usage

Facebook 11. Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive onHadoop Cluster Oracle RAC Federated MySQL 12. Hive/Hadoop Usage @ Facebook

  • Types of Applications:
    • Reporting
      • Eg: Daily/Weekly aggregations of impression/click counts
      • SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = 2009-05-01;
      • Complex measures of user engagement
    • Ad hoc Analysis
      • Eg: how many group admins broken down by state/country
    • Data Mining (Assembling training data)
      • Eg: User Engagement as a function of user attributes
    • Spam Detection
      • Anomalous patterns for Site Integrity
      • Application API usage patterns
    • Ad Optimization

Facebook 13. Hadoop Usage @ Facebook

  • Cluster Capacity:
    • 600 nodes
    • ~2.4PB (80% used)
  • Data statistics:
    • Source logs/day:6TB
    • Dimension data/day:4TB
    • Compression Factor ~5x (gzip)
  • Usage statistics:
    • 3200 jobs/day with 800K tasks(map-reduce tasks)/day
    • 55TB of compressed data scanned daily
    • 15TB of compressed output data written to hdfs
    • 150 active users within Facebook

Facebook 14.

  • Hive Progress and Roadmap

Facebook 15.

  • CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe'WITH SERDEPROPERTIES ('testserde.default.serialization.format'='03');

Facebook 16. Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 Tables Metastore DB Data Location Bucketing Info Partitioning Cols 17. HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB 18. Hive Query Language

  • SQL
    • Subqueries in from clause
    • Equi-joins
    • Multi-table Insert
    • Multi-group-by
  • Sampling
    • SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) sWHERE s.ds = 2009-04-22 GROUP BY s.key

Facebook 19.

      • FROM pv_users
      • INSERT INTOTABLEpv_gender_sum
        • SELECT gender, count(DISTINCT userid)
        • GROUP BY gender
      • INSERT INTODIRECTORY /user/facebook/tmp/pv_age_sum.dir
        • SELECT age, count(DISTINCT userid)
        • GROUP BY age
      • INSERT INTOLOCAL DIRECTORY/home/me/pv_age_sum.dir
        • SELECT age, count(DISTINCT userid)
        • GROUP BY age;

Facebook 20. Hive Query Language (continued)

  • Extensibility
    • Pluggable Map-reduce scripts
    • Pluggable User Defined Functions
    • Pluggable User Defined Types
      • Complex object types: List of Maps
    • Pluggable Data Formats
      • Apache Log Format

Facebook 21.

      • FROM (
        • FROM pv_users
        • MAPpv_users.userid, pv_users.date
        • USING 'map_script
        • AS dt, uid
        • CLUSTERBY dt) map
      • INSERT INTO TABLE pv_users_reduced
        • REDUCEmap.dt, map.uid
        • USING 'reduce_script'
        • AS date, count;

Pluggable Map-Reduce Scripts Facebook 22. Map Reduce Example Facebook Machine 2 Machine 1 Local Map Global Shuffle Local Sort Local Reduce 23. Hive QL Join

    • INSERT INTO TABLE pv_users
    • SELECT pv.pageid, u.age
    • FROM page_view pv
    • JOIN user u
    • ON (pv.userid = u.userid);

Facebook 24. Hive QL Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32 25. Join Optimizations

  • Map Joins
    • User specified small tables stored in hash tables on the mapper backed by jdbm
    • No reducer needed
    • INSERT INTO TABLE pv_users
    • SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age
    • FROM page_view pv JOIN user u
    • ON (pv.userid = u.userid);
  • Future
    • Exploit table/column statistics for deciding strategy

Facebook 26. Hive QL Map Join Facebook page_view user Hash table pv_users key