Cost-based query optimization in Apache Hive 0.14

  • View
    596

  • Download
    7

Embed Size (px)

DESCRIPTION

Slides from a talk given by Julian Hyde to the Hive User Group Meetup NYC Edition, on October 15th, 2014.

Text of Cost-based query optimization in Apache Hive 0.14

  • 1. Cost-based query optimization in Apache Hive 0.14 Julian Hyde Julian Hyde Page # Hortonworks Inc. 2014 Hive User Group, NYC October 15th, 2014
  • 2. About me Julian Hyde Architect at Hortonworks Open source: Founder & lead, Apache Calcite (query optimization framework) Founder & lead, Pentaho Mondrian (analysis engine) Committer, Apache Drill Contributor, Apache Hive Contributor, Cascading Lingual (SQL interface to Cascading) Past: SQLstream (streaming SQL) Broadbase (data warehouse) Oracle (SQL kernel development) Page # Hortonworks Inc. 2014
  • 3. Hadoop - A New Data Architecture for New Data APP LICA TIO NS DAT A SYS TEM Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP REPOSITORIES SOU RCE S Existing Sources (CRM, ERP, Clickstream, Logs) Page # Hortonworks Inc. 2014 OLTP, ERP, CRM Systems Unstructured documents, emails Server logs Clickstream Sentiment, Web Data Sensor. Machine Data Geolocation New Data Requirements: ! Scale Economics Flexibility Traditional Data Architecture
  • 4. Interactive SQL-IN-Hadoop Delivered Stinger Initiative DELIVERED Next generation SQL based interactive query in Hadoop Speed Page # Hortonworks Inc. 2014 Hortonworks Inc. 2013 Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop SQL Business Analytics Custom Apps Window Functions Apache Hive Apache MapReduce Apache Tez Apache YARN !! 1 !! Apache Hive Contribution an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added (2x) N HDFS (Hadoop Distributed File System) Stinger Project Stinger Phase 1: Base Optimizations SQL Types SQL Analytic Functions ORCFile Modern File Format Sting!er Phase 2: Delivered HDP 2.1 SQL Types SQL Analytic Functions Advanced Optimizations Performance Boosts via YARN !S tinger Phase 3 Hive on Apache Tez Query Service (always on) Buffer Cache Cost Based Optimizer (Calcite) 13 Months Governance & Integration Security Operations Data Access Data Management ORC File
  • 5. Hive Single tool for all SQL use cases Page # Hortonworks Inc. 2014 OLTP, ERP, CRM Systems Unstructured documents, emails Server logs Clickstream Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 6. Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 Rule-based Optimizations Page # Hortonworks Inc. 2014 No join reordering Main optimizations: predicate push- Apache Hive 0.13 April 2014 Old-style JOIN & push-down conditions: FROM t1, t2 WHERE HDP 2.1 April 2014 Cost-based ordering of joins Apache Hive 0.14 soon Bushy joins, large joins, better operator coverage, better statistics,
  • 7. Apache Calcite (incubating) Apache Calcite Page # Hortonworks Inc. 2014
  • 8. Apache Calcite Apache incubator project since May, 2014 Query planning framework Relational algebra, rewrite rules, cost model Extensible Usable standalone (JDBC) or embedded Adoption Lingual SQL interface to Cascading Apache Drill Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-memory data Page # Hortonworks Inc. 2014
  • 9. Conventional DB architecture Page # Hortonworks Inc. 2014
  • 10. Calcite architecture Page # Hortonworks Inc. 2014
  • 11. Expression tree Splunk Table: splunk MySQL Page # Hortonworks Inc. 2014 Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan scan Table: products SELECT p.product_name, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC
  • 12. Expression tree (optimized) Splunk Table: splunk Page # Hortonworks Inc. 2014 Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan MySQL scan Table: products SELECT p.product_name, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC
  • 13. Calcite APIs and SPIs Page # Hortonworks Inc. 2014 Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider RelMdColumnUniquensss RelMdDistinctRowCount RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule MergeFilterRule PushAggregateThroughUnionRule 100+ more Global transformations Unification (materialized view) Column trimming De-correlation Relational algebra RelNode (operator) TableScan Filter Project Union Aggregate RelDataType (type) RexNode (expression) RelTrait (physical property) RelConvention (calling-convention) RelCollation (sortedness) TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function TableFunction TableMacro Lattice
  • 14. Just added to Calcite - materialized views & lattices Page # Hortonworks Inc. 2014 Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk http://hortonworks.com/blog/dmmq/
  • 15. Calcite Lattice () 1 (z, s) 43.4k (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page # Hortonworks Inc. 2014 (z, s, g, y, m) 912k (s, g, y, m) 6k (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (z, s, g) 87k (g, y) 10 (g, y, m) 120 (g, m) 24
  • 16. Now back to Hive Page # Hortonworks Inc. 2014 Apache Calcite
  • 17. CBO in Hive Why cost-based optimization? Ease of Use Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive More efficient & maintainable query preparation process Laying the groundwork for deeper optimizations, e.g. materialized views Page # Hortonworks Inc. 2014 17
  • 18. Query preparation Hive 0.13 SQL parser Page # Hortonworks Inc. 2014 Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 19. Query preparation full CBO SQL parser Page # Hortonworks Inc. 2014 Semantic analyzer Annotated AST Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Calcite optimizer RelNode
  • 20. Query preparation Hive 0.14 SQL parser Page # Hortonworks Inc. 2014 Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Calcite optimizer