Hive sourcecodereading

  • View

  • Download

Embed Size (px)

Text of Hive sourcecodereading

  • 1. Hive Source CodeReadingHadoop Source Code Reading Vol.9(5/30)

2. OverviewMotivationWhat component did I read?Set up Development EnvironmentUseful document for Code ReadingLets try!My understandingWhere does Hive tune performance?My impression2012531 3. MotivationCuriosityHow does Hive convert HiveQL toMapReduce jobs?What optimizing processes are adopted? select item, sum(value) from item_tbl group by item;map reduce2012531 4. What component did I read? 5. Set up Development EnvironmentSee Setting up is troublesome...(-_-;) I read r1342841.2012531 6. Useful documentHive Anatomy anatomyInternal Hive internal-hive2012531 7. Lets try! 8. a few minutes later 9. a few hours later 10. Its difcult...ComplexMethod length is long 11. And there is this ticket... 12. 2012531 13. Terrible...Its not good... 14. anyway... 15. My understanding HiveQLParser AST Semantic AnalyzerOperatorLogical Plan Generation QB Logical OptimizerOperatorPhysical Plan Generation Prepare ExecutionTask Physical OptimizerTask Start MapReduce job2012531 16. Operator ExampleFetchOperatorSelectOperatorFilterOperatorGroupByOperatorTableScanOperatorReduceSinkOperatorJoinOperator2012531 17. Explain Example 1hive> explain select * from aaa;...STAGE DEPENDENCIES: Stage-0 is a root stageSTAGE PLANS: Stage: Stage-0Fetch Operator limit: -12012531 18. Explain Example 22012531 19. OptimizerRule-basedOptimizer is a set of Transformation rules onthe operator treeThe transformation rules are specied by aregexp pattern on the tree.The rule with minimum cost are executed.The cost is regexp pattern match length.2012531 20. Cost Calculation Example HiveQL is select key from src; Optimizer is ColumnPruner. SelectOperator is expressed as SEL%. Regexp pattern of ColumnPrunerSelectProc rule is expressed as SEL%. The cost is 4. Pattern.compile(SEL%).matcher(SEL%).group ().length()2012531 21. Logical Optimizer ExampleColumnPrunerPredicatePushDownPartitionPrunerGroupByOptimizerUnionProcessorJoinReorder2012531 22. Physical Optimizer ExampleSkewJoinResolverCommonJoinResolver If, resolver start.IndexWhereResolverMetadataOnlyOptimizer2012531 23. Break 24. Where does Hive tune performance? Plan Physical Optimizerhive.exec.parallel Prepare Execution2012531 25. trueLogical Plan Generationmap side aggregation(in-mapper combining) 26. false Physical OptimizerIf, map join start. 27. hive.exec.paralleldefault falsePrepare ExecutionIf hive.exec.parallel=true, parallel launch(likeparallel query in Oracle) start. It is effective inmulti table insert.2012-04-03 19:34:09,346 Stage-4 map = 0%, reduce = 0%2012-04-03 19:34:16,576 Stage-5 map = 0%, reduce = 0%2012-04-03 19:34:47,630 Stage-4 map = 100%, reduce = 0%2012-04-03 19:34:57,716 Stage-5 map = 100%, reduce = 0%2012-04-03 19:35:40,147 Stage-4 map = 100%, reduce = 67%2012-04-03 19:35:46,430 Stage-4 map = 100%, reduce = 100%2012-04-03 19:35:46,443 Stage-5 map = 100%, reduce = 100%2012531 28. My impressionThere are a few choices of performance tuning.(, hive.exec.parallel)Data model, ETL ow is more important thanperformance tuning.Current optimization in Hive is rule-based. Ifindex and statistics and cost-based optimizationare more developed, there will be more choicesof performance tuning.2012531