Hive sourcecodereading

  • View
    6.962

  • Download
    0

Embed Size (px)

Text of Hive sourcecodereading

  • 1. Hive Source CodeReadingHadoop Source Code Reading Vol.9(5/30)http://hadoop-scr9th.eventbrite.com/@wyukawa2012531

2. OverviewMotivationWhat component did I read?Set up Development EnvironmentUseful document for Code ReadingLets try!My understandingWhere does Hive tune performance?My impression2012531 3. MotivationCuriosityHow does Hive convert HiveQL toMapReduce jobs?What optimizing processes are adopted? select item, sum(value) from item_tbl group by item;map reduce2012531 4. What component did I read?http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook2012531 5. Set up Development EnvironmentSee http://glowing-moon-7493.herokuapp.com/questions/171/hive Setting up is troublesome...(-_-;) I read r1342841.2012531 6. Useful documentHive Anatomy http://www.slideshare.net/nzhang/hive- anatomyInternal Hive http://www.slideshare.net/recruitcojp/ internal-hive2012531 7. Lets try!http://matome.naver.jp/odai/2133708130778775801/21337140017833510032012531 8. a few minutes later http://matome.naver.jp/odai/2133708130778775801/21337140015833460032012531 9. a few hours later http://matome.naver.jp/odai/2133708130778775801/21337140016833480032012531 10. Its difcult...ComplexMethod length is long http://matome.naver.jp/odai/2133708130778775801/21337140016833491032012531 11. And there is this ticket... http://matome.naver.jp/odai/2133708130778775801/21337140015833458032012531 12. 2012531 13. Terrible...Its not good... http://matome.naver.jp/odai/2133622524999684001/21338103077617701032012531 14. anyway... http://matome.naver.jp/odai/2133708130778775801/21337140015833467032012531 15. My understanding HiveQLParser AST Semantic AnalyzerOperatorLogical Plan Generation QB Logical OptimizerOperatorPhysical Plan Generation Prepare ExecutionTask Physical OptimizerTask Start MapReduce job2012531 16. Operator ExampleFetchOperatorSelectOperatorFilterOperatorGroupByOperatorTableScanOperatorReduceSinkOperatorJoinOperator2012531 17. Explain Example 1hive> explain select * from aaa;...STAGE DEPENDENCIES: Stage-0 is a root stageSTAGE PLANS: Stage: Stage-0Fetch Operator limit: -12012531 18. Explain Example 22012531 19. OptimizerRule-basedOptimizer is a set of Transformation rules onthe operator treeThe transformation rules are specied by aregexp pattern on the tree.The rule with minimum cost are executed.The cost is regexp pattern match length.2012531 20. Cost Calculation Example HiveQL is select key from src; Optimizer is ColumnPruner. SelectOperator is expressed as SEL%. Regexp pattern of ColumnPrunerSelectProc rule is expressed as SEL%. The cost is 4. Pattern.compile(SEL%).matcher(SEL%).group ().length()2012531 21. Logical Optimizer ExampleColumnPrunerPredicatePushDownPartitionPrunerGroupByOptimizerUnionProcessorJoinReorder2012531 22. Physical Optimizer ExampleSkewJoinResolverCommonJoinResolver If hive.auto.convert.join=true, resolver start.IndexWhereResolverMetadataOnlyOptimizer2012531 23. Break http://matome.naver.jp/odai/2133708130778775801/21337140015833467032012531 24. Where does Hive tune performance?hive.map.aggrLogical Plan Generationhive.auto.convert.join Physical Optimizerhive.exec.parallel Prepare Execution2012531 25. hive.map.aggrdefault trueLogical Plan Generationmap side aggregation(in-mapper combining) http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-nal.pdf2012531 26. hive.auto.convert.joindefault false Physical OptimizerIf hive.auto.convert.join=true, map join start.http://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/4706679289192012531 27. hive.exec.paralleldefault falsePrepare ExecutionIf hive.exec.parallel=true, parallel launch(likeparallel query in Oracle) start. It is effective inmulti table insert.2012-04-03 19:34:09,346 Stage-4 map = 0%, reduce = 0%2012-04-03 19:34:16,576 Stage-5 map = 0%, reduce = 0%2012-04-03 19:34:47,630 Stage-4 map = 100%, reduce = 0%2012-04-03 19:34:57,716 Stage-5 map = 100%, reduce = 0%2012-04-03 19:35:40,147 Stage-4 map = 100%, reduce = 67%2012-04-03 19:35:46,430 Stage-4 map = 100%, reduce = 100%2012-04-03 19:35:46,443 Stage-5 map = 100%, reduce = 100%2012531 28. My impressionThere are a few choices of performance tuning.(hive.auto.convert.join, hive.exec.parallel)Data model, ETL ow is more important thanperformance tuning.Current optimization in Hive is rule-based. Ifindex and statistics and cost-based optimizationare more developed, there will be more choicesof performance tuning.2012531