54
Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance Page 1 [email protected] [email protected]

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

Embed Size (px)

DESCRIPTION

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance - Vinod Kumar Vavilapalli - Gopal Vijayaraghavan

Citation preview

Page 1: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance

Page 1

[email protected]@apache.org

Page 2: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Page 3: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Operation Stinger

Page 3

Page 4: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Performance at any cost

Page 5: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

• Scalability–Already works great, just don’t break it for performance gains

• Isolation + Security–Queries between different users run as different users

• Fault tolerance–Keep all of MR’s safety nets to work around bad nodes in clusters

• UDFs–Make sure they are “User” defined and not “Admin” defined

Page 6: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

First things first• How far can we push Hive as it exists today?

Page 7: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Benchmark spec• The TPC-DS benchmark data+query set• Query 27 (big joins small)

–For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic.

• Query 82 (big joins big)–List all items and current prices sold through the store channel from

certain manufacturers in a given price range and consistently had a quantity between 100 and 500 on hand in a 60-day period.

Page 8: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

TL;DR• TPC-DS Query 27, Scale=200, 10 EC2 nodes (40 disks)

Page 9: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

TL;DR - II• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)

Page 10: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Forget the actual benchmark• First of all, YMMV

–Software–Hardware–Setup–Tuning

• Text formats seem to be the staple of all comparisons–Really?–Everybody’s using it but only for benchmarks!

Page 11: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

What did the trick?• Mapreduce?• HDFS?• Or is it just Hive?

Page 12: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Optional Advice

Page 13: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

RCFile• Binary RCFiles• Hive pushes down column projections• Less I/O, Less CPU • Smaller files

Page 14: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Data organization• No data system at scale is loaded once & left alone• Partitions are essential• Data flows into new partitions every day

Page 15: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

A closer look• Now revisiting the benchmark and its results

Page 16: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Query27 - Before

Page 17: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Before

Page 18: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Query 27 - After

Page 19: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

After

Page 20: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Query 82 - Before

Page 21: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Query 82 - After

Page 22: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

What changed?• Job Count/Correct plan• Correct data formats• Correct data organization• Correct configuration

Page 23: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

What changed?

Data Organization

Data Formats

Query Plan

Page 24: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Page 25: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Is that all?• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Parallelism–Spin-up times–Data locality

• In HDFS–Bad disks/deteriorating nodes

Page 26: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In Hive• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Parallelism–Spin-up times–Data locality

• In HDFS–Bad disks/deteriorating nodes

Page 27: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In Hive• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Parallelism–Spin-up times–Data locality

• In HDFS–Bad disks/deteriorating nodes

Page 28: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Hive Metastore• 1+N Select problem

–SELECT partitions FROM tables;–/* for each needed partition */ SELECT * FROM Partition ..–For query 27 , generates > 5000 queries! 4-5 seconds lost on each call!–Lazy loading or Include/Join are general solutions

• Datanucleus/ORM issues–100K NPEs try.. Catch.. Ignore..

• Metastore DB Schema revisit–Denormalize some/all of it?

Page 29: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In Hive• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Parallelism–Spin-up times–Data locality

• In HDFS–Bad disks/deteriorating nodes

Page 30: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

RCFile issues• RCFiles do not split well

–Row groups and row group boundaries

• Small row groups vs big row groups–Sync() vs min split–Storage packing

• Run-length information is lost–Unnecessary deserialization costs

Page 31: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

ORC file format

• A single file as output of each task.– Dramatically simplifies integration with Hive– Lowers pressure on the NameNode

• Support for the Hive type model– Complex types (struct, list, map, union)– New types (datetime, decimal)– Encoding specific to the column type

• Split files without scanning for markers• Bound the amount of memory required for

reading or writing.

Page 32: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In Hive• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Parallelism–Spin-up times–Data locality

• In HDFS–Bad disks/deteriorating nodes

Page 33: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

CPU intensive code

Page 34: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

CPU intensive code• Hive query engine processes one row at a time

–Very inefficient in terms of CPU usage

• Lazy deserialization: layers• Object inspector calls • Lots of virtual method calls

Page 35: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Tighten your loops

Page 36: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Vectorization to the rescue• Process a row batch at a time instead of a single row• Row batch to consist of column vectors

–The column vector will consist of array(s) of primitive types as far as possible

• Each operator will process the whole column vector at a time

• File formats to give out vectorized batches for processing• Underlying research promises

–Better instruction pipelines and cache usage–Mechanical sympathy

Page 37: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Vectorization: Prelim results• Functionality

–Some arithmetic operators and filters using primitive type columns–Have a basic integration benchmark to prove that the whole setup

works

• Performance–Micro benchmark–More than 30x improvement in the CPU time–Disclaimer:

–Micro benchmark!– Include io or deserialization costs or complex and string datatypes

Page 38: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In YARN+MR• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Data locality–Parallelism–Spin-up times

• In HDFS–Bad disks/deteriorating nodes

Page 39: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In YARN+MR• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Data locality–Parallelism–Spin-up times

• In HDFS–Bad disks/deteriorating nodes

Page 40: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Data Locality• CombineInputFormat• AM interaction with locality• Short-circuit reads!• Delay scheduling

–Good for throughput–Bad for latency

Page 41: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In YARN+MR• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Data locality–Parallelism–Spin-up times

• In HDFS–Bad disks/deteriorating nodes

Page 42: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Parallelism• Can tune it (to some extent)

–Controlling splits/reducer count

• Hive doesn’t know dynamic cluster status–Benchmarks max out clusters, real jobs may or may not

• Hive does not let you control parallelism–particularly in case of multiple jobs in a query

Page 43: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In YARN+MR• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Data locality–Parallelism–Spin-up times

• In HDFS–Bad disks/deteriorating nodes

Page 44: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Spin up times• AM startup costs• Task startup costs• Multiple waves of map tasks

Page 45: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Apache Tez• Generic DAG workflow• Container re-use• AM pool service

Page 46: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

AM Pool Service•   Pre-launches a pool of AMs•   Jobs submitted to these pre-launched AMs

–Saves 3-5 seconds

•   Pre-launched AMs can pre-allocate containers• Tasks can be started as soon as the job is submitted

–Saves 2-3 seconds

Page 47: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Container reuse• Tez MapReduce AM supports Container reuse• Launched JVMs are re-used between tasks

– about 4-5 seconds saved in case of multiple waves

• Allows future enhancements –re-using task data structures across splits

Page 48: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

In HDFS• NO!• In Hive

–Metastore–RCFile issues–CPU intensive code

• In YARN+MR–Data locality–Parallelism–Spin-up times

• In HDFS–Bad disks/deteriorating nodes

Page 49: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Speculation/bad disks• No cluster remains at 100% forever• Bad disks cause latency issues

–Speculation is one defense, but it is not enough–Fault tolerance is a safety net

• Possible solutions:–More feedback from HDFS about stale nodes, bad/slow disks–Volume scheduling

Page 50: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

General guidelines• Benchmarking

–Be wary of benchmarks! Including ours!–Algebra with X

Page 51: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

General guidelines contd.• Benchmarks: To repeat, YMMV.• Benchmark *your* use-case.• Decide your problem size

–If (smallData) {Mysql/Postgres/Your smart phone

} else {–Make it work–Make it scale–Make it faster

}• If it is (seems to be) slow, file a bug, spend a little time!• Replacing systems without understanding them

–Is an easy way to have an illusion of progress

Page 52: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Related talks• “Optimizing Hive Queries” by Owen O’Malley• “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner

Page 53: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Credits• Arun C Murthy• Bikas Saha• Gopal Vijayaraghavan• Hitesh Shah• Siddharth Seth• Vinod Kumar Vavilapalli• Alan Gates• Ashutosh Chauhan• Vikram Dixit• Gunther Hagleitner• Owen O’Malley• Jintendranath Pandey• Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.

Page 54: Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

© Hortonworks Inc. 2013

Q&A• Thanks!