48

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Embed Size (px)

Citation preview

Page 1: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Page 2: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

FASTER, FASTER, FASTER: THE TRUE STORY OF A MOBILE ANALYTICS DATA MART ON HIVE

Mithun RadhakrishnanJosh Walters

Page 3: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

3

• Mithun Radhakrishnan• Hive Engineer at Yahoo• Hive Committer • Has an irrational fear of

spider monkeys• [email protected]• @mithunrk

About myself

Page 4: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

4

RECAP

Page 5: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

55 2015 Hadoop Summit, San Jose, California

Page 6: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

6

From: The [REDACTED] ETL teamTo: The Yahoo Hive TeamSubject: A small matter of size...

Dear YHive team,

We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr.

If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground?

Yours gigantically, Project [REDACTED]

Page 7: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

7

ABOUT ME• Josh Walters• Data Engineer at Yahoo• I build lots of data pipelines• Can eat a whole plate of deep fried cookie

dough• http://joshwalters.com• @joshwalters

Page 8: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

8

WHAT IS THE CUSTOMER NEED?• Faster ETL

• Faster queries

• Faster ramp up

Page 9: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

9

CASE STUDY: MOBILE DATA MART• Mobile app usage data

• Optimize performance

• Interactive analytics

Page 10: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

10

LOW HANGING FRUIT• Tez Tez Tez!

• Vectorized query execution

• Map-side aggregations

• Auto-convert map join

Page 11: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

11

DATA PARTITIONING• Want thousands of partitions

• Deep data partitioning

• Difficult to do at scale

Page 13: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

13

SOLID STATE DRIVES• Didn’t really help

• Ended up CPU bound

• Regular drives are fine

Page 14: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

14

ORC!• Used in largest data systems

• 90% boost on sorted columns

• 30x compression versus raw text

• Fits well with our tech stack

Page 15: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

15

SKETCH ALL THE THINGS• Very accurate

• Can store sketches in Hive

• Union, intersection, difference

• 75% boost on relevant queries

Page 16: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

16

SKETCH ALL THE THINGSSELECT COUNT(DISTINCT id)FROM DB.TABLEWHERE ...; -- ~100 seconds

SELECT estimate(sketch(id))FROM DB.TABLEWHERE ...; -- ~25 seconds

Page 17: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

17

SKETCH ALL THE THINGSStandard Deviation 1 2 3

Confidence Interval 68% 95% 99%

K = 16 25% 51% 77%

K = 512 4% 8% 13%

K = 4096 1% 3% 4%

K = 16384 < 1% 1% 2%

Page 18: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

18

MORE SKETCH INFO• Summarization, Approx. and Sampling: Tradeoffs for Improving Query, Hadoop Summit, 2015

• http://datasketches.github.io

Page 19: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

19

ADVANCED QUERIES• Desire for complex queries

• Retention, funnels, etc

• A lot can be done with UDFs

Page 20: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

20

FUNNEL ANALYSIS• Complex to write, difficult to reuse

• Slow, requires multiple joins

• Using UDFs, now runs in seconds, not hours

• https://github.com/yahoo/hive-funnel-udf

Page 21: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

21

REALLY FAST OLAP• OLAP type queries are the most common

• Aggregate only queries: group, count, sum, …

• Can we optimize for such queries?

Page 22: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

22

OLAP WITH DRUID• Interactive, sub-second latency

• Ingest raw records, then aggregate

• Open source, actively developed

• http://druid.io

Page 23: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

23

BI TOOL• Many options

• Don’t cover all needs

• Need graphs and dashboards

Page 24: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

24

CARAVEL• Hive, Druid, Redshift, MySQL, …

• Simple query construction

• Open source, actively developed

• https://github.com/airbnb/caravel

Page 25: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

25

WHAT WE LEARNED• Product teams need custom data marts

• Complex to build and run

• Just want to focus on business logic

Page 26: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

26

DATA MART IN A BOX!• Generalized ETL pipeline

• Easy to spin-up

• Automatic continuous delivery

• Just give us a query!

Page 27: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

27

DATA MART ARCHITECTURE

Page 28: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

28

INFRASTRUCTURE WORK• We didn’t do this alone

• Partners in grid team fixed many pain points

Page 29: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Y!HIVE

Page 30: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

30

Dedicated Queue Metrics:

Shared Cluster Metrics:

Hive on Tez - Interactive Queries in Shared Clusters

Page 31: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

31

Page 32: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

32

Hive 0.2

Hive 0.3

Hive 0.4

Hive 0.5

Hive 0.6

Hive 0.7

Hive 0.8

Hive 0.9

Hive 0.10

Hive 0.11

Hive 0.12

Hive 0.13

Hive 0.14

Hive 1.0

Hive 1.1

Hive 1.2

Hive 1 Hive 2.0

Hive Master

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Increased Configurability or Increased complexity?

LOC

Page 33: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

33

• Out of the box:• Tez container reuse

• set tez.am.container.reuse.enabled=true;• Tez speculative execution

• set tez.am.speculation.enabled=true;• Reduce-side vectorization

• set hive.vectorized.execution.reduce.enabled=true;• set hive.vectorized.execution.reduce.groupby.enabled=true;

Performance Tuning

Page 34: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

34

• Understand your data:• Use ORC’s index-based filtering:

• set hive.optimize.index.filter=true;• Bloom filters

• ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”);• Cardinality?

• Sort on filter-column• Trade-offs: Parallelism vs. filtering

Performance Tuning

Page 35: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

35

• Understand your queries:• Prefer LIKE and INSTR over REGEXP*• Compile-time date/time functions:

• current_date()• current_timestamp()

• Queries generated from UI tools

Performance Tuning

Page 36: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

36

• Index-based filtering available to Pig / MR users• HCatLoader, HCatInputFormat

• Split-calculation improvements• Block-based BI• Parallel ETL

• Disabled dictionaries for Complex data types• OOMs

Performance Improvements - ORC

Page 37: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

37

• Skew Joins• Already solved for Pig• Hive for ETL• Current Hive solution: Explicit values. (Wishful thinking)• Poisson sampling

• Faster sorted-merge joins• Wide-tables• SpillableRowContainers

Performance Improvements - Joins

Page 38: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

38

• Improvements for data-discovery• HCatClient-based users• Oozie, GDM• 10x improvement

• Fetch Operator improvements:• SELECT * FROM partitioned_table LIMIT 100;• Lazy-load partitions

Performance Improvements – Various Sundries

Page 39: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

39

• Avro Format is popular• Self describing• Flexible• Generic• Quirky

• Intermediate stages in pipelines• Development

Performance Improvements: Hive’s AvroSerDe

Page 40: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

40

“There is no mature, no stable. The only constant is change… ... [Our] work on feeds often involves new columns, several times a day.”

Page 41: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

41

Page 42: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

42

• AvroSerDe needs read-schema at job-runtime (i.e. map-side)• Stored on HDFS

• ETL Jobs need 10-20K maps• Replication factor• Data-node outage

• It gets steadily worse• Block-replication on node-loss• Task attempt retry• More nodes lost• Rinse and repeat

The Problem

Page 43: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

43

Page 44: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

44

• Reconcile metastore-schema against read-schema?• toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema

• Store schema in TBLPROPERTIES?• Cache read-schema during SerDe::initialize()

• Once per map-task• Prefetch read-schema at query-planning phase

• Once per job• Separate optimizer

The Solution

Page 45: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

4545

• Row-oriented format• Skew-join• Stats storage

We’re not done yet

Page 46: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

46

• Team effort• Chris Drome• Selina Zhang• Michael Natkovich• Olga Natkovich• Sameer Raheja• Ravi Sankurati

Thanks

Page 47: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Q&A

Page 48: Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive