Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Practical Steps to Improve Hive Queries PerformanceSergey Kovalev

Software Engineer at Altoros

How Hive works

1. Use partitions whenever possible

/folder1/video_data/file1

id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013

/folder1/video_data/2012/file1

1, title1, channelId1, description1, 20122, title2, channelId2, description2, 2012

/folder1/video_data/2013/file1

3, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013

SELECT * from video WHERE uploadYear=’2013-04-08’

1. Use partitions whenever possible

create table video (

id STRING,

title STRING,

description STRING,

viewCount BIGINT

) PARTITIONED BY (uploadYear date)STORED AS ORC;

insert into table video PARTITION (uploadYear) select * from video_external;

2. Use bucketing

create table video ( id STRING, channelId STRING, title STRING, description STRING, ) CLUSTERED BY(channelId)

INTO 2 BUCKETSSTORED AS ORC;

create table channel ( id STRING, title STRING, description STRING, viewCount BIGINT ) CLUSTERED BY(id)


SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE ch.viewCount>1000

2. Use bucketing/folder1/video_data/file1

id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20124, title4, channelId4, description4, 20125, title5, channelId5, description5, 20136, title6, channelId6, description6, 20137, title7, channelId7, description7, 20138, title8, channelId8, description8, 2013


2, title2, channelId2, description2, 20124, title4, channelId4, description4, 20126, title6, channelId6, description6, 20138, title8, channelId8, description8, 2013


1, title1, channelId1, description1, 20123, title3, channelId3, description3, 20125, title5, channelId5, description5, 20137, title7, channelId7, description7, 2013

2. Use bucketing/folder1/channel_data/file1

id, title, description, viewCountchannelId1, title1, description1, viewCount1channelId2, title2, description2, viewCount2channelId3, title3, description3, viewCount3channelId4, title4, description4, viewCount4channelId5, title5, description5, viewCount5channelId6, title6, description6, viewCount6channelId7, title7, description7, viewCount7channelId8, title8, description8, viewCount8

/folder1/channel_data/file1

channelId2, title2, description2, viewCount2channelId4, title4, description4, viewCount4channelId6, title6, description6, viewCount6channelId8, title8, description8, viewCount8

/folder1/channel_data/file2

channelId1, title1, description1, viewCount1channelId3, title3, description3, viewCount3channelId5, title5, description5, viewCount5channelId7, title7, description7, viewCount7

3. Partitions + bucketingcreate table video ( id STRING, channelId STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) CLUSTERED BY(channelId)


3. Partitions + bucketing/folder1/video_data/file1

id, title, channelId, viewCount, uploadYear1, title1, channelId1, viewCount1, 20122, title2, channelId2, viewCount2, 20123, title3, channelId3, viewCount3, 20124, title4, channelId4, viewCount4, 20125, title5, channelId5, viewCount5, 20136, title6, channelId6, viewCount6, 20137, title7, channelId7, viewCount7, 20138, title8, channelId8, viewCount8, 2013

/folder1/video_data/2012/file12, title2, description2, viewCount2, 20124, title4, description4, viewCount4, 2012




4. Use joins optimization

Shuffle join/Common join:


Map-side join:


Sort-merge-bucket (SMB) join:

5. Choose the right input formatRow Data Column Store

6. Other optimization

Avoid highly normalized table structures

Compress map/reduce output

For map output compression, execute set mapred.compress.map.output = true.

For job output compression, execute set mapred.output.compress = true.

Use parallel executionSET hive.exce.parallel=true;

7. Use the 'explain' keyword to improve the query execution plan

EXPLAIN query...

7. Use the 'explain' keyword to improve the query execution plan

8. Stinger Initiative

Use cost-based optimization

Use vectorization

Transactions with ACID semantics

8. Hive on Tez

8. Sub-Second Queries with Hive LLAPNew approach using a hybrid engine that leverages Tez and something new called LLAP (Live

Long and Process)

Questiones?

Technology

Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance