Upload
-
View
697
Download
2
Embed Size (px)
Citation preview
Practical Steps to Improve Hive Queries PerformanceSergey Kovalev
Software Engineer at Altoros
How Hive works
1. Use partitions whenever possible
/folder1/video_data/file1
id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013
/folder1/video_data/2012/file1
1, title1, channelId1, description1, 20122, title2, channelId2, description2, 2012
/folder1/video_data/2013/file1
3, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013
SELECT * from video WHERE uploadYear=’2013-04-08’
1. Use partitions whenever possible
create table video (
id STRING,
title STRING,
description STRING,
viewCount BIGINT
) PARTITIONED BY (uploadYear date)STORED AS ORC;
insert into table video PARTITION (uploadYear) select * from video_external;
2. Use bucketing
create table video ( id STRING, channelId STRING, title STRING, description STRING, ) CLUSTERED BY(channelId)
INTO 2 BUCKETSSTORED AS ORC;
create table channel ( id STRING, title STRING, description STRING, viewCount BIGINT ) CLUSTERED BY(id)
INTO 2 BUCKETSSTORED AS ORC;
SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE ch.viewCount>1000
2. Use bucketing/folder1/video_data/file1
id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20124, title4, channelId4, description4, 20125, title5, channelId5, description5, 20136, title6, channelId6, description6, 20137, title7, channelId7, description7, 20138, title8, channelId8, description8, 2013
/folder1/video_data/file1
2, title2, channelId2, description2, 20124, title4, channelId4, description4, 20126, title6, channelId6, description6, 20138, title8, channelId8, description8, 2013
/folder1/video_data/file2
1, title1, channelId1, description1, 20123, title3, channelId3, description3, 20125, title5, channelId5, description5, 20137, title7, channelId7, description7, 2013
2. Use bucketing/folder1/channel_data/file1
id, title, description, viewCountchannelId1, title1, description1, viewCount1channelId2, title2, description2, viewCount2channelId3, title3, description3, viewCount3channelId4, title4, description4, viewCount4channelId5, title5, description5, viewCount5channelId6, title6, description6, viewCount6channelId7, title7, description7, viewCount7channelId8, title8, description8, viewCount8
/folder1/channel_data/file1
channelId2, title2, description2, viewCount2channelId4, title4, description4, viewCount4channelId6, title6, description6, viewCount6channelId8, title8, description8, viewCount8
/folder1/channel_data/file2
channelId1, title1, description1, viewCount1channelId3, title3, description3, viewCount3channelId5, title5, description5, viewCount5channelId7, title7, description7, viewCount7
3. Partitions + bucketingcreate table video ( id STRING, channelId STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) CLUSTERED BY(channelId)
INTO 2 BUCKETSSTORED AS ORC;
3. Partitions + bucketing/folder1/video_data/file1
id, title, channelId, viewCount, uploadYear1, title1, channelId1, viewCount1, 20122, title2, channelId2, viewCount2, 20123, title3, channelId3, viewCount3, 20124, title4, channelId4, viewCount4, 20125, title5, channelId5, viewCount5, 20136, title6, channelId6, viewCount6, 20137, title7, channelId7, viewCount7, 20138, title8, channelId8, viewCount8, 2013
/folder1/video_data/2012/file12, title2, description2, viewCount2, 20124, title4, description4, viewCount4, 2012
/folder1/video_data/2012/file21, title1, description1, viewCount1, 20123, title3, description3, viewCount3, 2012
/folder1/video_data/2013/file16, title6, description6, viewCount6, 20138, title8, description8, viewCount8, 2013
/folder1/video_data/2013/file25, title5, description5, viewCount5, 20137, title7, description7, viewCount7, 2013
4. Use joins optimization
Shuffle join/Common join:
4. Use joins optimization
Map-side join:
4. Use joins optimization
Sort-merge-bucket (SMB) join:
5. Choose the right input formatRow Data Column Store
6. Other optimization
Avoid highly normalized table structures
Compress map/reduce output
For map output compression, execute set mapred.compress.map.output = true.
For job output compression, execute set mapred.output.compress = true.
Use parallel executionSET hive.exce.parallel=true;
7. Use the 'explain' keyword to improve the query execution plan
EXPLAIN query...
7. Use the 'explain' keyword to improve the query execution plan
8. Stinger Initiative
Use cost-based optimization
Use vectorization
Transactions with ACID semantics
8. Hive on Tez
8. Sub-Second Queries with Hive LLAPNew approach using a hybrid engine that leverages Tez and something new called LLAP (Live
Long and Process)
Questiones?