Upload
qubole
View
355
Download
1
Embed Size (px)
DESCRIPTION
A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by a. Better layout of data on HDFS via partitioning and bucketing b. Designing test queries by using block and bucket sampling before running the queries on large datasets c. Using bucket map joins and parallel processing to run queries faster Visit www.qubole.com for more information.
Citation preview
Effec%ve Hive Queries Secrets From the Pros
We will be star,ng at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Ques%ons!
Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?
AGENDA This Webinar provides %ps on improving the performance and beJer u%lizing resources using the following best prac%ces:
• Data Layout (Par%%ons and Buckets) • Data Sampling (Bucket and Block sampling) • Data Processing (Bucket Map Join and Parallel execu%on)
Dataset Used
# of records: 276M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on
Airline Bookings All Includes stops at intermediate ci%es
# of records: 116M records Columns: I%nerary ID Year &Quarter of Travel Trip Origin City & State Trip Des%na%on City & State Distance between Origin & Des%na%on
Airline Bookings Origin Only Only first leg of travel
# of records: 50 Columns: State code & Name Popula%on
Census Human popula%on by US State
#1 -‐ Data Par%%oning • Problem PaJern
– Query a subset of data in a table – Subset iden%fied by “Column_Name = X” filter
• Solu%on paJern – Layout data in sub-‐directories with each directory associated with a value of the par%%on column
– The filter on par%%on column just picks a single sub directory • Approach
– Use PARTITION BY clause • Benefit
– Par%%on pruning – 2.7x faster on a query on Airline Bookings Dataset (29 seconds)
#1 -‐ Data Par%%oning
Airline Bookings All Table
Origin State (Par%%on Column / Sub-‐directory) CA WY AL
File1001.dat
File1002.dat
File100n.dat
File3001.dat
File3002.dat
File300n.dat
Filex001.dat
Filex002.dat
Filex00n.dat
Files inside the par%%on
SELECT origin_city, origin_state FROM Airline_Bookings_All WHERE origin_state = ‘CA’
CREATE TABLE Airline_Bookings_All …. PARTITIONED BY (origin_state STRING)
#2 -‐ Data Bucke%ng
• Problem PaJern – Join data in two large tables efficiently – Sample data inside a table efficiently
• Solu%on paJern – More efficient processing by storing data in hash buckets
• Approach • Use bucke%ng using CLUSTERED BY .. INTO n BUCKETS
• Benefit – Bucket Map Join – Bucket Sampling
#2 – Data Bucke%ng CREATE TABLE Airline_Bookings_All … CLUSTERED BY (i%nid) INTO 64 BUCKETS
set hive.enforce.bucke%ng = true; INSERT OVERWRITE TABLE Airline_Bookings_All SELECT … FROM ..
Ailrine_Bookings_All
File00.dat
File63.dat
File01.dat Each File contains all
the rows that correspond to the same hash of i%nid
column
#2 -‐ Data Bucke%ng
a
File1001.dat
File1002.dat
File100n.dat
Filex001.dat
Filex002.dat
Filex00m.dat
Files containing table data bucketed on a
column
b
set hive.op%mize.bucketmapjoin = true; SELECT /*+ MAPJOIN(a, b) */ a.*, b.* FROM Airline_Bookings_All a JOIN Airline_Bookings_Origin_Only b ON a.i%nid = b.i%nid
Note: 1. Both the tables are bucketed on i%nid column 2. The numbers of buckets in the two tables are a strict mul%ple of each other
#3 -‐ Bucket Sampling
• Problem PaJern – Work on joinable samples of data from different tables
• Solu%on paJern – Use Bucket Sampling
• Approach • TABLESAMPLE (BUCKET x OUT OF Y ON column)
• Benefit – Useful while working with sample data and joins
#3 -‐ Bucket Sampling
Filex002.dat
Filex030.dat
Filex064.dat
Files containing bookings data bucketed on i%nid
a
SELECT a.*, b.* FROM Airline_Bookings_All TABLESAMPLE(bucket 30 out of 64 on i%nid) a , Airline_Bookings_Origin_Only TABLESAMPLE(bucket 30 out of 64 on i%nid) b WHERE a.i%nid = b.i%nid
Filex001.dat
Filex063.dat
Filey002.dat
Filey030.dat
Filey064.dat
b
Filey001.dat
Filey063.dat
#4 – Block Sampling • Problem PaJern
– View a sample of a data with in a table – Sample size expressed as number of rows, %age of data, or number of MBs
• Solu%on paJern – Use Block sampling
• Approach – Use TABLESAMPLE (n%, nM, or n ROWS)
• Benefit – Geyng a random sample from the table – More op%ons to specify how many samples to generate
#5 – Parallel Execu%on
SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM (
SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8
)a JOIN (
SELECT itinid, origin, originstate FROM air_travel_origins_8
)B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Stage 1 Stage 2
Stage 3
set hive.exec.parallel = false;
set hive.exec.parallel = true;
Summary
• Iterate quickly on Query Design – Use Bucket and Block Sampling
• Run queries faster – Par%%oning to invoke Par%%on Pruning – Bucke%ng to invoke Bucket Map Joins – Execute complex queries in parallel
THANK YOU
Managed Cluster Built-‐In Connectors Friendly User-‐Interface Dedicated Support
• 100% Managed Hadoop Cluster in the Cloud • Auto-‐Scaling Cluster. Full Life-‐cycle Management • +12 Connectors to Applica%ons and Data Sources • 14-‐Day Free Trial (free account available) • 24/7 Customer Support
What’s Included?
è www.qubole.com/try ç