Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in Apache Tajo

Query Optimization and JIT-based Vectorized Execution in Apache TajoHyunsik ChoiResearch Director, GruterHadoop Summit North America 2014

Talk Outline

• Introduction to Apache Tajo

• Key Topics– Query Optimization in Apache Tajo

• Join Order Optimization• Progressive Optimization

– JIT-based Vectorized Engines

About Me

• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)• PhD (Computer Science & Engineering, 2013), Korea Uni.• Director of Research, Gruter Corp, Seoul, South Korea

• Open-source Involvement– Full-time contributor to Apache Tajo (2013.6 ~ )– Apache Tajo PMC member and committer (2013.3 ~ )– Apache Giraph PMC member and committer (2011. 8 ~ )

• Contact Info– Email: [email protected]– Linkedin: http://linkedin.com/in/hyunsikchoi/

mailto:[email protected]

http://linkedin.com/in/hyunsikchoi/

http://linkedin.com/in/hyunsikchoi/

Apache Tajo

• Open-source “SQL-on-H” “Big DW” system

• Apache Top-level project since March 2014

• Supports SQL standards

• Low latency, long running batch queries

• Features– Supports Joins (inner and all outer), Groupby, and Sort– Window function– Most SQL data types supported (except for Decimal)

• Recent 0.8.0 release– https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0

Overall Architecture

Query Optimization

Optimization in Tajo

Query Optimization Steps

Logical Plan Optimization in Tajo

• Rewrite Rules– Projection Push Down

• push expressions to operators lower as possible• narrow read columns• remove duplicated expressions

– if some expressions has common expression

– Selection Push Down• reduce rows to be processed earlier as possible

– Extensible Rewrite rule interfaces• Allow developers to write their own rewrite rules

• Join order optimization– Enumerate possible join orders– Determine the optimized join order in greedy manner– Currently, we use simple cost-model using table volumes.

Join Optimization - Greedy Operator Ordering

Set<LogicalNode> remainRelations = new LinkedHashSet<LogicalNode>(); for (RelationNode relation : block.getRelations()) { remainRelations.add(relation); }

LogicalNode latestJoin; JoinEdge bestPair;

while (remainRelations.size() > 1) { // Find the best join pair among all joinable operators in candidate set. bestPair = getBestPair(plan, joinGraph, remainRelations);

// remainRels = remainRels \ Ti remainRelations.remove(bestPair.getLeftRelation()); // remainRels = remainRels \ Tj remainRelations.remove(bestPair.getRightRelation());

latestJoin = createJoinNode(plan, bestPair); remainRelations.add(latestJoin); }

findBestOrder() in GreedyHeuristicJoinOrderAlgorithm.java

Progressive Optimization (in DAG controller)

• Query plans often suboptimal as estimation-based

• Progressive Optimization:– Statistics collection over running query in runtime– Re-optimization of remaining plan stages

• Optimal ranges and partitions based on operator type (join, aggregation, and sort) in runtime (since v0.2)

• In-progress work (planned for 1.0)– Re-optimize join orders– Re-optimize distributed join plan

• Symmetric shuffle Join >>> broadcast join– Shrink multiple stages into fewer stages

JIT-based Vectorized Query Engine

Vectorized Processing - Motivation

• So far have focused on I/O throughput• Achieved 70-110MB/s in disk bound queries

• Increasing customer demand for faster storages such as SAS disk and SSD

• BMT with fast storage indicates performance likely CPU-bound rather than disk-bound

• Current execution engine based on tuple-at-a-time approach

What is Tuple-at-a-time model?

• Every physical operator produces a tuple by recursively calling next() of child operators

tuples

next() call

Upside• Simple Interface• All arbitrary operator combinations

Downside (performance degradation)• Too many function calls• Too many branches

• Bad for CPU pipelining• Bad data/instruction cache hits

Performance Degradation

Current implementation also uses: • Immutable Datum classes wrapping Java

primitives – Used in expression evaluation and serialization

Resulting in:• Object creation overheads• Big memory footprint (particularly inefficient in-memory

operations)

• Expression trees– Each primitive operator evaluation involves

function call

Benchmark Breakdown

• TPC-H Q1:select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,

avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc,

count(*) as count_orderfrom lineitemwhere l_shipdate <= '1998-09-01’group by l_returnflag, l_linestatusorder by l_returnflag, l_linestatus

Benchmark Breakdown

• TPC-H dataset (scale factor = 3)– 17,996,609 (about 18M) rows

• Plain text lineitem table (2.3 GB)• CSV dataset >> Parquet format file

– To minimize the effectiveness of other factors which may impact CPU cost

– No compression– 256MB block size, 1MB pagesize

• Single 1GB Parquet file

Benchmark Breakdown

• H/W environment– CPU i7-4770 (3.4GHz), 32GB Ram– 1 SATA Disk (WD2003FZEX)

• Read throughput: 105-167MB/s (avg. 144 MB/s) according to http://hdd.userbenchmark.com.

• Single thread and single machine

• Directly call next() of the root of physical operator tree

Benchmark Breakdown

?

CPU accounts for 50% total query processing time in TPC-H Q1

mill

isec

onds

About100MB/S

Benchmark Breakdownm

illis

econ

ds

FROM lineitem

GROUP BY l_returnflag

GROUP BY l_returnflag, l_shipflag

sum(…) x 4

avg(…) x 3

TPC-H Q1

Benchmark Analysis

• Much room for improvement

• Each tuple evaluation may involve overheads in tuple-at-a-time model– not easy to measure cache misses and branch mispredictions

• Each expression causes non-trivial CPU costs– Interpret overheads– Composite keys seem to degrade performance

• Too many objects created (yourkit profiler analysis)– Difficult to avoid object creation to retain all tuples and datum

instances used in in-memory operators

• Hash aggregation– Java HashMap - effective, but not cheap– Non-trivial GC time found in other tests when distinct keys > 10M– Java objects - big memory footprint, cache misses

Our Solution

• Vectorized Processing– Columnar processing on primitive arrays

• JIT helps vectorization engine– Elimination of vectorization impediments

• Unsafe-based in-memory structure for vectors– No object creations

• Unsafe-based Cukcoo HashTable– Fast lookup and No GC

Vectorized Processing

• Originated from database research– Cstore, MonetDB and Vectorwise

• Recently adopted in Hive 0.13

• Key ideas:– Use primitive type arrays as column values– Small and simple loop processing– In-cache processing – Less branches for CPU pipelining– SIMD

• SIMD in Java??• http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/tip/src/share/vm/

opto/superword.cpp


Id Name Age

101 abc 22

102 def 37

104 ghi 45

105 jkl 25

108 mno 31

112 pqr 27

114 owx 35

101 abc 22 102

def 37 104 ghi

45 105 jkl 25

mno

31 112 pqr

27 114 owx 35

A relation N-array storage model(NSM)

101 102 104 105

108 112 abc def

ghi jkl mno

pqr

owx 22 37 45

25 31 27 35

Decomposition storage model(DSM)

A Row Column values


Id

101

102

104

105

108

112

114

Name

abc

def

ghi

jkl

mno

pqr

owx

Age

22

37

45

25

31

27

35

Decomposition storage model

Id

101

102

104

105

Name

abc

def

ghi

jkl

Age

22

37

45

25

Vectorized model

108

112

114

mno

pqr

owx

31

27

35

vectorblock A

(fitting in cache)

vector block B

(fitting in cache)

(bad cache hits) (better cache hits)


MapAddLongIntColCol(int vecNum, long [] result, long [] col1, int [] col2, int [] selVec) {

if (selVec == null) { for (int i = 0; I = 0; i < vecNum; i++) { result[i] = col1[i] + col2[i]; } } else { int selIdx; for (int i = 0; I = 0; i < vecNum; i++) { selIdx = selVec[i]; result[selIdx] = col1[selIdx] + col2[selIdx]; } }}

Example: Add primitive for long and int vectors


SelLEQLongIntColCol(int vecNum, int [] resSelVec, long [] col1, int [] col2, int [] selVec) {

if (selVec == null) {

int selected; for (int rowIdx = 0; rowIdx < vecNum; rowIdx++) {

resSelVec[selected] = rowIdx; selected += col1[rowIdx] <= col2[rowIdx] ? 1 : 0;

}

} else {…

}}

Example: Less than equal filter primitive for long and int vectors


vector block 1

vector block 2

vector block 3

Column Values

l_shipdate l_discount l_extprice l_tax returnflag

l_shipdate <= '1998-09-01’

1-l_discount

l_extprice * l_tax

aggregation

An example of vectorized processing

Vectorized Processing in Tajo

• Unsafe-based in-memory structure for vectors– Fast direct memory access– More opportunities to use byte-level operations

• Vectorization + Just-in-time compilation – Byte code generation for vectorization primitives

in runtime– Significantly reduces branches and interpret

overheads

• One memory chunk divided into multiple fixed-length vectors

• Variable length values stored in pages of variable areas

– Only pointers stored in fixed-length vector

• Less data copy and object creation

• Fast direct access• Easy byte-level operations

– Guava’s FastByteComparisonswhich compare two strings via long comparison

• Forked it to directly access string vectors

Unsafe-based In-memory Structure for Vectors

Fixed Area Variable Area

variable-length field vector

pointers

Vectorization + Just-in-time Compilation

• For single operation types, many type combinations required:– INT vector (+,-,*,/,%) INT vector– INT vector (+,-,*,/,%) INT single value– INT single value (+,-,*,/,%) INT vector– INT column (+,-,*,/,%) LONG vector– …– FLOAT column …..

• ASM used to generate Java byte code in runtime for various primitives– Cheaper code maintenance– Composite keys for Sort, Groupby, and Hash functions

• Less branches and nested loops

• Complex Vectorization Primitive Generation (Planned)– Combining Multiple primitives into one primitive

Unsafe-based Cukcoo Hash Table

• Advantages of Cuckoo hash table– Use of multiple hash functions– No linked list– Only one item in each bucket– Worst-case constant lookup time

• Single direct memory allocation for a hash table– Indexed chunks used as buckets– No GC overheads even if rehash entire buckets

• Simple and fast lookup

• Current implementation only supports fixed-length hash bucket

Benchmark Breakdown: Tajo JIT + Vec Enginem

illis

econ

ds

Scanning lineitem (throughput 138MB/s)

Expression evaluation(projection)

Hashing groupby key columns

Finding all hash bucket ids

Aggregation

TPC-H Q1

Summary

• Tajo uses Join order optimization and re-optimizes special cases during running queries

• JIT-based Vectorized Engine prototype– Significantly reduces CPU times through:

• Vectorized processing• Unsafe-based vector in-memory structure• Unsafe-based Cuckoo hashing

• Future work– A single complex primitive generation to process

multiple operators at a time – Improvement for production level

Get Involved!

• We are recruiting contributors!

• General– http://tajo.apache.org

• Getting Started– http://tajo.apache.org/docs/0.8.0/ getting_started.html

• Downloads– http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html

• Jira – Issue Tracker– https://issues.apache.org/jira/browse/TAJO

• Join the mailing list– [email protected]– [email protected]

http://tajo.apache.org/

http://tajo.apache.org/docs/0.8.0/getting_started.html

http://tajo.apache.org/docs/0.8.0/getting_started.html

http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html

http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html

https://issues.apache.org/jira/browse/TAJO

https://issues.apache.org/jira/browse/TAJO




Software

Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in Apache Tajo