30
Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley [email protected] @owen_omalley Jitendra Pandey [email protected] Eric Hanson [email protected] [email protected] om

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Embed Size (px)

DESCRIPTION

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Citation preview

Copyright 2013 by Hortonworks and Microsoft

ORC File & Vectorization Improving Hive Data Storage and Query Performance

June 2013

Page 1

Owen O’Malley

[email protected]

@owen_omalley

Jitendra Pandey

[email protected]

Eric Hanson

[email protected]

[email protected]

ORC – Optimized RC File

Page 2

History

Page 3

Remaining Challenges

Page 4

Requirements

Page 5

File Structure

Page 6

Stripe Structure

Page 7

File Layout

Page 8

File Footer

Postscript

Index Data

Row Data

Stripe Footer

256 M

B S

trip

e

Index Data

Row Data

Stripe Footer

256

MB

Str

ipe

Index Data

Row Data

Stripe Footer

25

6 M

B S

trip

e

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Stream 2.1

Stream 2.2

Stream 2.3

Stream 2.4

Compression

Page 9

Integer Column Serialization

Page 10

String Column Serialization

Page 11

Hive Compound Types

Page 12

0Struct

4Struct

3String

1Int

2Map

7Time

5String

6 Double

Compound Type Serialization

Page 13

Generic Compression

Page 14

Column Projection

Page 15

How Do You Use ORC

Page 16

Managing Memory

Page 17

TPC-DS File Sizes

Page 18

ORC Predicate Pushdown

Page 19

Additional Details

Page 20

Current work for Hive 0.12

Page 21

Future Work

Page 22

Comparison

Page 23

RC File Trevni Parquet ORC

Hive Integration Y N N Y

Active Development N N Y Y

Hive Type Model N N N Y

Shred complex columns N Y Y Y

Splits found quickly N Y Y Y

Files per a bucket 1 many 1 or many 1

Versioned metadata N Y Y Y

Run length data encoding N N Y Y

Store strings in dictionary N N Y Y

Store min, max, sum, count N N N Y

Store internal indexes N N N Y

No overhead for non-null N N N Y ≥ 0.12

Predicate Pushdown N N N Y ≥ 0.12

Vectorization

Page 24

Vectorization

Page 25

Why row-at-a-time execution is slow

Page 26

• Hive uses Object Inspectors to work on a row

• Enables level of abstraction

• Costs major performance

• Exacerbated by using lazy serdes

• Inner loop has many method, new(), and if-

then-else calls

• Lots of CPU instructions

• Pipeline stalls Poor instructions/cycle

• Poor cache locality

How the code works (simplified)

Page 27

class LongColumnAddLongScalarExpression {int inputColumn;int outputColumn;long scalar;void evaluate(VectorizedRowBatch batch) {long [] inVector =

((LongColumnVector) batch.columns[inputColumn]).vector;long [] outVector =

((LongColumnVector) batch.columns[outputColumn]).vector;

if (batch.selectedInUse) {for (int j = 0; j < batch.size; j++) {

int i = batch.selected[j];outVector[i] = inVector[i] + scalar;

} } else {for (int i = 0; i < batch.size; i++) {outVector[i] = inVector[i] + scalar;

} }

}}

}

No method calls

Low instruction count

Cache locality to 1024 values

No pipeline stalls

SIMD in Java 8

Vectorization project

Page 28

Preliminary performance results

• NOT a benchmark

• 218 million row fact table of real data, 25 columns

• 18GB raw data

• 6 core, 12 thread workstation, 1 disk, 16GB RAM• select a, b, count(*) from t

where c >= const group by a, b -- 53 row result

Page 29

warm start times RC non-

vectorized

(default, not

compressed)

ORC non-

vectorized

(default,

compressed)

ORC vectorized

(default,

compressed)

Runtime (sec) 261 58 43

Total CPU (sec) 381 159 42

Thanks to contributors!

Page 30

• Microsoft Big Data:

• Eric Hanson, Remus Rusanu, Sarvesh

Sakalanaga, Tony Murphy, Ashit Gosalia

• Hortonworks:

• Jitendra Pandey, Owen O’Malley, Gopal V

• Others:

• Teddy Choi, Tim Chen

Jitendra/Eric are joint leads