26
ORC File – Optimizing Your Big Data Owen O’Malley, Co-founder Hortonworks Apache Hadoop, Hive, ORC, and Incubator @owen_omalley

ORC File - Optimizing Your Big Data

Embed Size (px)

Citation preview

Page 1: ORC File - Optimizing Your Big Data

ORC File –Optimizing Your Big DataOwen O’Malley, Co-founder HortonworksApache Hadoop, Hive, ORC, and Incubator@owen_omalley

Page 2: ORC File - Optimizing Your Big Data

2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Overview

Page 3: ORC File - Optimizing Your Big Data

3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

In the Beginning…

Hadoop applications used text or SequenceFile– Text is slow and not splittable when compressed

– SequenceFile only supports key and value and user-defined serialization

Hive added RCFile– User controls the columns to read and decompress

– No type information and user-defined serialization

– Finding splits was expensive

Avro files created– Type information included!

– Had to read and decompress entire row

Page 4: ORC File - Optimizing Your Big Data

4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

ORC File Basics

Columnar format– Enables user to read & decompress just the bytes they need

Fast– See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet

Indexed

Self-describing– Includes all of the information about types and encoding

Rich type system– All of Hive’s types including timestamp, struct, map, list, and union

Page 5: ORC File - Optimizing Your Big Data

5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

File Compatibility

Backwards compatibility– Automatically detect the version of the file and read it.

Forward compatibility– Most changes are made so old readers will read the new files

– Maintain the ability to write old files via orc.write.format

– Always write old version until your last cluster upgrades

Current file versions– 0.11 – Original version

– 0.12 – Updated run length encoding (RLE)

Page 6: ORC File - Optimizing Your Big Data

6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

File Structure

File contains a list of stripes, which are sets of rows– Default size is 64MB

– Large stripe size enables efficient reads

Footer– Contains the list of stripe locations

– Type description

– File and stripe statistics

Postscript– Compression parameters

– File format version

Page 7: ORC File - Optimizing Your Big Data

7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Stripe Structure

Indexes– Offsets to jump to start of row group

– Row group size defaults to 10,000 rows

– Minimum, Maximum, and Count of each column

Data– Data for the stripe organized by column

Footer– List of stream locations

– Column encoding information

Page 8: ORC File - Optimizing Your Big Data

8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

File Layout

Page 8

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~64 M

B S

trip

e

File Footer

Postscript

File Metadata

Page 9: ORC File - Optimizing Your Big Data

9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Schema Evolution

ORC now supports schema evolution– Hive 2.1 – append columns or type conversion

– Upcoming Hive 2.3 – map columns or inner structures by name

– User passes desired schema to ORC reader

Type conversions– Most types will convert although some are ugly.

– If the value doesn’t fit in the new type, it will become null.

Cautions– Name mapping requires ORC files written by Hive ≥ 2.0

– Some of the type conversions are slow

Page 10: ORC File - Optimizing Your Big Data

10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Using ORC

Page 11: ORC File - Optimizing Your Big Data

11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

From Hive or Presto

Modify your table definition:– create table my_table (

name string,

address string,

) stored as orc;

Import data:– insert overwrite table my_table select * from my_staging;

Use either configuration or table properties– tblproperties ("orc.compress"="NONE")

– set hive.exec.orc.default.compress=NONE;

Page 12: ORC File - Optimizing Your Big Data

12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

From Java

Use the ORC project rather than Hive’s ORC.– Hive’s master branch uses it.

– Maven group id: org.apache.orc version: 1.4.0

– nohive classifier avoids interfering with Hive’s packages

Two levels of access– orc-core – Faster access, but uses Hive’s vectorized API

– orc-mapreduce – Row by row access, simpler OrcStruct API

MapReduce API implements WritableComparable– Can be shuffled

– Need to specify type information in configuration for shuffle or output

Page 13: ORC File - Optimizing Your Big Data

13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

From C++

Pure C++ client library– No JNI or JDK so client can estimate and control memory

Combine with pure C++ HDFS client from HDFS-8707– Work ongoing in feature branch, but should be committed soon.

Reader is stable and in production use.

Alibaba has created a writer and is contributing it to Apache ORC.– Should be in the next release ORC 1.5.0.

Page 14: ORC File - Optimizing Your Big Data

14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Command Line

Using hive –orcfiledump from Hive– -j -p – pretty prints the metadata as JSON

– -d – prints data as JSON

Using java -jar orc-tools-1.4.0-uber.jar from ORC– meta – print the metadata as JSON

– data – print data as JSON

– convert – convert JSON to ORC

– json-schema – scan a set of JSON documents to find the matching schema

Page 15: ORC File - Optimizing Your Big Data

15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Optimization

Page 16: ORC File - Optimizing Your Big Data

16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Stripe Size

Makes a huge difference in performance– orc.stripe.size or hive.exec.orc.default.stripe.size

– Controls the amount of buffer in writer. Default is 64MB

– Trade off

• Large stripes = Large more efficient reads

• Small stripes = Less memory and more granular processing splits

Multiple files written at the same time will shrink stripes– Use Hive’s hive.optimize.sort.dynamic.partition

– Sorting dynamic partitions means a one writer at a time

Page 17: ORC File - Optimizing Your Big Data

17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

HDFS Block Padding

The stripes don’t align exactly with HDFS blocks

HDFS scatters blocks around cluster

Often want to pad to block boundaries– Costs space, but improves performance

– hive.exec.orc.default.block.padding – true

– hive.exec.orc.block.padding.tolerance – 0.05

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~64

MB

Str

ipe

Index Data

Row Data

Stripe Footer

~64

MB

Str

ipe

HDFS Block

HDFS Block

Padding

File Footer

Postscript

File Metadata

Page 18: ORC File - Optimizing Your Big Data

18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Predicate Push Down

Reader is given a SearchArg– Limited set predicates over column and literal value

– Reader will skip over any parts of file that can’t contain valid rows

ORC indexes at three levels:– File

– Stripe

– Row Group (10k rows)

Reader still needs to apply predicate to filter out single rows

Page 19: ORC File - Optimizing Your Big Data

19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Row Pruning

Every primitive column has minimum and maximum at each level– Sorting your data within a file helps a lot

– Consider sorting instead of making lots of partitions

Writer can optionally include bloomfilters– Provides a probabilistic bitmap of hashcodes

– Only works with equality predicates at the row group level

– Requires significant space in the file

– Manually enabled by using orc.bloom.filter.columns

– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)

– Set the default charset in JVM via -Dfile.encoding=UTF-8

Page 20: ORC File - Optimizing Your Big Data

20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Row Pruning Example

TPC-DS– from tpch1000.lineitem where l_orderkey = 1212000001;

Rows Read– Nothing – 5,999,989,709

– Min/Max – 540,000

– BloomFilter – 10,000

Time Taken– Nothing – 74 sec

– Min/Max – 4.5 sec

– BloomFilter – 1.3 sec

Page 21: ORC File - Optimizing Your Big Data

21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Split Calculation

Hive’s OrcInputFormat has three strategies for split calculation– BI

• Small fast queries

• Splits based on HDFS blocks

– ETL

• Large queries

• Read file footer and apply SearchArg to stripes

• Can include footer in splits (hive.orc.splits.include.file.footer)

– Hybrid

• If small files or lots of files, use BI

Page 22: ORC File - Optimizing Your Big Data

22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

LLAP – Live Long & Process

Provides a persistent service to speed up Hive– Caches ORC and text data

– Saves costs of Yarn container & JVM spin up

– JIT finishes after first few seconds

Cache uses ORC’s RLE– Decompresses zlib or Snappy

– RLE is fast and saves memory

– Automatically caches hot columns and partitions

Allows Spark to use Hive’s column and row security

Page 23: ORC File - Optimizing Your Big Data

23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Current Work In Progress

Page 24: ORC File - Optimizing Your Big Data

24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Speed Improvements for ACID

Hive supports ACID transactions on ORC tables– Uses delta files in HDFS to store changes to each partition

– Delta files store insert/update/delete operations

– Used to support SQL insert commands

Unfortunately, update operations don’t allow predicate push down on the deltas

In the upcoming Hive 2.3, we added a new ACID layout– It change updates to an insert and delete

– Allows predicate pushdown even on the delta files

Also added SQL merge command in Hive 2.2

Page 25: ORC File - Optimizing Your Big Data

25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Column Encryption (ORC-14)

Allows users to encrypt some of the columns of the file– Provides column level security even with access to raw files

– Uses Key Management Server from Ranger or Hadoop

– Includes both the data and the index

– Daily key rolling can anonymize data after 90 days

User specifies how data is masked if user doesn’t have access– Nullify

– Redact

– SHA256

Page 26: ORC File - Optimizing Your Big Data

26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Thank You@[email protected]