ORC File - Optimizing Your Big Data

ORC File –Optimizing Your Big DataOwen O’Malley, Co-founder HortonworksApache Hadoop, Hive, ORC, and Incubator@owen_omalley

2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Overview


In the Beginning…

Hadoop applications used text or SequenceFile– Text is slow and not splittable when compressed

– SequenceFile only supports key and value and user-defined serialization

Hive added RCFile– User controls the columns to read and decompress

– No type information and user-defined serialization

– Finding splits was expensive

Avro files created– Type information included!

– Had to read and decompress entire row


ORC File Basics

Columnar format– Enables user to read & decompress just the bytes they need

Fast– See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet

Indexed

Self-describing– Includes all of the information about types and encoding

Rich type system– All of Hive’s types including timestamp, struct, map, list, and union


File Compatibility

Backwards compatibility– Automatically detect the version of the file and read it.

Forward compatibility– Most changes are made so old readers will read the new files

– Maintain the ability to write old files via orc.write.format

– Always write old version until your last cluster upgrades

Current file versions– 0.11 – Original version

– 0.12 – Updated run length encoding (RLE)


File Structure

File contains a list of stripes, which are sets of rows– Default size is 64MB

– Large stripe size enables efficient reads

Footer– Contains the list of stripe locations

– Type description

– File and stripe statistics

Postscript– Compression parameters

– File format version


Stripe Structure

Indexes– Offsets to jump to start of row group

– Row group size defaults to 10,000 rows

– Minimum, Maximum, and Count of each column

Data– Data for the stripe organized by column

Footer– List of stream locations

– Column encoding information


File Layout

Page 8

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~64 M

B S

trip

e

File Footer

Postscript

File Metadata


Schema Evolution

ORC now supports schema evolution– Hive 2.1 – append columns or type conversion

– Upcoming Hive 2.3 – map columns or inner structures by name

– User passes desired schema to ORC reader

Type conversions– Most types will convert although some are ugly.

– If the value doesn’t fit in the new type, it will become null.

Cautions– Name mapping requires ORC files written by Hive ≥ 2.0

– Some of the type conversions are slow


Using ORC


From Hive or Presto

Modify your table definition:– create table my_table (

name string,

address string,

) stored as orc;

Import data:– insert overwrite table my_table select * from my_staging;

Use either configuration or table properties– tblproperties ("orc.compress"="NONE")

– set hive.exec.orc.default.compress=NONE;


From Java

Use the ORC project rather than Hive’s ORC.– Hive’s master branch uses it.

– Maven group id: org.apache.orc version: 1.4.0

– nohive classifier avoids interfering with Hive’s packages

Two levels of access– orc-core – Faster access, but uses Hive’s vectorized API

– orc-mapreduce – Row by row access, simpler OrcStruct API

MapReduce API implements WritableComparable– Can be shuffled

– Need to specify type information in configuration for shuffle or output


From C++

Pure C++ client library– No JNI or JDK so client can estimate and control memory

Combine with pure C++ HDFS client from HDFS-8707– Work ongoing in feature branch, but should be committed soon.

Reader is stable and in production use.

Alibaba has created a writer and is contributing it to Apache ORC.– Should be in the next release ORC 1.5.0.


Command Line

Using hive –orcfiledump from Hive– -j -p – pretty prints the metadata as JSON

– -d – prints data as JSON

Using java -jar orc-tools-1.4.0-uber.jar from ORC– meta – print the metadata as JSON

– data – print data as JSON

– convert – convert JSON to ORC

– json-schema – scan a set of JSON documents to find the matching schema


Optimization


Stripe Size

Makes a huge difference in performance– orc.stripe.size or hive.exec.orc.default.stripe.size

– Controls the amount of buffer in writer. Default is 64MB

– Trade off

• Large stripes = Large more efficient reads

• Small stripes = Less memory and more granular processing splits

Multiple files written at the same time will shrink stripes– Use Hive’s hive.optimize.sort.dynamic.partition

– Sorting dynamic partitions means a one writer at a time


HDFS Block Padding

The stripes don’t align exactly with HDFS blocks

HDFS scatters blocks around cluster

Often want to pad to block boundaries– Costs space, but improves performance

– hive.exec.orc.default.block.padding – true

– hive.exec.orc.block.padding.tolerance – 0.05

Index Data

Row Data

Stripe Footer

~6

4 M

B S

trip

e

Index Data

Row Data

Stripe Footer

~64

MB

Str

ipe

Index Data

Row Data

Stripe Footer

~64

MB

Str

ipe

HDFS Block

HDFS Block

Padding

File Footer

Postscript

File Metadata


Predicate Push Down

Reader is given a SearchArg– Limited set predicates over column and literal value

– Reader will skip over any parts of file that can’t contain valid rows

ORC indexes at three levels:– File

– Stripe

– Row Group (10k rows)

Reader still needs to apply predicate to filter out single rows


Row Pruning

Every primitive column has minimum and maximum at each level– Sorting your data within a file helps a lot

– Consider sorting instead of making lots of partitions

Writer can optionally include bloomfilters– Provides a probabilistic bitmap of hashcodes

– Only works with equality predicates at the row group level

– Requires significant space in the file

– Manually enabled by using orc.bloom.filter.columns

– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)

– Set the default charset in JVM via -Dfile.encoding=UTF-8


Row Pruning Example

TPC-DS– from tpch1000.lineitem where l_orderkey = 1212000001;

Rows Read– Nothing – 5,999,989,709

– Min/Max – 540,000

– BloomFilter – 10,000

Time Taken– Nothing – 74 sec

– Min/Max – 4.5 sec

– BloomFilter – 1.3 sec


Split Calculation

Hive’s OrcInputFormat has three strategies for split calculation– BI

• Small fast queries

• Splits based on HDFS blocks

– ETL

• Large queries

• Read file footer and apply SearchArg to stripes

• Can include footer in splits (hive.orc.splits.include.file.footer)

– Hybrid

• If small files or lots of files, use BI


LLAP – Live Long & Process

Provides a persistent service to speed up Hive– Caches ORC and text data

– Saves costs of Yarn container & JVM spin up

– JIT finishes after first few seconds

Cache uses ORC’s RLE– Decompresses zlib or Snappy

– RLE is fast and saves memory

– Automatically caches hot columns and partitions

Allows Spark to use Hive’s column and row security


Current Work In Progress


Speed Improvements for ACID

Hive supports ACID transactions on ORC tables– Uses delta files in HDFS to store changes to each partition

– Delta files store insert/update/delete operations

– Used to support SQL insert commands

Unfortunately, update operations don’t allow predicate push down on the deltas

In the upcoming Hive 2.3, we added a new ACID layout– It change updates to an insert and delete

– Allows predicate pushdown even on the delta files

Also added SQL merge command in Hive 2.2


Column Encryption (ORC-14)

Allows users to encrypt some of the columns of the file– Provides column level security even with access to raw files

– Uses Key Management Server from Ranger or Hadoop

– Includes both the data and the index

– Daily key rolling can anonymize data after 90 days

User specifies how data is masked if user doesn’t have access– Nullify

– Redact

– SHA256


Thank You@[email protected]

Technology

ORC File - Optimizing Your Big Data