ORC File & Vectorization - Improving Hive Data Storage and Query Performance

  • View

  • Download

Embed Size (px)


Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Text of ORC File & Vectorization - Improving Hive Data Storage and Query Performance

  • 1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen OMalley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om

2. ORC Optimized RC File Page 2 3. History Page 3 4. Remaining Challenges Page 4 5. Requirements Page 5 6. File Structure Page 6 7. Stripe Structure Page 7 8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4 9. Compression Page 9 10. Integer Column Serialization Page 10 11. String Column Serialization Page 11 12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double 13. Compound Type Serialization Page 13 14. Generic Compression Page 14 15. Column Projection Page 15 16. How Do You Use ORC Page 16 17. Managing Memory Page 17 18. TPC-DS File Sizes Page 18 19. ORC Predicate Pushdown Page 19 20. Additional Details Page 20 21. Current work for Hive 0.12 Page 21 22. Future Work Page 22 23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y 0.12 Predicate Pushdown N N N Y 0.12 24. Vectorization Page 24 25. Vectorization Page 25 26. Why row-at-a-time execution is slow Page 26 Hive uses Object Inspectors to work on a row Enables level of abstraction Costs major performance Exacerbated by using lazy serdes Inner loop has many method, new(), and if- then-else calls Lots of CPU instructions Pipeline stalls Poor instructions/cycle Poor cache locality 27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8 28. Vectorization project Page 28 29. Preliminary performance results NOT a benchmark 218 million row fact table of real data, 25 columns 18GB raw data 6 core, 12 thread workstation, 1 disk, 16GB RAM select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42 30. Thanks to contributors! Page 30 Microsoft Big Data: Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia Hortonworks: Jitendra Pandey, Owen OMalley, Gopal V Others: Teddy Choi, Tim Chen Jitendra/Eric are joint leads