42

Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater
Page 2: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

Sharpest Tool in the Hadoop ToolboxVertica SQL on HadoopBob HansenDeepak MajetiJames Clampffer

Page 3: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

We are making Vertica the fastest structured data processor for Hadoop.

3

Kafka

Spark Hive Pig

HDFS

MapReduce

HCatalog

HBase

Page 4: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 4

Accessing your existing data is easy to doCREATE HCATALOG SCHEMA hive WITH

HOSTNAME='hcat.mycorp.com'

HCATALOG_SCHEMA='tweets';

SELECT

keyword,

EXTRACT(month from created_at),

AVG(score)

FROM hive.tweets.tweet_keywords

WHERE created_at >=

now()- ‘3 months’::interval

GROUP BY 1,2

ORDER BY 2;

Page 5: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Hadoop has two popular formats for columnar data:Parquet and ORC

5

Column Oriented file formats used

by popular Hadoop ingesting tools

like Hive, Spark, Drill, Impala etc.

Page 6: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Columnar formats efficiently pack data into Hadoop blocks

6

File broken into blocks (rowgroups/stripes)

Typical size up to 256 MB (size of an HDFS block)

Structured: Metadata contains information about the file including DDL, statistics, etc.

Page 7: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

VSQLoH allows you to use your Hadoop data fast

7

Page 8: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 8

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Page 9: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 9

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Page 10: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

10x

20x

30x

40x

50x

60x

70x

10

Big Data SQL Performance Tournamentvs

Impala is 4x-60x faster Impala succeeded in 60 queries that Spark failed

Both Impala and Spark

failed 18 queries

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of S

park

and

Impa

laN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or Im

pala

Num

bers

less

than

1 a

re b

ette

r for

Spa

rk

Page 11: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 11

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Page 12: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

10x

20x

30x

40x

50x

60x

70x

12

Big Data SQL Performance Tournamentvs

HAWQ is 2x – 60x fasterTez is up to 3x faster

HAWQ succeeded in22 queries that Tez failed

Tez succeeded in16 queries thatHAWQ failed

Both Tez andHAWQ failed

18 queries

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

Tez

and

HAW

QN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or H

AWQ

Num

bers

less

than

1 a

re b

ette

r for

Tez

Page 13: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 13

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Page 14: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

1x

2x

3x

4x

5x

6x

14

Big Data SQL Performance Tournamentvs

Parquet: libhdfs++ Parquet: webhdfs

Comparable

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of V

ertic

a/O

RC

with

.li

bhdf

s++

and

web

hdfs

Num

bers

gre

ater

than

1 a

re b

ette

r for

libh

dfs+

+N

umbe

rs le

ss th

an 1

are

bet

ter f

or w

ebhd

fs

Page 15: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 15

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++

Page 16: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

1x

2x

3x

4x

5x

6x

7x

16

Big Data SQL Performance Tournamentvs

ORC: libhdfs++ ORC: webhdfs

Comparable

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of V

ertic

a/O

RC

with

.li

bhdf

s++

and

web

hdfs

Num

bers

gre

ater

than

1 a

re b

ette

r for

libh

dfs+

+N

umbe

rs le

ss th

an 1

are

bet

ter f

or w

ebhd

fs

Page 17: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 17

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++ ORC: libhdfs++

Page 18: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

5x

10x

15x

20x

25x

30x

35x

18

Big Data SQL Performance Tournament

vsParquet

Vertica is 2x – 30x fasterSimilar Vertica succeeded with 19queries Impala failed

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of I

mpa

la a

nd V

ertic

a/Pa

rque

tN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or V

ertic

aN

umbe

rs le

ss th

an 1

are

bet

ter f

or Im

pala

Page 19: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 19

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++ ORC: libhdfs++

Parquet

Page 20: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0x

10x

20x

30x

40x

50x

60x

70x

80x

20

Big Data SQL Performance Tournamentvs

ORC

Vertica is 2x – 73x fasterHAWQ up to 4x faster Vertica succeeded in 34 queries HAWQ failed

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of H

AWQ

and

Ver

tica/

OR

CN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or V

ertic

aN

umbe

rs le

ss th

an 1

are

bet

ter f

or H

AWQ

Page 21: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 21

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++ ORC: libhdfs++

Parquet ORC

Page 22: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

0.0x

0.5x

1.0x

1.5x

2.0x

2.5x

3.0x

22

Big Data SQL Performance Tournamentvs

Parquet ORC

Libhdfs++ is 1.2x – 2.5x faster

Comparable

Measured under TPC Benchmark™DS standards

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of V

ertic

a/Pa

rque

t and

Ver

tica/

OR

CN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or P

arqu

etN

umbe

rs le

ss th

an 1

are

bet

ter f

or O

RC

Page 23: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 23

Big Data SQL Performance Tournament

Cloudera Hortonworks

Parquet: libhdfs++

Parquet: webhdfs

ORC: libhdfs++

ORC: webhdfs

Parquet: libhdfs++ ORC: libhdfs++

Parquet ORC

ROS

Page 24: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 24

Big Data SQL Performance Tournamentvs

ROSVSQLOH

Measured under TPC Benchmark™DS standards

0x

2x

4x

6x

8x

10x

12x

ROS is 2x – 11x faster

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of V

ertic

a/R

OS

and

Verti

ca/P

arqu

etN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or R

OS

Num

bers

less

than

1 a

re b

ette

r for

Par

quet

Page 25: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

VSQLoH is fast because of our open source investments

25

Page 26: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Vertica developed libParquet and libOrc for speed and stability

26

Most systems use Java SerDes and Java Vectorized ReadersThese do not couple well with C++ based systems due to lack of control over resources and lack of tighter integration

libOrc (https://orc.apache.org)• Development Started early 2015• HPE + Hortonworks collaboration

libParquet (https://parquet.apache.org)• Development started early 2016• HPE + Cloudera collaboration

Page 27: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Optimizations

27

Column selection

Partition Pruning

Read only the data you needPredicate Pushdown

Page 28: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

How much do we gain ?

28Resources: https://github.com/apache/orc/pull/43/files

Page 29: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Fast is no good if it doesn’t work reliably

29

Page 30: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 30

VSQLoH will run your SQL queries out of the box

0

10

20

30

40

50

60

70

80

90

Successful Unaltered TPC-DS Queries

56

2318

98

64

Running unmodified TPC‐DS benchmark queries

Page 31: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 31

VSQLoH’s fine-grained resource management ensures that queries will complete without running out of memory

0

10

20

30

40

50

60Concurrent queries before error

Running concurrent select TPC‐DS queries

Page 32: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

WebHDFS was not a good fit for Vertica’s use case

Webhdfs was intended to be easy for people to use, not for high performanceMeant to be accessed from curl or web browser:

curl webhdfs://host:port/webhdfs/v1/my_file_pathor http://host:port/webhdfs/v1/my_file_path

32

Vertica

HDFS Server

Web Server

HDFS Client

WebHDFS

WebHDFSInterface

libhdfs++Interface

HDFS Client (JVM)

libhdfsInterface

Page 33: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Libhdfs++ is developed from scratch with a focus on performance

• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.

33

0

0.5

1

1.5

2

2.5

3

Time (sec)

Find of 1 directory

Java C++

0

100

200

300

400

500

600

Time (sec) Memory (MB)

Find across 1M directories

Java C++

2.4 seconds

0.012 seconds

Page 34: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Libhdfs++ is developed from scratch with a focus on performance

• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.

34

HDFS JIRAS

• HDFS-7280• HDFS-7279• HDFS-7270• HDFS-7945

0

0.5

1

1.5

2

2.5

3

Time (sec)

Find of 1 directory

Java C++

0

100

200

300

400

500

600

Time (sec) Memory (MB)

Find across 1M directories

Java C++

2.4 seconds

0.012 seconds

Page 35: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Libraries were implemented with bindings to other languages in mind

35

libhdfs++

liborc

libparquet

• Pure C wrapper APIs allow functionality to accessed from nearly any other language.• Write prototypes and tools in scripting languages, and move to native implementations if required.• More language bindings lead to more adoption and more contributors.• Development here: http://issues.apache.org/jira/browse/HDFS-8707

Developed within the apache community

Page 36: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

In upcoming releases, it will be faster

36

Page 37: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 37

Caching Hadoop data in Vertica’s ROS format will supercharge your queries

CREATE CACHE VIEW fast_tweetsFROM hive.tweets.tweet_keywordsWHERE created_year_month BETWEEN

201509 AND 201608;

SELECT keyword, EXTRACT(month from created_at), AVG(score)

FROM fast_tweetsWHERE created_year_month = 201608GROUP BY 1,2ORDER BY 2;

Page 38: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 38

ROS has 10 years of R&D to make it the fastest format around

ROSVSQLOH

HDFSROS

SELECT…

0x

2x

4x

6x

8x

10x

12x

ROS is 2x – 11x faster

TPC-DS query, sorted by relative run-time

Rel

ativ

e pe

rform

ance

of V

ertic

a/R

OS

and

Verti

ca/P

arqu

etN

umbe

rs g

reat

er th

an 1

are

bet

ter f

or R

OS

Num

bers

less

than

1 a

re b

ette

r for

Par

quet

Page 39: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 39

Complex types allow richer, more semantically clean data

Complex types enable expression of SQL queries in a natural and intuitive way

“SELECT customer, orders.total_cost FROM customersWHERE orders.total_sales > 4000 and orders.products.id= ‘B2’;”

Page 40: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData 40

Writing data to HDFS will make Vertica a central part of your workflow

Support writing data in Vertica to ORC and Parquet formats.

“SELECT * FROM customers AS COPY TO ‘hdfs:///user/customers’ PARQUET”;

HDFSORC / Parquet

Page 41: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Who wants some fast?

41

Page 42: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater

#SeizeTheData

Vertica SQL on Hadoop Summary

42

Kafka

Spark Hive Pig

HDFS

MapReduce

HCatalog

HBase

• High Performance Vertica Engine• Beats Hawk, Impala, Spark, Tez

• All TPC-DS queries run out of the box

• Supports major HDFS file formats• ORC, Parquet

• Native readers enable tighter integration• Partition pruning, Predicate pushdown,

Column selection

• Libhdfs++ enables efficient communication with HDFS

• Roadmap for more features and further improve performance