Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018 · 2019-02-26 · Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018

Kyle Bader - Red Hat

Yuan Zhou, Yong Fu, Jian Zhang - Intel

May, 2018

Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Agenda

▪ Background and Motivations

▪ The Workloads, Reference Architecture Evolution and Performance Optimization

▪ Performance Comparison with Remote HDFS

▪ Summary & Next Step


Optimization Notice


Optimization Notice

DISCONTINUITY IN BIG DATAINFRASTRUCTURE - WHY ?

HADOOP

SPARKSQLSPARK

HIVEMAPREDUCE

PRESTOIMPALA

KAFKANiFi

CONGESTIONin busy analytic clusterscausing missed SLAs.

MULTIPLE TEAMS COMPETINGand sharing the samebig data resources.


Optimization Notice

MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY

DATAGENERATION

INGEST DATASCIENCE

STREAMPROCESSING

TRANSFORM,MERGE,JOIN

MACHINELEARNING

DATAANALYTICS


Optimization Notice

MODERN BIG DATA ANALYTICS PIPELINEKEY TERMINOLOGY

DATAGENERATION

INGEST DATASCIENCE

MACHINELEARNING

STREAMPROCESSING

TRANSFORM,MERGE,JOIN

DATAANALYTICS

• Sensors• Click-stream• Transactions• Call-detail records

• NiFi• Kafka

• Presto• Impala• SparkSQL

• TensorFlow

• Kafka • Hadoop• Spark

• Spark• Hadoop


Optimization Notice

DISCONTINUITY IN BIG DATAINFRASTRUCTURE - WHY ?

HADOOP

SPARKSQLSPARK

HIVEMAPREDUCE

PRESTOIMPALA

KAFKANiFi

CONGESTIONin busy analytic clusterscausing missed SLAs.

MULTIPLE TEAMS COMPETINGand sharing the samebig data resources.


Optimization Notice

CAUSING CUSTOMERS TO PICK A SOLUTION

Get a bigger clusterfor many teams to share.

Give each team their own dedicated cluster,

each with a copy of PBs of data.

Give teams ability tospin-up/spin-downclusters which can

share data sets.

#1 #2 #3


Optimization Notice

#1 SINGLE LARGE CLUSTER

• Lacks isolationnoisy neighbors hinder SLAs.

• Lacks elasticitysingle rigid cluster.


Optimization Notice

#2 MULTIPLE SMALL CLUSTERS

• No dataset sharing

• Cost of duplicate storage

• Still lacks elasticity

• Can’t scale


Optimization Notice

#3 ON DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE

HIT SERVICE-LEVEL AGREEMENTSGive teams their owncompute clusters.

ELIMINATE IDLE RESOURCESBy right-sizing de-coupled compute and storage.

BUY 10s OF PBS INSTEAD OF 100S Share data sets across clusters instead of duplicating them.

INCREASE AGILITYWith spin-up/spin-down clusters.


Optimization Notice

#3 ON DEMAND ANALYTIC CLUSTERSWITH A SHARED DATA LAKE

PUBLIC CLOUD (AWS) PRIVATE CLOUD

AWS EC2 PROVISIONING

OPENSTACK PROVISIONING

AWS S3SHARED DATASETS

CEPH SHARED DATASETS

Hadoop

Presto

Spark Hadoop

Presto

Spark


Optimization Notice

GENERATION 1MONOLITHIC HADOOP STACKS

Analytics vendors provide

single-purpose infrastructure

Analytics vendors provideanalytics software

ANALYTICS +INFRASTRUCTURE


Optimization Notice

GENERATION 2DECOUPLED STACK WITH PRIVATE CLOUD INFRASTRUCTURE

Analytics vendors provideanalytics software.

Private cloud providesInfrastructure services

Provisioned Compute Poolvia OpenStack

Shared Datasets on Ceph Object Store


Optimization Notice


Optimization Notice

Workloads

Simple Read/Write

▪ DFSIO: TestDFSIO is the canonical example of a benchmark that attempts to measure the Storage's capacity for reading and writing bulk data.

▪ Terasort: a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system.

Data Transformation

▪ ETL: Taking data as it is originally generated and transforming it to a format (Parquet, ORC) that more tuned for analytical workloads.

Batch Analytics

▪ To consistently executing analytical process to process large set of data.

▪ Leveraging 54 derived from TPC-DS * queries with intensive reads across objects in different buckets


Optimization Notice

Bigdata on Object Storage Performance Overview --Batch analytics

• Significant performance improvement from Hadoop 2.7.3/Spark 2.1.1 to Hadoop 2.8.1/Spark 2.2.0 (improvement in s3a)

• Batch analytics performance of 10-node Intel AFA is almost on-par with 60-node HDD cluster

2244

5060

2120

34463968 3852

6573

7719

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Parquet (SSD) ORC (SSD) Parquet (HDD) ORC HDD)

Query

Tim

e(s

)

1TB Dataset Batch Analytics and Interactive Query Hadoop/SPark/Presto Comparison (lower the better)HDD: (740 HDDD OSDs, 40x Compute nodes, 20x Storage nodes)

SSD:(20 SSDs, 5 Compute nodes, 5 storage nodes)

Hadoop 2.8.1 / Spark 2.2.0 Presto 0.170 Presto 0.177


Optimization Notice

Hardware Configuration--Dedicate LB

5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @

2.2GHz, 128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3

5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as Journal• 4x 1.6TB Intel® SSD DC S3510 as data

drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3

*Other names and brands may be claimed as the property of others.

OSD1

MON

OSD1 OSD4…

HadoopHive

SparkPresto

1x10Gb NIC

OSD2 OSD3 OSD4 OSD5

2x10Gb NIC

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

RGW1 RGW2

LB

Head

4x10Gb NIC(bonding)

2x10Gb NIC


Optimization Notice

0%

20%

40%

60%

80%

100%

120%

hive-parquet spark-parquet presto-parquet

(untuned)

presto-parquet

(tuned)

Qu

ery

Su

cce

ss %

1TB Query Success % (54 TPC-DS Queries)

0%

20%

40%

60%

80%

100%

120%

spark-parquet spark-orc presto-parquet presto-parquet

1TB&10TB Query Success %(54 TPC-DS Queries)

Improve Query Success Ratio with Functional Trouble-shooting

0 2 4 6 8 10 12 14 16

Ceph issue

Compatible issue

Deployment issue

Improper default configuration

Middleware issue

Runtime issue

S3a driver issueCount of Issue Type

tuned

• 100% selected TPC-DS query passed with tunings • Improper Default configuration

• small capacity size, • wrong middleware configuration • improper Hadoop/Spark configuration for different size

and format data issues


Optimization Notice

Optimizing HTTP Requests-- The bottlenecks

Compute time take the big part. (compute time = read data +sort )

New connections out every time,Connection not reused

ESTAB 0 0 ::ffff:10.0.2.36:44446 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44454 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80

ESTAB 159724 0 ::ffff:10.0.2.36:44436 ::ffff:10.0.2.254:80

ESTAB 0 0 ::ffff:10.0.2.36:44448 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44338 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44438 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44414 ::ffff:10.0.2.254:80

ESTAB 0 480 ::ffff:10.0.2.36:44450 ::ffff:10.0.2.254:80 timer:(on,170ms,0)

ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44390 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44326 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44452 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44394 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44444 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44456 ::ffff:10.0.2.254:80

2 seconds interval ======================ESTAB 0 0 ::ffff:10.0.2.36:44508 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44476 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44524 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44374 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44500 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44504 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44512 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44506 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44464 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44518 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44510 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44442 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44526 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44472 ::ffff:10.0.2.254:80 ESTAB 0 0 ::ffff:10.0.2.36:44466 ::ffff:10.0.2.254:80

2017-07-18 14:53:52.259976 7fddd67fc700 1 ====== starting new request req=0x7fddd67f6710 =====2017-07-18 14:53:52.271829 7fddd5ffb700 1 ====== starting new request req=0x7fddd5ff5710 =====2017-07-18 14:53:52.273940 7fddd7fff700 0 ERROR: flush_read_list(): d->client_c->handle_data() returned -52017-07-18 14:53:52.274223 7fddd7fff700 0 WARNING: set_req_state_err err_no=5 resorting to 5002017-07-18 14:53:52.274253 7fddd7fff700 0 ERROR: s->cio->send_content_length() returned err=-52017-07-18 14:53:52.274257 7fddd7fff700 0 ERROR: s->cio->print() returned err=-52017-07-18 14:53:52.274258 7fddd7fff700 0 ERROR: STREAM_IO(s)->print() returned err=-52017-07-18 14:53:52.274267 7fddd7fff700 0 ERROR: STREAM_IO(s)->complete_header() returned err=-5

Http 500 errors in RGW log


Optimization Notice

Solution

Enable random read policy hadoop:<property>

<name>fs.s3a.experimental.input.fadvise</name>

<value>random</value>

</property>

<property>

<name>fs.s3a.readahead.range</name>

<value>64K</value>

</property>

By reducing the cost of closing existing HTTP requests, this is highly efficient for file IO accessing a binary file through a series of `PositionedReadable.read()` and `PositionedReadable.readFully()` calls.

Optimizing HTTP Requests-- S3a input policy

Seek in a IO stream

diff > 0 skip forward diff < 0 backward

diff = 0 sequential read

Close stream

&Open connection

again

diff = targetPos – pos

Background

The S3A filesystem client supports the notion of input policies, similar to that of the POSIX fadvise() API call. This tunes the behavior of the S3A client to optimize HTTP GET requests for various use cases. To optimize HTTP GET requests, you can take advantage of the S3A experimental input policy fs.s3a.experimental.input.fadvise.

Ticket: https://issues.apache.org/jira/browse/HADOOP-13203

https://issues.apache.org/jira/browse/HADOOP-13203


Optimization Notice

• Readahead feature is supported from Hadoop 2.8.1, but not enabled by default. By applying random read policy, the 500 issue is fixed and performance improved 3x

• All Flash storage architecture also show great performance benefit and low TCO which compared with HDD storage

Optimizing HTTP Requests-- Performance

8305 8267

2659

0

2000

4000

6000

8000

10000

Hadoop 2.7.3/Spark 2.1.1 Hadoop 2.8.1/Spark

2.2.0(untuned)

Hadoop 2.8.1/Spark

2.2.0(tuned)

seco

nd

s

1TB Batch Analytics Query on Parquet

26592120

4810

3346

0

2000

4000

6000

Hadoop 2.8.1/Spark2.2.0[SSD] Hadoop 2.8.1/Spark2.2.0[HDD]

Se

con

ds

1TB Dataset Batch Analytics Query Hadoop/Spark Comparison (lower the better)

Parquet ORC


Optimization Notice

New bottleneck on Load Balancer

• Load Balancer became the bottleneck on networking bandwidth

• Observed many messages blocked at load balancer server(send to s3a driver), but not much blocked at receiving on s3a driver side

0

500000

1000000

1500000

2000000

2500000

3000000

0

27

0

54

0

81

0

10

80

13

50

16

20

18

90

21

60

24

30

27

00

29

70

32

40

35

10

37

80

40

50

43

20

45

90

48

60

Network IO

Sum of rxkB/s

Sum of txkB/s

0

500000

1000000

1500000

2000000

2500000

0

15

30

45

60

75

90

10

5

12

0

13

5

15

0

16

5

Network IO

Sum of rxkB/s

Sum of txkB/s

Enlarge on a single query


Optimization Notice

Hardware Configuration--More RGWs with round-robin DNS

5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz,

128GB mem• 2x10G 82599 10Gb NIC• 2x SSDs • 3x Data storage (can be emliminated) Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3

5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD Journal• 4x 1.6TB Intel® SSD DC S3510 as data drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3

OSD1

MON

OSD1 OSD4…

HadoopHive

SparkPresto

1x10Gb NIC

OSD2 OSD3 OSD4 OSD5

2x10Gb NIC

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

RGW1 RGW2

HeadDNS Server

RGW2 RGW2 RGW2

1x10Gb NIC

1x10Gb NIC


Optimization Notice

Performance evaluation--More RGWs and round-robin DNS

• 18% performance improvement with more RGWs and round-robin DNS

• Query42(has less shuffle) is 1.64x faster in the new architecture

2659

2244

0

500

1000

1500

2000

2500

3000

Dedicate single load balance DNS

seco

nd

s

1TB Batch Analytics Query (parquet)

129

78

0

20

40

60

80

100

120

140

Dedicate load balance DNS RR

seco

nd

s

Query 42 Query Time(10TB Dataset with

parquet)


Optimization Notice

Key Resource Utilization Comparison

• Compute side(Hadoop s3a driver) can read more data from OSD faster, which showed DNS deployment bring big improvements for network throughput performance than single gateway with bonding/teaming technology

0

100000

200000

300000

400000

500000

600000

700000

800000

0

15

30

45

60

75

90

10

5

12

0

13

5

15

0

(DNS)S3A Driver Network IO

Sum of rxkB/s

Sum of txkB/s

0

100000

200000

300000

400000

500000

600000

700000

8000000

15

30

45

60

75

90

10

5

12

0

13

5

15

0

16

5

kB

/s

Axis Title

(Dedicate LB server)S3A Driver

Network IO

Sum of rxkB/s

Sum of txkB/s


Optimization Notice

Hardware Configuration--RGW and OSD Collocated

5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @


5x Storage Node, 2 RGW nodes, 1 LB nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and

rocksdb• 4x 1.6TB Intel® SSD DC S3510 as data

drive• 2x 400G S3700 SSDs• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3


RGW1OSD1

MON

OSD1 OSD4…

HadoopHive

SparkPresto

1x10Gb NIC

RGW2OSD2

RGW3OSD3

RGW4OSD4

RGW5OSD5

3x10Gb NIC(ECMP)

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HeadDNS Server


Optimization Notice

Performance under RGW & OSD Collocated

• No need extra dedicate RGW servers, RGW instance and OSD go through different network interface by enable ECMP

• No performance degradation, but less TCO

22422162

0

500

1000

1500

2000

2500

dedicate rgws rgw+ osd co-locality

seco

nd

s

1TB Batch Analytics Query on Parquet

0 500 1000 1500 2000 2500

10TB parquet ETL

1TB dfsio_write

Other workloads

rgw+ osd co-locality dedicate rgws


Optimization Notice

RGW & OSD collocated – RGW scaling

• Scale out RGWs can improve performance before OSD(storage) saturating

• So How many RGWs can win the best performance should be decided by the bandwidth of each RGW server and throughput of OSDs

950

1730

21192233 2254

0

500

1000

1500

2000

2500

1 2 3 4 5

MB

/s

# of RGWs

1TB DFSIO write

337

252 250

0

50

100

150

200

250

300

350

400

6 9 18

seco

nd

s

# of RGWs

10TB ETL(Lower the better)


Optimization Notice

0

50

100

150

200

qu

ery

3.s

ql

qu

ery

7.s

ql

qu

ery

12

.sq

l

qu

ery

13

.sq

l

qu

ery

15

.sq

l

qu

ery

17

.sq

l

qu

ery

19

.sq

l

qu

ery

20

.sq

l

qu

ery

21

.sq

l

qu

ery

24

.sq

l

qu

ery

25

.sq

l

qu

ery

26

.sq

l

qu

ery

27

.sq

l

qu

ery

28

.sq

l

qu

ery

29

.sq

l

qu

ery

31

.sq

l

qu

ery

32

.sq

l

qu

ery

34

.sq

l

qu

ery

39

.sq

l

qu

ery

40

.sq

l

qu

ery

42

.sq

l

qu

ery

43

.sq

l

qu

ery

45

.sq

l

qu

ery

46

.sq

l

qu

ery

48

.sq

l

qu

ery

49

.sq

l

qu

ery

51

.sq

l

qu

ery

52

.sq

l

qu

ery

55

.sq

l

qu

ery

56

.sq

l

qu

ery

60

.sq

l

qu

ery

63

.sq

l

qu

ery

64

.sq

l

qu

ery

65

.sq

l

qu

ery

66

.sq

l

qu

ery

68

.sq

l

qu

ery

71

.sq

l

qu

ery

73

.sq

l

qu

ery

75

.sq

l

qu

ery

76

.sq

l

qu

ery

79

.sq

l

qu

ery

82

.sq

l

qu

ery

83

.sq

l

qu

ery

84

.sq

l

qu

ery

85

.sq

l

qu

ery

87

.sq

l

qu

ery

88

.sq

l

qu

ery

89

.sq

l

qu

ery

90

.sq

l

qu

ery

91

.sq

l

qu

ery

92

.sq

l

qu

ery

93

.sq

l

qu

ery

96

.sq

l

qu

ery

97

.sq

l

1TB Batch Analytics (Physical shuffle storage vs RBD shuffle storage)

physical-query_time rbd-query_time

Using Ceph RBD as shuffle Storage-- Eliminate the physical drive on the compute

• Remote RBD volumes on compute node to act as shuffle devices instead of physical shuffle device.

• For most queries the performance is not impacted.


Optimization Notice31

Compute-side caching5x Compute Node• Intel® Xeon™ processor E5-2699 v4 @


5x Storage Node, 5 RGW nodes• Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 1x Intel® P3700 1.0TB SSD as WAL and

rocksdb• 4x 1.6TB Intel® SSD DC S3510 as data

drive• 1 OSD instances one each S3510 SSD• RHEl7.3• RHCS 2.3


RGW1OSD1

MON

OSD1 OSD4…

HadoopHive

SparkPresto

1x10Gb NIC

RGW2OSD2

RGW3OSD3

RGW4OSD4

RGW5OSD5

3x10Gb NIC

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HadoopHive

SparkPresto

HeadDNS Server

SSD Caching SSD Caching SSD Caching SSD Caching SSD Caching

5x Caching Node(co-located with compute)• 1TB SSD(P3700)Software:• Alluxio* 1.7.0


Optimization Notice

Compute-side caching for I/O intensive queries

32

• Compute-side caching brings better efficiency(10% - 30%) for I/O intensive queries

25.55

17.10

13.83

16.88 16.49 16.53

33.15

21.84

12.13 12.50 12.65 12.30 12.96

21.89

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

query19.sql query42.sql query43.sql query52.sql query55.sql query63.sql query68.sql

Se

con

ds

I/O intensive query performance

S3 Compute-side Caching

-37%


Optimization Notice


Optimization Notice

Hardware Configuration--Remote HDFS 5x Compute Node

• Intel® Xeon™ processor E5-2699 v4 @ 2.2GHz, 128GB mem

• 2x10G 82599 10Gb NIC• 2x S3700 as shuffle storage Software:• Hadoop 2.7.3• Spark 2.1.1• Hive 2.2.1• Presto 0.177• RHEL7.3

5x Data Node • Intel(R) Xeon(R) CPU E5-2699v4 2.20GHz• 128GB Memory• 2x 82599 10Gb NIC • 7x 400G S3700 SSDs as data stire • RHEl7.3

Data Node 1

1x10Gb NIC

Data Node 2

Data Node 3

Data Node 4

Data Node 5

1x10Gb NIC

Node ManagerSpark

Node ManagerSpark

Node ManagerSpark

Node ManagerSpark

Name NodeNode Manager

Spark


Optimization Notice

On-par performance compared with remote HDFS

• With optimizations, bigdata analytics on object storage is onpar with remote, especially on parquet format data

• performance of s3a driver close to native dfsclient , and demonstrate compute and storage separate solution has a considerable performance compare with combination solution

Bigdata on Cloud vs. Remote HDFS--Batch Analytics

22441921

5060

2839

0

1000

2000

3000

4000

5000

6000

Object Store Remote HDFS

1TB Dataset Batch Analytics and Interactive Query comparison with

remote HDFS (lower the better)

Parquet ORC


Optimization Notice

0

50000

100000

150000

200000

250000

300000

-5 25

55

85

11

5

14

5

17

5

20

5

23

5

26

5

29

5

32

5

35

5

38

5

41

5

Disk Bandwidth

Sum of rkB/s

Sum of wkB/s

DFSIO write performance in ceph is better than remote

hdfs(43%), but read performance is 34% lower

Write to Ceph hit disk bottleneck

Bigdata on Cloud vs. Remote HDFS--DFSIO

Shuffle storage block at consuming data from Data Lake

4257

1644

2354

38453532

0

500

1000

1500

2000

2500

3000

3500

4000

4500

remote hdfs(replicaiton

1)

remote hdfs(replication

3)

ceph over s3a

MB

/s

DFSIO 1TB on Parquet(Throughput)

dfsio write dfsio read


Optimization Notice

0

500000

1000000

1500000

2000000

2500000

-5

25

5

51

5

77

5

10

35

12

95

15

55

18

15

20

75

23

35

25

95

OSD Data Drive Disk

Bandwidth

Sum of rkB/s

Sum of wkB/s

0

20

40

60

80

100

120

140

160

180

-5

20

0

40

5

61

0

81

5

10

20

12

25

14

30

16

35

18

40

20

45

22

50

24

55

26

60

OSD Data Drive IO Latencies

Sum of await

Sum of svctm

Time cost at Reduce stage is big

part

Read and write concurrently

887

442

0

100

200

300

400

500

600

700

800

900

1000

Remote HDFS Ceph over s3a

1TB Terasort Total Throughput(MB/s)

Bigdata on Cloud vs Remote HDFS--Terasort


Optimization Notice

DirectOutputCommitter An implementation in Spark 1.6, that return the destination address as working directory then no need to rename/move task output, no good robustness for failures, removed in Spark 2.0

IBM's "Stocator" committer

Targets Openstack Swift, good robustness, but it is another file system for s3a

Staging committer A choice of new s3a committer, need large capacity of hard disk for staged data

Magic committer A choice of new s3a committer, if you know your object store is consistent or use s3gurad, this committer has higher performance

Bigdata on Cloud vs. Remote HDFS--Ongoing rename optimizations

38

“Renaming” overhead can be improved!


Optimization Notice


Optimization Notice

Summary and Next StepsSummary

• Bigdata on Ceph data lake is functionality ready validated by industry standard decision making workloads TPC-DS

• Bigdata on the Cloud delivers on-par performance with remote HDFS for batch analytics, intensive write operations still need further optimizations

• All flash solutions demonstrated significant TCO benefit compared with HDD solutions

Next

• Expand analytic workloads scope

• Rename operations optimizations to improve the performance

• Accelerating the performance with

• speed up layer for shuffle

• Compute-side caching


Optimization Notice


Optimization Notice


Optimization Notice

Experiment environment Cluster Hadoop head Hadoop slave Load balancer OSD RGW

Roles Hadoop name nodeSecondary name node Resource managerData nodeNode manager Hive metastoreserviceYarn history serverSpark history serverPresto server

Data node Node manager Presto server

Haproxy Ceph osd Ceph rados gateway

# of node 1 5 1 5 5

Processor Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores HT enabled Intel(R) Xeon(R) CPU E31280 @ 3.50GH 4 cores HT enabled

Memory 128GB 128GB 128GB 32GB

Storage 4x 1TB HDD2x Intel S3510 480GB SSD(vs s3700 metrics)

1x Intel S3510 480 GB SSD

• 1x Intel® P3700 1.6TB as jounal

• 4x 1.6TB Intel® SSD DC S3510 2X 400GB s370 as data store

1x Intel S3510 480 GB SSD

Network 10GB 40GB 10GB+10GB 10GB


Optimization Notice

SW Configuration

Hadoop version 2.7.3/2.8.1

Spark version 2.1.1/2.2.0

Hive version 2.2.1

Presto version 0.177

Executor memory 22GB

Executor cores 5

# of executor 24

JDK version 1.8.0_131

Memory.overhead 5GB

S3A Key Performance Configuration

fs.s3a.connection.maximum

10

fs.s3a.threads.max 30

fs.s3a.socket.send.buffer 8192

fs.s3a.socket.recv.buffer 8192

fs.s3a.threads.keepalivetime

60

fs.s3a.max.total.tasks 1000

fs.s3a.multipart.size 100M

fs.s3a.block.size 32M

fs.s3a.readahead.range 64k

fs.s3a.fast.upload true

fs.s3a.fast.upload.buffer array

fs.s3a.fast.upload.active.blocks

4

fs.s3a.experimental.input.fadvise

radom


Optimization Notice

Legal Disclaimer & Optimization Notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

45

http://www.intel.com/benchmarks

https://software.intel.com/en-us/articles/optimization-notice


Optimization Notice

0

50

100

150

0

58

5

11

70

17

55

23

40

29

25

35

10

40

95

Cpu Utilization

Average of

%idle

Average of

%steal0

200000

400000

600000

800000

-5

66

5

13

35

20

05

26

75

33

45

40

15

Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s0

20

40

60

80

5

47

5

94

5

14

15

18

85

23

55

28

25

32

95

37

65

42

35

Memory Utilization

Total

0

500000

1000000

1500000

0

67

0

13

40

20

10

26

80

33

50

40

20

Network IO

Sum of

rxkB/s

Sum of

txkB/s

0

50

100

150

0

58

5

11

70

17

55

23

40

29

25

35

10

40

95

Cpu Utilization

Average of

%idle

Average of

%steal

Average of

%iowait0

200000

400000

600000

800000

-5

66

5

13

35

20

05

26

75

33

45

40

15

Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s 0

50

100

150

5

47

5

94

5

14

15

18

85

23

55

28

25

32

95

37

65

42

35

Memory Utilization

Total

0

50

100

150

0

52

0

10

40

15

60

20

80

26

00

31

20

36

40

41

60

Cpu Utilization

Average of

%idle

Average of

%steal

Average of

%iowait0

2000

4000

6000

-5

51

5

10

35

15

55

20

75

25

95

31

15

36

35

41

55

Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s 12

12.5

13

13.5

5

43

0

85

5

12

80

17

05

21

30

25

55

29

80

34

05

38

30

42

55

Memory Utilization

Total

Computenode

OSD node

Resource Utilization on 1TB parquet

0

200000

400000

600000

800000

1000000

0

67

0

13

40

20

10

26

80

33

50

40

20

Network IO

Sum of

rxkB/s

Sum of

txkB/s

0

200000

400000

600000

800000

1000000

0

52

0

10

40

15

60

20

80

26

00

31

20

36

40

41

60

Network IO

Sum of

rxkB/s

Sum of

txkB/s

RGW node

Low Resource Utilizations


Optimization Notice

Compute node

OSD node

Resource Utilization on 10TB parquet

0

50

100

150

0

31

35

62

70

94

05

12

54

8

15

68

3

18

81

8

21

95

3

25

08

8

Cpu Utilization

Average of

%idle

Average of

%steal

Average of

%iowait

0

500000

1000000

1500000

-5

40

25

80

55

12

09

2

16

12

2

20

15

2

24

18

2

Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s 0

50

100

150

5

31

40

62

75

94

10

12

55

2

15

68

7

18

82

2

21

95

7

25

09

2

Ax

is T

itle

Memory Utilization

Total

0

500000

1000000

1500000

0

35

25

70

50

10

58

2

14

10

7

17

63

2

21

16

4

24

69

5

Network IO

Sum of

rxkB/s

Sum of

txkB/s

0

50

100

150

0

31

35

62

70

93

53

12

48

8

15

62

3

18

75

8

21

89

3

25

02

9

Cpu Utilization

Average of

%idle

Average of

%steal

Average of

%iowait

0

500000

1000000

1500000

-5

40

25

80

03

12

03

3

16

06

3

20

09

3

24

12

3

Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s 0

50

100

150

5

25

70

51

35

77

00

10

21

3

12

77

8

15

34

3

17

90

8

20

47

3

23

03

8

25

60

3

Memory Utilization

Total

0

500000

1000000

1500000

0

40

30

80

08

12

03

8

16

06

8

20

09

8

24

12

9

Network IO

Sum of

rxkB/s

Sum of

txkB/s

0

50

100

150

0

35

30

70

60

10

59

1

14

12

1

17

65

1

21

18

1

24

71

1

Cpu Utilization

Average of

%idle

Average of

%steal

Average of

%iowait

0

1000

2000

3000

-5

35

25

70

55

10

58

5

14

11

5

17

64

5

21

17

5

24

70

5Disk Bandwidth

Sum of

rkB/s

Sum of

wkB/s0

5

10

15

5

25

70

51

35

77

01

10

26

6

12

83

1

15

39

6

17

96

1

20

52

6

23

09

1

25

65

6

Memory Utilization

Total0

500000

1000000

1500000

0

40

30

80

61

12

09

1

16

12

1

20

15

1

24

18

1

Network IO

Sum of

rxkB/s

Sum of

txkB/s

RGW node

Documents

Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018 · 2019-02-26 · Kyle Bader - Red Hat Yuan Zhou, Yong Fu, Jian Zhang - Intel May, 2018