25
Optimizing PowerEdge Configurations for Hadoop Michael Pittaro Principal Architect, Big Data Solutions Dell

Optimizing Dell PowerEdge Configurations for Hadoop

Embed Size (px)

DESCRIPTION

Hadoop Hardware configurations for Dell PowerEdge Servers. (Presentation from 2013 Dell Enterprise Forum. )

Citation preview

Page 1: Optimizing Dell PowerEdge Configurations for Hadoop

Optimizing PowerEdge Configurations for Hadoop

Michael Pittaro Principal Architect, Big Data Solutions Dell

Page 2: Optimizing Dell PowerEdge Configurations for Hadoop

Big Data is when the data itself is part of the problem.

Volume

• A large amount of data, growing at large rates

Velocity

• The speed at which the data must be processed

Variety

• The range of data types and data structure

What is Big Data ?

Page 3: Optimizing Dell PowerEdge Configurations for Hadoop

Dell | Cloudera Apache Hadoop Solution

3

Retail Telco Media Web Finance

Page 4: Optimizing Dell PowerEdge Configurations for Hadoop

• A Proven Big Data Platform – Cloudera CDH4 Hadoop Distribution with Cloudera Manager – Validated and Supported Reference Architecture – Production deployments across all verticals

• Dell Crowbar provides deployment and management at scale

– Integrated with Cloudera Manager – Bare metal to deployed cluster in hours – Lifecycle management for ongoing operations

• Dell Partner Ecosystem

– Pentaho for Data Integration – Pentaho for Reporting and Visualization – Datameer for Spreadsheet style analytics and visualization – Clarity and Dell Implementation Services

Dell | Cloudera Apache Hadoop Solution

4

Page 5: Optimizing Dell PowerEdge Configurations for Hadoop

• Customers want results – Performance – Predictability – Reliability – Availability – Management – Monitoring

• Customers want value

• Big Data has many options – Servers – Networking – Software – Tools – Application Code – Fast Evolution

• Wide range of applications

The Problem with Big Data Projects

5

Page 6: Optimizing Dell PowerEdge Configurations for Hadoop

• Tested Server Configurations • Tested Network Configurations • Base Software Configuration

– Big Data Software – OS Infrastructure – Operational Infrastructure

• Predefined configuration – Recommended starting point

• Patterns, Use Cases, and Best

Practices are emerging in Big Data

• Reference Architectures help package this knowledge for reuse

A Reference Architecture Fills The Gap

6

Page 7: Optimizing Dell PowerEdge Configurations for Hadoop

• PowerEdge R720, R720XD – Balanced Compute and Storage

• PowerEdge C6105 – Scale Out Computing – Large Disk Capacity

• PowerEdge C8000 – Scale Out Computing – Flexible Configuration

7

Reference Architecture : Servers

Page 8: Optimizing Dell PowerEdge Configurations for Hadoop

1GbE 10GbE

Top of Rack

Force 10 S60

Force 10 S4810

Cluster Aggregation

Force 10 S4810 Force 10 S4810

Bonded Connections

Redundant Networking

Reference Architecture: Networking

8

Page 9: Optimizing Dell PowerEdge Configurations for Hadoop

• Hadoop – Cloudera CDH 4 – Cloudera Manager – Hadoop Tools

• Infrastructure Management

– Nagios – Ganglia

• Configuration Management

– Predefined parameters – Role based configuration

9

Reference Architecture: Software

Hive

Pig

HBase

Sqoop

Oozie

Hue

Flume

Whirr

Zookeeper

Page 10: Optimizing Dell PowerEdge Configurations for Hadoop

Tying it all Together: Crowbar

10

Del

l “C

row

bar

” O

ps

Man

agem

ent

Core Components & Operating Systems

Big Data Infrastructure & Dell Extensions

Physical Resources

APIs, User Access, & Ecosystem Partners

Crowbar

Deployer

Provisioner

Network RAID

BIOS IPMI

NTP

DNS Logging

HDFS HBase Hive

Nagios Ganglia

Pentaho

Cloudera

Cloudera Pig Force10

Page 11: Optimizing Dell PowerEdge Configurations for Hadoop

11 Revolutionary Cloud Solutions Confidential

Hadoop Node Architecture

Cloudera Manager

Hadoop Clients Task

Tracker Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Job Tracker

Job Tracker

Crowbar

Nagios

Ganglia

Admin Node

Edge Node Data Node Data Node Data Node

Master Name Node Secondary Name Node

Standby Name Node

Journal Node

Journal Node

Standby Name Node

High Availability Node

Active Name Node

Journal Node

Job Tracker

Page 12: Optimizing Dell PowerEdge Configurations for Hadoop

12 Revolutionary Cloud Solutions Confidential

Hadoop Cluster Scaling

Page 13: Optimizing Dell PowerEdge Configurations for Hadoop

Learning The Reference Architecture

• Read It ! – Read it again – Keep it under your pillow

• Three Documents

– Reference Architecture – Deployment Guide – Users Guide

• Deploy it

– Works on 4 or 5 nodes

• Available through the Dell Sales Team

13

Page 14: Optimizing Dell PowerEdge Configurations for Hadoop

Leveraging the Reference Architecture

• Start with the base configuration – It works, and eliminates mix and match problems – There are a lot of subtle details hidden behind the configurations

• Easy changes: processor, memory, disk

– Will generally not break anything – Will affect performance, however

• Harder changes: Hadoop configuration

– Mainly, need to know what you're doing here – We have experience and recommendations

• Hardest Changes: Optimization for workloads

– The default configuration is a general purpose one – Specific workloads must be tested and benchmarked

14

Page 15: Optimizing Dell PowerEdge Configurations for Hadoop

• Assume 1.5 Hadoop Tasks per physical core – Turn Hyperthreading on – This allows headroom for other processes

• Configure Hadoop Task slots – 2/3 map tasks – 1/3 reduce tasks

• Dual Socket 6 core Xeon example › mapred.tasktracker.map.tasks.maximum: 12 › mapred.task.tracker.reduce.tasks.maximum: 6

• Faster is better

– Hadoop compression uses processor cycles – Most Hadoop jobs are I/O bound, not processor bound – The Map / Reduce balance depends on actual workload – It’s hard to optimize more without knowing the actual workload

Selecting Processors

15

Page 16: Optimizing Dell PowerEdge Configurations for Hadoop

• Hadoop scales processing and storage together – The cluster grows by adding more data nodes – The ratio of processor to storage is the main adjustment

• Generally, aim for a 1 spindle / 1 core ratio

– I/O is large blocks (64Mb to 256Mb) – Primarily sequential read/write, very little random I/O – 8 tasks will be reading or writing 8 individual spindles

• Drive Sizes and Types

– NL SAS or Enterprise SATA 6 Gb/sec – Drive size is mainly a price decision

• Depth per node

– Up to 48 TB/node is common – 112 Tb / node is possible – Consider how much data is ‘active’ – Very deep storage impacts recovery performance

Spindle / Core / Storage Depth Optimization

16

Page 17: Optimizing Dell PowerEdge Configurations for Hadoop

PowerEdge C8000 Hadoop Scaling - 16 core Xeon

17

0

5,000

10,000

15,000

20,000

25,000

30,000

35,0001

15 29

43 57 71 85

99

113

127

141

155

169

183

197

211

22

52

39

Tb

Sto

rag

e

(1) 12 spindle 3Tb versus (3) 6 spindle 3Tb

Cores (1)

Storage (1)

IOPS (1)

Storage (3)

IOPS (3)

Page 18: Optimizing Dell PowerEdge Configurations for Hadoop

• Workload optimization requires profiling and benchmarking

• HBase versus pure Map/Reduce are different – I/O patterns are different – Hbase requires more memory – Cloudera RTQ (Impala) is I/O Intensive

• Map Reduce usage varies

– I/O intensive to CPU intensive

• Ingestion and Transfer impact the edge (gateway) nodes

• Heterogenous Cluster versus dedicated Clusters ? – Cloudera have added support for heterogenous clusters and nodes – Dedicated cluster makes sense if workload is consistent

› Primarily for ‘data’ businesses

Workload Optimization : Hadoop has widely varying workloads

18

Page 19: Optimizing Dell PowerEdge Configurations for Hadoop

Reference Architecture Options

• High Availability – Networking configuration – Master / Secondary Name Node configuration

• Alternative Switches

– It’s possible – Contact us for advice

• Cluster Size

– The Reference Architecture Scales Easily to Around 720 Nodes – Beyond that, a network engineer needs to take a closer look

• Node Size

– Memory recommendations are a starting point – Disk / Core balance is a never ending debate

19

Page 20: Optimizing Dell PowerEdge Configurations for Hadoop

Model Data Node Configuration Comments RA

R720Xd Dual socket, 12 cores, 24 x 2.5” spindles

Most popular platform for Hadoop

C8000 Dual socket, 16 cores, 16 x 3.5” spindles

Popular for deep/dense Hadoop applications

C6100 / C6105

Dual socket, 8/12 cores, 12 x 3.5” spindles

Two node version. C6100 is hardware EOL

C2100 Dual Socket, 12 cores, 12 x 3.5” spindles

Popular, hardware EOL but often repurposed for Hadoop

R620 Dual Socket, 8 cores, 10 x 2.5” spindles

1U form factor

C6220 Dual-socket, 8 cores, 6 x 2.5” spindles

Core/spindle ratio is not ideal for Hadoop.

In the Wild – Dell Customer Hadoop Configurations

20

Page 21: Optimizing Dell PowerEdge Configurations for Hadoop

SecureWorks : Based on R720xd Reference Architecture

SecureWorks 24 hours a day, 365 days a year, helping protect the security of its customers’ assets in real time

Challenge Collecting, processing, and analyzing massive amounts of data from customer environments

Results • Reduced cost of data storage to ~21 cents

per gigabyte

• 80% savings over previous proprietary solution

• 6 months faster deployment

• < 1 yr. payback on entire investment

• Data doubles every 18 months, magnifying savings

Page 22: Optimizing Dell PowerEdge Configurations for Hadoop

Further Information

• Dell Hadoop Home Page – http://www.dell.com/hadoop

• Dell Cloudera Apache Hadoop install with Crowbar (video)

– http://www.youtube.com/watch?v=ZWPJv_OsjEk

• Cloudera CDH4 Documentation – http://ccp.cloudera.com/display/CDH4DOC/CDH4+Documentation

• Crowbar homepage and documentation on GitHub

– http://github.com/dellcloudedge/crowbar/wiki

• Open Source Crowbar Installers – http://crowbar.zehicle.com/

22

Page 23: Optimizing Dell PowerEdge Configurations for Hadoop

Q&A

23

Page 24: Optimizing Dell PowerEdge Configurations for Hadoop

Thank you!

24

Page 25: Optimizing Dell PowerEdge Configurations for Hadoop

25

Notices & Disclaimers

Copyright © 2013 by Dell, Inc.

No part of this document may be reproduced or transmitted in any form without the written permission from Dell, Inc.

This document could include technical inaccuracies or typographical errors. Dell may make improvements or changes in the product(s) or program(s) described herein at any time without notice. Any statements regarding Dell’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

References in this document to Dell products, programs, or services does not imply that Dell intends to make such products, programs or services available in all countries in which Dell operates or does business. Any reference to an Dell Program Product in this document is not intended to state or imply that only that program product may be used. Any functionality equivalent program, that does not infringe Dell’s intellectual property rights, may be used.

The information provided in this document is distributed “AS IS” without any warranty, either expressed or implied. Dell EXPRESSLY DISCLAIMS any warranties of merchantability, fitness for a particular purpose OR INFRINGEMENT. Dell shall have no responsibility to update this information.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any Dell patents or copyrights.

Dell, Inc. 300 Innovative Way Nashua, NH 03063 USA