Upload
outstanding59
View
418
Download
0
Embed Size (px)
DESCRIPTION
VMWORLD2012
Citation preview
Inside the Hadoop Machine
Jeff Buell, VMware, Inc.
Richard McDougall, VMware, Inc.
Sanjay Radia, Hortonworks
APP-CAP2956
#vmworldapps
2
Disclaimer
! This session may contain product features that are currently under development.
! This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.
! Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
! Technical feasibility and market demand will affect final delivery.
! Pricing and packaging for any new technologies or features discussed or presented have not been determined.
3
Log Processing / Click Stream Analytics
Machine Learning / sophisticated data mining
Web crawling / text processing
Extract Transform Load (ETL) replacement
Image / XML message processing
Broad Application of Hadoop technology
General archiving / compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug Discovery
Social Media
Vertical Use Cases Horizontal Use Cases
Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
4
How does Hadoop enable parallel processing?
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
! A framework for distributed processing of large data sets across clusters of computers using a simple programming model.
5
Hadoop System Architecture
! MapReduce: Programming framework for highly parallel data processing
! Hadoop Distributed File System (HDFS): Distributed data storage
6
Host%1 Host%2 Host%3
%Input%File
Input%File
Job Tracker Schedules Tasks Where the Data Resides
Job Tracker
Job
DataNode
Task%%Tracker
Split%1%–%64MB
Task%<%1 Split%2%–%64MB Split%3%–%64MB
Task%%Tracker
Task%%Tracker
DataNode DataNode
Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB
Task%<%2 Task%<%3
7
Hadoop Distributed File System
8
Hadoop Data Locality and Replication
9
Hadoop Topology Awareness
10
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Elastic Scaling
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
11
Enterprise Challenges with Using Hadoop
! Deployment • Slow to provision
• Complex to keep running/tune
! Single Points of Failure • Single point of failure with Name Node and Job tracker
• No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
! Low Utilization • Dedicated clusters to run Hadoop with low CPU utilization
• No easy way to share resource between Hadoop and non-Hadoop workloads
• Noisy neighbor, lack resource containment
! Need Multi-tenant Isolation, Resource Management, etc,… • Noisy Neighbor - no performance or security isolation between different tenants/users
• Lack of configuration isolation - Can�t run multiple versions on the cluster
12
Virtualization enables a Common Infrastructure for Big Data
Single purpose clusters for various business applications lead to cluster sprawl.
Virtualization Platform
! Simplify • Single Hardware Infrastructure
• Unified operations
! Optimize • Shared Resources = higher utilization
• Elastic resources = faster on-demand access
MPP DB Hadoop HBase
Virtualization Platform
MPP DB
Hadoop
HBase
Cluster Sprawling
Cluster Consolidation
13
Deploy a Hadoop Cluster in under 30 Minutes
Deploy vHelperOVF to vSphere
Select configuration template
Automate deployment
Select Compute, memory, storage and network
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few simple commands to stand up Hadoop Cluster.
14
A Tour Through Serengeti
$ ssh serengeti@serengeti-vm $ serengeti serengeti>
15
A Tour Through Serengeti
serengeti> cluster create --name myElephant serengeti> cluster list -–name myElephant name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50 name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184
16
A Tour Through Serengeti
$ ssh [email protected] $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data …
17
Serengeti Spec File
[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]
18
Configuring Distro’s
{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },
19
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions
Community Projects Commercial Vendors
• Support major distribution and multiple projects • Contribute Hadoop Virtualization Extension (HVE) to Open
Source Community
20
Use Local Disk where it’s Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes
200,000 IOPS 8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
200,000 IOPS 10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets: 10 Petabytes 400,000 IOPS
250 Gbytes/sec
21
Extend Virtual Storage Architecture to Include Local Disk
! Shared Storage: SAN or NAS • Easy to provision • Automated cluster rebalancing
! Hybrid Storage • SAN for boot images, VMs, other
workloads
• Local disk for Hadoop & HDFS • Scalable Bandwidth, Lower Cost/GB
Host
Had
oop
Oth
er V
M
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Oth
er V
M
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
22
Hadoop has Significant Ephemeral Data
%75%%of%Disk%Bandwidth%
Job%
Map%Task%
Map%Task%
Map%Task%
Map%Task%
Reduce%
Reduce%
HDFS%
DFS%Input%Data%
DFS%Output%Data%%
12%%of%Bandwidth%
%12%%of%Bandwidth%
Spills%&%Logs%spill*.out*
Spills%
Map%Output%file.out*
Shuffle%Map_*.out*
Sort%
Combine%Intermediate.out*
23
Virtualized Hadoop Performance
! Issues of interest • Native vs various virtual configurations • Local disks vs Fibre Channel SAN
• Effect of protecting Hadoop master daemons with Fault Tolerance • Public cloud (renting) vs private cloud (buying)
…24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA
Arista 7124SX 10 GbE switch
EMC VNX7500
24
Configuration
! Software • vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) • RHEL 6.1 x86_64
• Cloudera CDH3u4 • Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
! Hadoop VMs • Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
1, 2, or 4 VMs per host • Separate VMs for NameNode and JobTracker for storage and FT tests
! Hadoop configuration • One map and one reduce task per vCPU (= logical thread)
• Machines are highly loaded
• 256 MB block size
• FT tests: 8 – 256 MB block sizes to vary load on NN and JT
25
Native versus Virtual Platforms, 24 hosts, 12 disks/host
0
50
100
150
200
250
300
350
400
450
TeraGen TeraSort TeraValidate
Elap
sed
tim
e, s
econ
ds
(low
er is
bet
ter)
Native 1 VM 2 VMs 4 VMs
26
Local vs Various SAN Storage Configurations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
TeraGen TeraSort TeraValidate
Elap
sed
tim
e ra
tio
to L
ocal
dis
ks (
low
er is
bet
ter)
Local disks SAN JBOD SAN RAID-0, 16 KB page size SAN RAID-0 SAN RAID-5
16 x HP DL380G7, EMC VNX 7500, 96 physical disks
27
Performance Effect of FT for Master Daemons
! NameNode and JobTracker placed in separate UP VMs ! Small overhead: Enabling FT causes 2-4% slowdown for TeraSort ! 8 MB case places similar load on NN &JT as >200 hosts with 256 MB
1
1.01
1.02
1.03
1.04
256 64 16 8
Elap
sed
tim
e ra
tio
to F
T of
f
HDFS block size, MB
TeraSort
28
Different Clouds for Different Folks
! Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts ! Google/MapR: SaaS on Google Compute Engine ! vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
CDH3u4 ! Vastly different cluster sizes • Compare throughput (MB sorted per second) normalized with resources
! Cost: rental or estimate of running continuously for 3 years
#cores #disks TeraSort, s MB/s/core MB/s/disk cost
Yahoo! 11680 5840 62 1.3 2.6 ~$7
Google/MapR 5024 1256 80 2.4 9.5 $16
vSphere 5.1 192 192 442 11.2 11.2 ~$2
vSphere 5.1 192 288 359 13.8 9.2 ~$2
29
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Elastic Scaling
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
30
VMware-Hortonworks Joint Engineering
! Hortonworks goal • Expand Hadoop ecosystem • Provide first class support of various platforms
• Hadoop should run well on VMs • VMs offer several advantages as presented earlier
• Take advantage of vSphere for HA ! First class support for VMs • Topology plugins (Hadoop-8468)
• 2 VMs can be on same host • Pick closer data • Schedule tasks closer • Don’t put two replicas on same host
• MR-tmp on HDFS using block pools • Elastic Compute-VMs will not need local disk
• Fast communications within VMs
31
Hadoop Full-Stack High Availability
HA Cluster for Master Daemons
Server Server Server
NN JT
Failover
N+K failover
Apps Running Outside
JT into Safemode
NN
job job job job job
Slave Nodes of Hadoop Cluster
32
HA is in HDP 1.0 Using Total System Availability Architecture
33
HA in Hadoop 1 with HDP1
! Full Stack High Availability • Namenode
• Clients pause automatically • JobTracker pauses automatically
• Other Hadoop master services (JT, …) coming
! Use industry proven HA framework • VMWare vSphere-HA
• Failover, fencing, … • Corner cases are tricky – if not addressed, corruption
• Addition benefits: • N-N & N+K failover • Migration for maintenance
34
Hadoop NN/JT HA with vSphere
35
Namenode Failover Times
! 60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5 minutes • Failure detection and Failover – 0.5 to 2 minutes • Namenode Startup (exit safemode) – 30 sec
! 180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5 minutes • Failure detection and Failover – 0.5 to 2 minutes • Namenode Startup (exit safemode) – 110 sec
For vSphere - OS bootup is needed – 10-20 seconds is included above.
Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates
35
36
Summary
! Advantages of Hadoop on VMs • Cluster Management • Cluster consolidation
• Greater Elasticity in mixed environment • Alternate multi-tenancy to capacity scheduler’s offerings
! HA for Hadoop Master Daemons • vSphere based HA for NN, JT, … in Hadoop 1
• Total System Availability Architecture
37
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Elastic Scaling
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
38
Storage
Elastic Scaling and Multi-tenancy of Hadoop on vSphere
1.#Hadoop#in#VM#< Single%Tenant%< Fixed%Resources%
2.#Separate#Compute#and#Data#< Single%Tenant%< ElasQc%Compute%%
3.#Mul8.#Clusters#< MulQple%Tenants%< ElasQc%Compute%
Compute Current%Hadoop:%%Combined%Storage/Compute Storage
T1 T2 VM VM VM
VM VM
VM
39
Virtual Hadoop Node
Datanode
Separated Compute and Data
Virtualization Host
Virtual Hadoop Node
Other Workload
VMDK
Task Tracker
Slot Slot
Virtual Hadoop Node
VMDK
Task Tracker
Slot Slot
Virtual Hadoop Node
Virtual Hadoop Node
Task Tracker
Slot Slot
Truly Elastic Hadoop: Scalable through virtual nodes
40
References
www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr
FILL OUT A SURVEY
EVERY COMPLETE SURVEY IS ENTERED INTO
DRAWING FOR A $25 VMWARE COMPANY
STORE GIFT CERTIFICATE
Inside the Hadoop Machine
Jeff Buell, VMware, Inc.
Richard McDougall, VMware, Inc.
Sanjay Radia, Hortonworks
APP-CAP2956
#vmworldapps