30

Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

  • Upload
    vonhan

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers
Page 2: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights

Gord Sissons Senior Manager, Technical Marketing IBM Platform Computing [email protected]

Page 3: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Agenda

• Some Context – IBM Platform Computing

• Low-latency scheduling meets open-source

• Breakthrough performance

• Multi-tenancy (for real!)

• Cluster-sprawl - The elephant in the room

• Side step the looming challenges

Page 4: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

• Acquired by IBM in 2012

• 20 year history in high-performance computing

• 2000+ global customers

• 23 of 30 largest enterprises

• High-performance, mission-critical, extreme scale

• Comprehensive capability

De facto standard for commercial high-

performance computing

IBM Platform Computing

Powers financial analytics grids for 60%

of top investment banks

Over 5 million CPUs under management

Breakthrough performance in Big

Data analytics

Page 5: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Platform LSF Family

Platform HPC

Scalable, comprehensive workload management for demanding heterogeneous environments

Simplified, integrated HPC management software bundled with systems

Platform Symphony Family

High-throughput, low-latency compute and data intensive analytics

• An SOA infrastructure for analytics

• Extreme performance and scale

• Complex Computations (i.e., risk)

• Big Data Analytics via MapReduce

Technical Computing - HPC

Analytics Infrastructure Software

Page 6: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

• Financial firms compete on the their ability to maximize use of capital

• Monte-Carlo simulation is a staple technique for simulating market outcomes

• Underlying instruments are increasingly complex

• A crush of new regulation

Our worldview – shaped by time critical analytics

Compute this over 5,000 market scenarios comprised of 200 risk factors over 10 years for all instruments and all portfolios – NOW!

Page 7: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

• A heterogeneous grid management platform

• A high-performance SOA middleware environment

• Supports diverse compute & data intensive applications

• ISV applications – Many applications in this space are open source

• In-house developed applications (C/C++, C#/.NET, Java, Excel, R etc)

• Support for Linux / Power Linux, Windows + other OS

• React instantly to time critical-requirements

• A multi-tenant shared services platform

• Implements a fully compatible MapReduce run-time for open-source Hadoop

IBM Platform Symphony

Page 8: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

split 0

split 1

split 2

split 3

split 4

split 5

Map

Map

Map

Reduce

Reduce

Reduce

C Client

output 0

output 1

output 2

M Master

Input Files

Map Phase

Intermediate Files

Reduce Phase

Output Files

De-facto “Big Data” standard

• Pioneered at Google / Yahoo!

• Framework for writing applications to rapidly process vast datasets

• More cost effective than traditional data warehouse / BI infrastructure

• Dramatic performance gains

• Java based

• From our perspective: Just another distributed computing problem

Hadoop MapReduce

Page 9: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Resource Orchestration

Workload Manager

C C C C C C

C C C C C C

D

D

D

D

D

D

D

D

D

D

D

D

C C C C C C

A A A A

A A A A

A A A A

A A A A

B

B

B

B

B

B

B

B

B

B

B

B B B B B B B

IBM Platform Symphony

Various open-source & commercial apps

IBM InfoSphere BigInsights Open Source Hadoop

A B C D

Page 10: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

BI / Reporting Exploration / Visualization

Functional App

Industry App

Predictive Analytics

Content Analytics

Analytic Applications

Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Accelerators

Information Integration & Governance

Data Warehouse Hadoop System

Stream Computing

Agile, multi-tenant shared infrastructure

IBM InfoSphere BigInsights & Platform Symphony

• Comprehensive platform

• Data at rest, data in motion

• Extensive library of data connectors

• Rich development tools

• Application accelerators

• Web-based management console

Page 11: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Big problems demand big infrastructure • Exploit threads

• Power 7+ - 2 threads per core vs. 2 threads per core

• High Throughput

• Extreme memory and I/O bandwidth

• Better Java implementation

• Optimized JVM on Power 7+

• Superior I/O

• Massive I/O bandwidth

• Parallel file system

• Your choice of HDFS or GPFS

• Ideal match for Apache Hadoop MapReduce framework

• Massively parallel processing across Linux clusters

Page 12: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

PERFORMANCE

Page 13: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Berkley SWIM

• “Real-world” MapReduce benchmark – synthesize and replay captured real-world workloads

• Developed by Yanpei Chen and others at @ UCB - https://github.com/SWIMProjectUCB/SWIM/wiki

• Viewed as an advance over existing synthetic MapReduce benchmarks including GridMix2, PigMix, Hive BM etc.

• Represents workloads comprised of short, large and huge jobs stressing disk, network IO, CPU and memory

• Promoted by Cloudera – advantages of SWIM promoted at Hadoop World 2011 - http://www.slideshare.net/cloudera/hadoop-world-2011-hadoop-and-performance-todd-lipcon-yanpei-chen-cloudera

Page 14: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

0 1000 2000 3000 4000 5000 6000 7000 8000

Symphony 6.1

Symphony 6.1

Symphony 6.1

Hadoop 1.0.1

Hadoop 1.0.1

Hadoop 1.0.1

Seconds

Benchmark: SWIM: Facebook 2010 Workload >7x

FASTER!

Page 15: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers
Page 16: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

• Open-source software for De Novo Genome Assembly – key contributors are Jeremy Lewi, Avijit Gupta, Ruschil Gupta, Michael Schatz and others

• Sequencing large genomes is too large a problem for conventional algorithms

• It turns out that the deBrujin graph fundamental to genome sequencing is readily represented as key-value pairs – ideal for processing with MapReduce

• Contrail runs a pipeline where each pipeline stage is implemented as a MapReduce job to exploit parallelism

http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail

Benchmark: Contrail

Page 17: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Benchmark: Contrail

2.3x FASTER!

Page 18: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Hardware

Cluster 1 PowerLinux P7+ Master Node

9 PowerLinux P7+ Slave Nodes

CPU 16 processor cores per server (128 total)

Memory 128 GB per server (1280 total)

Internal Storage 6 600GB internal SAS drivers per server

(36 TB total)

Storage Expansion 24 600GB SAS drives in IBM EXP24S

SFF Gen2-bay Drawer, per server(144 TB

total)

Network 2 10Gbe connections per server

Switch BNT BLACE RackSwitch G8264

Software

OS Red Had Enterprise Linux 6.2

Java IBM Java 64bit Version 7 SR1

HDFS Hadoop v1.1.3 (1 node as NameNode

and 9 nodes as DataNode)

Platform

Symphony

MapReduce

Advanced Edition 6.1.0.1

1 node as Management Host and 9

nodes as Compute Hosts

Record Terasort results on Power 7+ Normalized sorting rate per core

0.54 GB/min/core

1.01 GB/min/core

9 node Power 7+ Cluster

Hadoop 1.1.3 Symphony

18 node Sandy Bridge Cluster Intel E5-2667

Cloudera

IBM internal unaudited result – details of Intel system benchmark at http://www.hp.com/hpinfo/newsroom/press_kits/2012/HPDiscover2012/Hadoop_Appliance_Fact_Sheet.pdf

90% FASTER!

Page 19: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Understanding the advantage

Symphony 6.1 can schedule ~50x more tasks per second

Hadoop results taken from Hadoop World 2011 performance presentation, Lipcon & Chen

Tasks per second

Hadoop 0.20.2 3.3

Hadoop 30.3

Symphony 6.1 1516

0

200

400

600

800

1000

1200

1400

1600 Ta

sks/

Sec

Raw Scheduler Performance 50x FASTER!

Page 20: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Other Grid Server

Broker Engines

Each engine polls broker ~5 times per second (configurable)

Send work when engine ready

Client

Serialize input data

Network transport (client to broker) Wait for engine to poll broker

Network transport (broker to engine)

De-serialize Input data

Compute Result

Serialize result

Post result back to broker

Time

Broker Compute time

IBM Platform Symphony is (much) faster because:

Efficient C language routines use CDR (common data representation) and IOCP rather than slow, heavy-weight XML data encoding)

Network transit time is reduced by avoiding text based HTTP protocol and encoding data in more compact CDR binary format

Processing time for all Symphony services is reduced by using a native HPC C/C++ implementation for system services rather than Java

Platform Symphony has a more efficient “push model” that avoids entirely the architectural problems with polling

Platform Symphony

Serialize input

Network transport

SSM Compute time & logging

Time

Network transport (SSM to engine)

De-serialize

Serialize

Network transport (engine to SSM)

Compute result

No wait time due to polling, faster serialization/de-serialization, More network efficient protocol

Page 21: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

• C++ native code

• Optimized binary network protocols

• Fast object serialization

• JVM pre-start & re-use

• Generic slots enabling full cluster utilization

• Efficient push-based scheduling model

• Uses Symphony common data for JAR transport

• Shuffle-stage optimizations

• Intelligent pre-emption

Many performance optimizations

Page 22: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

MULTITENANCY

Page 23: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

“I need an updated counterparty credit risk analysis for the final

earnings report by 2:00 pm”

“I wonder if teenagers in California still think red shoes

are cool?”

Different workloads demand different SLAs

Page 24: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Big Data App #2 Big Data App #3

A B C D

Big Data App #1

Existing Analytic workload (SPSS)

A A A A

A A A A

B B B B C C C C D D D D

B B B B C C C C D D D D

A A A A

A A A A

B B B B C C C C D D D D

B B B B C C C C D D D D

A A A A

A A A A

B B B B C C C C D D D D

B B B B C C C C D D D D

Cluster Sprawl - Silos of underutilized, incompatible clusters

Page 25: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Resource Orchestration

Workload Manager

C C C C C C

C C C C C C

D

D

D

D

D

D

D

D

D

D

D

D

C C C C C C

A A A A

A A A A

A A A A

A A A A

B

B

B

B

B

B

B

B

B

B

B

B B B B B B B

Big Data App #2 Big Data App #3

A B C D

Big Data App #1

Existing Analytic workload (SPSS)

Dynamic resource sharing among heterogeneous tenants

Page 26: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Agile sharing at run-time while preserving ownership and

application SLAs

Ensuring SLAs is critical

Page 27: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Multiple deployment options

• Pure Data for Hadoop appliances

• Power and Intel based Big Data Reference Architectures

• Your choice of distribution

IBM BigInsights, Cloudera, MAPR, Apache, Hortonworks etc..

• Your choice of file system

HDFS or GPFS

Page 28: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Summing up A unique solution for open-source Big Data Analytics

• Exceptional performance

• Lower infrastructure cost

• Multiple Hadoop distributions

• Simplified application life cycle management

• Sophisticated multi-tenancy

• Optional GPFS file system

Page 29: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers

Next steps

• Review the benchmarks

• Take the TCO challenge

• Contact us

Gord Sissons - [email protected]

http://www.ibm.com/platformcomputing/products/symphony/

http://www.ibm.com/platformcomputing/products/symphony/highperfhadoop.html

http://www-03.ibm.com/systems/power/software/linux/powerlinux/

http://bigdatauniversity.com

Page 30: Turbo-Charging Open Source Hadoop - Event … · Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights ... •20 year history in high-performance computing Powers