48
© 2013 IBM Corporation Hadoop – It’s Not Just Internal Storage John Sing, Executive Consultant IBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013 11 June 2013

Hadoop_Its_Not_Just_Internal_Storage_V14

Embed Size (px)

DESCRIPTION

John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.

Citation preview

Page 1: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

Hadoop – It’s Not Just Internal Storage

John Sing, Executive ConsultantIBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013 11 June 2013

Page 2: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

2

John Sing 31 years of experience with IBM in high end servers, storage,

and software– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and

Planning, Enterprise Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics, HA/DR/BC

– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity, HA/DR/BC, IBM Storage

– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)

– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA)

[email protected]

You may follow my daily IT research blog

– http://www.delicious.com/atsf_arizona

You may follow me on Slideshare.net:

– http://www.slideshare.net/johnsing1

My LinkedIn:

– http://www.linkedin.com/in/johnsing

Page 3: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

44

Agenda

Understanding today’s Hadoop environments– Hadoop architecture, usage cases, deployments

– Hadoop design, performance, and cost considerations

Differing Hadoop perspectives: Applications/Business Line vs. Operations– Understanding implications of direct attached storage (DAS) vs. Shared Storage

Intelligently choosing Hadoop storage solutions– Usage cases where Direct Attached Storage makes sense

– Intelligent usage cases where Shared Storage makes sense

– Future evolution of storage, Hadoop, and cross-section of the two

IBM Hadoop, Storage, Big Data hardware and software components, tools, offerings

Source: If applicable, describe source origin

Page 4: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Presentation Template Full Version

55

Understanding today’s Hadoop environments

Hadoop – It’s Not Just Internal Storage

Page 5: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

6

What is Hadoop?

Instead of the traditional IT computation model:

Which brings the data to the function/program on application server

Loads data into memory on an application server and processes it

Unfortunately, this doesn’t scale for internet-scale Big Data problems

Apache Hadoop: open source framework for data-intensive applications

Inspired by Google technologies (MapReduce, GFS)

Well-suited to batch-oriented, read-intensive applications

Yahoo! adopted these technologies and open sourced them into the Apache Hadoop project

Hadoop has become a pervasive enabler of internet-scale applications, working with thousands of nodes, petabytes of data in highly parallel, cost effective manner

CPU + disks of commodity storage = Hadoop “node”

Hadoop nodes today running mission-critical production in massive clusters

10s of thousands of servers

New nodes can be added as needed to the cluster, without changing:

Data formats, how data is loaded, how jobs are written

6

Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

Page 6: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

7

The World of Hadoop: worldwide usage

eBay

Linkedin

Yahoo!

Facebook

New York Times

Many, many more…

http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.html One source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy

Page 7: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

8

Hadoop is today a well-developed ecosystem

Hadoop– Overall name of software

stack HDFS

– Hadoop Distributed File System

MapReduce– Software compute framework

• Map = queries • Reduce=aggregates

answers Hive

– Hadoop-based data warehouse

Pig– Hadoop-based language

Hbase– Non-relationship database

fast lookups

Flume– Populate Hadoop with data

Oozie– Workflow processing

system Whirr

– Libraries to spin up Hadoop on Amazon EC2, Rackspace, etc.

Avro– Data serialization

Mahout– Data mining

Sqoop– Connectivity to non-Hadoop

data stores BigTop

– Packaging / interop of all Hadoop components

http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyondhttp://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/

http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

Page 8: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

9

Hadoop vendor ecosystem today

http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html

Page 9: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

10

Why understand the Hadoop stack and environment?

Hadoop is being used for much more than just internet-scale Big Data analytics

Hadoop is increasingly being used by enterprises for inexpensive data storage

– As an industry we’re strongly exploiting a much wider variety of data types

– With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale “Big Landing Zone” Hadoop cluster for storing data

• Previously not viable to keep online

– Hadoop cluster also then can run internet-scale analytics on this data

– Significant driver: move to Hadoop to reduce traditional database licensing costs

Storage industry dynamics:

– Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB • At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server• Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB• Not saying this includes all factors, but these dynamics clearly affect the decision

– And then, there’s flash storage coming…..

Must understand full depth of the Hadoop environment and storage industry dynamics:

– In order to decide if/when/where Hadoop internal storage or shared storage is appropriate

Page 10: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

11

Why Hadoop was created for Big DataTraditional approach : Move data to program

Big Data approach: Move function/programs to data

Database server

Data

Query Data

return Data

process Data

Master node

Data nodes

Data

Application server

User request

Send result

User request

Send Function to process on Data

Query & process Data

Data nodes

Data

Data nodes

Data

Data nodes

DataSend Consolidate result

Traditional approachApplication server and Database server are separateData can be on multiple serversAnalysis Program can run on multiple Application serversNetwork is still in the middleData has to go through network

•Big Data Approach Analysis Program runs where the data is : on Data NodeOnly Analysis Program has to go through the networkAnalysis Program need to be MapReduce awareHighly Scalable :

1000s NodesPetabytes and more

Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide

Page 11: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

12

Example of Hadoop in action

Traditional approach : Move data to program

Database server

Data

Query Data

return Data

process Data

Application server

User request

Send result

Big Data approach : Move program to Data

Master node

Data nodes

Data

User request

Send Function to process on Data

Query & process Data

Data nodes

Data

Data nodes

Data

Data nodes

DataSend Consolidate result

Example :

How many hours Clint Eastwood appears in all the movies he has done ?

All movies need to be parsed to find Clint’s face

Traditional approach : All movies are uploaded to application server, through the network

• Big Data Approach : The Analysis Program and copy of Clint’s picture are downloaded to data nodes, through the network

Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide

Page 12: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

13

Hadoop principles: Storage, HDFS and MapReduce Hadoop Distributed File System = HDFS : where Hadoop stores the data

– HDFS file system spans all the nodes in a cluster with locality awareness

Hadoop data storage, computation model– Data stored in a distributed file system, spanning many inexpensive computers– Send function/program to the data nodes– i.e. distribute application to compute resources where the data is stored– Scalable to thousands of nodes and petabytes of data

MapReduce Application

1. Map Phase(break job into small parts)

2. Shuffle(transfer interim outputfor final processing)

3. Reduce Phase(boil all output down toa single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();

. . .

public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();

. . .

Distribute maptasks to cluster

Hadoop Data Nodes

Data is loaded, spread, resident

in Hadoop cluster

Performance = tuning Map Reduce workflow,

network, application, servers, and storage

http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/ http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/allenwittenauer/2012-lihadoopperf

Page 13: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

14

Big Data hadoop system architecture

Datanode

Datanode

Datanode

Management nodes

Namenode nodes

JobTracker nodes

Data nodes with local disks

Network1-10GB Ethernet

or Infiniband

Network1-10GB Ethernet

or Infiniband

Management nodes for Hadoop

and cluster

Management nodes for Hadoop

and cluster

IO performance = type and # of disksReference architecture:

From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanode

Hadoop Distributed File System (HDFS)

• HDFS stores data across multiple data nodes, Namenode knows where data is• HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data across multiple data nodes (typically 3 or more)• HDFS file system is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS• HDFS Name Node is a single point of failure

Scaling granularity: data node, scaling both IO and CPU

Locality awarenessLocality awareness

Note: any other location for data adds network latency

Page 14: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Presentation Template Full Version

15

15

Differing Hadoop storage perspectives

Hadoop – It’s Not Just Internal Storage

Page 15: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

16

Understanding Hadoop rationale for Direct Attached Storage (latency)

Primary Hadoop design goal is affordability at internet scale:

Data is loaded into Hadoop cluster with data locality– Spreading data across Data Nodes – Achieve lowest disk latency through direct attached storage

Send programs to data (not other way around)– Data in general does not move within the Hadoop cluster

Key performance components: disk latency, network interconnect, utilization, bandwidth Based on low capital expenditure, low cost commodity components

– Goal: lowest capital cost at scale (adapters, switches and # of ports)

Hadoop Application and performance tuning:

Fallacy: “all Hadoop jobs are IO-bound” Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads

– CPU/storage ratio different for different workloads

Network latency is major performance impact on Hadoop cluster– Adding external storage layer network latency causes major retuning of network

Hadoop Application team to Operations:

“Until you’ve read the Hadoop book, please don’t waste my time”

http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/

Page 16: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

17

Yet, there are valid operational issues with Hadoop from Enterprise Shared Storage management, cost standpoint

Servers under-utilized? Another storage silo? Amount of physical storage required per usable GB/TB? Reliability as Hadoop application goes into mission critical production? Hadoop-specific storage management, migration, backup, recovery? Hadoop-specific skill set? Ability to understand what data is used where? Audit, security, legacy application integration? Share Hadoop storage (and servers) dynamically, in a pool with other data center

resources?

Ultimately, it becomes a matter of perspective, type of infrastructure, and associated priority. Let’s explore this further………..

Page 17: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

18

Today: two different types of IT

Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/

Internet scale wkloadsTransactional IT

Page 18: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

19

Today’s two major IT workload types

Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads

Page 19: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

20

How to build these two different clouds

Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/

Transactional ITInternet scale wkloads

Page 20: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

21

Hadoop storage choices based on perspective:

This is where a Hadoop external shared storage infrastructure may often be found

This is where Hadoop DAS-focused infrastructure

may often be found

Page 21: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

22

Differing valid perspectives on Hadoop storage issues

Very specific reasons why Direct Access Storage is used:

Performance and throughput (lowest latency) Low cost commodity components

– cost of JBOD at 4-6 cents/GB today– Even at 3x copies, still very inexpensive

Many Hadoop workflow, software components to tune:

– Map and Reduce workflow– Memory allocation and usage– Algorithms, tuning at all levels– What are the tasks doing

Hadoop overall cluster configuration– Server and DAS storage configuration– 3X copies for performance reasons– Squeeze out all latency

– Network topology, speeds, utilization– Compression– Type of data

Etc…..……

Very specific reasons why shared storage is desired:

Cost CAPEX / OPEX? – Fixed server/storage ratio?– Low server % utilization = excess cost?

Reliability?

Backup? Disaster Recovery?

Another silo of storage?

Managing data:– Within the Hadoop cluster– Between Hadoop and other existing storage?

Hadoop Applications, Business Line team

Operations team

Clearly, different perspectives!

Page 22: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

23

Bottom line on Direct Attached Storage (DAS)vs. Shared Storage for Hadoop

Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external shared storage

– This is too blunt an instrument• Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop cluster

tuning, workload variations, customer’s environment

Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customer environment, and multiple perspectives:

– To identify cases where Direct Attached Storage (DAS) makes sense • Many Hadoop cases where DAS is the correct Hadoop primary storage choice• For issues of very large scale, performance and throughput, minimize network, adapter costs

– To identify cases where shared storage makes sense• While maintaining the Hadoop benefits of DAS latency, cost, scale• Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness

Without an intelligent in-depth Hadoop-aware approach:– Likely may not meet Hadoop performance or cost objectives

• Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale • SAN switches / port costs today cannot affordably reach thousands of data nodes

– Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase• Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costs

http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf

Page 23: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Presentation Template Full Version

24

24

Intelligently choosing Hadoop storage solutions

Hadoop – It’s Not Just Internal Storage

Page 24: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

25

Intelligently using Hadoop shared storage: goals

Wish to perform mixed workloads on a shared storage infrastructure

– Some storage for Hadoop, other storage for other things, all on the same storage devices Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability

– Saving on total Hadoop physical storage space

Exploit external storage placement/migration/storage mgmt strategy and capabilities Exploit configurable storage recovery policies, backup/restore Exploit your existing storage infrastructure in balanced, cost-effective way Reduce need for Hadoop storage allocation skills and manual management of Hadoop data Exploit existing shared storage infrastructure tooling / performance monitors Add audit, security, legacy integration opportunities leveraged out of existing infrastructure

– Avoiding silo’d Hadoop storage environment

Decoupling servers from storage:

– Enable using smaller servers (less power, cooling)

– Enable better use of resources on differing workloads with differing server/storage ratios

– Dynamically allocate servers and storage to work on differing and changing analytics workloads

Page 25: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

26

Intelligent usage cases for shared external storage in Hadoop

Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop:

Stage 1:

– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary storage while still preserving Direct Attach Storage data locality, ability for internet scale

– While using external storage to bring desired function or reduce number of Hadoop copies

– Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer

Stage 2:

– Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as a data protection or archival layer.

– Intelligently allocating, importing, exporting data appropriately

Stage 3:

– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)

– Appropriate for certain clusters and certain Hadoop environments where:

• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits

• Example: IBM GPFS Storage Server

Stage 3

Stage 1

Stage 2

Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/ http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storage http://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecture http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster

Page 26: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

27

IBM Big Data Networked Storage Solution for Hadoop

http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html

Stage 1Example: IBM DCS3700 with

Hadoop replication count = 2

Still direct attached data

locality

Page 27: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

28

IBM Big Data Network Storage Solution for Hadoop

http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html

Stage 1

Hadoop Storage building blocks

IBM Storage Hadoop

replication count = 2

Hadoop

Improved Namenodeprotection

Page 28: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

29

Another option: Hadoop environment using IBM GPFS-FPO (File Placement Optimizer)

MapReduce Cluster

MapReduce

MapReduce

UsersJobs

GPFS-FPO

GPFS File Placement Optimzer instead of HDFS - still places disk local to each server Aggregates the local disk space into a single redundant shared file GPFS system Designed for MapReduce workloads Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy Intended as a drop in replacement for open source HDFS (IBM BigInsights product

may be required)

Stage 1

IBM General

Parallel File System

FPO

Instead of HDFS

Page 29: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

30

GPFS 3.5 HDFS

Performance

Terasort: large reads Hbase: small write Metadata intensive

Enterprise readiness

POSIX compliance Meta-data replication

Distributed name node

Protection &Recovery

Snapshot

Asynchronous Replication

Backup

Security & Integrity Access Control Lists

Ease of Use Policy based Ingest

GPFS File Placement Optimizer shared storage advantages in Hadoop environment

Stage 1

Page 30: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

31

Augment Hadoop Storage with external storage

Datanode

Datanode

Datanode

Management nodes

Namenode nodes

JobTracker nodesCompute node

Compute node

Compute node

Compute node

Management nodes

Job submission nodes

Batch scheduler nodes

HDFS

External storage

Possibilities:

•Allocate one of Hadoop copies externally

•Move data back and forth between Hadoop and external storage

Stage 2

Page 31: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

32

Another option: augment Hadoop with IBM General Parallel File System in “Stage 2” configuration

Datanode

Datanode

Datanode

Management nodes

Namenode nodes

JobTracker nodesCompute node

Compute node

Compute node

Compute node

Management nodes

Job submission nodes

Batch scheduler nodes

GPFS StorageServer

GPFSStorage server

GPFS-FPO POSIX GPFS

Add GPFS ClusterPOSIX world

All nodes can write/read data

• Integration with existing or new external GPFS cluster• Policy based file movement in/out of GPFS-File Placement Optimizer pool • Seamlessly integrate tape as part of the same namespace

Stage 2

Page 32: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

33

Replace Hadoop DAS with intelligentexternal Hadoop storage implementation

Compute node1

Compute node3

Compute node2

Namenode nodes

JobTracker nodes

GPFS StorageServer

GPFS Storageserver

/gpfs/node1/dsk1/gpfs/node1/dsk2…/gpfs/node1/dskX

/gpfs

/gpfs/node2/dsk1/gpfs/node2/dsk2…/gpfs/node2/dskX

/gpfs/node3/dsk1/gpfs/node3/dsk2…/gpfs/node3/dskX

HDFS

Stage 3

Example:

GPFS Storage Server

Page 33: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

34

IBM Big Data Network Storage Solution for Hadoop

http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html

Stage 3Hadoop

Improved Namenodeprotection

Hadoop Storage building blocks

Other IBM Storage Hadoop

replication count = 2NAS

SAN

IBM NAS filer

NAS

SAN

Page 34: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

35

Future evolution: Hadoop, storage, intersection of the two

Continued evolution of Big Data workloads, Hadoop, and storage are all fast moving targets

– Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency• http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum

– We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage

More importantly, very soon flash storage costs will start to affect Hadoop reference architectures

– By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoop deployments

– Will start move Hadoop bottleneck from storage to network interconnect

– Whoever best solves that future network interconnect issue will be the next big Hadoop winner

Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch this space!

Page 35: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Presentation Template Full Version

36

36

IBM HadoopStorage components, tools, offerings

Hadoop – It’s Not Just Internal Storage

Page 36: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

37

Big Data application stack

User Interface LayerReports, Dashboards, Mashups, Search,

Ad hoc reporting, Spreadsheets

Analytic Process LayerReal-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content

management, text analytics, etc.

Infrastructure layerVirtualization, central end to end management, control,

deployment on software, server, storage in a geographically dispersed environment

Users

Secu

rityau

tho

rization

OS software

Location ofcompetitive advantageAnalytics

applications.

Cloud infrastructurelayer Servers, storage

IBM Big Data Software

Visualization layer

Analytics layer

Page 37: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

38

IBM Big Data Analytics Solutions

StreamingData

TraditionalWarehouse

Analytics onData at Rest

DataWarehouse

Analytics on Structured Data

Analytics onData In-Motion

IBM InfoSphereBigInsights

IBM InfoSphereBigInsights

Traditional / Relational

Data Sources

Traditional / Relational

Data Sources

Non-Traditional / Non-Relational

Data Sources

Non-Traditional / Non-Relational

Data Sources

Non-Traditional/Non-RelationalData Sources

Non-Traditional/Non-RelationalData Sources

Traditional/Relational Data Sources

Traditional/Relational Data Sources

Internet-ScaleData Sets

IBM InfoSphereStreams

IBM InfoSphereStreams

Page 38: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

39

Big Data infrastructure layer

User Interface LayerReports, Dashboards, Mashups, Search,

Ad hoc reporting, Spreadsheets

Analytic Process LayerReal-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content

management, text analytics, etc.

Infrastructure layerVirtualization, central end to end management, control,

deployment on software, server, storage in a geographically dispersed environment

Users

Secu

rityau

tho

rization

OS softwareCloud infrastructure

layer Servers, storage

Visualization layer

Analytics layer

Page 39: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

40

IBM Direct Attached Storage solutions for Hadoop

Rack-Level FeaturesUp to 20 System x3630 M4 nodesUp to 6 System x3550 M4

Management nodesUp to 960TB storageUp to 240 Intel Sany Bridge cores Up to 3,840GB memoryUp to two 10Gb Ethernet (IBM

G8264-T) switchesScalable to multi-rack configurations

Available Enterprise and Performance Features

Redundant storageRedundant networkingHigh performance coresIncreased memoryHigh performance networking

Reference architecture High volume x86 systems

Integrated solutionPureData System for Hadoop

Each system has local storage

Page 40: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

41

JBODDisk Enclosure

x3650 M4 Server

Storage solution includes Data Servers, Disk (2TB or 3TB NL-SAS, SSD), Software, InfiniBand / Ethernet with no Storage Controllers

GSS 24: Light and Fast2 3650 servers + 4 JBOD 20U rack

10 GB/Sec

GSS 26: Workhorse2 3650 servers +

6 JBOD Enclosures, 28U 12 GB/sec

High-Density Option6 3650 servers + 18 JBOD 2 - 42U Standard Racks

36 GB/sec

IBM external Big Data storage:GPFS Storage Server scalable building block approach

GPFSsoftware

RAID

Page 41: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

42

High Volume & Availability : Mainframe & Open

Storage for Distributed Systems

Storage management

SW

Tivoli Storage Productivity Center

Tivoli Storage FlashCopy Manager

Tivoli Storage Manager

Tivoli Key Lifecycle Manager

XIV SONASDS8000

Optimized System StorageOptimized System Storage

N seriesStorwize V7000 Unified

Storwize V7000

Integrated Innovation

Storage Virtualization SW and SVC

Real-time Compression

Déduplication

DS3500/DCS3700

Integrated Solutions Integrated Solutions

Virtual Storage Center

Easy Tier

IBM Active Cloud EngineTM

Linear Tape File System (LTFS)

IBM Shared Storage infrastructure solutions

V7000 Unified

V7000 Unified

V7000 Unified

V7000 Unified

Tape LibraryTS3310

Tape Virtualization TS7740

Tape AutomationTS3500

Tape drivesLTO 3, 4 and 5

ProtecTIERTS7610/20/50

Data protection & retentionData protection & retention

Page 42: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

43

IBM solutions for a Big Data world

IBM Netezza

Storwize V7000

“Unified” Storage

“File” Storage

“Block” Storage

Disks 3TB, 4 TB

• Storwize V7000• XIV Gen3• DS8800

Solid State Drives (SDD)

• Storwize V7000• XIV Gen3• DS8800

Scale Out NAS (SONAS)

IBM Tape Systems

2.7 ExaBytes

TS3500

InfoSphere Streams

GPFS StorageServer

Page 43: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

44

Learning Points

Many, most cases where traditional Hadoop Direct Attached Storage is appropriate

However, many Intelligent usage cases where Hadoop external shared storage, intelligently implemented, brings significant value

Stage 1: – Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary

storage while still preserving Direct Attach Storage data locality, ability for internet scale– While using external storage to bring desired function or reduce number of Hadoop copies

Stage 2: – Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as

a data protection or archival layer– Intelligently allocating, importing, exporting data appropriately

Stage 3: – Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)– Appropriate for certain clusters and certain Hadoop environments where:

• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits

Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving – Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics– Will move Hadoop bottleneck from storage to network

Page 44: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

45

Page 45: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

46

Trademarks and disclaimers© IBM Corporation 2011. All rights reserved.References in this document to IBM products or services do not imply that IBM intends to make them available in every country.

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.

Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.

The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.

Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography.

Photographs shown may be engineering prototypes. Changes may be incorporated in production models.

Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml.

ZSP03490-USEN-00

Page 46: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

47

Appendix

Page 47: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

48

Recommend you download, read,this very informative IBM book

”Understanding Big Data” – Published April 2012– Free download– Well worth reading to understand components

of Big Data, and how to exploit

Part 1: The Big Deal about Big Data– Chapter 1 – What is Big Data? Hint: You’re a

Part of it Every Day– Chapter 2 – Why Big Data is Important– Chapter 3 – Why IBM for Big Data

Part II: Big Data: From the Technology Perspective

– Chapter 4  - All About Hadoop: The Big Data Lingo Chapter

– Chapter 5 – IBM InfoSphere Big Insights – Analytics for “At Rest” Big Data

– Chapter 6 – IBM InfoSphere Streams – Analytics for “In Motion” Big Data

http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDFDownload your free copy here

Page 48: Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation

IBM Storage Solutions for Big Data

49

IBM InfoSphere BigInsights = IBM Hadoop distribution

CoreHadoop

BigInsights Basic Edition

BigInsights Enterprise Edition

Free download with web support Limit to <= 10 TB of data

(Optional: 24x7 paid supportFixed Term License)

Professional Services OfferingsQuickStart, Bootcamp, Education, Custom Development

Enterprise-grade features:

Tiered Terabyte-based pricing

Easy InstallationAnd programming

Analytics tooling/visualization Administration tooling Development tooling High Availability

Flexible storage Recoverability Security