Upload
john-sing
View
104
Download
0
Embed Size (px)
DESCRIPTION
John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.
Citation preview
© 2013 IBM Corporation
Hadoop – It’s Not Just Internal Storage
John Sing, Executive ConsultantIBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013 11 June 2013
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
2
John Sing 31 years of experience with IBM in high end servers, storage,
and software– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and
Planning, Enterprise Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics, HA/DR/BC
– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity, HA/DR/BC, IBM Storage
– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)
– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA)
You may follow my daily IT research blog
– http://www.delicious.com/atsf_arizona
You may follow me on Slideshare.net:
– http://www.slideshare.net/johnsing1
My LinkedIn:
– http://www.linkedin.com/in/johnsing
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
44
Agenda
Understanding today’s Hadoop environments– Hadoop architecture, usage cases, deployments
– Hadoop design, performance, and cost considerations
Differing Hadoop perspectives: Applications/Business Line vs. Operations– Understanding implications of direct attached storage (DAS) vs. Shared Storage
Intelligently choosing Hadoop storage solutions– Usage cases where Direct Attached Storage makes sense
– Intelligent usage cases where Shared Storage makes sense
– Future evolution of storage, Hadoop, and cross-section of the two
IBM Hadoop, Storage, Big Data hardware and software components, tools, offerings
Source: If applicable, describe source origin
© 2013 IBM Corporation
IBM Presentation Template Full Version
55
Understanding today’s Hadoop environments
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
6
What is Hadoop?
Instead of the traditional IT computation model:
Which brings the data to the function/program on application server
Loads data into memory on an application server and processes it
Unfortunately, this doesn’t scale for internet-scale Big Data problems
Apache Hadoop: open source framework for data-intensive applications
Inspired by Google technologies (MapReduce, GFS)
Well-suited to batch-oriented, read-intensive applications
Yahoo! adopted these technologies and open sourced them into the Apache Hadoop project
Hadoop has become a pervasive enabler of internet-scale applications, working with thousands of nodes, petabytes of data in highly parallel, cost effective manner
CPU + disks of commodity storage = Hadoop “node”
Hadoop nodes today running mission-critical production in massive clusters
10s of thousands of servers
New nodes can be added as needed to the cluster, without changing:
Data formats, how data is loaded, how jobs are written
6
Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
7
The World of Hadoop: worldwide usage
eBay
Yahoo!
New York Times
Many, many more…
http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.html One source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
8
Hadoop is today a well-developed ecosystem
Hadoop– Overall name of software
stack HDFS
– Hadoop Distributed File System
MapReduce– Software compute framework
• Map = queries • Reduce=aggregates
answers Hive
– Hadoop-based data warehouse
Pig– Hadoop-based language
Hbase– Non-relationship database
fast lookups
Flume– Populate Hadoop with data
Oozie– Workflow processing
system Whirr
– Libraries to spin up Hadoop on Amazon EC2, Rackspace, etc.
Avro– Data serialization
Mahout– Data mining
Sqoop– Connectivity to non-Hadoop
data stores BigTop
– Packaging / interop of all Hadoop components
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyondhttp://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
9
Hadoop vendor ecosystem today
http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
10
Why understand the Hadoop stack and environment?
Hadoop is being used for much more than just internet-scale Big Data analytics
Hadoop is increasingly being used by enterprises for inexpensive data storage
– As an industry we’re strongly exploiting a much wider variety of data types
– With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale “Big Landing Zone” Hadoop cluster for storing data
• Previously not viable to keep online
– Hadoop cluster also then can run internet-scale analytics on this data
– Significant driver: move to Hadoop to reduce traditional database licensing costs
Storage industry dynamics:
– Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB • At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server• Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB• Not saying this includes all factors, but these dynamics clearly affect the decision
– And then, there’s flash storage coming…..
Must understand full depth of the Hadoop environment and storage industry dynamics:
– In order to decide if/when/where Hadoop internal storage or shared storage is appropriate
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
11
Why Hadoop was created for Big DataTraditional approach : Move data to program
Big Data approach: Move function/programs to data
Database server
Data
Query Data
return Data
process Data
Master node
Data nodes
Data
Application server
User request
Send result
User request
Send Function to process on Data
Query & process Data
Data nodes
Data
Data nodes
Data
Data nodes
DataSend Consolidate result
Traditional approachApplication server and Database server are separateData can be on multiple serversAnalysis Program can run on multiple Application serversNetwork is still in the middleData has to go through network
•Big Data Approach Analysis Program runs where the data is : on Data NodeOnly Analysis Program has to go through the networkAnalysis Program need to be MapReduce awareHighly Scalable :
1000s NodesPetabytes and more
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
12
Example of Hadoop in action
Traditional approach : Move data to program
Database server
Data
Query Data
return Data
process Data
Application server
User request
Send result
Big Data approach : Move program to Data
Master node
Data nodes
Data
User request
Send Function to process on Data
Query & process Data
Data nodes
Data
Data nodes
Data
Data nodes
DataSend Consolidate result
Example :
How many hours Clint Eastwood appears in all the movies he has done ?
All movies need to be parsed to find Clint’s face
Traditional approach : All movies are uploaded to application server, through the network
• Big Data Approach : The Analysis Program and copy of Clint’s picture are downloaded to data nodes, through the network
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
13
Hadoop principles: Storage, HDFS and MapReduce Hadoop Distributed File System = HDFS : where Hadoop stores the data
– HDFS file system spans all the nodes in a cluster with locality awareness
Hadoop data storage, computation model– Data stored in a distributed file system, spanning many inexpensive computers– Send function/program to the data nodes– i.e. distribute application to compute resources where the data is stored– Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase(break job into small parts)
2. Shuffle(transfer interim outputfor final processing)
3. Reduce Phase(boil all output down toa single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();
. . .
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();
. . .
Distribute maptasks to cluster
Hadoop Data Nodes
Data is loaded, spread, resident
in Hadoop cluster
Performance = tuning Map Reduce workflow,
network, application, servers, and storage
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/ http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/allenwittenauer/2012-lihadoopperf
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
14
Big Data hadoop system architecture
Datanode
Datanode
Datanode
Management nodes
Namenode nodes
JobTracker nodes
Data nodes with local disks
Network1-10GB Ethernet
or Infiniband
Network1-10GB Ethernet
or Infiniband
Management nodes for Hadoop
and cluster
Management nodes for Hadoop
and cluster
IO performance = type and # of disksReference architecture:
From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanode
Hadoop Distributed File System (HDFS)
• HDFS stores data across multiple data nodes, Namenode knows where data is• HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data across multiple data nodes (typically 3 or more)• HDFS file system is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS• HDFS Name Node is a single point of failure
Scaling granularity: data node, scaling both IO and CPU
Locality awarenessLocality awareness
Note: any other location for data adds network latency
© 2013 IBM Corporation
IBM Presentation Template Full Version
15
15
Differing Hadoop storage perspectives
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
16
Understanding Hadoop rationale for Direct Attached Storage (latency)
Primary Hadoop design goal is affordability at internet scale:
Data is loaded into Hadoop cluster with data locality– Spreading data across Data Nodes – Achieve lowest disk latency through direct attached storage
Send programs to data (not other way around)– Data in general does not move within the Hadoop cluster
Key performance components: disk latency, network interconnect, utilization, bandwidth Based on low capital expenditure, low cost commodity components
– Goal: lowest capital cost at scale (adapters, switches and # of ports)
Hadoop Application and performance tuning:
Fallacy: “all Hadoop jobs are IO-bound” Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads
– CPU/storage ratio different for different workloads
Network latency is major performance impact on Hadoop cluster– Adding external storage layer network latency causes major retuning of network
Hadoop Application team to Operations:
“Until you’ve read the Hadoop book, please don’t waste my time”
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
17
Yet, there are valid operational issues with Hadoop from Enterprise Shared Storage management, cost standpoint
Servers under-utilized? Another storage silo? Amount of physical storage required per usable GB/TB? Reliability as Hadoop application goes into mission critical production? Hadoop-specific storage management, migration, backup, recovery? Hadoop-specific skill set? Ability to understand what data is used where? Audit, security, legacy application integration? Share Hadoop storage (and servers) dynamically, in a pool with other data center
resources?
Ultimately, it becomes a matter of perspective, type of infrastructure, and associated priority. Let’s explore this further………..
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
18
Today: two different types of IT
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Internet scale wkloadsTransactional IT
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
19
Today’s two major IT workload types
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
20
How to build these two different clouds
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Transactional ITInternet scale wkloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
21
Hadoop storage choices based on perspective:
This is where a Hadoop external shared storage infrastructure may often be found
This is where Hadoop DAS-focused infrastructure
may often be found
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
22
Differing valid perspectives on Hadoop storage issues
Very specific reasons why Direct Access Storage is used:
Performance and throughput (lowest latency) Low cost commodity components
– cost of JBOD at 4-6 cents/GB today– Even at 3x copies, still very inexpensive
Many Hadoop workflow, software components to tune:
– Map and Reduce workflow– Memory allocation and usage– Algorithms, tuning at all levels– What are the tasks doing
Hadoop overall cluster configuration– Server and DAS storage configuration– 3X copies for performance reasons– Squeeze out all latency
– Network topology, speeds, utilization– Compression– Type of data
Etc…..……
Very specific reasons why shared storage is desired:
Cost CAPEX / OPEX? – Fixed server/storage ratio?– Low server % utilization = excess cost?
Reliability?
Backup? Disaster Recovery?
Another silo of storage?
Managing data:– Within the Hadoop cluster– Between Hadoop and other existing storage?
Hadoop Applications, Business Line team
Operations team
Clearly, different perspectives!
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
23
Bottom line on Direct Attached Storage (DAS)vs. Shared Storage for Hadoop
Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external shared storage
– This is too blunt an instrument• Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop cluster
tuning, workload variations, customer’s environment
Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customer environment, and multiple perspectives:
– To identify cases where Direct Attached Storage (DAS) makes sense • Many Hadoop cases where DAS is the correct Hadoop primary storage choice• For issues of very large scale, performance and throughput, minimize network, adapter costs
– To identify cases where shared storage makes sense• While maintaining the Hadoop benefits of DAS latency, cost, scale• Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness
Without an intelligent in-depth Hadoop-aware approach:– Likely may not meet Hadoop performance or cost objectives
• Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale • SAN switches / port costs today cannot affordably reach thousands of data nodes
– Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase• Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costs
http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
© 2013 IBM Corporation
IBM Presentation Template Full Version
24
24
Intelligently choosing Hadoop storage solutions
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
25
Intelligently using Hadoop shared storage: goals
Wish to perform mixed workloads on a shared storage infrastructure
– Some storage for Hadoop, other storage for other things, all on the same storage devices Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability
– Saving on total Hadoop physical storage space
Exploit external storage placement/migration/storage mgmt strategy and capabilities Exploit configurable storage recovery policies, backup/restore Exploit your existing storage infrastructure in balanced, cost-effective way Reduce need for Hadoop storage allocation skills and manual management of Hadoop data Exploit existing shared storage infrastructure tooling / performance monitors Add audit, security, legacy integration opportunities leveraged out of existing infrastructure
– Avoiding silo’d Hadoop storage environment
Decoupling servers from storage:
– Enable using smaller servers (less power, cooling)
– Enable better use of resources on differing workloads with differing server/storage ratios
– Dynamically allocate servers and storage to work on differing and changing analytics workloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
26
Intelligent usage cases for shared external storage in Hadoop
Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop:
Stage 1:
– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary storage while still preserving Direct Attach Storage data locality, ability for internet scale
– While using external storage to bring desired function or reduce number of Hadoop copies
– Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer
Stage 2:
– Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as a data protection or archival layer.
– Intelligently allocating, importing, exporting data appropriately
Stage 3:
– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)
– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
• Example: IBM GPFS Storage Server
Stage 3
Stage 1
Stage 2
Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/ http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storage http://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecture http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
27
IBM Big Data Networked Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 1Example: IBM DCS3700 with
Hadoop replication count = 2
Still direct attached data
locality
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
28
IBM Big Data Network Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 1
Hadoop Storage building blocks
IBM Storage Hadoop
replication count = 2
Hadoop
Improved Namenodeprotection
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
29
Another option: Hadoop environment using IBM GPFS-FPO (File Placement Optimizer)
MapReduce Cluster
MapReduce
MapReduce
UsersJobs
GPFS-FPO
GPFS File Placement Optimzer instead of HDFS - still places disk local to each server Aggregates the local disk space into a single redundant shared file GPFS system Designed for MapReduce workloads Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy Intended as a drop in replacement for open source HDFS (IBM BigInsights product
may be required)
Stage 1
IBM General
Parallel File System
FPO
Instead of HDFS
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
30
GPFS 3.5 HDFS
Performance
Terasort: large reads Hbase: small write Metadata intensive
Enterprise readiness
POSIX compliance Meta-data replication
Distributed name node
Protection &Recovery
Snapshot
Asynchronous Replication
Backup
Security & Integrity Access Control Lists
Ease of Use Policy based Ingest
GPFS File Placement Optimizer shared storage advantages in Hadoop environment
Stage 1
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
31
Augment Hadoop Storage with external storage
Datanode
Datanode
Datanode
Management nodes
Namenode nodes
JobTracker nodesCompute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
HDFS
External storage
Possibilities:
•Allocate one of Hadoop copies externally
•Move data back and forth between Hadoop and external storage
Stage 2
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
32
Another option: augment Hadoop with IBM General Parallel File System in “Stage 2” configuration
Datanode
Datanode
Datanode
Management nodes
Namenode nodes
JobTracker nodesCompute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
GPFS StorageServer
GPFSStorage server
GPFS-FPO POSIX GPFS
Add GPFS ClusterPOSIX world
All nodes can write/read data
• Integration with existing or new external GPFS cluster• Policy based file movement in/out of GPFS-File Placement Optimizer pool • Seamlessly integrate tape as part of the same namespace
Stage 2
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
33
Replace Hadoop DAS with intelligentexternal Hadoop storage implementation
Compute node1
Compute node3
Compute node2
Namenode nodes
JobTracker nodes
GPFS StorageServer
GPFS Storageserver
/gpfs/node1/dsk1/gpfs/node1/dsk2…/gpfs/node1/dskX
/gpfs
/gpfs/node2/dsk1/gpfs/node2/dsk2…/gpfs/node2/dskX
/gpfs/node3/dsk1/gpfs/node3/dsk2…/gpfs/node3/dskX
HDFS
Stage 3
Example:
GPFS Storage Server
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
34
IBM Big Data Network Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 3Hadoop
Improved Namenodeprotection
Hadoop Storage building blocks
Other IBM Storage Hadoop
replication count = 2NAS
SAN
IBM NAS filer
NAS
SAN
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
35
Future evolution: Hadoop, storage, intersection of the two
Continued evolution of Big Data workloads, Hadoop, and storage are all fast moving targets
– Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency• http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum
– We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage
More importantly, very soon flash storage costs will start to affect Hadoop reference architectures
– By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoop deployments
– Will start move Hadoop bottleneck from storage to network interconnect
– Whoever best solves that future network interconnect issue will be the next big Hadoop winner
Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch this space!
© 2013 IBM Corporation
IBM Presentation Template Full Version
36
36
IBM HadoopStorage components, tools, offerings
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
37
Big Data application stack
User Interface LayerReports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process LayerReal-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layerVirtualization, central end to end management, control,
deployment on software, server, storage in a geographically dispersed environment
Users
Secu
rityau
tho
rization
OS software
Location ofcompetitive advantageAnalytics
applications.
Cloud infrastructurelayer Servers, storage
IBM Big Data Software
Visualization layer
Analytics layer
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
38
IBM Big Data Analytics Solutions
StreamingData
TraditionalWarehouse
Analytics onData at Rest
DataWarehouse
Analytics on Structured Data
Analytics onData In-Motion
IBM InfoSphereBigInsights
IBM InfoSphereBigInsights
Traditional / Relational
Data Sources
Traditional / Relational
Data Sources
Non-Traditional / Non-Relational
Data Sources
Non-Traditional / Non-Relational
Data Sources
Non-Traditional/Non-RelationalData Sources
Non-Traditional/Non-RelationalData Sources
Traditional/Relational Data Sources
Traditional/Relational Data Sources
Internet-ScaleData Sets
IBM InfoSphereStreams
IBM InfoSphereStreams
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
39
Big Data infrastructure layer
User Interface LayerReports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process LayerReal-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layerVirtualization, central end to end management, control,
deployment on software, server, storage in a geographically dispersed environment
Users
Secu
rityau
tho
rization
OS softwareCloud infrastructure
layer Servers, storage
Visualization layer
Analytics layer
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
40
IBM Direct Attached Storage solutions for Hadoop
Rack-Level FeaturesUp to 20 System x3630 M4 nodesUp to 6 System x3550 M4
Management nodesUp to 960TB storageUp to 240 Intel Sany Bridge cores Up to 3,840GB memoryUp to two 10Gb Ethernet (IBM
G8264-T) switchesScalable to multi-rack configurations
Available Enterprise and Performance Features
Redundant storageRedundant networkingHigh performance coresIncreased memoryHigh performance networking
Reference architecture High volume x86 systems
Integrated solutionPureData System for Hadoop
Each system has local storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
41
JBODDisk Enclosure
x3650 M4 Server
Storage solution includes Data Servers, Disk (2TB or 3TB NL-SAS, SSD), Software, InfiniBand / Ethernet with no Storage Controllers
GSS 24: Light and Fast2 3650 servers + 4 JBOD 20U rack
10 GB/Sec
GSS 26: Workhorse2 3650 servers +
6 JBOD Enclosures, 28U 12 GB/sec
High-Density Option6 3650 servers + 18 JBOD 2 - 42U Standard Racks
36 GB/sec
IBM external Big Data storage:GPFS Storage Server scalable building block approach
GPFSsoftware
RAID
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
42
High Volume & Availability : Mainframe & Open
Storage for Distributed Systems
Storage management
SW
Tivoli Storage Productivity Center
Tivoli Storage FlashCopy Manager
Tivoli Storage Manager
Tivoli Key Lifecycle Manager
XIV SONASDS8000
Optimized System StorageOptimized System Storage
N seriesStorwize V7000 Unified
Storwize V7000
Integrated Innovation
Storage Virtualization SW and SVC
Real-time Compression
Déduplication
DS3500/DCS3700
Integrated Solutions Integrated Solutions
Virtual Storage Center
Easy Tier
IBM Active Cloud EngineTM
Linear Tape File System (LTFS)
IBM Shared Storage infrastructure solutions
V7000 Unified
V7000 Unified
V7000 Unified
V7000 Unified
Tape LibraryTS3310
Tape Virtualization TS7740
Tape AutomationTS3500
Tape drivesLTO 3, 4 and 5
ProtecTIERTS7610/20/50
Data protection & retentionData protection & retention
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
43
IBM solutions for a Big Data world
IBM Netezza
Storwize V7000
“Unified” Storage
“File” Storage
“Block” Storage
Disks 3TB, 4 TB
• Storwize V7000• XIV Gen3• DS8800
Solid State Drives (SDD)
• Storwize V7000• XIV Gen3• DS8800
Scale Out NAS (SONAS)
IBM Tape Systems
2.7 ExaBytes
TS3500
InfoSphere Streams
GPFS StorageServer
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
44
Learning Points
Many, most cases where traditional Hadoop Direct Attached Storage is appropriate
However, many Intelligent usage cases where Hadoop external shared storage, intelligently implemented, brings significant value
Stage 1: – Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary
storage while still preserving Direct Attach Storage data locality, ability for internet scale– While using external storage to bring desired function or reduce number of Hadoop copies
Stage 2: – Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as
a data protection or archival layer– Intelligently allocating, importing, exporting data appropriately
Stage 3: – Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving – Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics– Will move Hadoop bottleneck from storage to network
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
45
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
46
Trademarks and disclaimers© IBM Corporation 2011. All rights reserved.References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.
Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.
The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography.
Photographs shown may be engineering prototypes. Changes may be incorporated in production models.
Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml.
ZSP03490-USEN-00
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
47
Appendix
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
48
Recommend you download, read,this very informative IBM book
”Understanding Big Data” – Published April 2012– Free download– Well worth reading to understand components
of Big Data, and how to exploit
Part 1: The Big Deal about Big Data– Chapter 1 – What is Big Data? Hint: You’re a
Part of it Every Day– Chapter 2 – Why Big Data is Important– Chapter 3 – Why IBM for Big Data
Part II: Big Data: From the Technology Perspective
– Chapter 4 - All About Hadoop: The Big Data Lingo Chapter
– Chapter 5 – IBM InfoSphere Big Insights – Analytics for “At Rest” Big Data
– Chapter 6 – IBM InfoSphere Streams – Analytics for “In Motion” Big Data
http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDFDownload your free copy here
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
49
IBM InfoSphere BigInsights = IBM Hadoop distribution
CoreHadoop
BigInsights Basic Edition
BigInsights Enterprise Edition
Free download with web support Limit to <= 10 TB of data
(Optional: 24x7 paid supportFixed Term License)
Professional Services OfferingsQuickStart, Bootcamp, Education, Custom Development
Enterprise-grade features:
Tiered Terabyte-based pricing
Easy InstallationAnd programming
Analytics tooling/visualization Administration tooling Development tooling High Availability
Flexible storage Recoverability Security