60
NextGen Infrastructure for Big Data Anil Vasudeva, President & Chief Analyst, IMEX Research Author: Anil Vasudeva, IMEX Research

NextGen Infrastructure for Big Data - SNIA

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

NextGen Infrastructure for Big Data

Anil Vasudeva, President & Chief Analyst, IMEX Research

Author: Anil Vasudeva, IMEX Research

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

2 2

SNIA Legal Notice

The material contained in this tutorial is copyrighted by the SNIA and author unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions:

Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.

This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

3 3

Abstract

NextGen Infrastructure for Big Data This session will appeal to Business Planning, Marketing, Technology System Integrators and Data Center Managers seeking to understand the drivers behind the demand for and rise of Big Data.

Abstract The internet has spawned an explosion in data growth in the form of data sets, called Big Data, that are so large they are difficult to store, manage and analyze using traditional RDBMS which are tuned for Online Transaction Processing (OLTP) only. Not only is this new data heavily unstructured, voluminous and streams rapidly and difficult to harness but even more importantly, the infrastructure cost of HW and SW required to crunch it using traditional RDBMS, to derive any analytics or business intelligence online (OLAP) from it, is prohibitive. To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop and others) many companies have emerged which are leveraging new parallelized processing, commodity hardware, open source software and tools to capture and analyze these new data sets and provide a price/performance that is 10 times better than existing Database/Data Warehousing/Business Intelligence Systems.

Learning Objectives The presentation will illustrate the existing operational challenges businesses face today using RDBMS systems despite using fast access in-memory and solid state storage technologies. It details how IT is harnessing the emergent Big Data to manage massive amounts of data and new techniques such as parallelization and virtualization to solve complex problems in order to empower businesses with knowledgeable decision-making. It lays out the rapidly evolving big data technology ecosystem - different big data technologies from Hadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private and hybrid cloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure, Prepare for analytics and visualization of data while manipulating it rapidly to derive business intelligence online, to run businesses smartly.

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

4 4

Big Data in IT Industry Roadmap

Cloudization On-Premises > Private Clouds > Public Clouds DC to Cloud-Aware Infrast. & Apps. Cascade migration to SPs/Public Clouds.

Integrate Physical Infrast./Blades to meet CAPSIMS ®IMEX Cost, Availability, Performance, Scalability, Inter-operability, Manageability & Security

Integration/Consolidation

Standard IT Infrastructure- Volume Economics HW/Syst SW (Servers, Storage, Networking Devices, System Software (OS, MW & Data Mgmt. SW)

Standardization

Virtualization Pools Resources. Provisions, Optimizes, Monitors Shuffles Resources to optimize Delivery of various Business Services

Automatically Maintains Application SLAs (Self-Configuration, Self-Healing©IMEX, Self-Acctg. Charges etc.)

Automation

IT Industry Roadmap

Analytics – BI Predictive Analytics - Unstructured Data From Dashboards Visualization to Prediction Engines using Big Data.

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

5 5

NextGen IT Infrastructure

Enterprise VZ Data Center On-Premise Cloud

Home Networks

Web 2.0 Social Ntwks.

Facebook, Twitter, YouTube…

Cable/DSL… Cellular

Wireless

Internet ISP

Core Optical

Edge ISP

ISP ISP

ISP

ISP

Supplier/Partners

Remote/Branch Office

Public CloudCenter©

Servers VPN IaaS, PaaS SaaS

Vertical Clouds

ISP

Tier-3 Data Base

Servers Tier-2 Apps

Management Directory Security Policy Middleware Platform

Switches: Layer 4-7, Layer 2, 10GbE, FC Stg.

Caching, Proxy, FW, SSL, IDS, DNS,

LB, Web Servers

Application Servers HA, File/Print, ERP, SCM, CRM Servers

Database Servers, Middleware, Data Mgmt.

Tier-1 Edge Apps

FC/ IPSANs

Request for data from a remote client to a Data Center or Cloud crosses a myriad of systems and devices. Key is identifying bottlenecks & improving performance

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Harnessing Big Data for Business Insights

6

Majority of data growth is being driven by unstructured data and billions of large objects

Information is at the center of New Wave of opportunity

80% of world’s data is unstructured driven by rise in Mobility devices, collaboration machine generated data.

Data Sources

Big Data Infrastructure

Business Insights

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Unstructured Big Data can provide Next Gen Analytics to help businesses make informed, better decision in: ▪ Product Strategy ▪ Targeting Sales ▪ Just-In-Time Supply-Chain Economics ▪ Business Performance Optimization ▪ Predictive Analytics & Recommendations ▪ Country Resources Management

Corporate Need: Business Perf... Optimization

7

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Corporate Need: Real Time Analytics

8

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Corporate Need: Business Insights

.

Store

Information Exploding Volume: Digital Content doubling every 18 months. Velocity: >80% growth driven from unstructured data. Variety: sources of data changing

A unified information/content storage methodology that enables users to manage the volume, velocity and variety of information from multiple sources

Manage

Complexity in "managing" information. - Need to classify, synchronize, aggregate, integrate, share, transform, profile, move, cleanse, protect, retire

A solution portfolio of tools and services to manage all types of information in a hybrid storage environment

Analyze Current solutions limited to BI tools focused on structured and lagging information

Build/buy packaged Real-Time Predictive Analytical Solutions for unstructured analytics tools

Collaborate Multiple access methods needed to meet needs of a diverse audience.

Centralized share, collaborate and act on insights anytime, anywhere on any device.

Model/ Adapt Ability to understand how the information impacts the business. How to transfer to action.

Model Information on current operations w/potential strategy impact. Leverage Tech. to adapt.

Item Issue Solution

9

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Opportunity: Converting Big Data Deluge into Predictive Analytics & Insights

Personal Location Services Data Generated by TB/Year

Navigation Devices 600

Navigation Apps on Phones 20

Smart Phone Opt-In Tracking 1000

Geo Targeted Ads 20

People Locator (Emergency Calls/Search..) 10

Location based Services (e.g.Games) 5

Other 45

Total (Est.) 1700

Big Data Predictive Analytics

10

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Issues with Existing RDBMS Key Issues with RDBMS Technologies Handling Mixed Unstructured Data - RDBMs don’t handle non-tabular data

(Notorious for doing a poor job on recursive data structure) Legacy Archaic Architecture - RDBMS don’t parallelize well to accommodate commodity HW clusters Speed - Seek time of physical Storage has not kept pace with network speed improvements Scale - Difficult to scale-out RDBMS efficiently – Clustering beyond few servers notoriously hard Integration - Data processing tasks need to combine data from non-related sources, over a network Volume - Data volumes have grown from 10s GB >100s TB > PBs in recent years. Existing Tabular RDBMS can’t handle such large DBs

11

Fault Tolerance

Availability Deep Insights

EU Adhoc Analytics

Cost

Unstructured Data

Latency

Scalability

Big Data Analytics

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Issues with Existing RDBMS Present RDBMS struggling to Store & Analyze Big Data

12

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data - Database Solutions

13

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data – The New Face of DBs

Big Data Paradigm - The New face of DB Systems • Adopts Schema-Free Architecture • Can do away with Legacy Relational DB Systems Some data have sparse attributes, do not need relational property • Key Oriented Queries Some data stored/retrieved mainly by primary key, w/o complex joins • Trade-off of Consistency, Availability & Partition Tolerance • Scale Out, not up, - Online Load balancing cluster growth

14

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Analytics – The Next Frontier in IT

15

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Key Innovations: HW Technologies

16

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

17

Key Innovations – Solid State Storage

Source: IMEX Research SSD Industry Report ©2011

Note: 2U storage rack, • 2.5” HDD max cap = 400GB / 24 HDDs, de-stroked to 20%, • 2.5” SSD max cap = 800GB / 36 SSDs

0

300

600

2009 2010 2011 2012 2013 2014

Uni

ts (M

illio

ns)

0

8

IOPS

/GB

IOPS/GB HDDs

HDD

SSD

17

Key to Database performance are random IOPS. SSDs outshine HDD in IO price/performance – a major reason, besides better space and power, for their explosive growth.

Storage - IOPS/GB & Price Erosion - HDD vs. SSDs

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Innovations – DB SW Technologies

Tech Innovation 1985 1990 1995 2000 2005 2010 2015

OLTP Transactions DB SW Rows Locking Optimizer Parallel Query Clustering XML Grid Open Source /

Hadoop

OLAP- Analytics DB SW Indexing Partitioning Columnar Materialized

View Bit Mapped

Index In-Memory Query Binding

Hardware 32 bit SMP NUMA 64 bit Multi-core/ Blades Flash MPP

Big Data Multi-core Columnar In-Memory

MPP Visualization

OLTP Database Innovation Progress

0.1

1

10

100

1000

10000

100000

1985 1990 1995 2000 2005 2010 20150.01

0.1

1

10

100

1000

10000

$/TPMc

TPMc/Processor

$ / T

PMc

TPM

c / P

roce

ssor

Big Data: Analytics DB Technology Impact

0

0

1

10

100

1,000

10,000

100,000

1985 1990 1995 2000 2005 2010 2015 0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

.1

.01

DW Size TB $/GB

18

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data - Key Requirements

Types of Data Organizations Analyze59%

44%

69%

64%

41%

33%

51%

28%

46%

36%

36%

18%

18%

21%

8%

3%

3%

68%

68%

37%

23%

33%

34%

26%

32%

21%

15%

11%

15%

11%

9%

3%

6%

5%

Customer/member data

Transactional data from applications

Application Logs

Other Types of Event Data

Network Monitoring/Network Traffic

Online Retail Transactions

Other Log Files

Call Data Records

Web Logs

Text data from social media and online

Search logs

Trade/quote data

Intelligence/defense data

Multimedia (audio/video/images)

Weather

Smartmeter data

Other (please specify)

HadoopNon Hadoop

19

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data – Architectural Goals

Big Data Platform

Meet Enterprise Criterion Meet Requirements of V3

Analyze Data in Native Format

20

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

21

Big Data - Market Requirements

Unified system: Pre-integrated for Ease of Installation and Management • Platform – Large Scale Indexing Pre-integrated using Hadoop Foundation, • Integrated Text Analytics - Address Unstructured Data • Usability - User Friendly Admin Console including HDFS Explorer, Query

Languages • Enterprise Class Features – Provisioning, Storage, Scheduler, Advance Security • Supports search-centric, document-based XML data model

o store documents within a transactional repository. • Schema-Free:

o No advance knowledge of the document structure (its "schema") needed o Index words and values from each of the loaded documents together with

its document structure. • Standard commodity hardware leveraged

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data – Market Requirements

Architectural • Shared-nothing clustered DB architecture

o programmable and extensible application servers. • Support massive scalability to petabytes of source data • Support open-source XQuery- and XSLT-driven architecture • Simple to Deploy, Develop and Manage (UI & Restful Interface) • Support extreme mixed workloads - a wide variety of data types including

arbitrarily hierarchical data structures, images, waveforms, data logs etc. • Support thousands of geographically dispersed on--line users and

programs executing variety of requests from ad hoc queries to strategic analysis • Loading data before declaring or discovering its structure • Load data in batch and streaming fashion • Integrate data from multiple sources during load process at very high rates

Spread I/O and data across instances • Provide consistent performance with linear cost • Leverage Open Source SW Lo Costs, Multiple Sources, Hadoop Foundation Tools • Connectivity with Oracle DB, Teradata Warehouse, JDBC Connectivity,

22

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Real Time Analytics Execution • Execute “streaming” analytic queries in real time on incoming load data • Updating data in place at full load speeds • Scheduling and execution of complex multi-hundred node workflows • Join a billion row dimension table to a trillion row fact table without pre-

clustering the dimension table with the fact table Performance • Analyze data, at very high rates >GB/sec • Predictable Sub-ms response time for highly constrained standard SQL

queries Availability • Ability to configure without any single point of failure • Auto-Failover Extreme High Availability

o Automated failover and process continuation without operational interruption when processing nodes fail

23

Big Data - Market Requirements

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data – Product Metrics Choices

Big Data -

Product Metrics

Data Set Size PB

TB

GB

Data Structure

Transaction

Machine

Unstructured

Other

Access/Use Transaction

Search

Analytics

Parallel Processing

Appliance

Cluster < 1K

Cluster > 1K

Memory In-Memory

Flash

DB Technique Columnar

Zero Sharing

No SQL

Data Cataloging SW

Text

Image

Audio

Video

24

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Advantage: Big Data Products Characteristic Legacy Paradigm Big Data Paradigm Structure •Transactional/Corporate •Unstructured/Derivative/Internet

Mode •Data Collection •Data Analysis

Focus •Find Answers •Find Questions

Facility •Reportive / What Happened? •Analytic / Why did it Happen? Predictive / What will Happen Next?

Opportunity •Very Small Growth •Massive Growth

Players •Legacy Players •Agile Start Ups, well funded

Impact •Analyze Existing Businesses •Create New Businesses

25

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Characteristic Traditional RDBMS Big Data/MapReduce

Data Size •GB •PB

Access •Interactive •Batch/Near Real-Time

Latency •Low •High

Data Updates •Read & Write Many Times •Write Once Read Many Times

Schema/Structure •Static Schema •Dynamic Schema

Language •SQL •UQL/Procedural (Java,C++..)

Integrity •High •Not 100%

Works Well for •Process Intensive Jobs •Data Intensive Jobs

Works Well w Data Size •Gigabytes •Petabytes

Data/Processing Interactions

•Low Latency/High BW – precursor to success. Ntwk. BW can be a bottleneck causing nodes to be idle

•Sends Code to Data, instead of Sending Data to other Nodes (Requiring Lower BW in Cluster)

Fault Tolerance •Coordinating Processes with Node Failures – a challenge

•Fault Tolerant for HW/SW Failures

Access •Interactive •Batch/Near Real-time

Scaling •Non-linear •Linear

Pgm-Distribution of Jobs •Difficult •Simple & Effective

26

Advantage: Big Data Products

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Ecosystem

27

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Stack

Collector

Admin Center

Data Importer

Data Importer

RHive

Enterprise

PerfMon, Query Plan

Hive Workflow Oracle-to-Hive

Search

Rest/ JSON API

Data exporter

Databases

Advanced Analytics

OLAP Server

OLTP Server

ETL Ad-hoc query

Data Sources

Data Store (Hadoop)

Oracle IBM

Teradata…

Real-time Queries

DBA

Streaming Data

Devices

Analytics Platform Data Sources Applications

Merging Hadoop innovations into Nextgen DBMS

28

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Hadoop’s Fit in Enterprise Stack

29

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data - Hadoop Architecture

30

Data Flow (Pig)

SQL (Hive)

Distributed Computing Framework (MapReduce)

Metadata (HCatalog)

Column-Stg (HBase) C

oord

inat

ion

(Zoo

keep

er)

Man

agem

ent

(HM

S)

Hadoop Distributed File System (HDFS)

Programming Languages

Computations

Object Storage

Tabular Storage

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Connectors to EDW/BI

Servers

Operating System Hypervisor/VMs

Big Data Storage Framework (HDFS)

Big Data Processing Framework (MapReduce)

Big Data Access Framework Pig Hive Sqoop

Big Data (Connectors)

Big Data Orchestration Framework HBase Avro Flume ZooKeeper

BI APPLICATIONS (Query, Analytics, Reporting, Statistics)

EDW

Backup &

Recovery

Managem

ent

Security

Network

BI Framework - Interoperable with Enterprise Data Warehousing

31

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Infrastructure – Map Reduce

32

Map Reduce A Distributed Computing Model Typical Pipeline:

Input>Map>Shuffle/Sort>Reduce>Output Easy to Use , Developer writes few functions,

Moves compute to Data Schedules work on HDFS node with data Scans through data, reducing seeks Automatic Reliability and re-execution on failure

Column-Stg (HBase)

Data Flow (Pig)

SQL (Hive)

Metadata (HCatalog)CV

Coo

rdin

atio

n (Z

ooke

eper

)

Man

agem

ent

(HM

S)

Hadoop Distributed File System

(HDFS)

Metadata (HCatalog)CV

MapReduce Computing Framework)

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

HDFS Architecture Actively Maintaining High Availability

Persistent Namespace Metadata & Journal

Name Node

Block Map

Namespace State

NFS

NFS

Nam

espa

ce

Hierarchical Namespace File name >> Block IDs >> Block Locations

1. Replicate 3. Block Received Periodically Check

Block Checksums

0. Bad/ Lost Block

Block ID >> Data

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

Horizontally Scale I/O & Storage

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

JBOD

Data Node b1 b3

b2

2. Copy

Block ID >> Data Horizontally Scale I/O & Storage

Name Node

Block Map

Namespace State

Persistent Namespace Metadata & Journal

NFS

NFS

Nam

espa

ce

Heartbeats & Block Reports

HDFS Immutable File System – Read, Write, Sync/Flush – No random writes Storage Server used for Computation – Move Computation to Data Fault Tolerant & Easy Management – Built In Redundancy, Tolerates Disk & Node Failure, Auto-Managing

addition/removal of nodes, One operator/8K nodes Not a SAN but high bandwidth network access to data via Ethernet Used typically to Solve problems not feasible with traditional systems: Large Storage Capacity >100PB raw,

Large IO/computational BW >4K node/cluster, scale by adding commodity HW, Cost ~$1.5/GB incl. MR cluster

Big Data Infrastructure – HDFS

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Hadoop Distributed File System

HDFS Architecture

HDFS Characteristics • Based on Google GFS (Google

File System) • Redundant Storage for massive

amounts of data • Data is distributed across all

nodes at load time – efficient MapReduce processing

• Runs on commodity hardware – assumes high failure rate for components

• Works well with lots of large files • Built around Write once – Read

many times” • Large Streaming Reads – Not

random access • High Throughtput more important

than low latency

34

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Hadoop Architecture - Overview Hadoop Data Processing Architecture

35

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Key Technologies Required for Big Data

Key Technologies Required for Big Data • Cloud Infrastructure • Virtualization • Networking • Storage

o In-Memory Data Base (Solid State Memory) o Tiered Storage Software (Performance Enhancement) o Deduplication (Cost Reduction) o Data Protection (Back Up, Archive & Recovery)

36

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Cloud Infrastructure for Big Data

37

Examples eMail - Yahoo!,Google… Collaboration - Facebook,Twitter … Bus.Apps - SalesForce, GoogleApps, Intuit…

Examples Amazon EC2 Force.com Navitaire

Examples Amazon S3 Nirvanix

Infrastructure HW & Services

- Servers, Network, Storage - Management, Reporting

SaaS

PaaS

IaaS

Platform Tools & Services - Deploy developed platforms ready for Application SW on Cloud Aware Infrastructure

Software-as-a-Service - Servers, Network, Storage - Management, Reporting

Service Providers

Examples Public - BT, Telstra, T-Systems France Telecom Private – Hybrid – IBM/Cloudburst,

Cloud Services Providers Public – Mutitenancy,OnDemand Private - On Premises, Enterprise Hybrid – Interoperable P2P

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Platform Tools & Services

Operating Systems

Cloud Computing Public Cloud Service

Providers Private Cloud

Enterprise

App SLA

SaaS Applications

….. .Net

Pyth

on

EJB

Ruby

PHP ….. PaaS

IaaS

SaaS

Virtualization Resources (Servers, Storage, Networks)

Hybrid Cloud

App SLA App SLA App SLA App SLA

Man

agem

ent

Cloud Infrastructure for Big Data

Application’s SLA dictates the Resources Required to meet specific requirements of Availability, Performance, Cost, Security, Manageability etc.

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Private Cloud Requirements for Big Data

Public Cloud

Storage

Costs

Ope

ratio

nal F

lexi

bilit

y

App Silos

VZ

Private Cloud

Storage

Automation Automated Provisioning - Moving Data & Processes Seamlessly Allocation, Self Tuning of Resources to

meet Workload Requirements

Self- Service

Access Resources on demand to speed deployment and delivery

Scale Resource Up/down to optimize their usage,

Release when not needed

Service Catalog Choose pre-defined IT Services/user/dept. Define SLA to efficiently meet services

Service Analytics

Monitor & Analyze Usage for Charge back Interactively auto-tune

performance Availability with SLA

requirements

Private Cloud

Storage

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Virtualization: Workloads Consolidation

Source: Dan Olds & IMEX Research 2009

• A single server 1.5x larger than standard 2-way server will handle consolidated load of 6 servers. • VZ manages the workloads + important apps get the compute resources they need automatically w/o operator intervention. • Physical consolidation of 15-20:1 is easily possible • Reasonable goal for VZ x86 servers – 40-50% utilization on large systems (>4way), rising as dual/quad core processors becomes available • Savings result in Real Estate, Power & Cooling, High Availability, Hardware, Management

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Virtualization: TCO Savings

$-

$4,000

$8,000

$12,000

$16,000

w/o VZ w VZ

Provisioning

HardwareSAN

NetworkPower & Cooling

DC Real EstateDisaster Recovery

Downtime

Cos

t ove

r 3

year

s 995 Pre-Virtualization (VZ) Servers 78 VZ Servers

VZ SW &

Support For T

CO

Ana

lysi

s, E

M: i

mex

@ im

exre

sear

ch.c

om

(408

) 268

-080

0

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage Infrastructure for Big Data

42

Storage Efficiency

Virtualization Mapping P > V, VM Management Performance In-Memory DB, Auto-Tiering-SSD/HDD Costs Reduction Thin Provisioning Deduplication Availability RAID/Auto recover HA, Snapshots, CDP, Cloning, DRS Security Encryption/DLP

Service Efficiency

Storage -as-a Service Service Catalogs by Workloads etc. Policy Infrastructure

• Service Level Attributes • Service Measurements

Performance Analytics • IOPS/Response Time, Bandwidth

Automation • Unified SAN/NAS Protocols • Auto learning Workload Forensics • Provisioning to Match Workloads • Assured Auto recovery

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage Architecture - Impact from VZ

Source: IMEX Research SSD Industry Report ©2011

43

Replication

RAID – 0,1,5,6,10

Virtual Tape

Back Up/Archive/DR Data Protection

Virtualization

MAID

Deduplication Thin Provisioning

Storage Efficiency

Auto Tiering

Virtualization (VZ) requires Shared Storage for - VMotion - Storage VMotion - HA/DRS - Fault Tolerance

Additional Capacity Consumed for - VZ snapshots, - VM Kernel etc.

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage – Issues & Solutions

Traditional Data Growth

Sto

rage

Cos

ts R

educ

tion

Capacity Requirements

Snapshots ~ 75%

Thin Provisioning ~30%

DeDuplication ~ 25-95%

Auto-Tiering 65-95%

Thin Replication ~ 95%

RAID*DP ~ 40% vs. R10

Virtual Clones ~80%

CAPACITY SAVINGS ~ xx %

Technologies Reducing Storage Costs

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage Architecture Impacting Big Data

Auto-Tiering System using Flash SSDs

Data Class (Tiers 0,1,2,3) Storage Media Type (Flash/Disk/Tape) Policy Engines (Workload Mgmt.) Transparent Migration (Data Placement) File Virtualization (Uninterrupted App.Opns.in Migration)

Source: IMEX Research SSD Industry Report ©2011

Replication

RAID – 0,1,5,6,10

Virtual Tape

Back Up/Archive/DR Data Protection

Storage Virtualization

MAID

Deduplication Thin Provisioning

Storage Efficiency

Auto-Tiering

45

DRAM

Flash SSD

Performance Disk

Capacity Disk

Tape

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

46 46

I/O Access Frequency vs. Percent of Corporate Data

SSD • Logs • Journals • Temp Tables • Hot Tables

FCoE/ SAS

Arrays

• Tables • Indices

• Hot Data

Cloud Storage

SATA • Back Up Data • Archived Data

• Offsite DataVault

2% 10% 50% 100% 1% % of Corporate Data

65%

75%

95%

% o

f I/O

Acc

esse

s

Data Storage: Hierarchical Usage

Source:: IMEX Research - Cloud Infrastructure Report ©2009-12

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

47 47

SSD Storage: Filling Price/Perf.Gaps

HDD

Tape

DRAM

CPU SDRAM

Performance I/O Access Latency

HDD becoming Cheaper, not faster

DRAM getting Faster (to feed faster CPUs) & Larger (to feed Multi-cores & Multi-VMs from Virtualization)

SCM

NOR

NAND PCIe SSD

SATA SSD

Price $/GB

Source: IMEX Research SSD Industry Report ©2010-12

SSD segmenting into PCIe SSD Cache - as backend to DRAM & SATA SSD - as front end to HDD

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

48 48

SSD Storage - Performance & TCO

SAN TCO using HDD vs. Hybrid Storage

14.2 5.2

75

28

0

64

0

36

145

0

0

50

100

150

200

250

HDD Only HDD/SSD

Cos

t $K

Power & Cooling RackSpace SSDs HDD SATA HDD FC

Pwr/Cool

RackSpace

SSD

HDD-SATA

HDD-FC

SAN PerformanceImprovements using SSD

0

50

100

150

200

250

300

FC-HDD Only SSD/SATA-HDD

IOPS

0

1

2

3

4

5

6

7

8

9

10

$/IO

P

Performance (IOPS) $/IOP

$/IOPS Improvement

800%

IOPS Improvement

475%

Source: IMEX Research SSD Industry Report ©2011

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Workloads Characterization

*IOPS for a required response time ( ms) *=(#Channels*Latency-1)

(RAID - 0, 3)

500 100 MB/sec

10 1 50 5

Data Warehousing

OLAP

Business Intelligence (RAID - 1, 5, 6)

IOPS

* (*L

aten

cy-1

)

Web 2.0 Audio

Video

Scientific Computing

Imaging HPC

TP HPC

10K

100 K

1K

100

10

1000 K OLTP

eCommerce Transaction Processing

Source:: IMEX Research - Cloud Infrastructure Report ©2009-12

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Storage performance, management and costs are big issues in running Databases

Data Warehousing Workloads are I/O intensive

• Predominantly read based with low hit ratios on buffer pools • High concurrent sequential and random read levels Sequential Reads requires high level of I/O Bandwidth (MB/sec) Random Reads require high IOPS)

• Write rates driven by life cycle management and sort operations OLTP Workloads are strongly random I/O intensive

• Random I/O is more dominant Read/write ratios of 80/20 are most common but can be 50/50 Can be difficult to build out test systems with sufficient I/O characteristics

Batch Workloads are more write intensive • Sequential Writes requires high level of I/O Bandwidth (MB/sec)

Backup & Recovery times are critical for these workloads • Backup operations drive high level of sequential IO • Recovery operation drives high levels of random I/O

Workloads Characterization

Source: IMEX Research SSD Industry Report ©2011 50

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Best Practices – Storage in Big Data Apps

Goals & Implementation Establish Goals for SLAs (Performance/Cost/Availability), BC/DR (RPO/RTO) & Compliance Increase Performance for DB, OLTP and OLAP Apps:

Random I/O > 20x , Sequential I/O Bandwidth > 5x Remove Stale data from Production Resources to improve performance

Use Partitioning Software to Classify Data By Frequency of Access (Recent Usage) and Capacity (by percent of total Data) using general guidelines as: Hyperactive (1%), Active (5%), Less Active (20%), Historical (74%)

Implementation Optimize Tiering by Classifying Hot & Cold Data

Improve Query Performance by reducing number of I/Os Reduce number of Disks Needed by 25-50% using advance compression software achieving 2-4x compression

Match Data Classification vs.Tiered Devices accordingly Flash, High Perf Disk, Low Cost Capacity Disk, Online Lowest Cost Archival Disk/Tape

Balance Cost vs. Performance of Flash More Data in Flash > Higher Cache Hit Ratio > Improved Data Performance

Create and Auto-Manage Tiering (Monitoring, Migrations, Placements) without manual intervention

51

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

52 52

Best Practices: I/O Forensics in Storage-Tiering

Source: IBM & IMEX Research SSD Industry Report 2011 ©IMEX 2010-12

LBA Monitoring and Tiered Placement • Every workload has unique I/O access signature • Historical performance data for a LUN can

identify performance skews & hot data regions by LBAs

Storage-Tiering at LBA/Sub-LUN Level Storage-Tiered Virtualization

Physical Storage Logical Volume

SSDs Arrays

HDDs Arrays

Hot Data

Cold Data

Automatic Migration

(Policy Based)

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Best Practices: Cached Storage

App/DB Server

LSI MegaRAID CacheCade Pro 2.0

Application Improvement over Cached vs.HDD only

Oracle OLTP Benchmarks 681%

SQL Server OLTP Benchmark

1251%

Neoload (Web Server Simulation

533%

SysBench (MySQL OLTP Server)

150%

0

373

655

0

100

200

300

400

500

600

700

All HDD Smart Flash Cache

Persist Data on Warpdrive

TPS

0

330

660

0 100 200 300 400 500 600 700

All HDD Smart Flash Cache

Persist Data on

Warpdrive

Response Time

53

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets: Analytics Key Industries Benefitting from Big Data Analytics

CAG

R %

(201

0-15

)

Global IT Spending by Industry Verticals 2010-15 $B

5 Year Cum Global IT Spending 2010-15 ($B)

54

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets – Storage Infrastructure

Data Intensity by Industry Vertical

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Banking & Financial ServicesMedia & Entrertainment

Healthcare ProvidersProfessional ServicesTelecommunications

Pharma, Life Sciences & Medical PdctsRetail & Wholesales

UtilitiesInd'l Elex & Electrical EquipmentSW Publishing & Internet Srvcs

Consumer ProductsInsurance

TransportationEnergy

Installed Terabytes/$M Rev 2010

Value Potential of Using Big Data by Data Intensive Verticals

55

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets: Storage Infrastructure

Data Stored by Large US Enterprises

Big Data Storage Potential Data Stored by Large US Enterprise

14%

12%

10%

10%

9%

6%

6%

6%

5%

4%

4%

3%

3%

3%

2%

2%

1%

Discrete Manafacturing

Government

Communications and Media

Process manufacturing

Banking

Health Care Providers

Securities & Investment Srvcs

Professional Services

Retail

Education

Insurance

Transportation

Wholesale

Utilities

Resource IndustriesConsumer and Recreational

ServicesConstruction 967

1,312

1,792

831

1,931

370

3,866

278

697

319

870

801

536

1,507

825

150

231

Big Data Storage Potential Data Stored by Large US Enterprises

Stored Data by Industry (in US 2009 PB)

Stored Data TB/Firm (>1K Employees US)

56

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Big Data Targets: Savings w Open Source

Legacy BI vs. Open Source Big Data Analytics

57

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

Rise of Big Data Adoption

58

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

59

Key Takeaways

Big Data creating paradigm shift in IT Industry Leverage the opportunity to optimize your computing infrastructure with Big Data Infrastructure after making a due diligence in selection of vendors/products, industry testing and interoperability. Apply best storage technologies listed in this presentation and elsewhere

Optimize Big Data Analytics for Query Response Time vs. # of Users Improving Query Response time for a given number of users (IOPs) or Serving more users (IOPS) for a given query response time

Select Automated Storage Management Software Data Forensics and Tiered Placement

• Every workload has unique I/O access signature • Historical performance data for a LUN can identify performance skews & hot data regions by

LBAs.Non-disruptively migrate hot data using auto-tiering Software Optimize Infrastructure to meet needs of Applications/SLA • Performance Economics/Benefits • Typically 4-8% of data becomes a candidate and when migrated for higher performance

tiering can provide response time reduction of ~65% at peak loads. Many industry Verticals and Applications will benefit using Big Data

Source: IMEX Research SSD Industry Report ©2011

NextGen Infrastructure for Big Data © 2012 Storage Networking Industry Association. All Rights Reserved. Source: © IMEX Research – Big Data Industry Report ©2011-12

60 60

Q&A / Feedback

Many thanks to the following individuals for their contributions to this tutorial.

Source: IMEX Research

Joseph White Anil Vasudeva

Send any questions or comments on this presentation to SNIA: [email protected]