AnilVasudeva NextGen Storage Big Data

Embed Size (px)

Citation preview

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    1/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    2/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    3/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    33

    Abstract

    NextGen Infrastructure forBig DataThis session will appeal to Business Planning, Marketing, Technology System Integrators and Data CenterManagers seeking to understand the drivers behind the demand for and rise of Big Data.

    AbstractThe internet has spawned an explosion in data growth in the form of data sets, called Big Data, that areso large they are difficult to store, manage and analyze using traditional RDBMS which are tuned forOnline Transaction Processing (OLTP) only. Not only is this new data heavily unstructured, voluminousand streams rapidly and difficult to harness but even more importantly, the infrastructure cost of HW andSW required to crunch it using traditional RDBMS, to derive any analytics or business intelligence online

    (OLAP) from it, is prohibitive.To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop and others)many companies have emerged which are leveraging new parallelized processing, commodity hardware,open source software and tools to capture and analyze these new data sets and provide aprice/performance that is 10 times better than existing Database/Data Warehousing/Business IntelligenceSystems.

    Learning ObjectivesThe presentation will illustrate the existing operational challenges businesses face today using RDBMSsystems despite using fast access in-memory and solid state storage technologies. It details how IT isharnessing the emergent Big Data to manage massive amounts of data and new techniques such asparallelization and virtualization to solve complex problems in order to empower businesses withknowledgeable decision-making.It lays out the rapidly evolving big data technology ecosystem - different big data technologies fromHadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private and hybridcloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure, Prepare foranalytics and visualization of data while manipulating it rapidly to derive business intelligence online, to runbusinesses smartly.

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    4/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    5/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    55

    NextGen IT Infrastructure

    Enterprise VZ Data CenterOn-Premise Cloud

    Home Networks

    Web 2.0Social Ntwks .Facebook,

    Twitter, YouTubeCable/DSL

    Cellular

    Wireless

    Internet ISP

    CoreOptical

    Edge

    ISP

    ISPISP

    ISP

    ISP

    Supplier/Partners

    Remote/Branch Office

    Public CloudCenter

    Servers VPNIaaS, PaaS

    SaaS

    VerticalClouds

    ISP

    Tier-3Data Base

    ServersTier-2 Apps

    ManagementDirectory Security Policy

    Middleware Platform

    Switches: Layer 4-7,Layer 2, 10GbE, FC Stg.

    Caching, Proxy,FW, SSL, IDS, DNS,

    LB, Web Servers

    App licat ion ServersHA, File/Print, ERP,

    SCM, CRM Servers

    Database Servers,Middleware, DataMgmt.

    Tier-1Edge Apps

    FC/ IPSANs

    Request for data from a remote client to a Data Center or Cloud crosses a myriad ofsystems and devices. Key is identifying bott lenecks & improving performance

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    6/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Harnessing Big Data for Business Insights

    6

    Majority of data growth is beingdriven by unstruc tured data

    and billions of large objects

    Information is at the center ofNew Wave of opportunity

    80% of worlds data is unstructureddriven by rise in Mobility devices,

    collaboration machine generated data.

    Data

    Sources

    Big Data

    Infrastructure

    Business

    Insights

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    7/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Unstructured Big Data can provide NextGen Analytics to help businesses makeinformed, better decision in:Product StrategyTargeting SalesJust-In-Time Supply-Chain EconomicsBusiness Performance OptimizationPredictive Analytics &RecommendationsCountry Resources Management

    Corporate Need: Business Perf... Optimization

    7

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    8/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    9/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Corporate Need: Business Insights

    .

    Store

    Information ExplodingVolume: Digital Content doubling every 18months. Velocity: >80% growth driven fromunstructured data. Variety: sources of datachanging

    A unif ied in formation/contentstorage methodologythat enablesusers to manage the volume, velocity andvariety of information from multiple sources

    Manage

    Complexity in "managing"information.- Need to classify,

    synchronize, aggregate, integrate, share,transform, profile, move, cleanse, protect,retire

    A solution port fo lio of tools andservices to manage all types of

    information in a hybrid storage environment

    AnalyzeCurrent solutions limited to BI toolsfocused on structured and lagginginformation

    Build/buy packaged Real-TimePredictive Analytical Solutionsforunstructured analytics tools

    Collaborate Multiple access methodsneeded to

    meet needs of a diverse audience.Centralized share, collaborateandact on insights anytime, anywhere on anydevice.

    Model/ AdaptAbility to understand how theinformation impacts the business.How to transfer to action.

    Model Information on currentoperations w/potential strategyimpact.Leverage Tech. to adapt.

    Item Issue Solution

    9

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    10/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Opportunity: Converting Big Data Delugeinto Predictive Analytics & Insights

    Personal Location Services Data Generated by TB/Year

    Navigation Devices 600

    Navigation Apps on Phones 20Smart Phone Opt-In Tracking 1000

    Geo Targeted Ads 20

    People Locator (Emergency Calls/Search..) 10

    Location based Services (e.g.Games) 5

    Other 45

    Total (Est.) 1700

    Big Data Predict ive Analytics

    10

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    11/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Issues with Existing RDBMS

    Key Issues with RDBMS Technologies

    Handling Mixed Unstructured Data- RDBMs dont handle non-tabular data

    (Notorious for doing a poor job on recursive data structure)

    Legacy Archaic Architecture- RDBMS dont parallelize well to accommodatecommodity HW clusters

    Speed- Seek time of physical Storage has not kept pace withnetwork speed improvementsScale- Difficult to scale-out RDBMS efficiently Clusteringbeyond few servers notoriously hardIntegration- Data processing tasks need to combine data from non-related sources, over a networkVolume- Data volumes have grown from 10s GB >100s TB >PBs in recent years. Existing Tabular RDBMS canthandle such large DBs

    11

    FaultTolerance

    Avai labil ityDeep

    Insights

    EU AdhocAnalytics

    Cost

    UnstructuredData

    Latency

    Scalability

    Big DataAnalytics

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    12/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Issues with Existing RDBMS

    Present RDBMS struggling to Store & Analyze Big Data

    12

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    13/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data - Database Solutions

    13

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    14/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data The New Face of DBs

    Big Data Paradigm - The New face of DB Systems Adopts Schema-Free Architecture

    Can do away with Legacy Relational DB SystemsSome data have sparse attributes, do not need relational property Key Oriented Queries

    Some data stored/retrieved mainly by primary key, w/o complex joins

    Trade-off of Consistency, Availability & Partition Tolerance

    Scale Out, not up, - Online Load balancing cluster growth

    14

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    15/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Analytics The Next Frontier in IT

    15

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    16/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Key Innovations: HW Technologies

    16

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    17/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    17

    Key Innovations Solid State Storage

    Source: IMEX Research SSD Industry Report 2011

    Note: 2U storage rack, 2.5 HDD max cap = 400GB / 24 HDDs, de-stroked to 20%, 2.5 SSD max cap = 800GB / 36 SSDs

    0

    300

    600

    2009 2010 2011 2012 2013 2014

    Un

    its

    (Millio

    ns

    )

    0

    8

    IOPS/GB

    IOPS/GB HDDs

    HDD

    SSD

    17

    Key to Database performance are random IOPS. SSDs outshine HDD in IO price/performance

    a major reason, besides better space and power, for their explosive growth.

    Storage - IOPS/GB & Price Erosion - HDD vs. SSDs

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    18/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Innovations DB SW Technologies

    Tech Innovation 1985 1990 1995 2000 2005 2010 2015

    OLTP TransactionsDB SW Rows Locking Optimizer Parallel Query Clustering XML Grid Open Source /Hadoop

    OLAP- AnalyticsDB SW

    Indexing Partitioning ColumnarMaterialized

    ViewBit Mapped

    IndexIn-Memory Query Binding

    Hardware 32 bit SMP NUMA 64 bitMulti-core/Blades

    Flash MPP

    Big Data Multi-coreColumnarIn-Memory

    MPPVisualization

    OLTP Database Innovation Progress

    0.1

    1

    10

    100

    1000

    10000

    100000

    1985 1990 1995 2000 2005 2010 2015

    0.01

    0.1

    1

    10

    100

    1000

    10000

    $/TPMc

    TPMc/Processor$/TPMc

    TPMc/Processor

    Big Data:Analy tics DB Technology Impact

    0

    0

    1

    10

    100

    1,000

    10,000

    100,000

    1985 1990 1995 2000 2005 2010 2015

    0.0001

    0.001

    0.01

    0.1

    1

    10

    100

    1000

    10000

    100000

    .1

    .01

    DW Size TB$/GB

    18

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    19/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data - Key Requirements

    Types of Data Organizations Analyze59%

    44%

    69%

    64%

    41%

    33%

    51%

    28%

    46%

    36%

    36%

    18%18%

    21%

    8%

    3%

    3%

    68%

    68%

    37%

    23%

    33%

    34%

    26%

    32%

    21%

    15%

    11%

    15%

    11%

    9%

    3%

    6%

    5%

    Customer/member data

    Transactional data from applications

    Application Logs

    Other Types of Event Data

    Network Monitoring/Network Traffic

    Online Retail TransactionsOther Log Files

    Call Data Records

    Web Logs

    Text data from social media and online

    Search logs

    Trade/quote data

    Intelligence/defense data

    Multimedia (audio/video/images)

    Weather

    Smartmeter data

    Other (please specify)

    Hadoop

    Non Hadoop

    19

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    20/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    21/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    21

    Big Data - Market Requirements

    Unified system: Pre-integrated for Ease of Installation and Management Platform Large Scale Indexing Pre-integrated using Hadoop Foundation, Integrated Text Analytics -Address Unstructured Data Usability- User Friendly Admin Console including HDFS Explorer, Query

    Languages Enterprise Class Features Provisioning, Storage, Scheduler, Advance Security

    Supports search-centric, document-based XML data modelo store documents within a transactional repository.

    Schema-Free:o No advance knowledge of the document structure (its "schema") needed

    o Index words and values from each of the loaded documents together withits document structure.

    Standard commodity hardware leveraged

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    22/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Market Requirements

    Architectural Shared-nothing clustered DB architecture

    o programmable and extensible application servers.

    Support massive scalabili tyto petabytes of source data Support open-source XQuery- and XSLT-driven architecture Simple to Deploy, Develop and Manage (UI & Restful Interface)

    Support extreme mixed workloads- a wide variety of data types includingarbitrarily hierarchical data structures, images, waveforms, data logs etc.

    Support thousands of geographically dispersed on--line usersandprogramsexecuting variety of requests from ad hoc queries to strategic analysis

    Loading data before declaring or discovering its structure Load data in batch and streaming fashion

    Integrate data from multiple sourcesduring load process at very high ratesSpread I/O and data across instances

    Provide consistent performance with linear cost Leverage Open Source SWLo Costs, Multiple Sources, Hadoop Foundation Tools Connectivity with Oracle DB, Teradata Warehouse, JDBC Connectivity,

    22

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    23/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Real Time Analytics Execution Execute streaming analytic queries in real timeon incoming load data Updating data in placeat full load speeds Scheduling and execution of complex multi-hundred node workflows Join a billion row dimension table to a trillion row fact tablewithout pre-

    clustering the dimension table with the fact tablePerformance Analyze data, at very high rates >GB/sec Predictable Sub-ms response timefor highly constrained standard SQL

    queries

    Availability Ability to configure without any single point of failure Auto-Failover Extreme High Availability

    o Automated failover and process continuation without operationalinterruption when processing nodes fail

    23

    Big Data - Market Requirements

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    24/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Product Metrics Choices

    Big Data-

    Product

    Metrics

    Data Set Size

    PB

    TB

    GB

    Data Structure

    Transaction

    Machine

    Unstructured

    Other

    Access/Use

    Transaction

    Search

    Analytics

    Parallel

    Processing

    Appliance

    Cluster < 1K

    Cluster > 1K

    MemoryIn-Memory

    Flash

    DB Technique

    Columnar

    Zero Sharing

    No SQL

    Data

    Cataloging

    SW

    Text

    Image

    Audio

    Video

    24

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    25/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Advantage: Big Data Products

    Characteristic Legacy Paradigm Big Data Paradigm

    Structure Transactional/Corporate Unstructured/Derivative/Internet

    Mode Data Collection Data Analysis

    Focus Find Answers Find Questions

    Facility Reportive / What Happened? Analytic / Why did it Happen?Predictive / What will Happen Next?

    Opportunity Very Small Growth Massive Growth

    Players Legacy Players Agile Start Ups, well funded

    Impact Analyze Existing Businesses Create New Businesses

    25

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    26/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Characteristic Tradit ional RDBMS Big Data/MapReduce

    Data Size GB PB

    Access Interactive Batch/Near Real-Time

    Latency Low High

    Data Updates Read & Write Many Times Write Once Read Many Times

    Schema/Structure Static Schema Dynamic Schema

    Language SQL UQL/Procedural (Java,C++..)

    Integrity High Not 100%

    Works Well for Process Intensive Jobs Data Intensive Jobs

    Works Well w Data Size Gigabytes Petabytes

    Data/ProcessingInteractions

    Low Latency/High BW precursor tosuccess. Ntwk. BW can be abottleneck causing nodes to be idle

    Sends Code to Data, instead ofSending Data to other Nodes(Requiring Lower BW in Cluster)

    Fault Tolerance Coordinating Processes with NodeFailures a challenge

    Fault Tolerant for HW/SW Failures

    Access Interactive Batch/Near Real-time

    Scaling Non-linear Linear

    Pgm-Distribution of Jobs Difficult Simple & Effective

    26

    Advantage: Big Data Products

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    27/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Ecosystem

    27

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    28/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Stack

    Collector

    Admin Center

    Data

    Importer

    Data

    Importer

    RHive

    EnterprisePerfMon, Query

    Plan

    Hive WorkflowOracle-to-Hive

    Search

    Rest/

    JSON

    API

    Data

    exporter

    Databases

    Advanced

    Analytics

    OLAP

    Server

    OLTP

    Server

    ETL

    Ad-hoc query

    DataSources

    Data Store(Hadoop)

    OracleIBM

    Teradata

    Real-timeQueries

    DBA

    StreamingData

    Devices

    Analyt ics Plat form Data

    SourcesAppl ications

    Merging Hadoop innovations into Nextgen DBMS

    28

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    29/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Hadoops Fit in Enterprise Stack

    29

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    30/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    31/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Connectors to EDW/BI

    Servers

    Operating System

    Hypervisor/VMs

    Big Data Storage Framework(HDFS)

    Big Data Processing Framework(MapReduce)

    Big Data Access Framework

    Pig Hive Sqoop

    Big Data(Connectors)

    Big Data Orchestration Framework

    HBase Avro Flume ZooKeeper

    BI APPLICATIONS(Query, Analytics, Reporting, Statistics)

    EDW

    Backup&

    Recovery

    Management

    Security

    Network

    BI Framework - Interoperable with Enterprise Data Warehousing

    31

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    32/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Infrastructure Map Reduce

    32

    Map Reduce

    A Distributed Computing Model Typical Pipeline:

    Input>Map>Shuffle/Sort>Reduce>Output Easy to Use , Developer writes few functions,

    Moves compute to Data Schedules work on HDFS node with data Scans through data, reducing seeks Automatic Reliability and re-execution on failure

    Column-Stg(HBase)

    DataFlow(Pig)

    SQL

    (Hive)

    Metadata(HCatalog)CV

    Coord

    ination

    (Zookeeper)

    Manag

    ement

    (HMS)

    Hadoop

    Distributed

    File System(HDFS)

    Metadata

    (HCatalog)CV

    MapReduce

    Computing

    Framework)

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    33/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    34/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Hadoop Distributed File System

    HDFS Archi tecture

    HDFS Characteristics Based on Google GFS (Google

    File System) Redundant Storage for massive

    amounts of data

    Data is distributed across allnodes at load time efficientMapReduce processing

    Runs on commodi ty hardwareassumes high failure rate forcomponents

    Works well with lots o f large files Built around Write once Read

    many times Large Streaming Reads Not

    random access High Throughtput more important

    than low latency

    34

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    35/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Hadoop Architecture - Overview

    Hadoop Data Processing Architecture

    35

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    36/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Key Technologies Required for Big Data

    Key Technologies Required for Big Data Cloud Infrastructure Virtualization Networking

    Storageo In-Memory Data Base (Solid State Memory)o Tiered Storage Software (Performance Enhancement)o Deduplication (Cost Reduction)o Data Protection (Back Up, Archive & Recovery)

    36

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    37/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Cloud Infrastructure for Big Data

    37

    Examples

    eMail - Yahoo!,Google

    Collaboration- Facebook,Twi tter Bus.Apps- SalesForce, GoogleApps, Intuit

    ExamplesAmazon EC2Force.comNavitaire

    ExamplesAmazon S3Nirvanix

    Infrastructure HW &Services

    - Servers, Network, Storage- Management, Reporting

    SaaS

    PaaS

    IaaS

    Platform Tools & Services- Deploy developed platformsready for Application SW on

    Cloud Aware Infrastructure

    Software-as-a-Service- Servers, Network, Storage

    - Management, Reporting

    ServiceProviders

    Examples

    Public - BT, Telstra, T-Systems France TelecomPrivate Hybrid IBM/Cloudburst,

    Cloud Services ProvidersPublic Mut itenancy,OnDemandPrivate - On Premises, EnterpriseHybrid Interoperable P2P

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    38/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Platform Tools & Services

    Operating Systems

    Cloud ComputingPublic Cloud Service

    ProvidersPrivate Cloud

    Enterprise

    AppSLA

    SaaSApplications

    ...Net

    Python

    EJB

    Ruby

    PHP ..

    PaaS

    IaaS

    SaaS

    Virtualization

    Resources (Servers, Storage, Networks)

    HybridCloud

    App SLA

    App SLA

    App SLA

    App SLA

    Managemen

    t

    Cloud Infrastructure for Big Data

    Applications SLA dictates the Resources Required to meet specific

    requirements of Availabil ity, Performance, Cost, Security, Manageabil ity etc.

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    39/60

    Vi t li ti W kl d C lid ti

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    40/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Virtualization: Workloads Consolidation

    Source: Dan Olds & IMEX Research 2009

    A single server 1.5x largerthan

    standard 2-way server will handleconsolidated load of 6 servers. VZ manages the workloads +important apps get the computeresourcesthey need automaticallyw/o operator intervention. Physical consolidation of 15-20:1is easily possible Reasonable goal for VZ x86servers 40-50% utilizationonlarge systems (>4way), rising asdual/quad core processorsbecomes available

    Savingsresult in Real Estate,Power & Cooling, High Availability,Hardware, Management

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    41/60

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    42/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Storage Infrastructure for Big Data

    42

    Storage

    Efficiency

    Virtualization Mapping P > V, VM Management

    Performance In-Memory DB, Auto-Tiering-SSD/HDD

    Costs Reduction Thin Provisioning

    Deduplication

    AvailabilityRAID/Auto recover HA, Snapshots, CDP, Cloning, DRS

    SecurityEncryption/DLP

    Service

    Efficiency

    Storage -as-a Service Service Catalogs by Workloads etc.

    Policy Infrastructure

    Service Level Attributes

    Service Measurements Performance Analytics

    IOPS/Response Time, Bandwidth Automation

    Unified SAN/NAS Protocols Auto learning Workload Forensics Provisioning to Match Workloads Assured Auto recovery

    St A hit t I t f VZ

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    43/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Storage Architecture - Impact from VZ

    Source: IMEX Research SSD Industry Report 2011

    43

    Replication

    RAID 0,1,5,6,10

    Virtual Tape

    Back Up/Archive/DR

    Data Protection

    Virtualization

    MAID

    Deduplication

    Thin Provisioning

    Storage Efficiency

    Auto Tier ing

    Virtualization (VZ)requires Shared Storage for- VMotion

    - Storage VMotion- HA/DRS- Fault Tolerance

    Addi tional CapacityConsumed for- VZ snapshots,

    - VM Kernel etc.

    St I & S l ti

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    44/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Storage Issues & Solutions

    TraditionalData Growth

    StorageCostsReduction

    Capacity Requirements

    Snapshots ~ 75%

    Thin Provisioning ~30%

    DeDupl ication ~ 25-95%

    Auto-Tier ing 65-95%

    Thin Replication ~ 95%

    RAID*DP ~ 40% vs. R10

    Virtual Clones ~80%

    CAPACITY

    SAVINGS

    ~ xx %

    Technologies Reducing Storage Costs

    St A hit t I ti Bi D t

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    45/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Storage Architecture Impacting Big Data

    Auto-Tiering Systemusing Flash SSDs

    Data Class

    (Tiers 0,1,2,3)

    Storage Media Type

    (Flash/Disk/Tape)

    Policy Engines(Workload Mgmt.)

    Transparent Migration

    (Data Placement)

    File Virtualization(Uninterrupted App.Opns.in Migration)

    Source: IMEX Research SSD Industry Report 2011

    Replication

    RAID 0,1,5,6,10

    Virtual Tape

    Back Up/Archive/DR

    Data Protection

    Storage Virtualization

    MAID

    Deduplication

    Thin Provisioning

    Storage Efficiency

    Auto-Tier ing

    45

    DRAM

    FlashSSD

    PerformanceDisk

    Capacity Disk

    Tape

    Data Storage: Hierarchical Usage

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    46/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    4646

    I/O Access Frequency vs. Percent of Corporate Data

    SSD Logs

    Journals

    Temp Tables

    Hot Tables

    FCoE/

    SASArrays

    Tables

    Indices

    Hot Data

    CloudStorage

    SATABack Up Data

    Archived Data

    Offsi te DataVault

    2% 10% 50% 100%1%% of Corporate Data

    65%

    75%

    95%

    %o

    fI/OAccesses

    Data Storage: Hierarchical Usage

    Source:: IMEX Research - Cloud Infrastructure Report 2009-12

    SSD St Filli P i /P f G

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    47/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    4747

    SSD Storage: Filling Price/Perf.Gaps

    HDD

    Tape

    DRAM

    CPUSDRAM

    PerformanceI/OAccess Latency

    HDD becomingCheaper, not faster

    DRAM gettingFaster (to feed faster CPUs) &Larger (to feed Multi -cores &

    Multi-VMs from Virtualization)

    SCM

    NOR

    NANDPCIeSSD

    SATA

    SSD

    Price$/GB

    Source: IMEX Research SSD Industry Report 2010-12

    SSD segmenting intoPCIe SSD Cache

    - as backend to DRAM&SATA SSD- as front end to HDD

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    48/60

    Workloads Characterization

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    49/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Workloads Characterization

    *IOPS for a required response time ( ms)*=(#Channels*Latency-1)

    (RAID - 0, 3)

    500100

    MB/sec101 505

    Data

    Warehousing

    OLAP

    Business

    Intelligence(RAID - 1, 5, 6)

    IOPS*(*La

    tency-1

    )

    Web 2.0Audio

    Video

    Scientific Computing

    Imaging

    HPC

    TPHPC

    10K

    100 K

    1K

    100

    10

    1000 K

    OLTP

    eCommerce

    Transaction

    Processing

    Source:: IMEX Research - Cloud Infrastructure Report 2009-12

    Workloads Characterization

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    50/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Storage performance, management and costs

    are big issues in running Databases

    Data Warehousing Workloads are I/O intensive Predominantly read based with low hit ratios on buffer pools High concurrent sequential and random read levels

    Sequential Reads requires high level of I/O Bandwidth (MB/sec)

    Random Reads require high IOPS) Write rates driven by life cycle management and sort operations

    OLTP Workloadsare strongly random I/O intensive Random I/O is more dominant

    Read/write ratios of 80/20 are most common but can be 50/50 Can be difficult to build out test systems with sufficient I/O characteristics

    Batch Workloadsare more write intensive Sequential Writes requires high level of I/O Bandwidth (MB/sec)

    Backup & Recoverytimes are crit ical for these workloads Backup operations drive high level of sequential IO Recovery operation drives high levels of random I/O

    Workloads Characterization

    Source: IMEX Research SSD Industry Report 2011 50

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    51/60

    Best Practices: I/O Forensics in Storage Tiering

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    52/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    5252

    Best Practices: I/O Forensics in Storage-Tiering

    Source: IBM & IMEX Research SSD Industry Report 2011 IMEX 2010-12

    LBA Monitoring and Tiered Placement

    Every workload has unique I/O access signature Historical performance data for a LUN can

    identify performance skews & hot data regionsby LBAs

    Storage-Tiering at LBA/Sub-LUN Level

    Storage-Tiered Virtualization

    Physical Storage Logical Volume

    SSDsArrays

    HDDsArrays

    Hot Data

    Cold Data

    AutomaticMigration

    (Policy Based)

    Best Practices: Cached Storage

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    53/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Best Practices: Cached Storage

    App/DBServer

    LSI MegaRAID CacheCade Pro 2.0

    Application Improvement over Cached

    vs.HDD onlyOracle OLTP Benchmarks 681%

    SQL Server OLTP

    Benchmark

    1251%

    Neoload (Web Server

    Simulation

    533%

    SysBench (MySQL OLTP

    Server)

    150%

    0

    373

    655

    0

    100

    200

    300

    400

    500

    600

    700

    All HDD Smart FlashCache

    Persist Dataon Warpdrive

    TPS

    0

    330

    660

    0

    100

    200

    300

    400

    500

    600

    700

    All HDD Smart FlashCache

    Persist Dataon

    Warpdrive

    Response Time

    53

    Big Data Targets: Analytics

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    54/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Targets: Analytics

    Key Industries Benefitt ing f rom Big Data Analytics

    CAGR%

    (2010-15

    )

    Global IT Spending by Industry Verticals 2010-15 $B

    5 Year Cum Global IT Spending 2010-15 ($B)

    54

    Bi D T S I f

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    55/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Targets Storage Infrastructure

    Data Intensity by Industry Vertical

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Banking & Financial Services

    Media & Entrertainment

    Healthcare Providers

    Professional Services

    TelecommunicationsPharma, Life Sciences & Medical Pdcts

    Retail & Wholesales

    Utilities

    Ind'l Elex & Electrical Equipment

    SW Publishing & Internet Srvcs

    Consumer Products

    Insurance

    Transportation

    Energy

    Installed Terabytes/$M Rev 2010

    Value Potential of Using Big Data by Data Intensive Verticals

    55

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    56/60

    Big Data Targets: Savings w Open Source

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    57/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Big Data Targets: Savings w Open Source

    Legacy BI vs. Open Source Big Data Analytics

    57

    Rise of Big Data Adoption

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    58/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-12

    Rise of Big Data Adoption

    58

    Key Takeaways

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    59/60

    NextGen Infrastructu re forBig Data 2012 Storage Networking Industry Association. All Rights Reserved.Source: IMEX Research Big Data Industry Report 2011-1259

    Key Takeaways

    Big Data creating paradigm shift in IT Industry

    Leverage the opportunity to optimize your computing infrastructure with Big DataInfrastructure after making a due diligence in selection of vendors/products, industry testing

    and interoperability.

    Apply best storage technologies listed in this presentation and elsewhere

    Optimize Big Data Analytics for Query Response Time vs. # of Users

    Improving Query Response time for a given number of users (IOPs) or Serving more users

    (IOPS) for a given query response timeSelect Automated Storage Management Software

    Data Forensics and Tiered Placement

    Every workload has unique I/O access signature

    Historical performance data for a LUN can identify performance skews & hot data regions by

    LBAs.Non-disruptively migrate hot data using auto-tiering Software

    Optimize Infrastructure to meet needs ofApplications/SLA

    Performance Economics/Benefits

    Typically 4-8% of data becomes a candidate and when migrated for higher performance

    tiering can provide response time reduction of ~65% at peak loads. Many industry Verticals

    and Applications will benefit using Big Data

    Source: IMEX Research SSD Industry Report 2011

    Q&A / Feedback

  • 8/12/2019 AnilVasudeva NextGen Storage Big Data

    60/60

    Q&A / Feedback

    Many thanks to the following individuals

    for their contributions to this tutorial.

    Source: IMEX Research

    Joseph White

    Anil Vasudeva

    Send any questions or comments on this presentation to SNIA: [email protected]

    mailto:[email protected]:[email protected]