Upload
walshe1
View
925
Download
0
Tags:
Embed Size (px)
DESCRIPTION
EMC HADOOP Storage Strategy presented at EMCWorld2014
Citation preview
1© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
EMC Hadoop Storage Strategy
Ed Walsh - @vEddieWJim Ruddy - @Darth_RuddyDan Baskette - @dbbaskette
2© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
CHANGES IN ANALYTICS
DATAVOLUME
DATAVELOCITY
DATATYPES APPS
3© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
DATA LAKE TECHNOLOGY
HADOOP DISTRIBUTED FILE SYSTEM• Highly saleable & portable
Apache Open Source Specification
• Structured and unstructured data
• Analytics API interface standard
• Storage hardware flexibility
• Performance optimized for large file access
HDFS TRADE-OFFS• Optimized for streaming writes; poor for random seeks
• Write once file system
• Hardware failure results in reduced performance
• Specialized file system, not designed for general use
HDFS Architecture
Client
NameNode
SecondaryNameNode
(Now called checkpoint or backup node)
Where do I read or
write data?
Justthesenodes
DataNode
DataNode
DataNode
Data
Status
HDFS Data
4© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
DATA LAKE TECHNOLOGY
HADOOP TIER
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
PROCESSING TIER – ME, HIVE, ETC.
DEEP SCALE SQL ANALYTICS – PIVOTAL HAWQ
IN MEMORY TIER
SQL OBJECTS JSON
DATABASES
Operational data is the focus (it is in memory, mostly)
Continue to work with RDBs
All data, history in HDFS
HDFS data files directly accessible inside Hadoop
Analytic results routed to memory tier
5© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
DATA LAKE STORAGE FEATURES
NO SILOS
Multi-protocol accessSimultaneous access for
unstructured dataSeparation of storage from access protocol
OPTMIZED COST
Choice of storage hardwareMulti-vendor, no lock-in
LIMITLESS SCALE
Expand capacity as neededMassive scale-outHighly available
6© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
INTEGRATED HDFS WITH HADOOP DISTRO
STRENGTHS• Tightly coupled with
Hadoop software
• Low cost
• Storage hardware choice
• Integrated software support
• Data locality
7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
HDFS STORAGE ARRAY INTERFACE
STRENGTHS• No Ingest necessary
• NameNode Fault Tolerance
• Eliminate 3x mirroring
• Multi-protocol access
• Simultaneous Multi-Hadoop distribution support
• Smart-Dedupe for Hadoop
• SEC 17a-4 Compliance
• Kerberos Authentication
• Application Multi-tenancy
EMC Hadoop Starter Kit: https://community.emc.com/docs/DOC-26892
8© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
HDFS BY STORAGE VIRTUALIZATION SOFTWARESTRENGTHS• Multi-protocol access
Object, HDFS, Block (iSCSI), more coming
Write file, read object & vice versa
• NameNode Fault Tolerance
• Eliminate 3x mirroring
• Compute & data locality
• Application multi-tenancy
• Heterogeneous Storage: Pool server storage Enterprise arrays
• EMC, Netapp, Hitachi
EMC Hadoop Starter Kit: https://community.emc.com/docs/DOC-34442
9© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
ANALYTICS APPLIANCES
STRENGTHS• Rapid deployment
• Predictable performance & scale
• Optimum resource utilization
• Integrated, simplified management
• Simplified support & maintenance
• Optimized cost
• Highest Reliability, Availability, and Stability
10© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
Traditional Analytics Architecture
RMT
Historian
IMAS
Alarm
LIMS
Oracle
BI(SSRS, Panopticon,
Web)
Analytics Server(SAS)
Analytics Server
(R)
Pre-aggregated Tables
BI(Cognos)
11© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
Modern Analytics ArchitectureEMC Data Lake Architecture
RMT
Historian
IMAS
Alarm
LIMS
BIServer
(SSRS, Panopticon, Web)
Analytics Server(SAS)
Analytics Server
(R)
Historian
Alpine/Chorus(Pivotal)
“Real
Time”
Feed
BIServer
(Tableau or other)
Reporting DB
GemFire XD HAWQ
HDFS
12© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
MODERN ANALYTICS USING DATA LAKE
DEMO
13© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
EMC DATA LAKE CAPABILITIEs
Documents (XLS, PPT, DOC)
SQLDatabases
Rich Media (PDF, JPG, Video,
Streaming)
Sensor Data (GPS coordinates,
temperature measurements)
Unstructured Context (Web Server Logs,
Scale Effortlessly | Store Efficiently | Access Globally
Ed Walsh - @vEddieWJim Ruddy - @Darth_RuddyDan Baskette - @dbbaskette
16© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
The Emerging Data Platform EcosystemBusiness Data Lake
Ingestion Tier
Real-time
Batch
Micro batch
Data Sources
ClickstreamSensorsTelemetrics
WeblogsNetwork Data
CRMERP DataCollab}
Insights Tier
SQL
MapReduce
NoSQL
Spark
R
Action Tier
Real-timeInsights
BatchInsights
InteractiveInsights
Operations Tier
Data Services Tier
Processing Tier
MDMRDM
Audit and Policy mgmt
Data mgmt services
Systems monitoring and management
Relational Database
MPP Database
In-memory processing
Workflow Management
Hadoop App ServerWeb
Services
Data Management Tier
HDFSSoftware-
defined StorageEnterprise SAN/NAS
Public Cloud Hybrid Cloud Private CloudInfrastructure Tier
17© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
Business Data LakeEMC Federation Solutions
Data Sources Ingestion Tier
ClickstreamSensorsTelemetrics
Real-time
WeblogsNetwork Data
BatchCRMERP DataCollab
Micro batch
}
Operations Tier
Data Services Tier
Processing Tier
MDMRDM
Audit and Policy mgmt
Data mgmt services
Systems monitoring and management
Relational Database
MPP Database
In-memory processing
Data Management Tier
Workflow Management
Insights Tier
SQL
MapReduce
NoSQL
Spark
R
Action Tier
Real-timeInsights
BatchInsights
InteractiveInsights
Hadoop App ServerWeb
Services
HDFSSoftware-
defined StorageEnterprise SAN/NAS
Public Cloud Hybrid Cloud Private CloudInfrastructure Tier
VMware vCloud Suite vCloud Hybrid Services