Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Successfully Deploying Alternative Storage Architectures for Hadoop
Gus Horn Iyer Venkatesan
NetApp
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Agenda
Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples Guidelines and best practices NFS Connector for Hadoop Conclusion and next steps
2
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Hadoop and Storage
3
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Data Node C Data Node B
Traditional Hadoop Storage Flow
4
Ingest to data-node-A Ingest is replicated to data-nodes-B and data-nodes-C
Name Node
Network Switch
Replication R=3
Ingest – logs, images, text
Data Node A
data1 data2
data3 data4
data1 data2
data3 data4
data1 data2
data3 data4
data1 data2
data3 data4 replicate replicate
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Implications of three copies
5
Network Congestion Server Congestion, RAM Utilization
Hadoop and Memory Memory issues large part of support calls
(root cause = server memory contention) Reducing server replication reduces memory
consumption for a more reliable, faster cluster
Server replication can be messy
Server A
I/O Controller
Memory Controller
CPU
Disk Drive(s)
Memory (RAM, DIMM)
Start network
Server B Server C
Server A Server B
network
Server C
LUN - A (master) LUN - B (copy) LUN - C (copy)
LUN - A (copy) LUN - B (master) LUN - C (copy)
LUN - A (copy) LUN - B (copy) LUN - C (master)
Hadoop uses server-based replication to keep three copies Causes high levels of I/O over server system bus Causes poor disk utilization (1/3 of raw capacity)
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Alternative DAS Architecture
Dedicated storage with E-series External DAS architecture
Higher capacity and density – 180TB in 4U – Less footprint in datacenter
Two copies of data (not three) – Less network congestion, better
throughput – Less data to manage, higher effic
High availability for Hadoop – Reliable NameNode protection – Jobs continue when nodes go off-line – Faster cluster recovery
6
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
NetApp Storage Layout for HDFS
Two 7-disk RAID 5 groups with two LUNs per node Dedicated set of disks per DataNode Shared-nothing architecture Spare disks shared globally
7
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Use Cases
8
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Service Provider Leveraging Hadoop
Significant growth in network log data from remote data centers couldn’t be consolidated Analytical queries can’t be done with existing tools – stakeholders couldn’t access data
9
Remote Servers
Central Servers
Remote Servers
Analytics Solution
Hadoop HDFS/MapReduce
Archiving & Indexing Tools
UI +
Sea
rch
Too
l
Analysts
Business Users
Faster consolidation, indexing, searching of log data Information needed for auditing and compliance New analytics capabilities Eight note Hadoop cluster with open source search, indexing tools
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Security Use Case in Government
Challenges Protect IT/data assets from cyber attacks Implementation: how to combine big data with cyber analytics
Benefits Defensive perimeter around financial data to thwart potential attacks Better situational awareness Required both Hadoop and custom analytical application for complete solution
10
Customer analytics
application
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Alternative Architecture in Healthcare
Challenges Extract Transform Load offload for increasing amounts of unstructured data Integration of Hadoop with traditional systems Benefits Cost effective ingest solution of semi and unstructured data New treatment analytics capabilities Highly available Hadoop cluster
11
Hadoop
Business Intelligence
Images, Insurance claims patient records
Data Warehouse
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Other customers and use cases
Manufacturing Electronics, industrial
coating
High Tech Semiconductor design and
packaging, networking
Healthcare Hospitals, pharmaceutical,
managed healthcare, clinical testing
Transportation Airline, automotive
Consumer Retail, household goods
Financial Services Insurance, banking, mobile
payments
12
Government Education, security
Telco/SP Wireless hotspots, logs analysis
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Advantages of Alternative Architecture
13
Feature External or Managed DAS White Box DAS
Replication count 2 – Reduction of hardware required by one third Single copy planned
3 minimum
Application availability
Enterprise Hardware RAID 5,6 & Dynamic Disk Pools Much higher uptime (five nines)
Slower recovery from disk drive failure, NameNode failure Less uptime
Performance Consistent performance during “healthy and unhealthy” modes of operation 33% less network traffic
Degraded of up to 240% with single drive failure
Fan-In Ratio Up to 8:1 (nodes per E-Series) SAS options: I-Band, FC
Limited scalability only with internal drives
Solution Architecture
Validated designs, Technical Reports expediting time to market, reducing risk
Iterative time-consuming tuning process, multiple failure points, and resource intensive
Growth Flexibility Storage and compute decoupled Non-disruptive lifecycle management
Can only grow both simultaneously Disruptive migration and rebalancing
DataNode Management
Non Disruptive DataNode replacement No rebalancing or migration
Disruptive DataNode Replacement – must rebalance and / or migrate content
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Best practices from customer use cases
Start with the use case or business problem to everage new data sources
Determine the workload, technologies, infrastructure
Enhance or update your datawarehouse and BI tools (ETL offload and active archiving)
Think about redesigning or updating the analytic platform
14
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Best Practices
Minimize network overhead Replication factor of 2 and RAID 5 Use compression wherever possible
Storage and Hadoop optimization Start with 4:1 storage to compute ratio Allocate 30% of storage capacity to map output Disk group layout
Turn on rack awareness
15
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Best Practices
Use E5560 (or later) as storage array, supporting four DataNodes
Use FAS22xx for diskless and network boot, storage administration
Separate network for data; separate for node interconnect
Use Jumbo Frames and 10GbE Determine DataNodes by storage and job run
requirements
16
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Best practices (continued)
Start a POC or pilot sooner than later POC is for business validation Pilot is for technology validation
Focus on performance after deployment Application and cluster size determine most of
the configuration
17
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Putting the Stack Together
Storage and File Systems
Servers, Networking, Hardware
Data Management
Applications and Analytics
Reporting/Dashboard/ Visualization
18
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Scenario for storage and analytics
Hadoop diagram courtesy Hortonworks
19
NetApp FAS Storage NFS-based
Enterprise Data
1
1) Data is sitting on FAS, NFS-based storage
2
2) If Hadoop or Map Reduce analysis is needed, HDFS-based storage has to be created
3
3) Data has to be moved to newly created Hadoop storage
4
4) Analysis can now be done on data
Hadoop Analytics
HDFS
YARN
Map-Reduce HBase Spark
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Introducing NetApp NFS Connector
Hadoop diagram courtesy Hortonworks
20
NetApp FAS Storage NFS-based
Enterprise Data
Map Reduce analytics natively on data sitting on FAS, NFS-based storage
NFS Connector is a thin software application between Map Reduce and NFS
NFS Connector
Directly on NFS Data
Hadoop Analytics
HDFS
YARN
Map-Reduce HBase Spark
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Next Steps
Download information at netapp.com/hadoop Technical Reports, Solution Guides, Cisco
Validated Designs, Solution Briefs Start a POC Engage NetApp or partner
Contact us [email protected] or [email protected] or NetApp System Engineer
21
2015 SNIA Analytics and Big Data Summit. © NetApp All Rights Reserved.
Thank You!
22