Upload
venkatesh-seetharam
View
391
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Data Management is a suite of services that manage the lifecycle of data on the Hadoop clusters.
Citation preview
Data Management on Hadoop @ Y!Seetharam Venkatesh ([email protected])Hadoop Data Infrastructure Lead/Architect
Agenda
2 Challenging Data Landscape
1 Introduction
3 The Solution
4 Future opportunities
Introduction
Replication
Data Export
Anonymization
Retention
Archival
Acquisition
Why is Data Management Critical
Challenging Data LandscapeData Warehouse
Database
NAS
Data Warehouse
Database
NAS
Data center
Hadoop Clusters
Data center
Tape
Steady growth in data movement volumes/day
SLA requirements (Minutes to day)
BCP requirements (Hot-Hot, Hot-Warm)
Feeds with varying periodicity (Minutes to Monthly)
Data Acquisition
Challenge SolutionSteady growth in data volumes Heavy lifting delegated to map-
only jobs
Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for specific API
Data Sources have diverse serving capacity
Throttling enables uniform load on the Data Source
Data Source Isolation Asynchronous Scheduling, progress monitored per source
Varying Data Formats, file sizes and long tails, failures
Conversion as map-reduce job Coalesce, Chunking, Checkpoint
Data QualityBCP
Pluggable validationsSupports Hot-Hot, Hot-Warm
Data Replication
Challenge SolutionSteady growth in data volumes
Heavy lifting delegated to map-only jobs (DistCp v2)
Cluster proximity, availability Tree of copies with at most one cross-datacenter copy
Long tails Dynamic split assignment, each map picks up only one file at a time (DistCp v2)
Data Export Export as Replication target - Push Adhoc uses HDFS Proxy – Pull
Datacenter Datacenter
Data Lifecycle Management
Challenge SolutionAging Data expires Retention to remove old data (as
required for legal compliance and for capacity purposes)
Data Privacy Anonymization of Personally Identifiable information
SOX Compliance & Audit Archival/Restoration to/from Tape (13 months)
SEC Compliance & Audit Archival/Restoration to/from Tape (7 years)
Operability, Manageability
Challenge SolutionMonitor and administer data loading across clusters, colos
Central dashboard for monitoring and administration
Integrated view of jobs running across clusters, colos
Interoperability across incompatible Hadoop versions
Support various Hadoop versions using Reverse Class loader
One data loading instance per colo that can work across clusters
Maintenance Windows, failuresSystem shutdown
Partial copy + auto resumeAutomatic resume upon restart
SLA management + introspection via metrics
Architecture
Highlights
• “Workflows” abstraction over MR Jobs• More workflows than Oozie with in Y!• Amounts to >30% of jobs launched on the clusters • Occupies less than 10% of cluster capacity (slots)
• Solves recurring batch data transfers • 2300+ feeds with varying periodicity (5m to Monthly)• 100+ TB/day of data movement
• SLAs• Central Dashboard• SLA monitoring with ETA on feeds
Highlights
Future
Hcatalog/Oozie Integration
Self service
Support for Event-level data
Storage Efficiency
1
2
3
4
Data Management on Hadoop @ Y!
Thank YouQ&A