Data Management on Hadoop at Yahoo!

Data Management on Hadoop @ Y!Seetharam Venkatesh ([email protected])Hadoop Data Infrastructure Lead/Architect

Agenda

2 Challenging Data Landscape

1 Introduction

3 The Solution

4 Future opportunities

Introduction

Replication

Data Export

Anonymization

Retention

Archival

Acquisition

Why is Data Management Critical

Challenging Data LandscapeData Warehouse

Database

NAS

Data Warehouse

Database

NAS

Data center

Hadoop Clusters

Data center

Tape

Steady growth in data movement volumes/day

SLA requirements (Minutes to day)

BCP requirements (Hot-Hot, Hot-Warm)

Feeds with varying periodicity (Minutes to Monthly)

Data Acquisition

Challenge SolutionSteady growth in data volumes Heavy lifting delegated to map-

only jobs

Diverse Interfaces, API contracts Pluggable interfaces, Adaptors for specific API

Data Sources have diverse serving capacity

Throttling enables uniform load on the Data Source

Data Source Isolation Asynchronous Scheduling, progress monitored per source

Varying Data Formats, file sizes and long tails, failures

Conversion as map-reduce job Coalesce, Chunking, Checkpoint

Data QualityBCP

Pluggable validationsSupports Hot-Hot, Hot-Warm

Data Replication

Challenge SolutionSteady growth in data volumes

Heavy lifting delegated to map-only jobs (DistCp v2)

Cluster proximity, availability Tree of copies with at most one cross-datacenter copy

Long tails Dynamic split assignment, each map picks up only one file at a time (DistCp v2)

Data Export Export as Replication target - Push Adhoc uses HDFS Proxy – Pull

Datacenter Datacenter

Data Lifecycle Management

Challenge SolutionAging Data expires Retention to remove old data (as

required for legal compliance and for capacity purposes)

Data Privacy Anonymization of Personally Identifiable information

SOX Compliance & Audit Archival/Restoration to/from Tape (13 months)

SEC Compliance & Audit Archival/Restoration to/from Tape (7 years)

Operability, Manageability

Challenge SolutionMonitor and administer data loading across clusters, colos

Central dashboard for monitoring and administration

Integrated view of jobs running across clusters, colos

Interoperability across incompatible Hadoop versions

Support various Hadoop versions using Reverse Class loader

One data loading instance per colo that can work across clusters

Maintenance Windows, failuresSystem shutdown

Partial copy + auto resumeAutomatic resume upon restart

SLA management + introspection via metrics

Architecture

Highlights

• “Workflows” abstraction over MR Jobs• More workflows than Oozie with in Y!• Amounts to >30% of jobs launched on the clusters • Occupies less than 10% of cluster capacity (slots)

• Solves recurring batch data transfers • 2300+ feeds with varying periodicity (5m to Monthly)• 100+ TB/day of data movement

• SLAs• Central Dashboard• SLA monitoring with ETA on feeds

Highlights

Future

Hcatalog/Oozie Integration

Self service

Support for Event-level data

Storage Efficiency

1

2

3

4

Data Management on Hadoop @ Y!

Thank YouQ&A

Technology

Data Management on Hadoop at Yahoo!