Upload
hortonworks
View
103
Download
0
Embed Size (px)
DESCRIPTION
Latest information on Apache Hadoop YARN from Big Data Camp LA by Bikas Saha
Citation preview
© Hortonworks Inc. 2013
Apache HadoopNext Generation Compute Platform - YARN
Page 1
Bikas Saha@bikassaha
© Hortonworks Inc. 2013 - Confidential
1st Generation Hadoop: Batch Focus
HADOOP 1.0Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
All other usage patterns MUST leverage same infrastructure
Forces Creation of Silos to Manage Mixed Workloads
Single App
BATCH
HDFS
Single App
ONLINE
Page 2
© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Page 3
© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Limitations
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Coarse Synchronization in JobTracker
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Lacks Support for Alternate Paradigms and Services
Iterative applications in MapReduce are 10x slower
Page 4
© Hortonworks Inc. 2013 - ConfidentialPage 5
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Manages and allocates cluster resources
Central agent
NodeManager (NM)
Manage Tasks, Enforce Allocations
Per-Node Agent
© Hortonworks Inc. 2013 - Confidential
Data Processing Engines Run Natively IN HadoopBATCH
MapReduceINTERACTIVE
TezSTREAMINGStorm, S4, …
GRAPHGiraph
MICROSOFTREEF
SASLASR, HPA
ONLINEHBase OTHERS
Apache YARN
HDFS2: Redundant, Reliable Storage
YARN: Cluster Resource Management
Page 6
FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming
EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service
SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads
The Data Operating System for Hadoop 2.0
© Hortonworks Inc. 2013 - Confidential
55 Key Benefits of YARN
1. New Applications & Services
2. Improved cluster utilization
3. Scale
4. Experimental Agility
5. Shared Services
Page 7
© Hortonworks Inc. 2013 - Confidential
YARN: Efficiency with Shared Services
Page 8
Yahoo! leverages YARN
40,000+ nodes running YARN across over 365PB of data
~400,000 jobs per day for about 10 million hours of compute
time
Estimated a 60% – 150% improvement on node usage per
day using YARN
Eliminated Colo (~10K nodes) due to increased utilization
© Hortonworks Inc. 2013 - Confidential
Key Improvements in YARN
Framework supporting multiple applications– Separate generic resource brokering from application logic– Define protocols/libraries and provide a framework for custom
application development– Share same Hadoop Cluster across applications
Application Agility and Innovation– Use Protocol Buffers for RPC gives wire compatibility– Map Reduce becomes an application in user space unlocking
safe innovation– Multiple versions of an app can co-exist leading to
experimentation– Easier upgrade of framework and applications
Page 9
© Hortonworks Inc. 2013 - Confidential
Key Improvements in YARN
Scalability– Removed complex app logic from RM, scale further– State machine, message passing based loosely coupled design
Cluster Utilization– Generic resource container model replaces fixed Map/Reduce
slots. Container allocations based on locality, memory (CPU coming soon)
– Sharing cluster among multiple applications
Reliability and Availability– Simpler RM state makes it easier to save and restart (work in
progress)– Application checkpoint can allow an app to be restarted.
MapReduce application master saves state in HDFS.
Page 10
© Hortonworks Inc. 2013 - Confidential
YARN as Cluster Operating System
Page 11
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2
© Hortonworks Inc. 2013 - Confidential
Multi-Tenancy with Capacity Scheduler
• Queues• Economics as queue-capacity
–Hierarchical Queues
• SLAs–Preemption
• Resource Isolation–Linux: cgroups–MS Windows: Job Control–Roadmap: Virtualization (Xen, KVM)
• Administration–Queue ACLs–Run-time re-configuration for queues–Charge-backs
Page 12
ResourceManager
Scheduler
root
Adhoc10%
DW70%
Mrkting20%
Dev10%
Reserved20%
Prod70%
Prod80%
Dev20%
P070%
P130%
Capacity Scheduler
Hierarchical Queues
© Hortonworks Inc. 2013 - Confidential
YARN Eco-system
Page 13
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative/Interactive applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN
RedPoint - Data Management
Frameworks Powered By YARN
Weave by Continuity
REEF by Microsoft
Spring support for Hadoop 2
There's an app for that...
YARN App Marketplace!
© Hortonworks Inc. 2013 - Confidential
YARN APIs & Client Libraries
Application Client Protocol: Client to RM interaction– Library: YarnClient– Application Lifecycle control– Access Cluster Information
Application Master Protocol: AM – RM interaction– Library: AMRMClient / AMRMClientAsync– Resource negotiation– Heartbeat to the RM
Container Management Protocol: AM to NM interaction– Library: NMClient/NMClientAsync– Launching allocated containers– Stop Running containers
Use external frameworks like Weave/REEF/Spring
Page 14
© Hortonworks Inc. 2013 - Confidential
YARN Application Flow
Page 15
Application Client
ResourceManager
Application Master
NodeManager
YarnClient
AppSpecific API
Application ClientProtocol
AMRMClient
NMClient
Application MasterProtocol
ContainerManagement
Protocol
AppContainer
© Hortonworks Inc. 2013 - Confidential
YARN Best Practices
Page 16
Use provided Client libraries
Resource Negotiation–You may ask but you may not get what you want - immediately.– Locality requests may not always be met. –Resources like memory/CPU are guaranteed.
Failure handling–Remember, anything can fail ( or YARN can pre-empt your
containers)–AM failures handled by YARN but container failures handled by the
application.
Checkpointing–Check-point AM state for AM recovery.– If tasks are long running, check-point task state.
© Hortonworks Inc. 2013 - Confidential
YARN Best Practices
Page 17
Cluster Dependencies–Try to make zero assumptions on the cluster.–Your application bundle should deploy everything required using
YARN’s local resources.
Client-only installs if possible–Simplifies cluster deployment, and multi-version support
Securing your Application–YARN does not secure communications between the AM and its
containers.
© Hortonworks Inc. 2013 - Confidential
Testing/Debugging your Application
Page 18
MiniYARNClusterRegression tests
Unmanaged AMSupport to run the AM outside of a YARN cluster for manual testing
LogsLog aggregation support to push all logs into HDFS
Accessible via CLI, UI
© Hortonworks Inc. 2013 - Confidential
YARN Future Work
Page 19
ResourceManager High Availability and Work-preserving restart– Work-in-Progress
Scheduler Enhancements– SLA Driven Scheduling, Low latency allocations– Multiple resource types – disk/network/GPUs/affinity
Rolling upgrades
Long running services– Better support to running services like HBase– Discovery of services, upgrades without downtime
More utilities/libraries for Application Developers– Failover/Checkpointing
• http://hadoop.apache.org
© Hortonworks Inc. 2013 - Confidential
Key Take-Aways
YARN - Distributed Application Framework to build/run Multiple
Applications (original being MapReduce)
YARN is completely Backwards Compatible for existing
MapReduce apps
YARN Allows Different Applications to Share the Same Cluster
YARN Enables Fine Grained Resource Management via Generic
Resource Containers
YARN Provides Better Control over Application Upgrades via Wire
Compatibility
Page 20
© Hortonworks Inc. 2013 - Confidential
Data Processing Engines Run Natively IN HadoopBATCH
MapReduceINTERACTIVE
TezSTREAMINGStorm, S4, …
GRAPHGiraph
MICROSOFTREEF
SASLASR, HPA
ONLINEHBase OTHERS
Apache YARN
HDFS2: Redundant, Reliable Storage
YARN: Cluster Resource Management
Page 21
FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming
EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service
SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads
The Data Operating System for Hadoop 2.0
© Hortonworks Inc. 2013 - Confidential
Thank you!
Page 22
http://hortonworks.com/products/hortonworks-sandbox/
Download Sandbox: Experience Apache Hadoop
Both 2.0 and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/
Additional Questions?