Upload
khalid-ahmed
View
404
Download
1
Embed Size (px)
Citation preview
© 2015 IBM Corporation
State of Resource Management in Big Data: What it is and Why You Should Care
Khalid AhmedSenior Technical Staff Member(STSM), Architect, IBM Platform Computing
Yong FengArchitect, IBM Platform Computing
© 2015 IBM Corporation2
Contents
1. Background
2. Resource Management Architectures
3. Comparisons: YARN, Mesos, Kubernetes
4. Use Cases
© 2015 IBM Corporation3
IBM Platform Computing Infrastructure software for high performance applications
– Acquired by IBM in 2012
– 20 years managing distributed scale-out systems with 2000+ customers in many industries
– Market leading workload, resource and cluster management
– Unmatched scalability (small clusters to global grids) and enterprise production-proven reliability
– Heterogeneous environments – x86 and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc.
– Shared services for both compute and data intensive workloads
23 of 30 largest commercial enterprises
Over 5M CPUs under
management
60% of top financial services
companies
© 2015 IBM Corporation4
Resource Management Terminology
Cluster Management
Resource
Allocation
Distributed & Parallel
Execution
Scheduling & Placement
Workload
Management
Batch Queuing
© 2015 IBM Corporation5
History of Resource Management in Distributed Systems
1990s� High-performance Computing� Batch Queuing Systems� Message Passing Interface (MPI)
2000s� P2P Computing� Parallel SOA� Big Data – MR v1� Virtualization
Platform LSF
Sun Grid Engine
NQS/DQS
VMare
United Devices
Apache Hadoop
Datasynapse
2010-2015� Big Data – MR v2� Cloud Computing� Virtualization
Globus
Platform Symphony
Openstack
Apache YARN
Apache Mesos
2015+� Containerization� Hyperconverge/Hyperscale� Hybrid Cloud� Data Center OS (DCOS)
Docker
Kubernetes
Swarm
Cloudfoundry
© 2015 IBM Corporation6
What problem are we trying to solve? - Creating infrastructure silos to accommodate apps is inefficient
Many new solution workloadsin addition to existing apps
Leads to costly, complex, siloed, under-utilized infrastructure and replicated data
Batch Overnight
Financial
Reporting
Counterparty
Credit Risk
Modeling
Distributed ETL, Sensitivity Analysis
Hadoop based Sentiment Analysis
Low Utilization= Higher cost
© 2015 IBM Corporation7
Convergence of Compute & DataData-centric Architecture for High Performance
Data lives on disk and tape
Move data to CPU as needed
Deep Storage Hierarchy
Data lives in persistent storage/memory
Many CPU’s surround and use
Shallow/Flat Storage Hierarchy
Old Compute-centric Model New Data-centric Model
Massive ParallelismData & Computing
Flash Phase Change
Manycore FPGA
Big Data and Exascale High Performance Computing are driving many similar computer systems requirements: Move the Compute to the Data!
IBM Confidential
© 2015 IBM Corporation8
Data Center OS: System Software for Hyperscale Datacenters
Hardware
Node OS
Hardware
Node OS
Hardware
Node OS
Hardware
Node OS
Node Agent Node Agent Node Agent Node Agent
Distributed File/Block/Object System
Resource Manager
Remote Execution & Container
Management
Distributed Services Manager
Device Drivers for Nodes
Virtual / Physical Hardware
Patterns & REST API
Manage long-running services lifecycle
Aggregate & share resources across multiple frameworks
Manage execution of containers (discovery, clustering, load-balancing)
Persistent storage for applications
and services supporting multiple protocols
Nodes become the resources managed by Data Center OS. Specialized hardware (storage, network switches, routers) become software services on commodity hardware.
© 2014 IBM Corporation9
Resource Manager Architectures
© 2014 IBM Corporation10
What is expected from Resource Manager
• Resource Abstraction
• Workload Placement
• High Availability
• Monitoring
• Membership Management
• Workload Provision and Execution
• Scalability
• Trouble-shooting
1. Hide the details of resource management and failure handling so that user could focus on application development
2. Operate with high availability and reliability, and support application to do the same
3. Run workload across tens of thousands of machine efficiently.
Open source resource management solution to manage resources used by services on a shared infrastructure.
• Resource Sharing and Plan
• Security and Isolation
• Performance
• Service Management
• Hortworks, Cloudera, MapR – YARN
• Docker - Swarm
• Mesosphere Twitter, eBay, Netflix - Mesos • Google, Redhat, CoreOS - Kubernetes
We need a common solution to manage resources of large clusters (~10K of machines) shared by multiple workloads:
• Sharing policies: tenant reservation, shares, isolation
• Placement policies: topology-driven affinity, anti-affinity, proximity, min/max/desired
• Execution: container and non-container
HPCHPC PaaSPaaSData
Services
Data
ServicesLong
RunningLong
RunningBatch JobBatch Job OtherOther
Common Resource ManagementCommon Resource Management
Shared Infrastructure and DataShared Infrastructure and Data
© 2014 IBM Corporation11
Hadoop YARN
YARN is not the first general Resource Management platform. So what’s different? It’s data!
• Store all your data in one place … (HDFS)
• Interact with that data in multiple ways … (YARN Platform + Apps)
• Scale as you go, shared, multi-tenant, secure … (The Hadoop Stack)
© 2014 IBM Corporation12
YARN Architecture
� Resource management framework: central
o Resource Manager (RM) controls resource
allocation
o Application Master (AM) negotiates with RM for
resources and launch executors to run jobs
� Resource allocation policies
o Policy plug-ins, currently supports:
� Capacity scheduler
� Fair sharing
� Framework Integration
o Implement client to launch application through RM
o Implement driver for application scheduler to
communicate with RM and Node Manager
o Make framework executor available to YARN
© 2014 IBM Corporation13
Mesos in DBAS
BERKELEY DATA ANALYTICS STACK
(BDAS)
© 2014 IBM Corporation14
Mesos Programming Interface
© 2014 IBM Corporation15
Mesos Architecture
� Resource management framework: hierarchical
o Mesos offers resource
o Framework schedulers accept or reject offered resource
� Resource allocation policies
o Pluggable allocation modules, currently supports
� Fair sharing
o Resource allocations decisions are delegated to allocation
modules
o Resource preferences are communicated to Mesos through
common APIs
� Framework Integration
o Modify framework scheduler to communicate with Mesos master
through its API
o Make framework executor binary available to Mesos
© 2014 IBM Corporation16
Kubernetes Basic concepts
• Only support container-based applications/workloads– Currently only support Docker and Rocket
• POD: Smallest schedulable unit– All containers within a POD are placed onto the same host and share the same
namespace (network)
• Replication Group: Manage one or more PODs– Use POD labels to ensure only a desired number of PODs with specific labels are running
at anytime– Used to scale up/down, failure recovery, rolling upgrade
• Services: Find and load-balance between one or more PODs– Use POD labels to define endpoints to a service– Used to handle changes in IP address, host, number of PODs, etc.– Services records are recorded in i) Env variables ii) DNS service entries
• Namespaces: Multi-tenancy support• Pods/Services/RCs can be put into different namespaces to provide logical isolation for
the purpose of management
© 2014 IBM Corporation17
Kubernetes architecture
API server
Scheduler
Controller mger
kubectl
K8s master
K8s minion K8s minion
Kubelet
Proxy
CAdvisor
Kubelet
Proxy
CAdvisor
Etcd service
state
Many components are pluggable
- schedulers
- container runtime
- persistent data store
- cloud providers
- …
© 2014 IBM Corporation18
Comparison of Open-source Resource Managers
© 2014 IBM Corporation19
Framework
(scheduler)
Master
Jobs type A
Master has no knowledge of workloads. Workloads have partial view of system.
Issues: Offers are computed without any workload awareness – may be unsuitable for a workload
Possible solution: Optimistic Offer
Framework
(scheduler)
Jobs type B
State
(2) Offer
(5) Revoke offer
(4) accept/decline offers
Master
Jobs type A
(short, small)
Master has knowledge of entire state and coarse-grained definition of workloads. Workloads have partial view, but selected based on workload specification.
Issues: More complex protocol, master has some properties of monolithic scheduler
Possible solution: Multiple level scheduler
Framework
(scheduler)
Jobs type B
(long running)
State
(1) Request
(6) Return resource
(1) Partition resources among frameworks
(3) schedule
(2) Allocate resources, based on workload priorities and requirements
(2) Return allocation
(4) Schedule small, short lived tasks
(5) Reclaim resources
Offer Vs Request
© 2013 IBM Corporation20
Feature Mesos YARN Kubernetes Comment
Container support YARN is planning to support Docker. Mesos support both Docker and its own unified container. Kebernetes only support container as its execution facility.
Placement Policies
YARN focus more on affinity. Marathon support several placement constraints and polices. Kuberentes borrows some placement policy from Marathon and support its own specific placement constraints.
Resource Sharing YARN has a pretty good support for resource sharing (priority/preemption/fair-share), Mesos does not support priority and its preemption is weak. Kuberentes only supports quota.
Service Management
Marathon and Kuberentes both support service life cycle management. Slider is still incubation.
Maturity YARN has longer development history and probably most deployment. Mesos and Kubernetes are relatively new
Mesos = Mesos + Marathon YARN = YARN + Slider
Complete Many features Some features
A little features None
Comparison: Mesos vs YARN vs Kubernetes
© 2013 IBM Corporation21
Spark on YARN
Cluster Mode
spark-submit MYJAR --master yarn-cluster –class MYCLASS
Client Mode
spark-submit MYJAR --master yarn-client –class MYCLASS
© 2013 IBM Corporation22
Spark on Mesos
Coarse-grain Mode
conf.set("spark.mesos.coarse", "true")Fine-grain Mode
conf.set("spark.mesos.coarse", "false")
© 2013 IBM Corporation23
Spark on YARN Vs Spark on Mesos
� Spark on YARN
o Coarse grain
o Fixed size of each Spark Executor
� Resource could be wasted if no enough tasks in an executor
o Leverage YARN data aware scheduling
� Spark on Mesos (Coarse-grain mode)
o Coarse grain
o Cannot launch multiple executors in same host (fixed in Spark 2.0.0 by SPARK-5095)
� New resource cannot be used
� Cannot fully use the big memory due to JVM GC issue in big memory environment
o Spark schedule tasks by data affinity within the offer
� Spark on Mesos (Fine-grain mode)
o Fine grain
o Extra overhead when launching tasks
o Resource may not be reschedule in time after time finish because of Mesos scheduling interval
© 2013 IBM Corporation24
USE CASES
© 2013 IBM Corporation25
Real-timeStreams, FPGA-based applications near market feeds
Near Real-timeAnalytic tasks are often time-critical supporting trading desks – “real-time” risk applications
BatchLong-running
Exchange/ECNs
Data Feeds
Big DataDiverse sources of structured/unstructured data -RDBMS, DFS (HDFS,GPFS),
In-memory caches etc..
Data Intensive Workloads Compute Intensive Workloads
Algorithmic trading / HFT / “Black-box”/ “Robo-trading”
Orders
Program
TradingArbitrage
Trend
Following
Exotics ,
Derivative Pricing
Sentiment
Analysis
Counterparty
Risk, CVA
CRMAnti-money
Laundering
(AML)
“Real-time“
Market
Risk
Pre-trade, post-trade analytics
Credit scoring
ETL
Incremental
Modeling
Fraud
Detection
Incremental
Modeling
Forex
Mining of
Unstructured Data
Sensitivity
Analysis
Model
Backtesting
Regulatory
Reporting
Actuarial analysis
CEPProtocol
Conversion
Deeper Counterparty
Modeling
Variable annuity
FX IR Equities
Applications in Financial Services
VaR
ALM Mortgage analyticsStrategy mining
data mining
Predictive
analytics
Predictive
analytics
Optimization
Optimization
Trade surveillance
Portfolio Stress Testing
P&L analysis
Document Processing
Non-structured Data Query
IBM Confidential
Check processing
Image Analytic
© 2013 IBM Corporation26
Example - Genome Sequencing
All the DNA contained in a living cell
makes up the genome. The alphabet of
genome contains only four letters: A, C, G
and T. Just like a book uses words and
letters to tell a story, so do these letters in
the genomes as they encode genes that
carry out all cellular functions.
SAM BAM Recalibrated BAM
Mark Duplicate & Sort
BWA ADAM
Spark
Map to ReferenceRealignment & Recalibration
Samtool
Picard
GATK
Mutech
Variant Analysis
VCF
Genomics is the study of the DNA sequence and meaning of these letters
in the genome (e.g. genes & mutations) so that scientists can precisely
tell the story of life.
Next Generation Sequencing Pipeline for faster results
Complex workflows and dependencies
Your life story
FASTQ
© 2013 IBM Corporation27
Challenges – Genome Sequencing� Poor resource utilization – peaks and valleys of different workloads � How to orchestrate multi-phase workflows among many collaborative apps of
distributed workloads, sub- and parallel-flows, across diverse infrastructure
� Lack of reliable parallelism in workflow due to variety of workload
types and resource needs
Move to job arrays, MPI/MPI2,
distributed messaging & cache,
MapReduce, Spark
frameworks?
� Data, app and resource silos causing inefficiencies in data movements,
app integrations and resource sharing
Workload
Manager
Job 1 Job 2
Job 3 Job N
MapReduce App
Resource 1 Resource 2
Resource 3 Resource 4
Resource 5 Resource 6
Resource 7 Resource 8
HDFS
Workload
Manager
App1 App2
App3 AppN
SOA App
Resource 1 Resource 2
Resource 3 Resource 4
Resource 5 Resource 6
Resource 7 Resource 8
NFS
Workload
Manager
Job 1 Job 2
Job 3 Job N
Batch App
Resource 1 Resource 2
Resource 3 Resource 4
Resource 5 Resource 6
Resource 7 Resource 8
POSIX
Workload
Manager
App1 App2
App3 AppN
Spark Apps
Resource 1 Resource 2
Resource 3 Resource 4
Resource 5 Resource 6
Resource 7 Resource 8
Objects
Cluster #2 Cluster #4Cluster #3Cluster #1
© 2013 IBM Corporation28
A Life Science App Workflow with Hybrid Workloads - Genome Sequencing
Genome Analysis Toolkit (GATK) : A widely-adopted genomics workflow from Broad Institute
ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UC, Berkley
GATK workflow
Pipeline optimized using ADAM on Spark (parallelize mark duplicate
and sort processing)
Site A
Site B
Site C
Share results worldwide immediately
ReplicateTo Remote
© 2015 IBM Corporation29
Platform Computing is Part of IBM Software Defined Infrastructure
IBM Platform Computing/DCOS
Software Defined Compute
Symphony
MapReduceSymphony
Application Service Controller
LSF
High Performance Analytics
(Low Latency Parallel)
Hadoop / Big Data
Application Frameworks(Long Running Services)
High Performance Computing(Batch, Serial, MPI, Workflow)
Example
Applications
&
Application
Frameworks
HomegrownHomegrown
Spectrum ScaleSoftware Defined
Storage
On-premises, On-Cloud, HybridPhysical Infrastructure
Hypervisor
x86 Linux on z
Software Defined InfrastructureManagement
IBM Platform Cluster Manager
Bare Metal Provisioning Virtual Machine Provisioning SoftLayer APIs & Services
IBM Cloud Manager with OpenStack
IBM Platform ComputingCloud Service
Other Compute
Management Software
TraditionalCommercial
Applications
© 2015 IBM Corporation
IBM Platform Computing
30
Resource Management Community Activities
• Active development with Mesos community – 11
IBM Developers.
• 100+ JIRAs delivered or in progress
• Leading several work streams: POWER Support,
Optimistic Offers, Container Support, Swarm and
Kubernetes integration
• YARN-plugin to Platform Symphony
• Technical Preview of Mesos with IBM Value-
Add (ASC) on Docker Hub – Both x86 and POWER images
31
For more information: ibm.com/systems