State of Resource Management in Big Data

© 2015 IBM Corporation

State of Resource Management in Big Data: What it is and Why You Should Care

Khalid AhmedSenior Technical Staff Member(STSM), Architect, IBM Platform Computing

[email protected]

Yong FengArchitect, IBM Platform Computing

[email protected]

© 2015 IBM Corporation2

Contents

1. Background

2. Resource Management Architectures

3. Comparisons: YARN, Mesos, Kubernetes

4. Use Cases


IBM Platform Computing Infrastructure software for high performance applications

– Acquired by IBM in 2012

– 20 years managing distributed scale-out systems with 2000+ customers in many industries

– Market leading workload, resource and cluster management

– Unmatched scalability (small clusters to global grids) and enterprise production-proven reliability

– Heterogeneous environments – x86 and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc.

– Shared services for both compute and data intensive workloads

23 of 30 largest commercial enterprises

Over 5M CPUs under

management

60% of top financial services

companies


Resource Management Terminology

Cluster Management

Resource

Allocation

Distributed & Parallel

Execution

Scheduling & Placement

Workload

Management

Batch Queuing


History of Resource Management in Distributed Systems

1990s� High-performance Computing� Batch Queuing Systems� Message Passing Interface (MPI)

2000s� P2P Computing� Parallel SOA� Big Data – MR v1� Virtualization

Platform LSF

Sun Grid Engine

NQS/DQS

VMare

United Devices

Apache Hadoop

Datasynapse

2010-2015� Big Data – MR v2� Cloud Computing� Virtualization

Globus

Platform Symphony

Openstack

Apache YARN

Apache Mesos

2015+� Containerization� Hyperconverge/Hyperscale� Hybrid Cloud� Data Center OS (DCOS)

Docker

Kubernetes

Swarm

Cloudfoundry


What problem are we trying to solve? - Creating infrastructure silos to accommodate apps is inefficient

Many new solution workloadsin addition to existing apps

Leads to costly, complex, siloed, under-utilized infrastructure and replicated data

Batch Overnight

Financial

Reporting

Counterparty

Credit Risk

Modeling

Distributed ETL, Sensitivity Analysis

Hadoop based Sentiment Analysis

Low Utilization= Higher cost


Convergence of Compute & DataData-centric Architecture for High Performance

Data lives on disk and tape

Move data to CPU as needed

Deep Storage Hierarchy

Data lives in persistent storage/memory

Many CPU’s surround and use

Shallow/Flat Storage Hierarchy

Old Compute-centric Model New Data-centric Model

Massive ParallelismData & Computing

Flash Phase Change

Manycore FPGA

Big Data and Exascale High Performance Computing are driving many similar computer systems requirements: Move the Compute to the Data!

IBM Confidential


Data Center OS: System Software for Hyperscale Datacenters

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Node Agent Node Agent Node Agent Node Agent

Distributed File/Block/Object System

Resource Manager

Remote Execution & Container

Management

Distributed Services Manager

Device Drivers for Nodes

Virtual / Physical Hardware

Patterns & REST API

Manage long-running services lifecycle

Aggregate & share resources across multiple frameworks

Manage execution of containers (discovery, clustering, load-balancing)

Persistent storage for applications

and services supporting multiple protocols

Nodes become the resources managed by Data Center OS. Specialized hardware (storage, network switches, routers) become software services on commodity hardware.


Resource Manager Architectures


What is expected from Resource Manager

• Resource Abstraction

• Workload Placement

• High Availability

• Monitoring

• Membership Management

• Workload Provision and Execution

• Scalability

• Trouble-shooting

1. Hide the details of resource management and failure handling so that user could focus on application development

2. Operate with high availability and reliability, and support application to do the same

3. Run workload across tens of thousands of machine efficiently.

Open source resource management solution to manage resources used by services on a shared infrastructure.

• Resource Sharing and Plan

• Security and Isolation

• Performance

• Service Management

• Hortworks, Cloudera, MapR – YARN

• Docker - Swarm

• Mesosphere Twitter, eBay, Netflix - Mesos • Google, Redhat, CoreOS - Kubernetes

We need a common solution to manage resources of large clusters (~10K of machines) shared by multiple workloads:

• Sharing policies: tenant reservation, shares, isolation

• Placement policies: topology-driven affinity, anti-affinity, proximity, min/max/desired

• Execution: container and non-container

HPCHPC PaaSPaaSData

Services

Data

ServicesLong

RunningLong

RunningBatch JobBatch Job OtherOther

Common Resource ManagementCommon Resource Management

Shared Infrastructure and DataShared Infrastructure and Data


Hadoop YARN

YARN is not the first general Resource Management platform. So what’s different? It’s data!

• Store all your data in one place … (HDFS)

• Interact with that data in multiple ways … (YARN Platform + Apps)

• Scale as you go, shared, multi-tenant, secure … (The Hadoop Stack)


YARN Architecture

� Resource management framework: central

o Resource Manager (RM) controls resource

allocation

o Application Master (AM) negotiates with RM for

resources and launch executors to run jobs

� Resource allocation policies

o Policy plug-ins, currently supports:

� Capacity scheduler

� Fair sharing

� Framework Integration

o Implement client to launch application through RM

o Implement driver for application scheduler to

communicate with RM and Node Manager

o Make framework executor available to YARN


Mesos in DBAS

BERKELEY DATA ANALYTICS STACK

(BDAS)


Mesos Programming Interface


Mesos Architecture

� Resource management framework: hierarchical

o Mesos offers resource

o Framework schedulers accept or reject offered resource

� Resource allocation policies

o Pluggable allocation modules, currently supports

� Fair sharing

o Resource allocations decisions are delegated to allocation

modules

o Resource preferences are communicated to Mesos through

common APIs

� Framework Integration

o Modify framework scheduler to communicate with Mesos master

through its API

o Make framework executor binary available to Mesos


Kubernetes Basic concepts

• Only support container-based applications/workloads– Currently only support Docker and Rocket

• POD: Smallest schedulable unit– All containers within a POD are placed onto the same host and share the same

namespace (network)

• Replication Group: Manage one or more PODs– Use POD labels to ensure only a desired number of PODs with specific labels are running

at anytime– Used to scale up/down, failure recovery, rolling upgrade

• Services: Find and load-balance between one or more PODs– Use POD labels to define endpoints to a service– Used to handle changes in IP address, host, number of PODs, etc.– Services records are recorded in i) Env variables ii) DNS service entries

• Namespaces: Multi-tenancy support• Pods/Services/RCs can be put into different namespaces to provide logical isolation for

the purpose of management


Kubernetes architecture

API server

Scheduler

Controller mger

kubectl

K8s master

K8s minion K8s minion

Kubelet

Proxy

CAdvisor

Kubelet

Proxy

CAdvisor

Etcd service

state

Many components are pluggable

- schedulers

- container runtime

- persistent data store

- cloud providers

- …


Comparison of Open-source Resource Managers


Framework

(scheduler)

Master

Jobs type A

Master has no knowledge of workloads. Workloads have partial view of system.

Issues: Offers are computed without any workload awareness – may be unsuitable for a workload

Possible solution: Optimistic Offer

Framework

(scheduler)

Jobs type B

State

(2) Offer

(5) Revoke offer

(4) accept/decline offers

Master

Jobs type A

(short, small)

Master has knowledge of entire state and coarse-grained definition of workloads. Workloads have partial view, but selected based on workload specification.

Issues: More complex protocol, master has some properties of monolithic scheduler

Possible solution: Multiple level scheduler

Framework

(scheduler)

Jobs type B

(long running)

State

(1) Request

(6) Return resource

(1) Partition resources among frameworks

(3) schedule

(2) Allocate resources, based on workload priorities and requirements

(2) Return allocation

(4) Schedule small, short lived tasks

(5) Reclaim resources

Offer Vs Request


Feature Mesos YARN Kubernetes Comment

Container support YARN is planning to support Docker. Mesos support both Docker and its own unified container. Kebernetes only support container as its execution facility.

Placement Policies

YARN focus more on affinity. Marathon support several placement constraints and polices. Kuberentes borrows some placement policy from Marathon and support its own specific placement constraints.

Resource Sharing YARN has a pretty good support for resource sharing (priority/preemption/fair-share), Mesos does not support priority and its preemption is weak. Kuberentes only supports quota.

Service Management

Marathon and Kuberentes both support service life cycle management. Slider is still incubation.

Maturity YARN has longer development history and probably most deployment. Mesos and Kubernetes are relatively new

Mesos = Mesos + Marathon YARN = YARN + Slider

Complete Many features Some features

A little features None

Comparison: Mesos vs YARN vs Kubernetes


Spark on YARN

Cluster Mode

spark-submit MYJAR --master yarn-cluster –class MYCLASS

Client Mode

spark-submit MYJAR --master yarn-client –class MYCLASS


Spark on Mesos

Coarse-grain Mode

conf.set("spark.mesos.coarse", "true")Fine-grain Mode

conf.set("spark.mesos.coarse", "false")


Spark on YARN Vs Spark on Mesos

� Spark on YARN

o Coarse grain

o Fixed size of each Spark Executor

� Resource could be wasted if no enough tasks in an executor

o Leverage YARN data aware scheduling

� Spark on Mesos (Coarse-grain mode)

o Coarse grain

o Cannot launch multiple executors in same host (fixed in Spark 2.0.0 by SPARK-5095)

� New resource cannot be used

� Cannot fully use the big memory due to JVM GC issue in big memory environment

o Spark schedule tasks by data affinity within the offer

� Spark on Mesos (Fine-grain mode)

o Fine grain

o Extra overhead when launching tasks

o Resource may not be reschedule in time after time finish because of Mesos scheduling interval


USE CASES


Real-timeStreams, FPGA-based applications near market feeds

Near Real-timeAnalytic tasks are often time-critical supporting trading desks – “real-time” risk applications

BatchLong-running

Exchange/ECNs

Data Feeds

Big DataDiverse sources of structured/unstructured data -RDBMS, DFS (HDFS,GPFS),

In-memory caches etc..

Data Intensive Workloads Compute Intensive Workloads

Algorithmic trading / HFT / “Black-box”/ “Robo-trading”

Orders

Program

TradingArbitrage

Trend

Following

Exotics ,

Derivative Pricing

Sentiment

Analysis

Counterparty

Risk, CVA

CRMAnti-money

Laundering

(AML)

“Real-time“

Market

Risk

Pre-trade, post-trade analytics

Credit scoring

ETL

Incremental

Modeling

Fraud

Detection

Incremental

Modeling

Forex

Mining of

Unstructured Data

Sensitivity

Analysis

Model

Backtesting

Regulatory

Reporting

Actuarial analysis

CEPProtocol

Conversion

Deeper Counterparty

Modeling

Variable annuity

FX IR Equities

Applications in Financial Services

VaR

ALM Mortgage analyticsStrategy mining

data mining

Predictive

analytics

Predictive

analytics

Optimization

Optimization

Trade surveillance

Portfolio Stress Testing

P&L analysis

Document Processing

Non-structured Data Query

IBM Confidential

Check processing

Image Analytic


Example - Genome Sequencing

All the DNA contained in a living cell

makes up the genome. The alphabet of

genome contains only four letters: A, C, G

and T. Just like a book uses words and

letters to tell a story, so do these letters in

the genomes as they encode genes that

carry out all cellular functions.

SAM BAM Recalibrated BAM

Mark Duplicate & Sort

BWA ADAM

Spark

Map to ReferenceRealignment & Recalibration

Samtool

Picard

GATK

Mutech

Variant Analysis

VCF

Genomics is the study of the DNA sequence and meaning of these letters

in the genome (e.g. genes & mutations) so that scientists can precisely

tell the story of life.

Next Generation Sequencing Pipeline for faster results

Complex workflows and dependencies

Your life story

FASTQ


Challenges – Genome Sequencing� Poor resource utilization – peaks and valleys of different workloads � How to orchestrate multi-phase workflows among many collaborative apps of

distributed workloads, sub- and parallel-flows, across diverse infrastructure

� Lack of reliable parallelism in workflow due to variety of workload

types and resource needs

Move to job arrays, MPI/MPI2,

distributed messaging & cache,

MapReduce, Spark

frameworks?

� Data, app and resource silos causing inefficiencies in data movements,

app integrations and resource sharing

Workload

Manager

Job 1 Job 2

Job 3 Job N

MapReduce App

Resource 1 Resource 2




HDFS

Workload

Manager

App1 App2

App3 AppN

SOA App





NFS

Workload

Manager

Job 1 Job 2

Job 3 Job N

Batch App





POSIX

Workload

Manager

App1 App2

App3 AppN

Spark Apps





Objects

Cluster #2 Cluster #4Cluster #3Cluster #1


A Life Science App Workflow with Hybrid Workloads - Genome Sequencing

Genome Analysis Toolkit (GATK) : A widely-adopted genomics workflow from Broad Institute

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UC, Berkley

GATK workflow

Pipeline optimized using ADAM on Spark (parallelize mark duplicate

and sort processing)

Site A

Site B

Site C

Share results worldwide immediately

ReplicateTo Remote


Platform Computing is Part of IBM Software Defined Infrastructure

IBM Platform Computing/DCOS

Software Defined Compute

Symphony

MapReduceSymphony

Application Service Controller

LSF

High Performance Analytics

(Low Latency Parallel)

Hadoop / Big Data

Application Frameworks(Long Running Services)

High Performance Computing(Batch, Serial, MPI, Workflow)

Example

Applications

&

Application

Frameworks

HomegrownHomegrown

Spectrum ScaleSoftware Defined

Storage

On-premises, On-Cloud, HybridPhysical Infrastructure

Hypervisor

x86 Linux on z

Software Defined InfrastructureManagement

IBM Platform Cluster Manager

Bare Metal Provisioning Virtual Machine Provisioning SoftLayer APIs & Services

IBM Cloud Manager with OpenStack

IBM Platform ComputingCloud Service

Other Compute

Management Software

TraditionalCommercial

Applications

© 2015 IBM Corporation

IBM Platform Computing

30

Resource Management Community Activities

• Active development with Mesos community – 11

IBM Developers.

• 100+ JIRAs delivered or in progress

• Leading several work streams: POWER Support,

Optimistic Offers, Container Support, Swarm and

Kubernetes integration

• YARN-plugin to Platform Symphony

• Technical Preview of Mesos with IBM Value-

Add (ASC) on Docker Hub – Both x86 and POWER images

31

For more information: ibm.com/systems

Data & Analytics

State of Resource Management in Big Data