31
© 2015 IBM Corporation State of Resource Management in Big Data: What it is and Why You Should Care Khalid Ahmed Senior Technical Staff Member(STSM), Architect, IBM Platform Computing [email protected] Yong Feng Architect, IBM Platform Computing [email protected]

State of Resource Management in Big Data

Embed Size (px)

Citation preview

Page 1: State of Resource Management in Big Data

© 2015 IBM Corporation

State of Resource Management in Big Data: What it is and Why You Should Care

Khalid AhmedSenior Technical Staff Member(STSM), Architect, IBM Platform Computing

[email protected]

Yong FengArchitect, IBM Platform Computing

[email protected]

Page 2: State of Resource Management in Big Data

© 2015 IBM Corporation2

Contents

1. Background

2. Resource Management Architectures

3. Comparisons: YARN, Mesos, Kubernetes

4. Use Cases

Page 3: State of Resource Management in Big Data

© 2015 IBM Corporation3

IBM Platform Computing Infrastructure software for high performance applications

– Acquired by IBM in 2012

– 20 years managing distributed scale-out systems with 2000+ customers in many industries

– Market leading workload, resource and cluster management

– Unmatched scalability (small clusters to global grids) and enterprise production-proven reliability

– Heterogeneous environments – x86 and Power plus 3rd party systems, virtual and bare metal, accelerators / GPU, cloud, etc.

– Shared services for both compute and data intensive workloads

23 of 30 largest commercial enterprises

Over 5M CPUs under

management

60% of top financial services

companies

Page 4: State of Resource Management in Big Data

© 2015 IBM Corporation4

Resource Management Terminology

Cluster Management

Resource

Allocation

Distributed & Parallel

Execution

Scheduling & Placement

Workload

Management

Batch Queuing

Page 5: State of Resource Management in Big Data

© 2015 IBM Corporation5

History of Resource Management in Distributed Systems

1990s� High-performance Computing� Batch Queuing Systems� Message Passing Interface (MPI)

2000s� P2P Computing� Parallel SOA� Big Data – MR v1� Virtualization

Platform LSF

Sun Grid Engine

NQS/DQS

VMare

United Devices

Apache Hadoop

Datasynapse

2010-2015� Big Data – MR v2� Cloud Computing� Virtualization

Globus

Platform Symphony

Openstack

Apache YARN

Apache Mesos

2015+� Containerization� Hyperconverge/Hyperscale� Hybrid Cloud� Data Center OS (DCOS)

Docker

Kubernetes

Swarm

Cloudfoundry

Page 6: State of Resource Management in Big Data

© 2015 IBM Corporation6

What problem are we trying to solve? - Creating infrastructure silos to accommodate apps is inefficient

Many new solution workloadsin addition to existing apps

Leads to costly, complex, siloed, under-utilized infrastructure and replicated data

Batch Overnight

Financial

Reporting

Counterparty

Credit Risk

Modeling

Distributed ETL, Sensitivity Analysis

Hadoop based Sentiment Analysis

Low Utilization= Higher cost

Page 7: State of Resource Management in Big Data

© 2015 IBM Corporation7

Convergence of Compute & DataData-centric Architecture for High Performance

Data lives on disk and tape

Move data to CPU as needed

Deep Storage Hierarchy

Data lives in persistent storage/memory

Many CPU’s surround and use

Shallow/Flat Storage Hierarchy

Old Compute-centric Model New Data-centric Model

Massive ParallelismData & Computing

Flash Phase Change

Manycore FPGA

Big Data and Exascale High Performance Computing are driving many similar computer systems requirements: Move the Compute to the Data!

IBM Confidential

Page 8: State of Resource Management in Big Data

© 2015 IBM Corporation8

Data Center OS: System Software for Hyperscale Datacenters

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Hardware

Node OS

Node Agent Node Agent Node Agent Node Agent

Distributed File/Block/Object System

Resource Manager

Remote Execution & Container

Management

Distributed Services Manager

Device Drivers for Nodes

Virtual / Physical Hardware

Patterns & REST API

Manage long-running services lifecycle

Aggregate & share resources across multiple frameworks

Manage execution of containers (discovery, clustering, load-balancing)

Persistent storage for applications

and services supporting multiple protocols

Nodes become the resources managed by Data Center OS. Specialized hardware (storage, network switches, routers) become software services on commodity hardware.

Page 9: State of Resource Management in Big Data

© 2014 IBM Corporation9

Resource Manager Architectures

Page 10: State of Resource Management in Big Data

© 2014 IBM Corporation10

What is expected from Resource Manager

• Resource Abstraction

• Workload Placement

• High Availability

• Monitoring

• Membership Management

• Workload Provision and Execution

• Scalability

• Trouble-shooting

1. Hide the details of resource management and failure handling so that user could focus on application development

2. Operate with high availability and reliability, and support application to do the same

3. Run workload across tens of thousands of machine efficiently.

Open source resource management solution to manage resources used by services on a shared infrastructure.

• Resource Sharing and Plan

• Security and Isolation

• Performance

• Service Management

• Hortworks, Cloudera, MapR – YARN

• Docker - Swarm

• Mesosphere Twitter, eBay, Netflix - Mesos • Google, Redhat, CoreOS - Kubernetes

We need a common solution to manage resources of large clusters (~10K of machines) shared by multiple workloads:

• Sharing policies: tenant reservation, shares, isolation

• Placement policies: topology-driven affinity, anti-affinity, proximity, min/max/desired

• Execution: container and non-container

HPCHPC PaaSPaaSData

Services

Data

ServicesLong

RunningLong

RunningBatch JobBatch Job OtherOther

Common Resource ManagementCommon Resource Management

Shared Infrastructure and DataShared Infrastructure and Data

Page 11: State of Resource Management in Big Data

© 2014 IBM Corporation11

Hadoop YARN

YARN is not the first general Resource Management platform. So what’s different? It’s data!

• Store all your data in one place … (HDFS)

• Interact with that data in multiple ways … (YARN Platform + Apps)

• Scale as you go, shared, multi-tenant, secure … (The Hadoop Stack)

Page 12: State of Resource Management in Big Data

© 2014 IBM Corporation12

YARN Architecture

� Resource management framework: central

o Resource Manager (RM) controls resource

allocation

o Application Master (AM) negotiates with RM for

resources and launch executors to run jobs

� Resource allocation policies

o Policy plug-ins, currently supports:

� Capacity scheduler

� Fair sharing

� Framework Integration

o Implement client to launch application through RM

o Implement driver for application scheduler to

communicate with RM and Node Manager

o Make framework executor available to YARN

Page 13: State of Resource Management in Big Data

© 2014 IBM Corporation13

Mesos in DBAS

BERKELEY DATA ANALYTICS STACK

(BDAS)

Page 14: State of Resource Management in Big Data

© 2014 IBM Corporation14

Mesos Programming Interface

Page 15: State of Resource Management in Big Data

© 2014 IBM Corporation15

Mesos Architecture

� Resource management framework: hierarchical

o Mesos offers resource

o Framework schedulers accept or reject offered resource

� Resource allocation policies

o Pluggable allocation modules, currently supports

� Fair sharing

o Resource allocations decisions are delegated to allocation

modules

o Resource preferences are communicated to Mesos through

common APIs

� Framework Integration

o Modify framework scheduler to communicate with Mesos master

through its API

o Make framework executor binary available to Mesos

Page 16: State of Resource Management in Big Data

© 2014 IBM Corporation16

Kubernetes Basic concepts

• Only support container-based applications/workloads– Currently only support Docker and Rocket

• POD: Smallest schedulable unit– All containers within a POD are placed onto the same host and share the same

namespace (network)

• Replication Group: Manage one or more PODs– Use POD labels to ensure only a desired number of PODs with specific labels are running

at anytime– Used to scale up/down, failure recovery, rolling upgrade

• Services: Find and load-balance between one or more PODs– Use POD labels to define endpoints to a service– Used to handle changes in IP address, host, number of PODs, etc.– Services records are recorded in i) Env variables ii) DNS service entries

• Namespaces: Multi-tenancy support• Pods/Services/RCs can be put into different namespaces to provide logical isolation for

the purpose of management

Page 17: State of Resource Management in Big Data

© 2014 IBM Corporation17

Kubernetes architecture

API server

Scheduler

Controller mger

kubectl

K8s master

K8s minion K8s minion

Kubelet

Proxy

CAdvisor

Kubelet

Proxy

CAdvisor

Etcd service

state

Many components are pluggable

- schedulers

- container runtime

- persistent data store

- cloud providers

- …

Page 18: State of Resource Management in Big Data

© 2014 IBM Corporation18

Comparison of Open-source Resource Managers

Page 19: State of Resource Management in Big Data

© 2014 IBM Corporation19

Framework

(scheduler)

Master

Jobs type A

Master has no knowledge of workloads. Workloads have partial view of system.

Issues: Offers are computed without any workload awareness – may be unsuitable for a workload

Possible solution: Optimistic Offer

Framework

(scheduler)

Jobs type B

State

(2) Offer

(5) Revoke offer

(4) accept/decline offers

Master

Jobs type A

(short, small)

Master has knowledge of entire state and coarse-grained definition of workloads. Workloads have partial view, but selected based on workload specification.

Issues: More complex protocol, master has some properties of monolithic scheduler

Possible solution: Multiple level scheduler

Framework

(scheduler)

Jobs type B

(long running)

State

(1) Request

(6) Return resource

(1) Partition resources among frameworks

(3) schedule

(2) Allocate resources, based on workload priorities and requirements

(2) Return allocation

(4) Schedule small, short lived tasks

(5) Reclaim resources

Offer Vs Request

Page 20: State of Resource Management in Big Data

© 2013 IBM Corporation20

Feature Mesos YARN Kubernetes Comment

Container support YARN is planning to support Docker. Mesos support both Docker and its own unified container. Kebernetes only support container as its execution facility.

Placement Policies

YARN focus more on affinity. Marathon support several placement constraints and polices. Kuberentes borrows some placement policy from Marathon and support its own specific placement constraints.

Resource Sharing YARN has a pretty good support for resource sharing (priority/preemption/fair-share), Mesos does not support priority and its preemption is weak. Kuberentes only supports quota.

Service Management

Marathon and Kuberentes both support service life cycle management. Slider is still incubation.

Maturity YARN has longer development history and probably most deployment. Mesos and Kubernetes are relatively new

Mesos = Mesos + Marathon YARN = YARN + Slider

Complete Many features Some features

A little features None

Comparison: Mesos vs YARN vs Kubernetes

Page 21: State of Resource Management in Big Data

© 2013 IBM Corporation21

Spark on YARN

Cluster Mode

spark-submit MYJAR --master yarn-cluster –class MYCLASS

Client Mode

spark-submit MYJAR --master yarn-client –class MYCLASS

Page 22: State of Resource Management in Big Data

© 2013 IBM Corporation22

Spark on Mesos

Coarse-grain Mode

conf.set("spark.mesos.coarse", "true")Fine-grain Mode

conf.set("spark.mesos.coarse", "false")

Page 23: State of Resource Management in Big Data

© 2013 IBM Corporation23

Spark on YARN Vs Spark on Mesos

� Spark on YARN

o Coarse grain

o Fixed size of each Spark Executor

� Resource could be wasted if no enough tasks in an executor

o Leverage YARN data aware scheduling

� Spark on Mesos (Coarse-grain mode)

o Coarse grain

o Cannot launch multiple executors in same host (fixed in Spark 2.0.0 by SPARK-5095)

� New resource cannot be used

� Cannot fully use the big memory due to JVM GC issue in big memory environment

o Spark schedule tasks by data affinity within the offer

� Spark on Mesos (Fine-grain mode)

o Fine grain

o Extra overhead when launching tasks

o Resource may not be reschedule in time after time finish because of Mesos scheduling interval

Page 24: State of Resource Management in Big Data

© 2013 IBM Corporation24

USE CASES

Page 25: State of Resource Management in Big Data

© 2013 IBM Corporation25

Real-timeStreams, FPGA-based applications near market feeds

Near Real-timeAnalytic tasks are often time-critical supporting trading desks – “real-time” risk applications

BatchLong-running

Exchange/ECNs

Data Feeds

Big DataDiverse sources of structured/unstructured data -RDBMS, DFS (HDFS,GPFS),

In-memory caches etc..

Data Intensive Workloads Compute Intensive Workloads

Algorithmic trading / HFT / “Black-box”/ “Robo-trading”

Orders

Program

TradingArbitrage

Trend

Following

Exotics ,

Derivative Pricing

Sentiment

Analysis

Counterparty

Risk, CVA

CRMAnti-money

Laundering

(AML)

“Real-time“

Market

Risk

Pre-trade, post-trade analytics

Credit scoring

ETL

Incremental

Modeling

Fraud

Detection

Incremental

Modeling

Forex

Mining of

Unstructured Data

Sensitivity

Analysis

Model

Backtesting

Regulatory

Reporting

Actuarial analysis

CEPProtocol

Conversion

Deeper Counterparty

Modeling

Variable annuity

FX IR Equities

Applications in Financial Services

VaR

ALM Mortgage analyticsStrategy mining

data mining

Predictive

analytics

Predictive

analytics

Optimization

Optimization

Trade surveillance

Portfolio Stress Testing

P&L analysis

Document Processing

Non-structured Data Query

IBM Confidential

Check processing

Image Analytic

Page 26: State of Resource Management in Big Data

© 2013 IBM Corporation26

Example - Genome Sequencing

All the DNA contained in a living cell

makes up the genome. The alphabet of

genome contains only four letters: A, C, G

and T. Just like a book uses words and

letters to tell a story, so do these letters in

the genomes as they encode genes that

carry out all cellular functions.

SAM BAM Recalibrated BAM

Mark Duplicate & Sort

BWA ADAM

Spark

Map to ReferenceRealignment & Recalibration

Samtool

Picard

GATK

Mutech

Variant Analysis

VCF

Genomics is the study of the DNA sequence and meaning of these letters

in the genome (e.g. genes & mutations) so that scientists can precisely

tell the story of life.

Next Generation Sequencing Pipeline for faster results

Complex workflows and dependencies

Your life story

FASTQ

Page 27: State of Resource Management in Big Data

© 2013 IBM Corporation27

Challenges – Genome Sequencing� Poor resource utilization – peaks and valleys of different workloads � How to orchestrate multi-phase workflows among many collaborative apps of

distributed workloads, sub- and parallel-flows, across diverse infrastructure

� Lack of reliable parallelism in workflow due to variety of workload

types and resource needs

Move to job arrays, MPI/MPI2,

distributed messaging & cache,

MapReduce, Spark

frameworks?

� Data, app and resource silos causing inefficiencies in data movements,

app integrations and resource sharing

Workload

Manager

Job 1 Job 2

Job 3 Job N

MapReduce App

Resource 1 Resource 2

Resource 3 Resource 4

Resource 5 Resource 6

Resource 7 Resource 8

HDFS

Workload

Manager

App1 App2

App3 AppN

SOA App

Resource 1 Resource 2

Resource 3 Resource 4

Resource 5 Resource 6

Resource 7 Resource 8

NFS

Workload

Manager

Job 1 Job 2

Job 3 Job N

Batch App

Resource 1 Resource 2

Resource 3 Resource 4

Resource 5 Resource 6

Resource 7 Resource 8

POSIX

Workload

Manager

App1 App2

App3 AppN

Spark Apps

Resource 1 Resource 2

Resource 3 Resource 4

Resource 5 Resource 6

Resource 7 Resource 8

Objects

Cluster #2 Cluster #4Cluster #3Cluster #1

Page 28: State of Resource Management in Big Data

© 2013 IBM Corporation28

A Life Science App Workflow with Hybrid Workloads - Genome Sequencing

Genome Analysis Toolkit (GATK) : A widely-adopted genomics workflow from Broad Institute

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UC, Berkley

GATK workflow

Pipeline optimized using ADAM on Spark (parallelize mark duplicate

and sort processing)

Site A

Site B

Site C

Share results worldwide immediately

ReplicateTo Remote

Page 29: State of Resource Management in Big Data

© 2015 IBM Corporation29

Platform Computing is Part of IBM Software Defined Infrastructure

IBM Platform Computing/DCOS

Software Defined Compute

Symphony

MapReduceSymphony

Application Service Controller

LSF

High Performance Analytics

(Low Latency Parallel)

Hadoop / Big Data

Application Frameworks(Long Running Services)

High Performance Computing(Batch, Serial, MPI, Workflow)

Example

Applications

&

Application

Frameworks

HomegrownHomegrown

Spectrum ScaleSoftware Defined

Storage

On-premises, On-Cloud, HybridPhysical Infrastructure

Hypervisor

x86 Linux on z

Software Defined InfrastructureManagement

IBM Platform Cluster Manager

Bare Metal Provisioning Virtual Machine Provisioning SoftLayer APIs & Services

IBM Cloud Manager with OpenStack

IBM Platform ComputingCloud Service

Other Compute

Management Software

TraditionalCommercial

Applications

Page 30: State of Resource Management in Big Data

© 2015 IBM Corporation

IBM Platform Computing

30

Resource Management Community Activities

• Active development with Mesos community – 11

IBM Developers.

• 100+ JIRAs delivered or in progress

• Leading several work streams: POWER Support,

Optimistic Offers, Container Support, Swarm and

Kubernetes integration

• YARN-plugin to Platform Symphony

• Technical Preview of Mesos with IBM Value-

Add (ASC) on Docker Hub – Both x86 and POWER images

Page 31: State of Resource Management in Big Data

31

For more information: ibm.com/systems