Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… · · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury

Recent Advances and What’s Next?Coflow

Mosharaf Chowdhury

University of Michigan

Datacenter-Scale Computing

Geo-DistributedComputing

Fast AnalyticsOver the WAN

Rack-ScaleComputing

Proactive AnalyticsBefore You Think!

Coflow Networking Open Source

Apache Spark Open Source

Cluster File System Facebook

Resource Allocation Microsoft

DAG Scheduling Apache YARN

Cluster Caching Alluxio

Datacenter-Scale Computing

Geo-DistributedComputing

Rack-ScaleComputing

< 0.01 ms ~ 1 ms > 100 ms

Big Data

The volume of data businesses want to make sense of is increasing

Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory

Stalling processor speeds

Big Datacenters for Massive Parallelism

2005 2010 2015

MapReduce Hadoop

Spark

HiveDryad

DryadLINQ

Spark-Streaming

GraphXGraphLabPregel

Storm

Dremel

BlinkDB

1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.

Distributed Data-Parallel Applications

Multi-stage dataflow• Computation interleaved with communication

Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel

Communication Stage (e.g., Shuffle)• Between successive computation stages Map Stage

Reduce Stage

A communication stage cannot complete until all the data have been transferred

Communication is Crucial

Performance

As SSD-based and in-memory systems proliferate,the network is likely to become the primary bottleneck

1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

FasterCommunication

Stages:TraditionalNetworking

Approach

FlowTransfers data from a source to a destination

Independent unit of allocation, sharing, load balancing, and/orprioritization

Existing Solutions

GPS RED

WFQ CSFQ

ECN XCP D2TCPDCTCP

PDQD3

FCP

DeTail pFabric

2005 2010 20151980s 1990s 2000s

RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel applications

DatacenterFabric

1

2

3

1

2

3

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

Input Links Output Links

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3Datacenter

Fabric

1

2

3

1

2

3

r1

r2

s1

s2

s3

Why Do They Fall Short?

DatacenterFabric

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletionTime = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

Solutions focusing on flow completion time cannot further

decrease the shuffle completion time

Improve Application-Level Performance1

DatacenterFabric

time2 4 6

Link to r2

Link to r1


CompletionTime = 5

Avg. FlowCompletionTime = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to accelerate

slower flows

time2 4 6

Link to r2

Link to r1


CompletionTime = 4

Avg. FlowCompletionTime = 4

444

444

Data-Proportional Allocation

Communication abstraction for data-parallel applications to express their performance goalsCoflow

1. Size of each flow;2. Total number of flows;3. Endpoints of individual flows;4. Dependencies between coflows;

Aggregation

Broadcast

ShuffleParallel Flows

All-to-All

Single Flow

How to schedule coflows online …

… for faster#1 completion

of coflows?

… to meet#2 more

deadlines?

… for fair#3 allocation of

the network?

1

2

N

1

2

N

.

.

.

.

.

.

Datacenter

Varys, Aalo & HUG

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination Consistent calculation and enforcement of scheduler decisions

3. The Coflow API Decouples network optimizations from applications, relieving developers and end users

1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.2. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015.3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.

1 2 3

Benefits of

time2 4 6 time2 4 6 time2 4 6

Coflow1 comp. time = 5Coflow2 comp. time = 6


Fair Sharing Smallest-Flow First1,2 The Optimal


L1

L2

L1

L2

L1

L2

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

Inter-Coflow Scheduling

Inter-Coflow Scheduling

1

2

3

1

2

3

Input Links Output Links

Datacenter

Concurrent Open Shop Scheduling with Coupled Resources• Examples include job scheduling and

caching blocks• Solutions use a ordering heuristic• Consider matching constraints

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

3

6

2

is NP-Hard

Many Problems to Solve

Aalo

VarysClairvoyant Objective

HUG

Min CCT

Min CCT

Fair CCT

Yes

No

No

Optimal

Yes

No

No

Coflow-Based Architecture

Centralized master-slave architecture • Applications use a client library to

communicate with the masterActual timing and rates are determined by the coflow scheduler

Master/Coordinator

Network Interface

f Computation tasks

Local Daemon

Local Daemon

Local Daemon

CoordinationCoflow Scheduler

1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.

Coflow API

Change the applications• At the very least, we need to know

what a coflow is• For clairvoyant versions, we need

more informationChanging the framework can enabled ALL jobs to take advantage of coflows

DO NOT change the applications1

• Infer coflows from traffic network traffic patterns• Design robust coflow scheduler that

can tolerate misestimationsOur current solution only works for coflows without dependencies; we need DAG support!

Performance Benefits of Using Coflows

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

1.003.21

5.65 5.53

22.07

1.10

0

5

10

15

20

25

Varys Fair FIFO Priority FIFO-LM NC

Ove

rhea

dO

ver V

arys

Varys Aalo1 4Per-FlowFairness

Per-FlowPrioritization

2,3

Lower is Better

The Need for Coordination

8

17

115

495 99

2

1

10

100

1000

100

1000

1000

0

5000

0

1000

00Ave

rage

Coo

rdin

atio

n T

ime

(ms)

# (Emulated) Aalo Slaves

Coordination is necessary to determine realtime

• Coflow size (sum);• Coflow rates (max);• Partial order of coflows (ordering);

Can be a large source of overhead• Does not impact too much for large

coflows in slow networks, but …How to perform decentralized coflow scheduling?

Coflow-Aware Load Balancing

Especially useful in asymmetric topologies• For example, in the presence of switch or link failures

Provides an additional degree of freedom• During path selection• For dynamically determining load balancing granularity

Increased need for coordination, but at an even higher cost

Coflow-Aware Routing

Relevant in topologies w/o full bisection bandwidth• When topologies have temporary in-network oversubscriptions• In geo-distributed analytics

Scheduling-only solutions do not work well• Calls for routing-scheduling joint solutions• Must take network utilization into account• Must avoid frequent path changes

Increased need for coordination

Coflows in Circuit-Switched Networks

Circuit switching is relevant again due to the rise of optical networks• Provides very high bandwidth• Expensive to setup new circuits

Co-scheduling applications and coflows• Schedule tasks so that we can reuse already-setup circuits• Perform in-network aggregation using existing circuits instead of waiting for new

circuits to be created

Extension to Multiple Resources1

A DAG of coflows is very similar to a job DAG of stages

• Same principle applies, but with new challenges

Consider both fungible (b/w) and non-fungible resources (cores)

• Across the entire DAG

1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.

Communication abstraction for data-parallel applications to express their performance goalsCoflow

Key open challenges1. Better theoretical understanding2. Efficient solutions to deal with decentralization, topologies, multi-resource

settings, estimations over DAG, circuit-switching, etc.More information

1. Papers: http://www.mosharaf.com/publications/2. Software/simulator/workloads: https://github.com/coflow

Documents

Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… · · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury