Upload
vodat
View
218
Download
3
Embed Size (px)
Citation preview
Recent Advances and What’s Next?Coflow
Mosharaf Chowdhury
University of Michigan
Datacenter-Scale Computing
Geo-DistributedComputing
Fast AnalyticsOver the WAN
Rack-ScaleComputing
Proactive AnalyticsBefore You Think!
Coflow Networking Open Source
Apache Spark Open Source
Cluster File System Facebook
Resource Allocation Microsoft
DAG Scheduling Apache YARN
Cluster Caching Alluxio
Datacenter-Scale Computing
Geo-DistributedComputing
Rack-ScaleComputing
< 0.01 ms ~ 1 ms > 100 ms
Big Data
The volume of data businesses want to make sense of is increasing
Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …
Cheaper disks, SSDs, and memory
Stalling processor speeds
Big Datacenters for Massive Parallelism
2005 2010 2015
MapReduce Hadoop
Spark
HiveDryad
DryadLINQ
Spark-Streaming
GraphXGraphLabPregel
Storm
Dremel
BlinkDB
1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.
Distributed Data-Parallel Applications
Multi-stage dataflow• Computation interleaved with communication
Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel
Communication Stage (e.g., Shuffle)• Between successive computation stages Map Stage
Reduce Stage
A communication stage cannot complete until all the data have been transferred
Communication is Crucial
Performance
As SSD-based and in-memory systems proliferate,the network is likely to become the primary bottleneck
1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.
Facebook jobs spend ~25% of runtime on average in intermediate comm.1
FasterCommunication
Stages:TraditionalNetworking
Approach
FlowTransfers data from a source to a destination
Independent unit of allocation, sharing, load balancing, and/orprioritization
Existing Solutions
GPS RED
WFQ CSFQ
ECN XCP D2TCPDCTCP
PDQD3
FCP
DeTail pFabric
2005 2010 20151980s 1990s 2000s
RCP
Per-Flow Fairness Flow Completion Time
Independent flows cannot capture the collective communication behavior common in data-parallel applications
DatacenterFabric
1
2
3
1
2
3
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3
Input Links Output Links
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3Datacenter
Fabric
1
2
3
1
2
3
r1
r2
s1
s2
s3
Why Do They Fall Short?
DatacenterFabric
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletionTime = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
Solutions focusing on flow completion time cannot further
decrease the shuffle completion time
Improve Application-Level Performance1
DatacenterFabric
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletionTime = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
Slow down faster flows to accelerate
slower flows
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 4
Avg. FlowCompletionTime = 4
444
444
Data-Proportional Allocation
Communication abstraction for data-parallel applications to express their performance goalsCoflow
1. Size of each flow;2. Total number of flows;3. Endpoints of individual flows;4. Dependencies between coflows;
Aggregation
Broadcast
ShuffleParallel Flows
All-to-All
Single Flow
How to schedule coflows online …
… for faster#1 completion
of coflows?
… to meet#2 more
deadlines?
… for fair#3 allocation of
the network?
1
2
N
1
2
N
.
.
.
.
.
.
Datacenter
Varys, Aalo & HUG
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination Consistent calculation and enforcement of scheduler decisions
3. The Coflow API Decouples network optimizations from applications, relieving developers and end users
1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.2. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015.3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.
1 2 3
Benefits of
time2 4 6 time2 4 6 time2 4 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Fair Sharing Smallest-Flow First1,2 The Optimal
Coflow1 comp. time = 3Coflow2 comp. time = 6
L1
L2
L1
L2
L1
L2
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
Inter-Coflow Scheduling
Inter-Coflow Scheduling
1
2
3
1
2
3
Input Links Output Links
Datacenter
Concurrent Open Shop Scheduling with Coupled Resources• Examples include job scheduling and
caching blocks• Solutions use a ordering heuristic• Consider matching constraints
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
3
6
2
is NP-Hard
Many Problems to Solve
Aalo
VarysClairvoyant Objective
HUG
Min CCT
Min CCT
Fair CCT
Yes
No
No
Optimal
Yes
No
No
Coflow-Based Architecture
Centralized master-slave architecture • Applications use a client library to
communicate with the masterActual timing and rates are determined by the coflow scheduler
Master/Coordinator
Network Interface
f Computation tasks
Local Daemon
Local Daemon
Local Daemon
CoordinationCoflow Scheduler
1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.
Coflow API
Change the applications• At the very least, we need to know
what a coflow is• For clairvoyant versions, we need
more informationChanging the framework can enabled ALL jobs to take advantage of coflows
DO NOT change the applications1
• Infer coflows from traffic network traffic patterns• Design robust coflow scheduler that
can tolerate misestimationsOur current solution only works for coflows without dependencies; we need DAG support!
Performance Benefits of Using Coflows
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
1.003.21
5.65 5.53
22.07
1.10
0
5
10
15
20
25
Varys Fair FIFO Priority FIFO-LM NC
Ove
rhea
dO
ver V
arys
Varys Aalo1 4Per-FlowFairness
Per-FlowPrioritization
2,3
Lower is Better
The Need for Coordination
8
17
115
495 99
2
1
10
100
1000
100
1000
1000
0
5000
0
1000
00Ave
rage
Coo
rdin
atio
n T
ime
(ms)
# (Emulated) Aalo Slaves
Coordination is necessary to determine realtime
• Coflow size (sum);• Coflow rates (max);• Partial order of coflows (ordering);
Can be a large source of overhead• Does not impact too much for large
coflows in slow networks, but …How to perform decentralized coflow scheduling?
Coflow-Aware Load Balancing
Especially useful in asymmetric topologies• For example, in the presence of switch or link failures
Provides an additional degree of freedom• During path selection• For dynamically determining load balancing granularity
Increased need for coordination, but at an even higher cost
Coflow-Aware Routing
Relevant in topologies w/o full bisection bandwidth• When topologies have temporary in-network oversubscriptions• In geo-distributed analytics
Scheduling-only solutions do not work well• Calls for routing-scheduling joint solutions• Must take network utilization into account• Must avoid frequent path changes
Increased need for coordination
Coflows in Circuit-Switched Networks
Circuit switching is relevant again due to the rise of optical networks• Provides very high bandwidth• Expensive to setup new circuits
Co-scheduling applications and coflows• Schedule tasks so that we can reuse already-setup circuits• Perform in-network aggregation using existing circuits instead of waiting for new
circuits to be created
Extension to Multiple Resources1
A DAG of coflows is very similar to a job DAG of stages
• Same principle applies, but with new challenges
Consider both fungible (b/w) and non-fungible resources (cores)
• Across the entire DAG
1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.
Communication abstraction for data-parallel applications to express their performance goalsCoflow
Key open challenges1. Better theoretical understanding2. Efficient solutions to deal with decentralization, topologies, multi-resource
settings, estimations over DAG, circuit-switching, etc.More information
1. Papers: http://www.mosharaf.com/publications/2. Software/simulator/workloads: https://github.com/coflow