Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Improving the Scalability of Data Center Networks with Traffic-aware Virtual

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement IEEE INFOCOM 2010

INTRODUCTION The scalability of modern data centers has become a practical concern and has attracted significant attention in recent years. existing solutions that require changes in the network architecture and the routing protocols to balance traffic load With an increasing trend towards more communication intensive applications in data centers, the bandwidth usage between virtual machines (VMs) is rapidly growing.

INTRODUCTION This paper proposes using traffic-aware virtual machine (VM) placement to improve the network scalability Many VM placement solutions seek to consolidate VMs for CPU, physical memory and power consumption savings, yet without considering consumption of network resources This paper tackling the scalability issue by optimizing the placement of VMs on host machines.

INTRODUCTION e.g. VMs with large mutual bandwidth usage are assigned to host machines in close proximity design a two-tier approximate algorithm that efficiently solves the VM placement problem for very large problem sizes

INTRODUCTION Contributions 1.We address the scalability issue of data center networks with network-aware VM placement. We formulate it as an optimization problem, prove its hardness and propose a novel two-tier algorithm.

INTRODUCTION Contributions 2.We analyze the impact of data center network architectures and traffic patterns on the scalability gains attained by network-aware VM placement.

INTRODUCTION Contributions 3.We measure traffic patterns in production data center environments, and use the data to evaluate the proposed algorithm as well as the impact analysis.

INTRODUCTION Problem definition: the Traffic-aware VM Placement Problem (TVMPP) as an optimization problem. Input: traffic matrix, cost matrix Output: where VMs should be placed Goal: minimize the total cost

Data Center Traffic Patterns Examine traces from two data-center-like systems: 1. a data warehouse hosted by IBM Global Services the incoming and outgoing traffic rates for 17 thousand VMs 2. the incoming and outgoing TCP connections for 68 VMs 10 days measurement

Data Center Traffic Patterns Uneven distribution of traffic volumes from VMs

Data Center Traffic Patterns Stable per-VM traffic at large timescale:

Data Center Traffic Patterns Weak correlation between traffic rate and latency:

Data Center Traffic Patterns The potential benefit : increased network scalability and reduced average traffic latency. The observed traffic stability over large timescale suggests that it is feasible to find good placements based on past traffic statistics

Data Center Network Architectures

Problem Formulation Cij: the communication cost from slot i to j Dij: traffic rate from VM i to j ei: external traffic rate for VMi gi:communication cost between VMi and the gateway

Problem Formulation The above objective function is equivalent to X: permutation matrix

Problem Formulation we define Cij as the number of switches on the routing path from VM i to j. Accordingly, optimizing TVMPP is equivalent to minimizing average traffic latency caused by network infrastructure. Offline scenario & Online scenario

Complexity Analysis TVMPP falls into the category of Quadratic Assignment Problem (QAP) QAP: n things are put into n location, with distance & flows NP-hard, finding the optimality of QAP problems with size > 15 is practically impossible

Complexity Analysis Theorem 1: For a TVMPP problem defined on a data center that takes one of the topology in Figure 4, finding the TVMPP optimality is NP-hard

Theorem 1 This can be proved by a reduction from the Balanced Minimum K-cut Problem (BMKP) BMKP: G = (V,E) undirected, weighted graph with n vertices A k-cut on G is defined as a subset of E that partition G into k components BMKP is NP-hard

Theorem 1 Now considering a data center network, regardless of which topology being used, we can always create a network topology that satisfy: n slots that are partitioned into k slot-clusters of equal size n/k Every two slots have a connection with certain cost within the same cluster : ci across clusters cost: co co > ci

Theorem 1 Suppose there are n VMs with traffic matrix D. By assigning these n VMs to the n slots, we obtain a TVMPP problem if we define a graph with the n VMs as nodes and D as edge weights, we obtain a BMKP problem It can be shown that when the TVMPP is optimal, the associated BMKP is also optimal. And vise versa

Theorem 1 when the TVMPP is optimal, if we swap any two VMs i, j that have been assigned to two slot-cluster r1, r2 respectively, the k-cut weight will increase. Let s1 denote the set of VMs assigned to r1 Let s2 denote the set of VMs assigned to r2

the TVMPP objective value increases: The amount of change for the k-cut weight is:

ALGORITHMS Proposition 1: Suppose 0 a1 a2... an and 0 b1 b2... bn, the following inequalities hold for any permutation on [1,..., n]

ALGORITHMS First, according to Proposition 1, solving TVMPP is intuitively equivalent to finding a mapping of VMs to slots such that VM pairs with heavy mutual traffic be assigned to slot pairs with low-cost connections.

ALGORITHMS The second design principle is divide-and-conquer: we partition VMs into VM-clusters and partition slots into slot clusters. VM-clusters are obtained via classical min-cut graph algorithm which ensures that VM pairs with high mutual traffic rate are within the same VM cluster Slot-clusters are obtained via standard clustering techniques which ensures slot pairs with low-cost connections belong to the same slot-cluster

ALGORITHMS SlotClustering: Minimum k-clustering,NP-hard. ( an approximation ratio 2 ) O(nk) VMMinKcut: minimum k-cut algorithm O(n4) Assign VMs to slots Recursive call

ALGORITHMS

IMPACT OF NETWORK ARCHITECTURES AND TRAFFIC PATTERNS Global Traffic Model: each VM sends traffic to every other VM at equal and constant rate For any permutation, matrix X, holds This simplifies the TVMPP problem to the following:

IMPACT OF NETWORK ARCHITECTURES AND TRAFFIC PATTERNS which is the classical Linear Sum Assignment Problem (LSAP). The complexity for LSAP is O(n3) Random placement:

IMPACT OF NETWORK ARCHITECTURES AND TRAFFIC PATTERNS

Partitioned Traffic Model: Under the partitioned traffic model, each VM belongs to a group of VMs and it sends traffic only to other VMs in the same group The GLB is a lower bound for the optimal objective value of a QAP problem

IMPACT OF NETWORK ARCHITECTURES AND TRAFFIC PATTERNS

observation

EVALUATION

DISCUSSION Combining VM migration with dynamic routing protocols VM placement by joint network and server resource optimization:

Amos Brocco, Apostolos Malatras, Ye Huang, Beat Hirsbrunner Department of Informatics University of Fribourg, Switzerland ARiA: A Protocol for Dynamic Fully Distributed Grid Meta-Scheduling ICDCS 2010

INTRODUCTION An advantage of grid systems is their ability to guarantee efficient meta-scheduling (optimal allocation of jobs across a pool of sites with diverse local scheduling policies ) The centralized nature of current meta-scheduling solutions is not well suited for the envisioned increasing scale and dynamicity

INTRODUCTION This paper focuses on grid task meta-scheduling, by presenting a fully distributed protocol named ARiA to achieve efficient global dynamic scheduling across multiple sites The meta-scheduling process is performed online, and takes into account the availability of new resources as well as changes in actual allocation policies

ARiA PROTOCOL Job Submission Phase Job Acceptance Phase Dynamic Rescheduling Phase

Job Submission Phase Jobs are assigned a universal unique identifier (UUID) Nodes receiving job submissions are referred to as initiators for these jobs Initiators issue resource discovery queries across the grid peer-to-peer overlay by broadcasting REQUEST messages

Job Acceptance Phase If the request cannot be satisfied, the message is further forwarded on the peer-to-peer overlay otherwise a cost value for the job based on actual resources and current scheduling is computed and sent back to the jobs initiator by means of an ACCEPT message The initiator evaluates incoming ACCEPT responses, and selects the best qualified node, and sends an ASSIGN message

Dynamic Rescheduling Phase the assignee attempts to find candidates for rescheduling of jobs in its queue while their execution has not yet started by the INFORM messagges The structure of INFORM messages relates to that of REQUEST messages

EVALATION For the evaluation of ARiA, an overlay of 500 nodes with a target average path length of 9 hops was deployed in a custom simulator The average nodes degree attained during simulations was 4, resulting in about 2000 overlay links

EVALATION In all scenarios a total of 1000 jobs is submitted to random nodes on the grid. Unless otherwise specified, jobs are submitted at 10 seconds intervals when dynamic rescheduling is enabled, INFORM messages are sent for at most 2 scheduled jobs every 5 minutes

EVALATION REQUEST messages are forwarded on the overlay for at most 9 hops; at each step, at most 4 random neighbors of the current node are contacted INFORM messages a more lightweight approach is followed, with at most 8 hops and up to 2 neighbors

EVALATION

Amir Epstein Dean H. Lorenz Ezra Silvera Inbar Shapira Virtualization Technologies, System Technologies & Services IBM Haifa Research Lab, Haifa, Israel Virtual Appliance Content Distribution for a Global Infrastructure Cloud Service IEEE INFOCOM 2010

INTRODUCTION An emerging cloud service is a virtual server shop, that allows cloud customers to order virtual appliances to be delivered virtually on the cloud Global cloud providers need to create customized virtual-server disk images and deliver them on time to meet the customer reservations and service level In order to reduce provisioning time and meet reservation deadlines, one approach is to stage images on storage near the customer

INTRODUCTION This introduces an optimization problem of finding an optimal staging schedule, according to network bandwidth, pending reservations schedule, and customer value Continuous model vs integral model

Problem Definition a staging storage space with capacity C n appliance deployment requests Each request i is for an appliance that consume Ci staging capacity and has a desired due date di propagation time pi (assume pi = kCi)

Continuous model

Lemma 4.1: Any feasible schedule S can be turned into right-tight feasible schedule by right-shifting job execution intervals. Lemma 4.2: There exists an optimal schedule in which the jobs are processed in EDD (Earliest Due Date) order

Continuous model Algorithm 1 is a dynamic program that finds an optimal schedule that is right-tight and in EDD order O(nW ) time and space

THE INTEGRAL MODEL p1=3, d1=5 p2=3, d2=8 P3=2, d3=9 Capacity = 5 s1=2, s2=5, s3=0 But can not be an EDD order

SOLUTION Our algorithms for the integral model have two steps In the first step, we solve the problem for the continuous model In the second step, we discard jobs from this schedule, without losing too much weight, to obtain a feasible schedule for the integral model

THE INTEGRAL MODEL Lemma 5.1: For unweighted jobs, any feasible schedule S for the continuous model can be transformed to a feasible schedule S for the integral model with w(S)>=1/2w(S). Lemma 5.2: For weighted jobs, any feasible schedule S for the continuous model can be transformed to a feasible schedule S* for the integral model with w(S*)>=1/2w(S)

SOLUTION

IMPLEMENTATION AND SIMULATION RESULTS

Documents

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Improving the Scalability of Data Center Networks with Traffic-aware Virtual