10

Click here to load reader

Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Embed Size (px)

Citation preview

Page 1: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Network Topology Optimization for DataAggregation with Splitting

Soham Das and Sartaj SahniDepartment of Computer and Information Science and Engineering

University of FloridaGainesville, USA

Email: {sdas, sahni}@cise.ufl.edu

Abstract—In this paper, we develop algorithms for the dataaggregation problem which arises in the context of big-dataapplications that employ the MapReduce operation. For the casewhen source racks can send their data to the aggregator usingmultiple paths, we show that an aggregation tree topology thatminimizes aggregation time can be constructed in polynomialtime. We consider also the problem of constructing aggregationtrees that minimize total network traffic subject to the primaryconstraint that aggregation time is minimized. Heuristics forthis problem are presented. Experiments show that allowingmultiple paths reduces aggregation time by up to 99% relativeto the aggregation trees constructed using the LPT rule [3]. Thisreduction in aggregation time, however, comes with up to 35%increase in total network traffic when racks have more than 2optical links and up to 98% increase in total network traffic wheneach rack has 2 optical links.

Keywords—Data Center Networks; Software Defined network-ing; Big Data applications; Map-Reduce tasks

I. INTRODUCTION

Data center networks are comprised of a large numberof racks interconnected via top-of-rack (ToR) switches. Theperformance of big-data applications may be enhanced bydynamically reconfiguring, via software defined networking(SDN), the data center network to suit the application. In manybig-data applications, the time required for this reconfiguringis insignificant compared to the overall execution time of theapplication. The achievable network topologies are limitedby the number, k, k ≥ 2, of optical links at each ToRswitch. In this paper, we focus on the construction of networktopologies that are trees and that minimize the time requiredby the aggregation phase of a MapReduce operation. We notethat the MapReduce paradigm is employed by many big-dataapplications and that the aggregation time in these applicationsis a dominant component of overall execution time [2].

Consider the case when data is to be aggregated from 5racks to a single aggregator rack. Figure 1(a) shows a possibleaggregation tree that may be used for this purpose. In this fig-ure, A is the aggregator rack and X(d), X ∈ {B,C,D,E, F},is a source rack that has d units of data that are to be aggregatedinto A. The edges denote optical links between pairs of ToRswitches. In the aggregator tree of Figure 1(a) node B usesthree optical links to connect to nodes A, D, and F and so kmust be at least 3. For the data aggregation, each node sends itsdata using the unique tree path to the aggregator. Assume thatdata is transmitted in packets, a node can receive incoming datafrom all children in parallel, and that while a node is receivingpackets from its children, it can transmit previously received

A

B(4) C(8)

D(8) E(4)F(4)

A

B(4) C(8)

D(8) E(4)

a. b.

F1(2) F2(2)

Fig. 1: Example aggregation tree topologies

packets to its parent. If each optical link has a bandwidth of 10Gbps, the data from B, D, and F can be aggregated into A in(4+8+4)/10 = 1.6s and that from C and E can be aggregatedin (8 + 4)/10 = 1.2s. Since A receives data from B and Cin parallel, the total aggregation time is max{1.6, 1.2} = 1.6s.Figure 1(b) shows another aggregation tree with k = 3. In thistree, node F has been split into two nodes F1 and F2 (we usea broken line to connect nodes that represent the same rack;solid lines denote optical links). So, rack F uses two paths (F1to A and F2 to A) to route its data to the aggregator; each pathcarries 2 units of F ’s data. The path from F1 to A goes throughB and that from F2 goes through C. Note that this multi-path(or split) routing of F ’s data is possible because rack F hask = 3 optical links of which the topology of Figure 1(b) usesonly 2. The aggregation time using this topology is reducedto max{0.4 + 0.8 + 0.2, 0.8 + 0.2 + 0.4} = 1.4s.

In practice, one attempts to reduce aggregation time byrandomly distributing tasks across the racks. With the adventof SDN, it is possible to dynamically configure the networktopology at the start of the aggregation phase when the amountof data in each rack that is to be aggregated is known. Thisdynamic configuration can reduce the aggregation time versusa randomized distribution done in the Map phase. We alsonote that while randomization algorithms run in the Map phasemay improve the expected performance of the aggregationphase of MapReduce, they cannot improve the worst-caseperformance of this phase. Further, for the example of Figure1, for instance, no distribution (random or otherwise) of thetasks can reduce aggregation time to 1.4s (without splitting)as there is no partition of tasks A-F whose size is 14. Hence,aggregation with splitting, which is made possible by theability to configure network topology in SDNs may be usedto improve application performance and network utilization.

The total network traffic is the sum of the amounts ofdata moved through each optical link. For the topology ofFigure 1(b) the total network traffic is 14 (link AB) + 14 (linkAC) + 8 (link BD) + 2 (link BF ) + 2 (link CF ) + 4 (linkCE) = 44Gb. If we swap nodes B and D in Figure 1(b), the

Page 2: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

total network traffic becomes 40Gb and the aggregation timeremains 1.4s.

The focus of this paper is the Single Aggregator NetworkTopology Optimization problem with data Splitting (SANTOS)in which we are to determine a tree topology that minimizesthe aggregation time; each source rack may appear as severalnodes in the aggregation tree and thereby split its data trafficutilizing different optical links. We also study the relatedproblem, SANTOS-NT, in which we wish to minimize totalnetwork traffic subject to the constraint that aggregation timeis minimized. That is, from among the many solutions toSANTOS, obtain one with least total network traffic. The maincontributions of this paper are:

1) We propose two classes of algorithms for SANTOS.The complexity of one of these is O(n) and that ofthe other is O(n log k), where k ≥ 2 is the number ofoptical links per rack and n is the number of sourceracks. When these algorithms are extended to requireracks be considered in sorted order, their complexitybecomes O(n log n) as the sort takes this much time.

2) We develop O(n log n) heuristics to reorganize thenodes in any solution to SANTOS in an attempt tominimize total network traffic subject to the con-straints that (a) the nodes in each of the up to ksubtrees of the aggregator do not change and (b) thedegree of each node is unchanged.

3) We identify two classes of solutions to SANTOS-NTin which total network traffic can be minimized inO(n) time subject to the minimization of aggregationtime. However, it takes O(n log n) time to generatethe initial SANTOS solution, which is then reorga-nized in O(n) time to minimize total network traffic.

4) Through extensive experiments, we demonstrate thebenefit of permitting racks to use multiple data pathsto the aggregator (i.e., to split their data). The aggre-gation time reduces by up to 99% using data splitting.Although data splitting can reduce or increase totalnetwork traffic, our experiments indicate that it ismore likely to increase total traffic. This is not sur-prising as splitting increase the total number of nodesin the aggregation tree and hence the total numberof hops on paths to the aggregator. We establish thesuperiority of our hybrid greedy algorithm GDFDfor SANTOS-NT over our other proposed SANTOS-NT algorithms. GDFD limits the increase in networktraffic, relative to the aggregation trees obtained usingthe LPT rule of [3] to a maximum of 35% when rackshave more than 2 optical links each.

The remainder of this paper is organized as follows. Re-lated work is reviewed in Section II and two polynomial timealgorithms for SANTOS are proposed in Section III. Heuristicsand algorithms to reorganize the nodes in an aggregation topol-ogy so as to minimize total network traffic are developed inSection IV. Our experimental results are presented in Section Vand we conclude in Section VI.

II. RELATED WORK

The current paper is a follow up to our work [3] on singleaggregator network topology optimization (SANTO) underthe constraint that each rack appears as exactly one node in

the aggregation tree (i.e., nodes cannot split their data overmultiple paths). Consequently, we have reproduced much ofthe related-work section of [3] here and added in a summaryof the new results of [3].

There has been considerable interest in configuring thetopology of data center networks to improve the performanceof applications. For example, Greenberg et al. [6] have pro-posed the use of programmable commodity switches to reducenetwork cost and enhance performance through the use ofValiant Load Balancing, Al-Fares et al. [7] propose usingcommodity Ethernet switches to support the full aggregatebandwidth of large clusters, Guo et al. [8] use servers as nodesin the interconnect, Das et al. [9] explore the use of OpenFlowto control routing according to application need and Webb etal. [10] propose to isolate applications and use different routingmechanisms for them in fat-tree based data-centers. The flowof big data traffic in data center networks has also been studied.For example, Kavulya et al. [11] have analyzed 10-monthsof Map-Reduce logs from the M45 supercomputing cluster atYahoo! and Benson et al. [12] have conducted an empiricalstudy of the network traffic in 10 data centers that includeuniversity, enterprise, and cloud data centers.

Wang, Ng and Shaikh [2] describe an “integrated networkcontrol for big data applications” that comprises “OpenFlow-enabled top-of-rack (ToR) switches”. They propose an algo-rithm for SANTO as well as for the more general case ofmultiple aggregators. Their algorithm for SANTO has twosteps. In the first step, the racks are sorted into decreasingorder of the amount of data they need to send to the aggregatorand in the second step, the racks are placed into the treetopology in this order. The objective is to place racks with“higher demand closer to the aggregator”. In [3], we showthat SANTO is NP-hard and that the approximation ratio of theSANTO algorithm of Wang, Ng, and Shaikh [2] is (k + 1)/2,where k is the degree of ToR (top-of-rack) switches. Further,we show that this algorithm exhibits an anomalous behavior-an increase in the switch degree may result in an increase inthe aggregation time. We propose a SANTO algorithm that isbased on the longest processing time (LPT) scheduling rule andwhose approximation ratio is (4/3 − 1/(3k)) [4], [5]. In [3],we show that for every instance of SANTO, the time requiredto aggregate using the LPT method is no more than that usingthe Wang et al. algorithm. Moreover, if the LPT algorithm isused to partition the source racks into k groups and the rackswithin each group organized into a subtree of the aggregationtree using the algorithm of Wang et al. [2], then total networktraffic is reduced [3]. In fact, the total network traffic in eachsubtree is minimum for the racks in that subtree.

The SANTOS model used in this paper differs from theSANTO model used in [2], [3] in that SANTOS allows racksto split their data across multiple paths to the aggregator. Bypermitting the use of multiple paths, we potentially reduce theaggregation time. In fact, the aggregation time may reduce bya factor up to k. For example, consider the case when thereis only 1 source rack (i.e., n = 1). If the time to aggregatewithout splitting is t, then by equally splitting across k links,the aggregation time becomes t/k. It is easy to verify that areduction by a factor larger than k is not possible.

Page 3: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

III. ALGORITHMS FOR SANTOS

Let k, k ≥ 2, be the number of optical links in a ToRswitch; n the number of source racks; and di the amountof data in rack i, 1 ≤ i ≤ n, that is to be aggregated.For a given aggregation tree T , the aggregation time, A(T ),is max{Dj}/B, where Dj is the amount of data to betransmitted to the aggregator by the racks in subtree j,1 ≤ j ≤ k, of the root of the aggregation tree and B isthe bandwidth of an optical link. Since

∑k1 Dj =

∑n1 di,

A(T ) ≥ LB = d(∑n

1 di)/ke/B (here, LB denotes a lowerbound on the aggregation time). To achieve this lower bound,we need to place the racks into up to k subtrees so thatDj ≤ LB, 1 ≤ j ≤ k. In case rack i is placed into morethan one subtree, the portion of its data di that each subtreemust handle needs to be specified. Additionally, we need tospecify how many of rack i’s k optical links are available toeach subtree in which the rack is placed. In the example ofFigure 1(b), the 5 source racks are placed into 2 non-emptysubtrees of the aggregator tree; rack F is in both subtreesand the remaining 4 racks are in exactly 1 subtree each; F ’sdata of 4 units are evenly split between the two subtrees inwhich it is placed; one of F ’s optical links is assigned to eachsubtree and the remaining link is idle; D1 = D2 = 14; andA(T ) = 14/B = 1.4s, where B = 10. The lower bound LBfor this example is d28/3e/10 = 1.0s. This lower bound isachieved, for example, by the aggregation tree of Figure 2(a)that has 3 subtrees with D1 = D2 = 10 and D3 = 8; theaggregator and rack D use all three of their optical links; racksB and E use 2 links each and racks C and F use only 1 each.

For simplicity, in the remainder of this paper, we assumethat the dis have been normalized so that B = 1. Thereare at least 2 different high-level strategies that may beused to place the n source racks into k groups such thatmax{Dj} = LB and no rack requires more than k links-depth first (Algorithm 1) and greedy (Algorithm 2). In each,the racks are considered one at a time in some order and therack under current consideration is assigned to one or moregroups.

Algorithm 1 Depth First Algorithm

Input: n racks with rack i having di amount of data and koptical links.

Input: LB = d∑n

1 di/ke.Output: k groups such that max{Dj} = LB.

1: g = 1; i = 1; D = 0.2: Dj = 0, 1 ≤ j ≤ k.3: while i ≤ n do4: if Dg + di > LB then5: Assign rack i to group g.6: Group g gets LB −Dg of rack i’s data.7: di = di − (LB −Dg). g = g + 1. Dg = LB.8: else9: Assign rack i and all of di to group g.

10: i = i + 1. Dg = Dg + di.11: if Dg = LB then12: g = g + 1.13: end if14: end if15: end while

In Algorithm 1, which we shall refer to as Algorithm DF ,

A

B(4) E(4)

D(6) F(4)

A

3 6

5 2

a. b.

D(2)

C(8)

3

4

1

c.A

3

5 2

8 4

2

Fig. 2: Topologies for lower bound on aggregation time

we first place racks into group 1 until D1 equals LB. Thismay require us to split the data in the last considered rack.Then we move on to place racks in the next group and so onuntil all racks and their data have been placed. As an example,consider the case when n = 5, d1:5 = (3, 8, 4, 7, 2), and k = 3.Now, LB = 8. Racks 1 and 2 are first assigned to group 1with rack 2’s data being split so that group 1 is to handle 5units of rack 2’s data. This makes D1 = 8. Next we assignracks to group 2 beginning with rack 2, which has 3 units ofdata remaining. Racks 3 and 4 are also assigned to group 2with rack 4’s data split so that group 2 is assigned 1 unit ofthis data making D2 = 8. Finally, racks 4 (with its remaining6 units of data) and 5 are assigned to group 3 making D3 = 8.Figure 2(b) shows a possible aggregation tree for this grouping.In this tree, each subtree of the aggregator is a chain and theaggregation time is LB. Notice that this aggregation topologymay be realized when k = 3 as no rack requires more than 3links.

Correctness of Algorithm DF : It is easy to see thatmax{Dj} = LB. We need also show that each of the k groupsof racks can be organized into an aggregation subtree using atmost k optical links per rack. It is sufficient to show that wecan link each group into a chain as in Figure 2(b). First, notethat if a rack has not been split, it needs at most 2 links (oneto link to its parent, which may be the aggregator and anotherone to link to its child (if any)). If a rack is split k − 1 times(i.e., it is placed in all k groups), then it is the sole rack in atleast k−2 of the groups (the first and last groups are the onlypossible exceptions). Also, in this case, the split rack is theonly rack that is split in groups 1 and k. If we make this rackthe last rack (leaf rack), in each subtree chain, this rack usesk links while the other source racks use 2 links each, and theaggregator uses k links. On the other hand, if a rack is splits < k− 1 times, it can be made the leaf rack in the first s− 1of these groups (note that the rack is the sole rack in all butthe first and last group to which it is assigned), and a non-leafrack in the last one (assuming the last group has more than 1rack) using (s− 1 + 2) ≤ k links.

Complexity of Algorithm DF . The while loop is iterated n+Stimes, where S is the total number of splits. It is easy to seethat S < k. Further, we may assume that k ≤ n. Hence thenumber of iterations is O(n). Each iteration of the while looptakes O(1) time. So, the overall complexity of Algorithm DFis O(n).

In Algorithm 2, which we refer to as Algorithm G, theselection of a group to place a rack is done using the greedycriterion “place the rack in the group that has the smallestamount of data assigned so far”. This criterion seeks tominimize the maximum Dj . The algorithm begins with Dj = 0

Page 4: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Algorithm 2 Greedy Algorithm

Input: n racks with rack i having di amount of data and koptical links.

Input: LB = d∑n

1 di/ke.Output: k groups such that max{Dj} = LB.

1: Dj = 0, 1 ≤ j ≤ k.2: i = 1.3: while i ≤ n do4: Let g be such that Dg = min{Dj}.5: if Dg + di > LB then6: Assign rack i to group g.7: Group g gets LB −Dg of rack i’s data.8: di = di − (LB −Dg). Dg = LB.9: else

10: Assign rack i and all of di to group g.11: i = i + 1. Dg = Dg + di.12: end if13: end while

for all j, which represents the state when no racks have beenassigned. If the assignment of a rack causes the total dataassigned to a group to exceed LB, only part of the dataassociated with this rack is assigned and the process repeatedto complete the assignment of the remaining data.

Consider the n = 5 and k = 3 instance used to illustratealgorithm DF . The first rack is assigned to group 1 makingD1 = 3. The second rack is assigned to group 2 making D2 =8 and the third rack is assigned to group 3 making D3 = 4.The fourth rack is then split between racks 1 and 3 with rack1 getting 5 units of its data and rack 3 getting 2. Now, D1 =D2 = 8 and D3 = 6. Finally, rack 5 is assigned to group3 making D3 = 8. Figure 2(c) shows a possible aggregationtree for this grouping; the aggregation time is LB and theaggregation topology may be realized when k = 3 as no rackrequires more than 3 links.

Correctness of Algorithm G: As was the case for algorithmDF , it is easy to see that max{Dj} = LB and so we needalso show that each of the k groups of racks can be organizedinto an aggregation subtree using at most k optical links perrack. For this, it is sufficient to show that we can link eachgroup into a chain. The proof is similar to that for algorithmDF . Unlike the groups constructed by algorithm DF , whichhave at most 2 split racks each, a group of algorithm G maycontain more than 2 split racks. Suppose the data in rack Ris assigned to s groups (i.e., the rack is split s− 1 times). Inthe first s− 1 groups, rack R is the last assigned rack and canbe made a leaf (as each non-empty group has exactly one lastassigned rack and its aggregation tree has at least one leaf).If we assign one link to R for each of the first s − 1 racksto which it is assigned and 2 to the last group to which it isassigned (except in the case when s = k when R gets 1 linkin the last group as well), R has enough links in each groupto be part of that group’s aggregation tree.

Complexity of Algorithm G. For the complexity, we see thatthe while loop is iterated n+S = O(n) times, where S is thetotal number of splits. In each iteration of the while loop weneed to find the minimum Dj . This takes O(log k) time if westore the Dj in a priority queue structure such as a min heap[1]. So, the overall complexity of Algorithm DF is O(n log k).

Note that the proposed algorithms DF and G result in anaggregation time equal to the lower bound LB. Wang’s algo-rithm [2] and LPT for the case when splitting is not permitted,result in a worst-case aggregation time of (k+1)/2×LB and(4/3−1/3k)×LB, (assuming LB > max{di}), respectively.Thus, relative to these algorithms, splitting can reduce the ag-gregation time by a factor of up to (k+1)/2 and (4/3−1/3k),respectively. Note that, when LB < max{di}, the factor canbe as high as k as mentioned earlier. As an example, fork = 2, the percentage reduction is up to 50% relative toWang’s algorithm and up to 16.67% relative to LPT . Thesebounds increase to approximately 450% and 30%, respectivelyfor k = 10.

IV. HEURISTICS AND ALGORITHMS FOR SANTOS-NT

A

3

65

2

a.

3

4

1

b.A

3

5

2

8 4

2

Fig. 3: Example aggregation tree topologies

Many aggregation trees are possible for a grouping of racksconstructed by either of the algorithms DF and G. Figures 2(b)and (c) gave possible aggregation trees for the groupingsconstructed by each of these algorithms for an example dataset. Figure 3 gives another possible aggregation tree for eachof these groupings. While all example trees have the sameaggregation time, they result in different total network traffic.The network traffic for the trees of Figures 2(b) and (c) is 37and 35, respectively, while for the trees of Figure 3, the totalnetwork traffic is 34 and 33, respectively.

In this section, we consider the problem of constructingaggregation trees, for a given grouping of racks, that minimizetotal network traffic. We note that, in [3], we showed how to dothis when all racks in a group have the same number of linksavailable. The traffic minimization algorithm of [3] constructsa tree in which every non-leaf node (except possibly the last)has k − 1 children and then assigns the racks to the nodes ofthis tree in decreasing order of di by levels, top to bottom. Thisalgorithm cannot be applied directly to the groupings obtainedby algorithms DF and G because there is some flexibilityin distributing the k links available to a split rack across thegroups in which that rack appears. For example, if k = 8 andrack 5 is placed in 3 groups X , Y , and Z then every set ofpositive integers x, y, z such that x + y + z = 8 definesa plausible assignment of links to rack 5 in these groups,respectively. For almost all of these assignments aggregationtrees can be constructed for X , Y , and Z and it is difficultto determine a priori which assignment will minimize networktraffic.

A. The Heuristics BySize and ByDegree

Suppose that a rack is placed in q ≥ 1 groups by eitherDF or G. This rack is assigned 1 optical link in each of the

Page 5: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

first (q − 1) groups and (k − (q − 1)) links in the last group.From the correctness proof, it follows that this link assignmentmakes it possible to organize all the racks in a group into anaggregation tree using no more links than what is assignedto the rack in this group. We propose two simple heuristics,BySize and ByDegree that attempt to minimize total networktraffic using this link assignment scheme. This results in4 algorithms- DFSize, DFDeg, GSize and GDeg. Bothheuristics construct the k subtrees of the overall aggregationtree one group at a time. In BySize, we sort the racks (withthe exception of the rack, if any, that has been assigned only1 link) in a group into decreasing order of their data size.The first rack is made the root of the aggregation subtree forthis group. If this rack has been assigned q links (note thatq > 1 except possibly when the group has only 1 rack), thenext q − 1 racks are made its children and so on. The rackwith 1 link (if any) is made a child of a rack closest to theroot that has a non-utilized link. We note that when all rackshave the same number of links, BySize constructs the sameaggregation subtrees as does the traffic minimization algorithmof [3]. The complexity of BySize is O(n log n), because ofthe need to sort the racks by data size.

In ByDegree we maximize the number of nodes at eachlevel of an aggregation subtree. For this, we sort the racks ina group (with the exception of the rack, if any, that has 1 link)into decreasing order of their number of links; racks with thesame number of links are sorted in decreasing order of theirdata size. The construction of the aggregation subtree nowfollows the same process as used in BySize. The complexityof ByDegree is also O(n log n).

B. Minimizing Network Traffic For Decreasing Order DF

We are able to minimize total network traffic for thegrouping that results when algorithm DF processes racks indecreasing order of di. Let DFD be the resulting groupingalgorithm. The complexity of DFD is O(n log n) because ofthe need to sort the racks prior to running DF . The groupingthat results from DFD has valuable properties that are statedin the following lemma.

Lemma 1. Suppose that rack i is placed in groups G1, · · · ,Gq , in this order, by DFD. Let dij be the amount of data forrack i in group Gj . Note that

∑dij = di.

1) Rack i is the last rack assigned to Gj , 1 ≤ j < q.2) dij is the smallest data amount associated with any

rack in Gj except possibly the first rack assigned toGj , 1 ≤ j < q.

Proof: Follows from the definition of DF and the factthat the racks are assigned in decreasing order of di.

Using Lemma 1, the fact that the groupings constructed byalgorithm DFD have at most two split racks in each groupwith one of these split racks having the least or second leastdata in the group, and the traffic minimization proof of [3], wearrive at the following observations regarding an aggregationtree that has minimum total traffic for the given grouping. LetTmin(G) denote this minimum total traffic subtree. Let Fi bethe first rack assigned to Gi and Li the last. Note that since arack may be assigned to several groups, it is possible that Li

and Fi+1 represent the same rack. If Li is rack R and Li’s

total data is dR, then dLiis the amount of dR that is assigned

to group Gi by DFD.

1) If Li occurs in only 1 group, there is a Tmin in whichLi is assigned k links. However, when Li occurs inmore than one group, we need to decide how manylinks to allocate it in Gi. For each of the followingcases/observations, we assume Li occurs in more than1 group.

2) When k = 2, the total number of groups is 2 andi = 1 (as only the last rack assigned to G1 may besplit by DFD). In this case, we have no option butto assign Li 1 link in G1 and 1 in G2. Tmin isobtained by ordering the racks in G1 in decreasingorder of assigned data top to bottom (Li becomesa leaf) and G2 is similarly ordered except that F2

must be the leaf in G2’s aggregation subtree. In thefollowing cases, we assume k > 2.

3) If dFi≥ dLi

, it is sufficient to assign Li 1 link as Li

has least data in Gi and there is an optimal topology(i.e., a subtree with minimum total traffic) for Gi (andhence, a Tmin), in which Li is a leaf. So, we mayoptimize Gis topology independent of the remaininggroups. For the remaining cases, assume dFi

< dLi.

Now, i > 1 as dF1≥ dL1

in DFD groupings.4) If |Gi| > 2, then there is a rack C in Gi such that

dC ≥ dLi> dFi

. So, there is an optimal topologyfor Gi in which Li is a leaf (Fi will also be a leaf).So, it is sufficient to assign Li 1 link, which is theminimum number of links assignable to Li. Again,we may optimize Gi’s topology independent of theremaining groups.

5) When |Gi| = 2, Gi’s optimal topology has Li as rootand so Li needs 2 links. Let Gj be the last group towhich the rack Li is assigned. So, Li = Fj . Groupsi+1, · · · , j−1 have only one rack, which is the sameas rack Li. In each of these groups, we may assignthe rack 1 link and this is sufficient for optimalityof each group’s topology and is also the minimumnumber of links that can be assigned. When k = 3,i = 2 and j = 3 and there is a Tmin in which eitherL2 has 2 links and F3 has 1 or L2 has one link andF3 has 2. Let Tmin1 be the optimal topology for thefirst assignment and Tmin2 that for the second. EachTmini is obtained by using the optimal topology forthe individual Gis while restricted to the specifiedlink assignment. Tmin is the Tmini with the smallertotal traffic. For the following cases, k > 3.

6) If |Gj | > 2, it is sufficient to assign Lj 1 link asthere is an optimal topology for Gj (as Gj has arack C with dC ≥ dLj

) in which Lj is a leaf. So, tominimize the total traffic in Gi through Gj , we haveonly 2 pairs of link assignments for Li and Fj (Li

gets 1 or 2 links and Fj gets k−(j−i−1)−x, where xis the number of links assigned to Li) to check. Noteagain that we optimize the topology for Gi, · · · , Gj

independent of the remaining groups and that toobtain the optimal topology for Gi, · · · , Gj , we needto compute the optimal topology for Gi for 2 linkassignments and that for Gj for 2 link assignments.The optimal topology for each of Gi+1, · · · , Gj−1 istrivially obtained as each has only 1 rack.

7) If |Gj | = 2 and dFj≥ dLj

, it is sufficient to assign

Page 6: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Lj 1 link (as there is an optimal topology for Gj inwhich Lj is a leaf) and we have two pairs of assign-ments for Li and Fj to check. Again, we can computethe optimal topology for Gi, · · · , Gj independent ofthe remaining groups as in the preceding case.

8) Suppose |Gj | = 2 and dFj< dLj

. If j = i + 1, itis sufficient to assign 2 links to each of Li and Fj

(note that this is possible as k > 3) and decouplethe optimization of Gi and Gj from each other. So,assume j > i + 1. Let q = j − i > 1. The totaldata

∑j−1u=i dLu

+ dFjin rack Li is more than dLi

+(q − 1)LB (LB is the lower bound on aggregationtime, and groups Gu, i < u < j have LB amount ofdata assigned, all from rack Li) > (q − 0.5)LB (asdFi

< dLi, total data in Gi is LB, and |Gi| = 2).

Since there is at least one rack (i.e., Fi) that has atleast as much total data as does Li, and dFi < 0.5LB,the rest of the data in rack Fi must have been assignedto at least q groups with index less than i. Hence,i > q > 1 and so i ≥ 3. If we assign 2 links to Li,2 to Fj (recall that Li and Fj are the same racks),and 1 to R = Li in each of Gi+1 to Gj−1, the totalnumber of links assigned is j− i+3 ≤ k− i+3 ≤ k.So, we can decouple the optimization of Gi and Gj

from each other.

From the above observations, we see that Tmin can bedetermined by computing at most 2 optimal topologies for eachgroup independently (each computation uses a different linkassignment for one of the racks). From the at most 2k optimaltopologies computed, Tmin can be determined in O(k) time.

We now turn our attention to computing an optimal topol-ogy for a group of racks given the number of links assigned toeach. When |Gi| ≤ 2, its optimal topology is easily computed.So, assume |Gi| > 2. In this case, all racks other than Fi

and Li have k links. If Fi also has k links, the optimaltopology may be computed using the algorithm of [3]. A slightmodification of the algorithm of [3] may be applied when Fi

has only 1 link as now Fi and Li must be leaves. So, assumethat Fi has fewer than k links but more than 1. The numberof links assigned to Li is not important as Li will be a leafand needs just 1 link.

Let U be an infinite tree with the following properties:

1) One node of U is called the designated node.2) Every node other than the designated node has ex-

actly k − 1 children.3) The designated node has q − 1 children, where q is

the number of links assigned to Fi.

Note that there are many infinite trees U that satisfy theabove 3 properties. A feasible tree for Gi is any tree with |Gi|nodes that is derived from a U by retaining the root of U ,the designated node, and |Gi| − 2 other nodes. Every feasibletree has the property that the racks of Gi can be assigned toits nodes so that the degree of each node is no more than thenumber of links assigned to the rack assigned to it.

Figure 4(a) shows an U with k = 4 and a designated nodeN . Figure 4(b) shows a possible DFD group Gi with 2 linksassigned to Fi. Figures 4(c) and (d) show two feasible trees forGi. Node labels X(Y ) indicates that rack X in Gi is placedin node Y of U .

a.

N

1

10

2

4 5 6

. . ......

.

........ .........

.

........ .............

.

........

2

3

7 8 9

11 12 13 3029 31

d.

Fi(N)

A(1)

J(10)

B(2)

D(4) E(5) F(6)

C(3)

G(7) H(8) I(9)

c.

Fi(N)

A(1)

B(2)

C(4) D(5) E(6) F(7)

G(11) H(12) I(13) J(14)

2

b.

Fi

A

J

B

D E F

C

G H I

Fig. 4: Feasible and canonical trees for Gi with k = 4

Lemma 2. Let T be a feasible tree for Gi. The total networktraffic using T is minimized by assigning the racks (other thanFi) in Gi in decreasing order of their assigned data size to thenodes of T top to bottom; rack Fi is assigned to the designatednode.

Proof: Follows from the proof used in [3] for the opti-mality assigning racks in decreasing order of their data size.

Let V be an infinite tree that, in addition to satisfying allthe properties of U , satisfies the following property:

Property 1. When we number the nodes in U by levels fromtop to bottom and within a level from left to right beginningwith the number 1, the number assigned to the designated nodeis ≤ |Gi|.

Note that the tree of Figure 4(a) satisfies with this propertywith respect to the Gi of Figure 4(b) and so this tree is anexample of both U and V .

A canonical aggregation tree for Gi is constructed fromV by retaining those nodes that are numbered 1 through |Gi|.Clearly, there are several canonical trees for Gi. It is easy to seethat every canonical tree is a feasible tree and that there existfeasible trees that are not canonical. The tree of Figure 4(d)is a canonical tree for Gi while that of Figure 4(c) is a non-canonical feasible tree for Gi.

Let A(j) denote a tree A in which the designated node isat level j and let N(A(j), l) be the number of nodes at levell of tree A(j).

Lemma 3. 1) Let A(j) and B(j) be canonical trees forGi (in both, the designated node is at level j).

N(A(j), q) = N(B(j), q), q ≥ 1

2) Let A(l) and B(j), l < j be canonical trees for Gi

(note that the designated node is closer to the rootin A(l) than in B(j)).

s∑q=1

N(A(l), q) ≤s∑

q=1

N(B(j), q), s ≥ 1

3) Let A(j) and B(j), respectively, be a feasible and acanonical tree for Gi.

s∑q=1

N(A(j), q) ≤s∑

q=1

N(B(j), q), s ≥ 1

Page 7: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Proof: Follows from the definitions of feasible and canon-ical trees and the fact that the degree (i.e., number of children)of the designated node is less than k − 1.

Lemma 4. Let A(j) and B(j) be two canonical trees for Gi.The minimum network traffic using A(j) and B(j) is the same.

Proof: Since the number of nodes at each level of A(j) isthe same as that at each level of B(j) (Lemma 3), the optimalrack to node assignment used in the proof of Lemma 2 assignseach rack to the same level in A(j) as in B(j). Hence, bothoptimal assignments yield the same total network traffic.

Lemma 5. Let G′i be Gi with the number of links assignedto each rack being k. Suppose that Fi is placed at level l inthe minimum traffic tree constructed by the algorithm of [3]for G′i. There is a feasible tree for Gi in which the designatednode is at a level ≥ l and this feasible tree minimizes networktraffic for Gi.

Proof: Let A(j), j < l be a feasible tree for Gi. Let B(l)be a minimum traffic tree for G′i (the racks are assigned indecreasing order of data size to nodes in a tree in which eachnode, except the last few, has k−1 children [3]). Assume thatthe rack-to-node assignment in both A(j) and B(l) is doneusing the same rack ordering and that rack Li is the last rackexcept when dFi

< dLiin which case Li is the next to last

rack (this specification is necessary to resolve ties in data sizeamong racks). Since Fi has fewer than k links, A(j) must havea node M at a level q, q > j such that dr ≥ dFi

, where ris the rack assigned to node M . Obtain A′(q) from A(j) byswapping M and the designated node (the subtrees of thesenodes are not relocated by this swap). Further, if node M hasp non-empty subtrees in A(j) and the designated node hass < p non-empty subtrees, then p − s of the subtrees of Mare relocated from level q to level j as well. This swapping ofracks and relocating of subtrees to obtain A′(q) ensures thatA′(q) is a feasible tree for Gi. Note also that this swapping andrelocation does not increase total traffic. In fact, the change intotal traffic is (dFi−(dr+ sum of data amounts in the relocatedsubtrees (if any)))∗(q−j) ≤ 0. In this way the designated nodemay be moved progressively to a level ≥ l ensuring a feasibletree with the minimum network traffic for Gi in which thedesignated node is at a level ≥ l.

Lemma 6. Let A(j) be a feasible tree and B(j) a canonicaltree for Gi. The minimum network traffic using A(j) is ≥ thatusing B(j).

Proof: The minimum network traffic using either A(j) orB(j) is obtained using the rack to node assignment used inLemma 2. Since,

∑sq=1 N(A(j), q) ≤

∑sq=1 N(B(j), q) for

s ≥ 1 (Lemma 3), each rack is assigned to a node at the sameor more distance from the root in A(j) than in B(j).

Theorem 1. There is a canonical tree A(j), for Gi, withj ≥ l, where l is the level at which rack Fi is placed inthe minimum traffic tree for Gi using the algorithm of [3] andthe assumption that each rack of Gi has k links.

Proof: Follows from Lemmas 2–6.

Complexity Analysis. Let ni = |Gi|. Note that the racks in Gi

except possibly Fi are assigned to Gi in decreasing order oftheir data size and so there is no need to do another sorting

of racks to implement the rack-to-node assignment strategy ofLemma 2.

When k = 2, a minimum traffic tree for Gi is easilyconstructed in O(ni). So, the minimum traffic aggregation treemay be constructed from G1 and G2 in O(n) time.

When k > 2, we can compute l, as defined in Lemma 5,in O(ni) time using the algorithm of [3]. A canonical treeA(j) may be constructed in O(ni) time. The minimum trafficusing A(j) may be determined in O(ni) time using the rack-to-node assignment strategy of Lemma 2. Since the height ofevery canonical tree for Gi is O(log ni), the minimum traffic(canonical) tree for Gi may be determined in O(ni log ni)time. Hence, the at most 2k minimum canonical trees neededfor all k groups may be determined in O(n log n) time. Fromthese, the minimum traffic aggregation tree for the k groupscan be found in O(k) time. So, the minimum aggregationtree for the groups Gi, 1 ≤ i ≤ k may be determined inO(n log n) time. This time may be reduced to O(n) by notingthat the minimum traffic for Gi can be determined in O(ni)time rather than in O(ni log ni) time. For this, we compute theminimum traffic for canonical trees A(j), j ≥ l in increasingorder of j. The computation for A(l) takes O(ni) time. ForA(j), j > l, we first compute the prefix sums of the decreasingorder sequence of data sizes in G′i (G′i is obtained from Gi byremoving Fi). Using these prefix sums, the minimum traffic forA(j) can be determined from that for A(j − 1) in O(log ni)time (see the following example). Since O(log ni) j valuesneed to be tried, the minimum traffic tree for Gi is computed inO(ni+log2 ni) = O(ni) time. The time to do this computationfor all k groups is therefore O(n).

Example 1. Consider an instance of the problem where n =17, d1:17 = (24, 1× 16), and k = 4. DFD forms the followinggroups- (10), (10), (4, 1× 6), (1× 10). Let us look at Group3, which is shown in Figure 5(a). Rack d1 = F3 is split inthe groups 1 and 2. So it has got (k − 2) = 2 of it’s degreesleft. The total traffic for this topology is 23. If we move d1down to the next level, and make necessary adjustments to thetopology, we have the tree in Figure 5(b). We find that thetotal traffic for Figure 5(b) reduces to 22.

As we move F3 from level 1 to 2, we observe each andevery level v ≥ 1 in the group to see what adjustments areto be done to the topology. We notice that v loses a set ofconsecutive racks say de:f , where e ≤ f , to level (v − 1) andthese racks account for a reduction in the traffic, because eachof these racks are placed one level closer to the aggregatorin the new topology. If we look at Figure 5(b), levels 2, 3and 4 lose racks d2:2, d3:4 and d6:7 respectively. Then, thereduction in total traffic at each level v can be computed asfollows: ∆tv = (Pf − Pe−1), where Pr = (

∑rj=1 dj) are the

prefix sums of the decreasing order sequence of data size inG′3 (G′3 is obtained from G3 by removing F3). At the sametime, since F3 is moved one level down, it accounts for anincrease in the total traffic by a similar argument. So, totalchange in traffic, when F3 is moved down to level 2 from 1is given by: (F3 −

∑hv=2 ∆tv) = 4− 5 = −1, where h is the

height of the subtree. In this way we can calculate the totaltraffic in the subtree when F3 in moved further down and weplace F in the level which minimizes the total traffic.

Regardless of which strategy is used to determine theminimum traffic aggregation tree for a given set of Gis, the

Page 8: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

Aa.

d1(4)

b.

d2(1)

d4(1)d3(1) d5(1)

d6(1) d7(1)

A

d1(4)

d2(1)

d4(1)d3(1)

d5(1) d6(1) d7(1)

Fig. 5: Group 3 for Example 1

overall complexity of our DFD algorithm for SANTOS-NTis O(n log n) as DFD takes this much time to determine theGis that minimize aggregation time.

C. The Hybrid Algorithm GreedyDFD

The traffic minimization algorithm of Section IV-B doesnot generalize to a corresponding version GD (greedy de-creasing order) of the minimum aggregation time groupingalgorithm G of Section III because a group constructed byalgorithm GD may contain more than two split jobs. Intu-itively, we expect algorithm GD to construct groupings thatresult in lesser traffic as this algorithm distributes the rackswith large data amounts across groups rather than clusteringthese racks into a single group as is done by algorithm DFD.To take advantage of this property of GD and also be able toconstruct minimum traffic trees for the obtained groupings, wepropose a hybrid algorithm GDFD (Greedy-DF-Decreasingorder) in which algorithm G is run until the first time anyrack is split. This splitting completes one of the groups beingconstructed (the total data assigned to this group is LB).The remaining racks are assigned to the remaining k − 1groups using algorithm DF . In GDFD, racks are assignedin decreasing order of their data size.

Since the groups generated by GDFD have the sameproperties as do those generated by DFD, Lemma 1 appliesto the GDFD groups (G1 is the first group to fill, G2 the next,and so on). As a result, Theorem 1 applies and we may usethe same algorithm as used for DFD to construct a minimumtraffic aggregation tree for the groupings of GDFD.

V. EXPERIMENTS

We performed experiments to assess the expected improve-ment in aggregation time using data splitting compared tousing the LPT algorithm of [3], which does not do datasplitting. We also measured the change in total network trafficresulting from the heuristics and algorithms of Section IV. Ourexperiments are modeled along the lines of the experiments weconducted in [3].

The distribution of the amount of data, available at eachrack, may have wide variations depending on the specificapplication. We can have a fixed amount of data to aggregate ateach rack if, say, we are scanning a big text corpus distributedin the network and returning frequencies of a set of keywordsfor each document. On the other hand, if we are executinga search query on the same text corpus and returning thedocuments matching the keywords, the amount of data to be

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(a) Zipfian 1

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(b) Zipfian 2

0

0.5

1

1.5

2

2.5

3

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(c) Zipfian 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(d) Uniform

Fig. 6: Average percentage change in aggregation time: LPTVS DFD. n=10000

sent to the aggregator from each rack may vary to a largeextent. So, in order to access the performance of our algorithmsin such real scenarios, we use the following data sets:

• Zipfian 1. The dis are drawn from a Zipfian distribu-tion with parameter 2.

• Zipfian 2. The dis are drawn from a Zipfian distribu-tion with parameter 2.5.

• Zipfian 3. The dis are drawn from a Zipfian distribu-tion with parameter 3.

• Uniform. The dis are drawn from a uniform distribu-tion with values in [50,100].

The parameter for the Zipfian Distribution denotes the valueof the exponent for the power-law distribution. The Zipfiandistribution models the case when racks have widely varyingamounts of data to return. We use a link bandwidth of 1 sothat the data transmission time equals the amount of datatransmitted through a link. This does not affect our resultsas we analyze percentage change in the values of aggregationtime and total network traffic. We have used n = 1000 and10000 racks and the number, k, of optical links per rackis varied between 2 and 100 in our experiments. For eachchoice of k and the data distribution, we run the algorithms on10 randomly generated instances. For the heuristics- DFSize,DFDeg, GSIZE and GDeg, we use 10 random orderings of theracks for each k and data distribution. Our experiments wereconducted on a 64-bit PC with a 2.80 GHz AMD Athlon(tm)II X2 B22 processor and 8GB RAM.

A. Aggregation Time

Figures 6 and 7 show the average change in aggregationtime when DFD is used (note that all our splitting algorithmsgive the same optimal aggregation time), compared to LPT for10,000 and 1,000 rack data centers, respectively. With 10,000racks, the average reduction in aggregation time for any kusing splitting was up to 90%, 40%, and 2.5% for the threeZipfian data-sets. On uniform data sets, the average reductionin aggregation time for any k was at most 0.5%, an indicatorthat LPT does near optimal partitioning on uniform data sets.

Page 9: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(a) Zipfian 1

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(b) Zipfian 2

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(c) Zipfian 3

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n A

gg

. T

ime

Aggregator Degree

(LPT-DFD)

(d) Uniform

Fig. 7: Average percentage change in aggregation time: LPTVS DFD. n=1000

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(a) Zipfian 1

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(b) Zipfian 2

0

5

10

15

20

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(c) Zipfian 3

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(d) Uniform

Fig. 8: Average percentage change in total network traffic: n =10000

The maximum percentage reduction in aggregation time forany instance of each of our 4 data sets was 98%, 78%, 13%and 0.55%, respectively and the the standard deviation in theaggregation times for the 4 data sets was less than 29%, 33%,4.9% and 0.048% respectively.

With n = 1000 racks, the average reduction in aggregationtime (relative to LPT) for any k was up to 90%, 70% and 46%for the three Zipfian data-sets and up to 4.5% the Uniform data-set. The corresponding maximum percentages are 99%, 90%,83% and 5.1% and the corresponding standard deviations wereless than 28%, 30%, 30% and 0.09%, respectively.

Our experiments indicate that the benefits of splitting in-crease as k increases and reduces as n increases. These resultsare not entirely surprising as splitting can reduce aggregationtime by up to a factor of k and as n increases the likelihoodthat there will be a prefect partition without splitting increases.

B. Total Network Traffic

When racks are split, the aggregation tree has more nodesand hence may have more levels than when no rack is split.This increase in number of nodes has the potential to increasethe total network traffic as the number of hops to the aggregatormay increase for some racks (relative to the optimal traffic LPTaggregation tree of [3]). Our experiments show that this indeedis the case. Figures 8 and 9 show the average change in totaltraffic (relative to the optimal traffic LPT aggregation tree)for the heuristics and algorithms developed in Section IV. Forthe Zipfian 1 data-set with n = 10000 (Figure 8(a)), DFD,GDFD, DFSize, DFDeg, GSize and GDeg result in anaverage increase in total traffic of about 48%, 15%, 43%, 43%,55% and 55%, respectively. The maximum increase is 60%,35%, 85%, 85%, 85% and 85%, respectively. The standarddeviation in the total traffic is less than 6%, 10%, 29%, 29%,25% and 25%, respectively. The results for the other twoZipfian data-sets are similar (Figures 8(b) and (c)). For theUniform data-set (Figure 8(d)), the average increase in totaltraffic is 5.5% for DFD and 0.5% for each of DFSize,DFDeg, GSize and GDeg. GDFD shows no increase intraffic relative to LPT for the Uniform data-set. The standarddeviation for the Uniform data set was at most 0.03%.

For the smaller data center with 1000 racks the experi-mental results are similar to those for the 10,000 rack datacenter. For the Zipfian 1 data-set (Figure 9(a)), DFD, GDFD,DFSize, DFDeg, GSize and GDeg have an average max-imum increase in total traffic for any k of about 48%, 15%,50%, 50%, 47% and 47%, respectively. The maximum per-centages for all the algorithms are nearly 98% for k = 2, butreduce to 17%, 17%, 35%, 36%, 66% and 66%, respectively,for larger k. The standard deviation is less than 18%, 29%,25%, 25%, 25% and 25%, respectively. The results for theother two Zipfian data sets are similar (Figures 9(b) and (c)).For the Uniform data-set (Figure 9(d)), the average increase intotal traffic is 5.75% for DFD, 3% for each of DFSize andDFDeg, and 2% for each of GSize and GDeg relative toLPT . GDFD again has the same total traffic as does LPTand the standard deviation is about 0.2%.

Overall, GDFD results in the best solutions to SANTOS-NT.

VI. CONCLUSION

We have developed two classes of algorithms forSANTOS–DF and G, which take O(n) and O(n log k) time,respectively, where k ≥ 2 is the number of optical linksper rack and n is the number of source racks. When thesealgorithms are extended to require racks be considered insorted order, their complexity becomes O(n log n). O(n log n)heuristics to reorganize the nodes in any solution to SANTOSin an attempt to minimize total network traffic subject tominimizing the aggregation time were also developed. Further,two classes of solutions to SANTOS (i.e., DFD and GDFD)for which total network traffic can be minimized in O(n) timewere identified. Since it takes O(n log n) time to generate theinitial SANTOS solution, which is then reorganized in O(n)time to minimize total network traffic, the overall complexityof these algorithms for SANTOS-NT is O(n log n). Throughextensive experiments, we demonstrate the benefit of permit-ting racks to use multiple data paths to the aggregator (i.e.,to split their data). The aggregation time reduces by up to

Page 10: Network Topology Optimization for Data Aggregation …sahni/papers/santos.pdf · Network Topology Optimization for Data Aggregation with Splitting Soham Das and Sartaj Sahni Department

5

10

15

20

25

30

35

40

45

50

55

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(a) Zipfian 1

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(b) Zipfian 2

0

2

4

6

8

10

12

14

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(c) Zipfian 3

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80 90 100

% C

ha

ng

e i

n T

raff

ic

Aggregator Degree

DFD-LPTGDFD-LPT

DFSize-LPTDFDeg-LPTGSize-LPTGDeg-LPT

(d) Uniform

Fig. 9: Average percentage change in total network traffic: n =1000

99% using data splitting. This reduction in aggregation timecomes at the cost of increased network traffic. We establishthe superiority of our hybrid greedy algorithm GDFD forSANTOS over our other proposed SANTOS algorithms withrespect to total network traffic. GDFD limits the increase innetwork traffic to a maximum of 35% when racks have morethan 2 links each.

ACKNOWLEDGMENT

This research was supported, in part, by the National Sci-ence Foundation under grants CNS0829916 and CNS0905308.

REFERENCES

[1] Ellis Horowitz, Sahni, Dinesh Mehta, Data Structures, Algorithms, andApplications in C++, McGraw Hill, NY, 1998.

[2] Wang, Guohui and Ng, T.S. Eugene and Shaikh, Anees, Programmingyour network at run-time for big data applications, Proceedings of thefirst workshop on Hot topics in software defined networks HotSDN ’12,2012.

[3] Soham Das and Sartaj Sahni, Network topology optimization for dataaggregation, IEEE/ACM International Symposium on Cluster, Cloud,and Grid Computing (CCGrid), 2014.

[4] R. L. Graham, Bounds on Multiprocessing Timing Anomalies, SIAMJOURNAL ON APPLIED MATHEMATICS, 1969, volume 17, number2, pages 416–429.

[5] Coffman,Jr., E. G. and Sethi, Ravi, A generalized bound on LPTsequencing, Proceedings of the 1976 ACM SIGMETRICS conferenceon Computer performance modeling measurement and evaluation, SIG-METRICS ’76, 1976.

[6] Greenberg, Albert and Lahiri, Parantap and Maltz, David A. and Patel,Parveen and Sengupta, Sudipta, Towards a next generation data centerarchitecture: scalability and commoditization, Proceedings of the ACMworkshop on Programmable routers for extensible services of tomorrow,PRESTO ’08, 2008.

[7] Al-Fares, Mohammad and Loukissas, Alexander and Vahdat, Amin,A scalable, commodity data center network architecture, Proceedingsof the ACM SIGCOMM 2008 conference on Data communication,SIGCOMM ’08, 2008.

[8] Guo, Chuanxiong and Wu, Haitao and Tan, Kun and Shi, Lei andZhang, Yongguang and Lu, Songwu, Dcell: a scalable and fault-tolerantnetwork structure for data centers, Proceedings of the ACM SIGCOMM2008 conference on Data communication, SIGCOMM ’08, 2008.

[9] Das, S. and Yiakoumis, Y. and Parulkar, G. and McKeown, N. andSingh, P. and Getachew, D. and Desai, P.D., Application-aware aggrega-tion and traffic engineering in a converged packet-circuit network, Op-tical Fiber Communication Conference and Exposition (OFC/NFOEC),2011 and the National Fiber Optic Engineers Conference, 2011.

[10] Webb, Kevin C. and Snoeren, Alex C. and Yocum, Kenneth, Topologyswitching for data center networks, Proceedings of the 11th USENIXconference on Hot topics in management of internet, cloud, andenterprise networks and services, Hot-ICE’11, 2011.

[11] Kavulya, S. and Tan, J. and Gandhi, R. and Narasimhan, P., An Analysisof Traces from a Production MapReduce Cluster Cluster Cloud and GridComputing (CCGrid), 2010.

[12] Benson, Theophilus and Akella, Aditya and Maltz, David A., Networktraffic characteristics of data centers in the wild, Proceedings of the10th ACM SIGCOMM conference on Internet measurement, IMC ’10,2010.