of 14 /14
1 SCDP: Systematic Rateless Coding for Efficient Data Transport in Data Centres Mohammed Alasmar * , George Parisis * , Jon Crowcroft * School of Engineering and Informatics, University of Sussex, UK, Email: {m.alasmar, g.parisis}@sussex.ac.uk Computer Laboratory, University of Cambridge, UK, Email: [email protected] Abstract —In this paper we propose SCDP, a novel, general- purpose data transport protocol for data centres that, in contrast to all other protocols proposed to date, natively supports one-to-many and many-to-one data communication, which is extremely common in modern data centres. SCDP does so without compromising on efficiency for short and long unicast flows. SCDP achieves this by integrating RaptorQ codes with receiver-driven data transport, in-network packet trimming and Multi-Level Feedback Queuing (MLFQ); (1) RaptorQ codes enable efficient one-to-many and many-to-one data transport; (2) on top of RaptorQ codes, receiver-driven flow control, in combination with in-network packet trimming, enable efficient usage of network resources as well as multi-path transport and packet spraying for all transport modes. Incast and Outcast are eliminated; (3) the systematic nature of RaptorQ codes, in combination with MLFQ, enable fast, decoding-free completion of short flows. We extensively evaluate SCDP in a wide range of simulated scenarios with realistic data centre workloads. For one-to-many and many-to-one transport sessions, SCDP performs significantly better compared to NDP. For short and long unicast flows, SCDP performs equally well or better compared to NDP. Index Terms—Data centre networking, data transport proto- col, fountain coding, modern workloads. I. I NTRODUCTION Data centres support the provision of core Internet ser- vices and it is therefore crucial to have in place data transport mechanisms that ensure high performance for the diverse set of supported services. Data centres consist of a large number of commodity servers and switches, support multiple paths among servers, which can be multi-homed, very large aggregate bandwidth and very low latency com- munication with shallow buffers at the switches. One-to-many and many-to-one communication. A sig- nificant portion of data traffic in modern data centres is produced by applications and services that replicate data for resilience purposes. For example, distributed storage systems, such as GFS/HDFS [1], [2] and Ceph [3], replicate data blocks across the data centre (with or without daisy chaining 1 ). Partition-aggregate [4], [5], streaming telemetry [6], [7], [8], and distributed messaging [9], [10] applications also produce similar traffic workloads. Multicast has already been deployed in data centres 2 and, with the advent of P4, scalable multicasting is becoming practical [11]. As a result, 1 https://patents.google.com/patent/US20140215257 2 e.g. https://www.rackspace.com/en-gb/cloud/networks much research on scalable network-layer multicasting in data centres has recently emerged [12], [13], [14], [15], [16] Existing data centre transport protocols are suboptimal in terms of network and server utilisation for these workloads. One-to-many data transport is implemented through multi- unicasting or daisy chaining for distributed storage. As a result, copies of the same data is transmitted multiple times, wasting network bandwidth and creating hotspots that severely hurt the performance of short, latency-sensitive flows. In many application scenarios, multiple copies of the same data can be found in the network at the same time (e.g. in replicated distributed storage) but only one replica server is used to fetch it. Fetching data from all servers, in parallel, from all available replica servers (many-to-one data transport) would provide significant benefits in terms of eliminating hotspots and naturally balancing load among servers These performance limitations are illustrated in Figure 1, where we plot the application goodput for TCP and NDP [17] in a distributed storage scenario with 1 and 3 replicas. When a single replica is stored in the data centre, NDP performs very well, as also demonstrated in [17]. TCP performs poorly 3 . On the other hand, when three replicas are stored in the network, both NDP and TCP perform poorly in both write and read workloads. Writing data involves either multi-unicasting replicas to all three servers (bottom two lines in Figure 1a) or daisy chaining replica servers (the line with the diamond marker); although daisy chaining performs better, avoiding the bottleneck at the client’s uplink, they both consume excessive bandwidth by moving multiple copies of the same block in the data centre. Fetching a data block from a single server when it is stored in two more servers creates hotspots at servers’ uplinks due to collisions from randomly selecting a replica server for each read request (see 3-sender goodput performance in Figure 1b). Long and short flows. Modern cloud applications com- monly have strict latency requirements [18], [19], [20], [21], [22], [23]. At the same time, background services require high network utilisation [24], [25], [26], [27]. A plethora of mechanisms and protocols have been proposed to date to 3 It is well-established that TCP is ill-suited for meeting throughput and latency requirements of applications in data centre networks, therefore we will be using NDP [17] as the baseline protocol throughout this paper. arXiv:1909.08928v1 [cs.NI] 19 Sep 2019

SCDP: Systematic Rateless Coding for Efficient Data ... · 1 SCDP: Systematic Rateless Coding for Efficient Data Transport in Data Centres Mohammed Alasmar⁄, George Parisis⁄,

  • Author
    others

  • View
    1

  • Download
    0

Embed Size (px)

Text of SCDP: Systematic Rateless Coding for Efficient Data ... · 1 SCDP: Systematic Rateless Coding for...

  • 1

    SCDP: Systematic Rateless Coding for EfficientData Transport in Data Centres

    Mohammed Alasmar∗, George Parisis∗, Jon Crowcroft †∗School of Engineering and Informatics, University of Sussex, UK, Email: {m.alasmar, g.parisis}@sussex.ac.uk

    †Computer Laboratory, University of Cambridge, UK, Email: [email protected]

    Abstract—In this paper we propose SCDP, a novel, general-purpose data transport protocol for data centres that, incontrast to all other protocols proposed to date, nativelysupports one-to-many and many-to-one data communication,which is extremely common in modern data centres. SCDPdoes so without compromising on efficiency for short and longunicast flows. SCDP achieves this by integrating RaptorQ codeswith receiver-driven data transport, in-network packet trimmingand Multi-Level Feedback Queuing (MLFQ); (1) RaptorQ codesenable efficient one-to-many and many-to-one data transport;(2) on top of RaptorQ codes, receiver-driven flow control, incombination with in-network packet trimming, enable efficientusage of network resources as well as multi-path transport andpacket spraying for all transport modes. Incast and Outcastare eliminated; (3) the systematic nature of RaptorQ codes, incombination with MLFQ, enable fast, decoding-free completionof short flows. We extensively evaluate SCDP in a wide rangeof simulated scenarios with realistic data centre workloads.For one-to-many and many-to-one transport sessions, SCDPperforms significantly better compared to NDP. For short andlong unicast flows, SCDP performs equally well or bettercompared to NDP.

    Index Terms—Data centre networking, data transport proto-col, fountain coding, modern workloads.

    I. INTRODUCTION

    Data centres support the provision of core Internet ser-vices and it is therefore crucial to have in place datatransport mechanisms that ensure high performance for thediverse set of supported services. Data centres consist of alarge number of commodity servers and switches, supportmultiple paths among servers, which can be multi-homed,very large aggregate bandwidth and very low latency com-munication with shallow buffers at the switches.One-to-many and many-to-one communication. A sig-nificant portion of data traffic in modern data centres isproduced by applications and services that replicate datafor resilience purposes. For example, distributed storagesystems, such as GFS/HDFS [1], [2] and Ceph [3], replicatedata blocks across the data centre (with or without daisychaining1). Partition-aggregate [4], [5], streaming telemetry[6], [7], [8], and distributed messaging [9], [10] applicationsalso produce similar traffic workloads. Multicast has alreadybeen deployed in data centres2 and, with the advent of P4,scalable multicasting is becoming practical [11]. As a result,

    1https://patents.google.com/patent/US201402152572e.g. https://www.rackspace.com/en-gb/cloud/networks

    much research on scalable network-layer multicasting indata centres has recently emerged [12], [13], [14], [15], [16]

    Existing data centre transport protocols are suboptimal interms of network and server utilisation for these workloads.One-to-many data transport is implemented through multi-unicasting or daisy chaining for distributed storage. As aresult, copies of the same data is transmitted multiple times,wasting network bandwidth and creating hotspots thatseverely hurt the performance of short, latency-sensitiveflows.

    In many application scenarios, multiple copies of thesame data can be found in the network at the same time(e.g. in replicated distributed storage) but only one replicaserver is used to fetch it. Fetching data from all servers,in parallel, from all available replica servers (many-to-onedata transport) would provide significant benefits in termsof eliminating hotspots and naturally balancing load amongservers

    These performance limitations are illustrated in Figure1, where we plot the application goodput for TCP andNDP [17] in a distributed storage scenario with 1 and 3replicas. When a single replica is stored in the data centre,NDP performs very well, as also demonstrated in [17]. TCPperforms poorly3. On the other hand, when three replicasare stored in the network, both NDP and TCP performpoorly in both write and read workloads. Writing datainvolves either multi-unicasting replicas to all three servers(bottom two lines in Figure 1a) or daisy chaining replicaservers (the line with the diamond marker); although daisychaining performs better, avoiding the bottleneck at theclient’s uplink, they both consume excessive bandwidth bymoving multiple copies of the same block in the data centre.Fetching a data block from a single server when it is storedin two more servers creates hotspots at servers’ uplinks dueto collisions from randomly selecting a replica server foreach read request (see 3-sender goodput performance inFigure 1b).

    Long and short flows. Modern cloud applications com-monly have strict latency requirements [18], [19], [20], [21],[22], [23]. At the same time, background services requirehigh network utilisation [24], [25], [26], [27]. A plethora ofmechanisms and protocols have been proposed to date to

    3It is well-established that TCP is ill-suited for meeting throughput andlatency requirements of applications in data centre networks, therefore wewill be using NDP [17] as the baseline protocol throughout this paper.

    arX

    iv:1

    909.

    0892

    8v1

    [cs

    .NI]

    19

    Sep

    2019

    https://patents.google.com/patent/US20140215257https://www.rackspace.com/en-gb/cloud/networks

  • 2

    0 2000 4000 6000 8000 10000Rank of transport session

    0

    0.2

    0.4

    0.6

    0.8

    1

    Goo

    dput

    (Gbp

    s)

    1 Replica NDP3 Replicas NDP (daisy chain)3 Replicas NDP (multi-unicast)

    1 Replica TCP3 Replicas TCP

    (a) One-to-many (write)

    0 2000 4000 6000 8000 10000Rank of transport session

    0

    0.2

    0.4

    0.6

    0.8

    1

    Goo

    dput

    (Gbp

    s)

    1 Sender NDP3 Senders NDP

    1 Sender TCP3 Senders TCP

    (b) Many-to-one (read)

    Figure 1: Goodput results in a 250-server FatTree topology with 1GB link speed & 10µs link delay. Background traffic ispresent to simulate congestion. Results are for 10,000 (a) write and (b) read block requests (2MB each). Each I/O requestis ‘assigned’ to a host in the network, which is selected uniformly at random and acts as the client. Requests’ arrivaltimes follow a Poisson process with λ = 1000. Replica selection and placement is based on HDFS’ default policy (seeSection IV-A for a full description of the experimental setup).

    provide efficient access to network resources to data centreapplications, by exploiting support for multiple equal-costpaths between any two servers [28], [17], [26], [29] and hard-ware capable of low latency communication [22], [30], [31]and eliminating Incast [32], [33], [34], [35] and Outcast [36].Recent proposals commonly focus on a single dimensionof the otherwise complex problem space; e.g. TIMELY[37],DCQCN[38], QJUMP [39] and RDMA over Converged Eth-ernet v2 [40] focus on low latency communication butdo not support multi-path routing. Other approaches [27],[26] do provide excellent performance for long flows butperform poorly for short flows [28], [24]. None of theseprotocols supports efficient one-to-many and many-to-onecommunication.Contribution. In this paper we propose SCDP4, a general-purpose transport protocol for data centres that, unlikeany other protocol proposed to date, supports efficientone-to-many and many-to-one communication. This, inturn, results in significantly better overall network utilisa-tion, minimising hotspots and providing more resourcesto long and short unicast flows. At the same time, SCDPsupports fast completion of latency-sensitive flows andconsistently high-bandwidth communication for long flows.SCDP eliminates Incast [32], [33], [35] and Outcast [36].All these are made possible by integrating RaptorQ codes[43], [44] with receiver-driven data transport [17], [22], in-network packet trimming [45], [17] and Multi-Level Feed-back Queuing (MLFQ) [46]. RaptorQ codes are systematicand rateless, induce minimal network overhead and sup-port excellent encoding/decoding performance with lowmemory footprint (§II). They naturally enable one-to-many(§III-E) and many-to-one (§III-F) data transport. They sup-port per-packet (encoded symbol) multi-path routing andmulti-homed network topologies [47], [48] (§III-C); packetreordering does not affect SCDP’s performance, in con-

    4SCDP builds on our early work on integrating fountain coding in datatransport protocols [41], [42]

    trast to protocols like [24], [18], [28]. In combination withreceiver-driven flow control (§III-D), and packet trimming(§III-C), SCDP eliminates Incast and Outcast, playing wellwith switches’ shallow buffers. The systematic nature ofRaptorQ codes enables fast, decoding-free completion oflatency-sensitive flows by prioritising newly establishedones, therefore eliminating loss (except under very heavyloads) (§III-H). Long flows are latency-insensitive so lostsymbols can be recovered by repair ones; SCDP employspipelining of source blocks, which alleviates the decodingoverhead for large data blocks and maximises applicationgoodput (§III-G). SCDP is a simple-to-tune protocol, which,as with NDP and scalable multicasting, will be deployablewhen P4 switches [49] are deployed in data centres.

    SCDP performance overview. We found that SCDP im-proves goodput performance by up to ∼50% compared toNDP with different application workloads involving one-to-many and many-to-one communication (§IV-A). Equallyimportantly, it reduces the average FCT for short flows byup to ∼45% compared to NDP under two realistic datacentre traffic workloads (§IV-B). For short flows, decodinglatency is minimised by the combination of the system-atic nature of RaptorQ codes and MLFQ; even in a 70%loaded network, decoding was needed for only 9.6% ofshort flows. This percentage was less than 1% in a 50%congested network (§IV-G). The network overhead inducedby RaptorQ codes is negligible compared to the benefits ofsupporting one-to-many and many-to-one communication.Only 1% network overhead was introduced under a heavilycongested network (§IV-F). RaptorQ codes have been shownto perform exceptionally well even on a single core, interms of encoding/decoding rates. We therefore expect thatwith hardware offloading, in combination with SCDP’s blockpipelining mechanism (§III-G), the required computationaloverhead will be insignificant.

  • 3

    Encoder

    Sender Source Block

    12345678abc

    Encoding Symbols

    SourceRepair1

    2

    3

    567

    8

    4

    a

    b

    c 12 3567 8a c

    Lost packetNetwork Node

    Overhead

    Received Symbols

    Source Block

    Receiver

    Decoder

    Figure 2: RaptorQ-based communication

    II. RAPTORQ ENCODING AND DECODING

    Encoding. RaptorQ codes are rateless and systematic. Theinput to the encoder is one or more source blocks; foreach one of these source blocks, the encoder creates apotentially very large number of encoding symbols (ratelesscoding). All source symbols (i.e. the original fragments ofa source block) are amongst the set of encoding symbols(systematic coding). All other symbols are called repairsymbols. Senders initially send source symbols, followed byrepair symbols, if needed.Decoding. The decoder decodes a source block after receiv-ing a number of encoding symbols that must be equal toor larger than the number of source symbols; all symbolscontribute to the decoding process equally. In a losslesscommunication scenario, decoding is not required, becauseall source symbols are available (systematic coding).Performance. In the absence of loss, RaptorQ codes donot incur any network or computational overhead. Thetrade-off associated with RaptorQ codes when loss occursis with respect to some (1) minimal network overheadto enable successful decoding of the original fragmentsand (2) computational overhead for decoding the receivedsymbols to the original fragments. RaptorQ codes behaveexceptionally well in both respects. With two extra encodingsymbols (compared to the size of original fragments), thedecoding failure probability is 10−6. It is important to notethat decoding failure is not fatal; instead more encodingsymbols can be requested. The time complexity of RaptorQencoding and decoding is linear to the number of sourcesymbols. RaptorQ codes support excellent performance forall block sizes, including very small ones, which is veryimportant for building a general-purpose data transportprotocol that is able to handle equally efficiently differenttypes of workloads. In [50], the authors report encoding anddecoding speeds of over 10 Gbps using a RaptorQ softwareprototype running on a single core. With hardware offload-ing RaptorQ codes would be able to support data transportat line speeds in modern data centre deployments. On topof that, multiple blocks can be decoded in parallel, inde-pendently of each other. Decoding small source blocks iseven faster, as reported in [51]. The decoding performancedoes not depend on the sequence that symbols arrived noron which ones do.Example. Before explaining in detail how RaptorQ codes

    are integrated in SCDP, we present a simple example ofpoint-to-point communication between two hosts, which isillustrated in Figure 2.5 On the sender side, a single sourceblock is passed to the encoder that fragments it into K= 8equal-sized source symbols S1,S2, ...,S8. The encoder usesthe source symbols to generate repair symbols Sa,Sb,Sc(here, the decision to encode 3 repair symbols is arbitrary).Encoding symbols are transmitted to the network, alongwith the respective encoding symbol identifiers (ESI) andsource block numbers (SBN) [43]. As shown in Figure 2,symbols S4 and Sb are lost. Symbols take different pathsin the network but this is transparent to the receiver thatonly needs to collect a specific amount of encoding sym-bols (source and/or repair). The receiver could have beenreceiving symbols from multiple senders through differentnetwork interfaces. In this example, the receiver attemptsto decode the original source block upon receiving 9 sym-bols, i.e. one extra symbol which is a necessary networkoverhead (as shown in Figure 2). Decoding is successfuland the source block is passed to the receiver application.As mentioned above, if no loss had occurred, there wouldbe no need for decoding and the data would have beendirectly passed to the application.

    III. SCDP DESIGN

    In this section, we describe SCDP in detail. We firstpresent an overview of the protocol and discuss its keydesign decisions. We define SCDP’s packet types and switchservice model, the supported communication modes andhow efficiency is provided for short and long flows.

    A. Design Overview

    Figure 3 illustrates SCDP’s key components (shown inrectangles) and how these are brought together to tacklethe challenges identified in Section I (shown in ellipses).SCDP is a receiver-driven transport protocol, which allowsfor swift reactions to congestion when observing loss; morespecifically trimmed headers, as discussed in Section III-B(no Incast, no hotspots in Figure 3). Initially, senders pusha pre-specified number of symbol packets, starting with

    5Note that Figure 2 does not illustrate SCDP’s underlying mechanismsfor requesting encoding symbols and flow control. It is only intended toshowcase the main features of RaptorQ codes, which SCDP builds on. Thedesign of SCDP is discussed extensively in Section III.

  • 4

    no hotspots

    RaptorQcodes

    receiver-driven data transport

    MLFQ

    one-to-many

    rateless coding,symbol ordering not important

    packet trimming

    many-to-one no Incast

    high network utilisation

    no Outcast

    fast FCT

    decoding-free completion of short flows, no handshaking

    per-symbol spraying, minimal overhead, multi-homing

    pulling at link capacity

    swift feedbackswift reactions to congestion, request pacing

    swift feedback no tail drop

    Figure 3: SCDP’s key components

    source symbols; subsequently, receivers request additionalsymbols at their link capacity (no Incast) until they candecode the respective source block. Transport sessions areinitiated immediately, without any handshaking, by thesource symbols pushed by the sender(s) (fast FCT in Figure3). In SCDP, there are no explicit acknowledgements; arequest for an encoding symbol implicitly acknowledgesthe reception of a symbol. SCDP adopts packet trimmingto provide fast congestion feedback to receivers (no Incast,no Outcast in Figure 3) and MLFQ, as in [46], to eliminatelosses for short flows (except under extreme congestion);this, along with the implicit connection establishment, re-sults in fast, decoding-free completion of (almost all) shortflows (fast FCT). RaptorQ codes incur minimal networkoverhead due to the extra repair symbols when loss occurs,therefore network utilisation is not affected (high networkutilisation in Figure 3).

    A unique feature of SCDP that differentiates it fromall previous proposals is its support for one-to-many(multicast) and many-to-one (multi-source) communicationmodes (see Figure 3). In SCDP’s one-to-many communica-tion mode, a sender initially pushes a window of symbolsto all receivers, which then start pulling additional (sourceand/or repair) ones. Senders aggregate pull requests andmulticast a new symbol after receiving a pull request fromall receivers. In the many-to-one mode, a receiver pulls en-coding symbols from multiple senders. Duplicate symbolsare avoided by having senders partitioning the stream ofrepair symbols in a distributed fashion [43]. Senders initiallyselect and send a subset of source symbols, before sendingrepair symbols. Multi-source transport enables natural loadbalancing; (1) at the server level, each server contributessymbols at its available capacity; (2) at the network level,more symbols come through less congested paths. Unicasttransport is a specialisation of many-to-one transport withone sender.

    In SCDP, receivers are oblivious of the provenance ofencoding symbols. Symbols can follow different paths inthe network, enabling per-packet ECMP routing. Symbol

    reordering does not affect decoding performance. Symbolscan also be received from different network interfaces,enabling multi-homed communication [47], [48] (high net-work utilisation).

    B. Packet Types

    SCDP supports three types of packets. A symbol packetcarries one MTU-sized encoding symbol, either source orrepair, the respective source block number (SBN) (i.e. thesource block it belongs to), and the encoding symbolidentifier (ESI), which identifies a symbol within a streamof source and repair symbols for a specific source block[43]. Data packets also include port numbers for identifyingtransport sessions, a priority field that is set by the senderand a syn flag that is set for all symbol packets that sendersinitially push.

    A pull packet is sent by a receiver to request a symboland contains a sequence number and a fin flag. Note thatmultiple symbol packets may be sent in response to a singlepull request, as described in Section III-D. The sequencenumber is only used to indicate to a sender how manysymbols to send (e.g. if pull requests get reordered due topacket spraying in the network). 6The fin flag is used toidentify the last pull request; upon receiving such a pullrequest, a sender sends the last symbol packet for this SCDPsession.

    Header packets are trimmed versions of symbol pack-ets. Whenever a network switch receives a symbol packetthat cannot be buffered, instead of dropping it, it trimsits payload (i.e. a source or repair RaptorQ symbol) andforwards the remaining header with the highest priority.Header packets are a signal of congestion and are usedby receivers for flow control and to always keep a windowworth of symbol packets on the fly.

    6Note that RaptorQ codes are rateless, therefore there is no need forreceivers to request lost symbols; a new symbol will equally contribute tothe decoding of the source block.

  • 5

    C. Switch Service Model

    SCDP relies on network switching functionality that iseither readily available in today’s data centre networks [22]or is expected to be [17] when P4 switches are widelydeployed. Note that it does not require any more switchfunctionality than NDP [17], QJUMP [39], or PIAS [46] do.Priority scheduling and packet trimming. In order to sup-port latency-sensitive flows, we employ MLFQ [46], andpacket trimming [45]. We assume that network switchessupport a small number of queues (with respective prioritylevels). The top priority queue is only used for header andpull packets. This is crucial for swiftly providing feedback toreceivers about loss in the network. Given that both typesof packets are very small, it is extremely unlikely that therespective queue gets full and that they are dropped7. Therest of the queues are very short and are used to buffersymbol packets. Switches perform weighted round-robinscheduling between the top-priority (header/pull) queueand the symbol packet queues. This guards against con-gestion collapse, a situation where a switch only forwardstrimmed headers and all symbol packets are trimmed (toheaders). When a data packet is to be transmitted, theswitch selects the head packet from the highest priority,non-empty queue. In combination with the priority settingmechanism, this minimises loss for short flows, enablingfast, decoding-free completion.Multipath routing. SCDP packets are sprayed to all availableequal-cost paths to the destination8 in the network. SCDPrelies on ECMP and spraying could be done either by usingrandomised source ports [24], or the ESI of symbol andheader packets and the sequence number of pull packets.

    D. Unicast Transport Sessions

    A unicast SCDP transport session is implicitly opened bya sender by pushing a window of symbol packets to thereceiver. Senders tag outgoing symbol packets with a pri-ority value, which is used by the switches when schedulingtheir transmission (§III-C). The priority of outgoing symbolpackets is gradually degraded, when specific thresholds arereached. Calculating these thresholds can be done as inPIAS [46] or AuTO [30]. After receiving the initial windowof packets, the receiver takes control of the flow of incomingpackets by pacing pull requests to the sender. A pull requestcarries a sequence number which is auto-incremented foreach incoming symbol packet. The sender keeps track ofthe sequence number of the last pull request and, uponreceiving a new pull request, it will send one or morepackets to fill the gap between the sequence numbersof the last and current request. Such gaps may appearwhen pull requests are reordered due to packet spraying.Senders ignore pull requests with sequence numbers thathave already been ‘served’; i.e. when they had previouslyresponded to the respective pull requests.

    7Receivers employ a simple timeout mechanism, as in [17], to recoverfrom the unlikely losses of pull and header packets.

    8In SCDP’s one-to-many communication mode there are many destina-tions. In Section III-E, we describe this communication mode in detail.

    Receivers maintain a single queue of pull requests for allactive transport sessions. Flow control’s objective is to keepthe receiver’s incoming link as fully utilised as possible atall times. This dictates the pace at which receivers send pullrequests to all different senders. Receivers buffer encodingsymbols along with their ESI and SBN and start decoding asource block upon receiving either K source symbols, whereK is the total number of source symbols, or K+o source andrepair symbols, when loss occurs (o is the induced networkoverhead). We found that o = 2 extra symbols, when lossoccurs, is the sweet spot with respect to the overhead anddecoding failure probability trade-off.

    The receiver sets the fin flag in the pull request for thelast symbol (a source or repair symbol at that point) thatsends to the sender. Note that this may not actually bethe last request the receiver sends, because the symbolpacket that is sent in response to that request may gettrimmed. All pull requests for the last required symbol (nota specific one) are sent with the fin flag on. The senderresponds to fin-enabled pull requests by sending the nextsymbol in the potentially very large stream of source andrepair symbols with the highest priority. It finally releasesthe transport session only after a time period that ensuresthat the last prioritised symbol packet was not trimmed.This time period is very short; in the very unlikely case thatthe prioritised symbol packet was trimmed, the respectiveheader would be prioritised along with the pull packetsubsequently sent by the receiver.

    E. One-to-many Transport Sessions

    One-to-many transport sessions exploit support fornetwork-layer multicast (e.g. with[11], [12], [13], [14], [15],[16]) and coordination at the application layer; for example,in a distributed storage scenario, multicast groups couldbe pre-established for different replica server groups orsetup on demand by a metadata storage server. This wouldeliminate the associated latency overhead for establishingmulticast groups on the fly and is practical for other datacentre multicast workloads, such as streaming telemetryand distributed messaging, where destination servers areknown at deployment time. With recent advances in scal-able data centre multicasting, a very large number of mul-ticast groups can be deployed with manageable overheadin terms of switch state and packet size. For example, Elmo[11] encodes multicast group information inside packets,therefore minimising the need to store state at the networkswitches. With small group sizes, as in the common datacentre use cases mentioned above, Elmo can support anextremely large number of groups, which can be encodeddirectly in packets, eliminating any maintenance overheadassociated with churn in the multicast state. “In a three-tier data centre topology with 27K hosts, Elmo supports amillion multicast groups using a 325-byte packet header,requiring as few as 1.1K multicast group-table entries onaverage in leaf switches, with a traffic overhead as low as5% over ideal multicast” [11].

    As with unicast transport sessions, an SCDP sender ini-tially pushes IW (syn-enabled) symbol packets tagged with

  • 6

    the highest priority. Receivers then request more symbolsby sending respective pull packets. The sender sends anew symbol packet only after receiving a request from allreceivers within the same multicast group. Receivers queueand pace pull packets as in all other transport modes.Depending on the network conditions and server load,a receiver may get behind in terms of received symbols.The rateless property of RaptorQ codes is ideal for suchsituation; within a single transport session, receivers mayreceive a different set of symbols but they will all decodethe original source block as long as the required numberof symbols is collected, regardless of which symbols theymissed (see Section II). On the other hand, some receiversmay end up receiving more symbols than what wouldbe required to decode the original source block. This isunnecessary network overhead induced by SCDP but, inSection IV-F, we show that even under severe congestion,SCDP performs significantly better than NDP, exploiting thesupport for network-layer multicast. In extreme scenarioswhere receivers become unresponsive, this overhead in-creases significantly, as all other receivers will be unneces-sarily receiving a potentially very large number of symbols.In such situations, detaching the straggler server from themulticast group (at the application layer) would triviallysolve the issue.

    F. Many-to-one Transport Sessions

    Many-to-one data transport is a generalisation of the uni-cast transport discussed in Section III-D. Senders initialise amulti-source transport session by pushing an initial windowIWi of symbol packets to the receiver. As in the unicasttransport mode, these symbol packets have the syn flagset, are tagged with the highest priority and contain sourceor repair symbols. The total number of initially pushedsymbol packets IWtot al =

    ∑nsi=1 IWi , where ns is the total

    number of senders, is selected to be larger than the initialwindow IW used in unicast transport sessions. This is toenable natural load balancing in the data centre in thepresence of slow senders or hotspots in the network. Inthat case, SCDP ensures that a subset of senders (e.g. 2out of 3 in a 3-replica scenario) can still fill the receiver’sdownstream link. In Section IV-E, we show that initialwindow sizes that are greater than 10 symbol packets resultin the same (high) goodput performance. A large initialwindow would inevitably result in more trimmed symbolpackets, which however would not affect short, latency-sensitive flows that would always be prioritised over longermulti-source sessions.

    As discussed in Section II, RaptorQ codes are ratelessand all symbols contribute equally to the decoding process,therefore the receiver is agnostic to the origin of eachsymbol. In many-to-one communication scenarios, sendersare coordinated at the application layer. For example, in adistributed storage scenario, clients can either resolve theIP addresses of servers in a deterministic way (e.g. as in [3],[52]) or by asking a metadata server (e.g. as in [53]). Beforefetching the data, they are aware of (1) the total number

    of senders ns and (2) the server index i in the set of allsenders. As a result, they can partition the potentially largestream of source and repair (if needed) symbols so thateach one produces unique symbol packets.

    G. Maximising Goodput for Long Flows through SourceBlock Pipelining

    With RaptorQ codes, if loss occurs, the receiver must de-code the source block after collecting the required numberof source and repair symbols (§II). This induces latencybefore the data can become available to the application.

    For large source blocks, SCDP masks this latency by split-ting the large source block to many smaller blocks, insteadof encoding and decoding the whole block. The smallerblocks are then pipelined over a single SCDP session. Withpipelining, a receiver decodes each one of these smallersource blocks while receiving symbol packets for the nextone, effectively masking the latency induced by decoding,except for the last source block. The latency for decodingthis last smaller block is considerably smaller comparedto decoding the whole block at once.9 For short, latency-sensitive flows, this could be a serious issue, but SCDPstrives to eliminate losses, resulting in fast, decoding-freecompletion of short flows (§III-H).

    H. Minimising Network Overhead and Completion Time forShort Flows

    SCDP ensures that a window of IW symbol packets areon the fly throughout the lifetime of a transport session.The window decreases by one symbol packet for the last IWpackets that the sender sends. As long as no loss is detected(through receiving a trimmed header), a receiver sendsK − IW pull requests, in total. For every received trimmedheader (i.e. observed loss), the receiver sends a pull request,and, subsequently, the sender sends a new symbol, whichequally contributes to the decoding of the source block.This ensures that SCDP does not induce any unnecessaryoverhead; i.e. symbol packets that are unnecessary for thedecoding of the respective source block. The target for thetotal number of received symbols also changes when lossis detected. Initially, all receivers aim at receiving K sourcesymbols. Upon receiving the first trimmed header, the targetchanges to K + 2, which ensures that decoding failure isextremely unlikely to occur (see Section II).

    By prioritising earlier packets of a session over later onesthrough MLFQ, SCDP minimises loss for short flows. Thishas an extremely important corollary in terms of SCDP’scomputational cost; no decoding is required for the greatmajority of short flows, therefore completion times arealmost always near-optimal. We extensively evaluate thisaspect of SCDP’s design in Section IV-G. It is importantto note that for all supported types of communication,there is no latency induced due to encoding, because repair

    9For the experimental evaluation presented in Section IV, we haveintegrated pipelining into the developed SCDP model and simulated therespective latency following the results reported in [51].

  • 7

    (a) NDP one-to-many (multi-unicast)

    R1 R2 R3 R1 R2 R3 R1 R2 R3C1 C2 C4

    (b) NDP one-to-many (daisy chain)

    (c) SCDP one-to-many

    R1 R2 R3C3

    (d) NDP many-to-one (single-block)

    (e) SCDP many-to-one

    …….Core

    Figure 4: Illustration of read and write workloads and replica placement policy used in comparing goodput performancefor SCDP and NDP. For clarity, the core of the data centre is omitted and replica groups (and the respective transportsessions) are selected to be in the same pod. In our simulations, the selection of remote racks to store data blocksis random and racks in different pods can be selected. Network-layer multicast is supported in SCDP one-to-manycommunication.

    symbols can be generated while source symbols are sent;i.e. there can always be one or more repair symbols readybefore they are needed.

    IV. EXPERIMENTAL EVALUATION

    We have extensively evaluated SCDP’s performancethrough large scale, packet-level simulations. We have de-veloped models of SCDP, NDP, the switch service modeland network-layer multicast support [54] in OMNeT++10.Our results are fully reproducible. For our experimenta-tion we have used a 250-server FatTree topology with 25core switches and 5 aggregation switches in each pod(50 aggregation switches in total). This is a typical sizefor a simulated data centre topology, also used in theevaluation of recent data centre transport proposals [22],[46], [31], [23]. The values for the link capacity, link delayand switch buffer size are 1 Gbps, 10µs and 20 packets,respectively. The buffer is allocated to 5 packet queues withdifferent scheduling priorities. The thresholds for demotingthe priority for a specific session are statically assigned to10KBs, 100KBs, 1MB and 10MBs, respectively11. The toppriority queue is for pull and header packets which arevery small, therefore we can avoid timeouts by setting itssize to a relatively large value (as also done in [17]). Unlessotherwise stated, the initial window IW for one-to-oneand one-to-many sessions is set to 12 symbol packets. Formany-to-one sessions the initial window is set to 6 symbolpackets per sender. For all experiments we set the blocksize for pipelining to 100 MTU-sized symbol packets. Wehave run each simulation 5 times with different seeds and

    10Some of our models that we use in this paper have been publishedat OMNeT++ Community Summit[55].

    11In a real-world deployment these would be set dynamically, e.g. as inAuTO [30].

    report average (with 95% confidence intervals) or aggregatevalues.

    A. Goodput for One-to-Many and Many-To-One Communi-cation

    In this section we measure the application goodput forSCDP and NDP in a distributed storage setup with 3 replicas(as depicted in Figure 4).The setup involves many-to-oneand one-to-many communication. In each run, we simulate2000 transport sessions (or I/O requests at the storage layer)with sizes 1MB and 4MB each (rs in the figures). Transportsession arrival times follow a Poisson process; we have useddifferent λ values (2000 and 4000) to assess the performanceof the studied protocols under different loads. Each I/Orequest is ‘assigned’ to a host in the network (Ci in Figure4), which is selected uniformly at random and acts as theclient. Replica selection and placement is based on HDFS’default policy. More specifically, we assume that clients arenot data nodes themselves, therefore a data block is placedon a randomly selected data node (Ri in Figure 4). Onereplica is stored on a node in a different remote rack,and the last replica is stored on a different node in thesame remote rack. A client will read a block from a serverlocated in the same rack, or a randomly selected one, ifno replica is stored in the same rack. In order to simulatecongestion in the core of the network, 30% of the nodes runbackground long flows, the scheduling of which is based ona permutation traffic matrix.One-to-many transport sessions. We evaluate SCDP’s per-formance in one-to-many traffic workloads and assess howit benefits from the underlying support for network-layermulticast, compared to NDP. One-to-many communicationwith NDP is implemented through (1) multi-unicastingdata to multiple recipients (Figure 4a) or (2) daisy-chainingthe transmission of replicas through the respective servers

  • 8

    0 5000 10000Rank of transport session

    0

    0.5

    1G

    oodp

    ut (G

    bps)

    SCDPNDP (daisy chain)NDP (multi-unicast)

    (a) rs = 1MB, λ = 2000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (daisy chain)NDP (multi-unicast)

    (b) rs = 1MB, λ = 4000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (daisy chain)NDP (multi-unicast)

    (c) rs = 4MB, λ = 2000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s) SCDPNDP (daisy chain)NDP (multi-unicast)

    (d) rs = 4MB, λ = 4000

    Figure 5: Performance comparison for SCDP and NDP - write I/O with 3 replicas (one-to-many)

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (multi-block)NDP (single-block)

    (a) rs = 1MB, λ = 2000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (multi-block)NDP (single-block)

    (b) rs = 1MB, λ = 4000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (multi-block)NDP (single-block)

    (c) rs = 4MB, λ = 2000

    0 5000 10000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDPNDP (multi-block)NDP (single-block)

    (d) rs = 4MB, λ = 4000

    Figure 6: Performance comparison for SCDP and NDP - read I/O with 3 replicas (many-to-one)

    (Figure 4b). In daisy-chaining, each replica starts trans-mitting the data to the next replica server (according toHDFS’s placement policy), as soon as it starts receiving datafrom another replica server. Daisy-chaining eliminates thebottleneck at the client’s uplink. We measure the overallgoodput from the time the client initiates the transmissionuntil the last server receives the whole data. The resultsfor various loads and I/O request sizes are shown inFigure 5. In all figures, flows are ranked according to themeasured goodput performance (shown on the y axis).SCDP, with its natural load balancing and the support ofmulticast (Figure 4c), significantly outperforms NDP evenwhen daisy-chaining is used for replicating data. Daisy-chaining is effective compared to multi-unicasting whenthe network is not heavily loaded. In SCDP, around 50%of the sessions experience goodput that is over 90% ofthe available bandwidth for 1MB sessions and λ = 2000.The remaining 50% sessions still get a goodput perfor-mance over 60% of the available bandwidth. When thenetwork load is heavier, daisy-chaining does not provideany significant benefits over multi-unicasting because dataneeds to be moved in the data centre multiple times andcongestions gets severe. For λ = 4000 and 4MB sessions,NDP’s performance is significantly worse for most sessions,whereas SCDP still offers an acceptable transport service toall sessions. SCDP fully exploits the support for network-layer multicasting providing superior performance to allstorage clients because the required network bandwidthis minimised. Minimising the bandwidth requirements forone-to-many flows that are extremely common in the datacentre, makes space for regular short and long flows. For theexperimental setup with the heaviest network load (λ= 4000and 4MB sessions), we have measured the average goodputfor SCDP background traffic to be 0.408 Gbps, compared

    to 0.252 Gbps for the respective NDP experiment12. Thisis 15.6% of the available bandwidth freed up for all otherflows. We evaluate the positive effect that SCDP has withrespect to network hotspots in Section IV-C.Many-to-one transport sessions. In the many-to-one sce-nario (Figure 6), clients read previously stored data fromthe network. SCDP naturally balances this load accordingto servers’ capacity and network congestion, as discussedin Section III-F (see Figure 4e). With NDP, clients read dataeither from a replica server located in the same rack ora randomly selected server, if there is no replica storedin the same rack. For NDP, we simulate both a single-block (see Figure 4d) and multi-block request workload. Thelatter enables parallelisation at the application layer (e.g.the read-ahead optimisation where a client reads multipleconsecutive blocks under the assumption that they willsoon be requested). Here, we simulate a 3-block read-aheadpolicy and measure the overall goodput from the time theI/O request is issued until all 3 blocks are fetched. To makethe results as comparable to each other as possible, for the3-block setup we use blocks the size of which is one third ofthe size of the single-block scenario (as reported in Figure6). We do not include multi-block results for SCDP as theyare almost identical to the single-block case, confirming theargument that it naturally distributes the load without anyapplication-layer parallelisation.

    In Figure 6 we observe that SCDP significantly outper-forms NDP for all different request sizes and λ values. Evenunder heavy load, SCDP provides acceptable performanceto all transport sessions. This is the result of (1) the naturaland dynamic load balancing provided to SCDP’s many-to-

    12Note that this improvement for background flows is despite theserunning at the lowest possible priority, given that they span the wholeduration of the simulation.

  • 9

    0.5 0.6 0.7 0.8Load

    0.25

    0.3

    0.35

    0.4

    0.45A

    vg. F

    CT

    (ms) NDP

    SCDP

    (a) (0, 100KB]

    0.5 0.6 0.7 0.8Load

    0.3

    0.4

    0.5

    FCT

    (ms)

    NDPSCDP

    (b) (0, 100KB] 99th percentile

    0.5 0.6 0.7 0.8Load

    4

    6

    8

    10

    Avg

    . FC

    T (m

    s) NDPSCDP

    (c) (100KB, 1MB]

    0 1000 2000 3000 4000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDP Load 0.5SCDP Load 0.8NDP Load 0.5NDP Load 0.8

    (d) (1MB, 10MB]

    Figure 7: Web search workload with unicast flows as background traffic

    0.5 0.6 0.7 0.8Load

    0.02

    0.03

    0.04

    0.05

    Avg

    . FC

    T (m

    s)

    NDPSCDP

    (a) (0, 100KB]

    0.5 0.6 0.7 0.8Load

    0.02

    0.04

    0.06

    0.08

    0.1

    FCT

    (ms)

    NDPSCDP

    (b) (0, 100KB] 99th percentile

    0.5 0.6 0.7 0.8Load

    3

    4

    5

    6

    Avg

    . FC

    T (m

    s) NDPSCDP

    (c) (100KB, 1MB]

    0 500 1000 1500Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDP Load 0.5SCDP Load 0.8NDP Load 0.5NDP Load 0.8

    (d) (1MB, 10MB]

    Figure 8: Data mining workload with unicast flows as background traffic

    0.5 0.6 0.7 0.8Load

    0.25

    0.3

    0.35

    0.4

    0.45

    Avg

    . FC

    T (m

    s) NDPSCDP

    (a) (0, 100KB]

    0.5 0.6 0.7 0.8Load

    0.3

    0.4

    0.5

    FCT

    (ms)

    NDPSCDP

    (b) (0, 100KB] 99th percentile

    0.5 0.6 0.7 0.8Load

    4

    6

    8

    10

    Avg

    . FC

    T (m

    s) NDPSCDP

    (c) (100KB, 1MB]

    0 1000 2000 3000 4000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDP Load 0.5SCDP Load 0.8NDP Load 0.5NDP Load 0.8

    (d) (1MB, 10MB]

    Figure 9: Web search workload with a mixture of one-to-many and many-to-one sessions as background traffic

    0 -10KB

    10KB -100KB

    100KB -1MB

    1M-..

    Averageflow size

    WebSearch [18]

    49% 3% 18% 20% 1.6MB

    DataMining [56]

    78% 5% 8% 9% 7.4MB

    Table I: Flow size distribution of realistic workloads

    one sessions and (2) MLFQ; long background flows runat the lowest priority to boost the performance of shorterflows. Around 82% of the sessions experience goodput thatis above 90% of the available bandwidth for 1MB sessionsand λ= 2000. In contrast, NDP offers this good performanceto only 10% of the sessions. For λ= 4000 and 4MB sessions,NDP’s performance is significantly worse for most sessions,whereas SCDP still offers good performance to all sessions.Notably, the performance difference between SCDP andNDP increases with the congestion in the network, withSCDP being able to provide acceptable levels of perfor-mance where NDP would not (e.g. in the presence ofhotspots or in over-subscribed networks).

    B. Performance Benchmarking with Realistic Workloads

    SCDP is a general-purpose transport protocol for datacentres therefore it is crucial that it provides high perfor-mance for all supported transport modes and traffic work-loads. In this section, we use realistic workloads reportedby data centre operators to evaluate SCDP’s applicabilityand effectiveness beyond one-to-many and many-to-onesessions. Here, we consider two typical services; web searchand data mining [56], [18]. The respective flow size distri-butions are shown in Table I. They are both heavy-tailed;i.e. a small fraction of long flows contribute most of thetraffic. We have chosen the workloads to cover a widerange of average flow sizes ranging from 64KB to 7.4MB.We simulate four target loads of background traffic (0.5,0.6, 0.7 and 0.8). We generate 20000 transport sessions, theinter-arrival time of which follows a Poisson process withλ = 2500. In Figures 7a and 7c and 8a and 8c, we reportthe average flow completion time (FCT) of flows with sizesin (0 − 1MB). For the shortest flows (0 − 100KB) we alsoreport the 99th percentile of the measured FCTs (Figures 7band 8b). Finally, Figures 7d and 8d illustrate the measured

  • 10

    0.5 0.6 0.7 0.8Load

    0.02

    0.03

    0.04

    0.05A

    vg. F

    CT

    (ms) NDP

    SCDP

    (a) (0, 100KB]

    0.5 0.6 0.7 0.8Load

    0.02

    0.04

    0.06

    0.08

    0.1

    FCT

    (ms)

    NDPSCDP

    (b) (0, 100KB] 99th percentile

    0.5 0.6 0.7 0.8Load

    3

    4

    5

    6

    Avg

    . FC

    T (m

    s)

    NDPSCDP

    (c) (100KB, 1MB]

    0 500 1000 1500Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s)

    SCDP Load 0.5SCDP Load 0.8NDP Load 0.5NDP Load 0.8

    (d) (1MB, 10MB]

    Figure 10: Data mining workload with a mixture of one-to-many and many-to-one sessions as background traffic

    0 10 20 30 40 50 60 70Number of senders in parallel

    0

    0.2

    0.4

    0.6

    0.8

    1

    Goo

    dput

    (Gbp

    s)

    0 10 20 30 40 50 60 70Number of senders in parallel

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Goo

    dput

    (Gbp

    s)

    SCDP 256KBSCDP 70KBNDP 256KBNDP 70KBTCP 256KBTCP 70KB

    0 10 20 30 40 50 60 70Number of senders in parallel

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Goo

    dput

    (Gbp

    s)

    SCDP 256KBSCDP 70KBNDP 256KBNDP 70KBTCP 256KBTCP 70KB

    0 10 20 30 40 50 60 70Number of senders in parallel

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Goo

    dput

    (Gbp

    s)

    SCDP 256KBSCDP 70KBNDP 256KBNDP 70KBTCP 256KBTCP 70KB

    (a) Incast: goodput comparison

    140 145 150FCT (ms)

    0

    0.2

    0.4

    0.6

    0.8

    1C

    DF

    SCDP 256KBNDP 256KB

    (b) Incast: FCT with 70 senders

    Dest 1 Dest 2 Src 1 Src 2

    ToRToR

    App2App1

    2 flows

    12 flows

    Bottleneck

    (c) Outcast: setup

    SCDP TCP0

    0.2

    0.4

    0.6

    Avg

    . goo

    dput

    (Gbp

    s) App2App1

    (d) Outcast: goodput

    Figure 11: Incast and Outcast evaluation

    goodput for flows with sizes in (1MB, 10MB] (for load valuesof 0.5 and 0.8).

    SCDP performs better in all scenarios due to thedecoding-free completion of (almost all) short flows andthe supported MLFQ. Note that when loss occurs, SCDPsessions must exchange 2 additional symbols; they also paythe ‘decoding latency’ price. For very short flows, the 99thpercentile FCT is close to the average one for all loads,which indicates that this is rarely happening. We studythe extent that this overhead and the associated decodinglatency is required in Section IV-G. For higher loads, NDPperforms even worse than SCDP because of the lack ofsupport for MLFQ, which results in the trimming of morepackets belonging to short flows. Note that the FCT of shortflows in web search is larger than in data mining. This ismainly because the percentage of long flows in the formerworkload is larger than in the latter, resulting in a higheroverall load (for all fixed loads of background traffic). A keymessage here is that SCDP provides significantly better tailperformance for short flows compared to NDP, especiallyas the network load increases, despite the (very unlikely)potential for decoding and network overhead. For flowswith sizes in (1MB, 10MB], we observe that goodput withSCDP is better compared to NDP; tail performance is alsobetter.

    C. Minimising Hotspots in the Network

    SCDP increases network utilisation by exploiting supportfor network-layer multicasting and enabling load balancingwhen data is fetched simultaneously from multiple servers,as demonstrated in Section IV-A. This, in turn, makes spacein the network for regular short and long flows. In this

    section, we evaluate this performance benefit. We use asbackground traffic a 50%/50% mixture of write and read I/Orequests (4MB each) that produce one-to-many and many-to-one traffic, respectively. We repeat the experiment of theprevious section and evaluate the performance benefits ofSCDP over NDP with respect to minimising hotspots andmaximising network utilisation for regular short and longflows.

    In Figures 9a and 9c, we observe that SCDP’s perfor-mance is almost identical to the one reported in Figures 7aand 7c (similarly between Figure 8 and Figure 10). In con-trast, NDP’s performance deteriorates significantly becausethe background traffic requires more bandwidth (one-to-many) and results in hotspots at servers’ uplinks (many-to-one). Tail performance for SCDP gets only marginally worse(the 99th percentile increases from 0.277ms to 0.287msfor the web search workload in load 0.5), whereas NDP’sperformance gets significantly worse (the 99th percentile in-creases from 0.306ms to 0.381ms). The observed behaviouris more pronounced in the web search workload which, asdescribed in the previous section, results in higher overallnetwork utilisation compared to the data mining workload.

    D. Eliminating Incast and Outcast

    SCDP eliminates Incast by integrating packet trimmingand not relying on retransmissions of lost packets (given therateless nature of RaptorQ codes). We have simulated Incastby having multiple senders (ranging from 1 to 70) sendingblocks of data (70KB and 256KB, each, in two separateexperiments) to a single receiver. All transport sessionswere synchronised and background traffic was present tosimulate congestion. Figure 11a illustrates the measured

  • 11

    0 2000 4000 6000 8000 10000Rank of transport session

    0

    0.2

    0.4

    0.6

    0.8

    1

    Goo

    dput

    (Gbp

    s)

    IW = 4IW = 8IW = 12

    IW = 16IW = 20IW = 24

    (a) Goodput performance

    2 4 6 8 10 12 14 16 18 20IW

    0

    5

    10

    15

    20

    25

    30

    35

    Num

    ber o

    f trim

    med

    pac

    kets

    (b) Number of trimmed packets

    Figure 12: The effect of the IW value

    aggregated goodput for all SCDP, NDP and TCP flows. Errorbars represent the 95% confidence interval. As expected,TCP’s performance collapses when the number of sendersincreases. SCDP performs slightly better compared to NDPeven when a large number of servers send data to thereceiver at the same time. This is attributed to the decoding-free completion of these flows, in combination with thepacket trimming and the lack of retransmissions for SCDP.Figure 11b shows the CDF of the FCTs in the presence ofIncast with 70 senders. We observe that for the vast majorityof transport sessions, SCDP provides superior performancecompared to NDP.

    SCDP eliminates outcast by employing receiver-drivenflow control and packet trimming, which prevent portblackout. We have simulated a classic outcast scenario,where two receivers that are connected to the same ToRswitch receive traffic from senders located in the samepod (2 flows crossing 4 hops) and different pods (12 flowscrossing 6 hops), respectively. Flow size is 200KB and allflows start at the same time. This is illustrated in Figure11c. Here, the bottleneck link lies between the aggregateswitch and the ToR switch, which is different from the Incastsetup. Figure 11d shows the aggregate goodput for the twogroups of flows, for SCDP and TCP. TCP Outcast manifestsitself through (1) unfair sharing of the bottleneck bandwidth(around 113 and 274 Mbps for the groups of flows, re-spectively) and (2) suboptimal overall performance (around0.387 Gbps). SCDP eliminates Outcast as the bottleneck isshared fairly between the two groups of flows (around 460and 435 Mbps for the groups of flows, respectively, and theoverall goodput is around 0.9 Gbps).

    E. The effect of the Initial Window Size

    A key parameter of SCDP is the initial windows IWof symbol packets that a sender pushes to the network.Throughout the lifetime of a transport session this windowis maintained and only decreased for the last IW pullpackets. In many-to-one transport sessions the sum of allthe initial windows for all senders is set to be larger thanthe initial window for the one-to-one and one-to-manymodes. This is to enable natural load balancing between

    all senders in the presence of congestion in the network(see Section III-F). In this section we evaluate the effectthat the initial window has in the performance of SCDP.The experimental setup is as described in Section IV-A,with 1.5MB unicast sessions (we evaluated one-to-manyand many-to-one sessions as well, which showed similarresults as the unicast sessions).

    In Figure 12a, we observe that for very small values of theinitial window, the goodput is very low and the receiver’sdownlink underutilised. As the window increases, utilisationapproaches the maximum available link capacity (for 10symbol packets). For larger values, the measured goodputis the same (full link capacity). This means that for many-to-one sessions, increasing the sum of initial windows for allsenders does not have any negative impact on the goodput.However, increasing the window inevitably leads to moretrimmed packets due to the added network load, whichwould be beyond the receiver’s downlink capacity. This isillustrated in Figure 12b, where the average number oftrimmed packets for session sizes of 1.5MB grows from13 for an initial window of 12 symbol packets to 32 foran initial window of 20. This increase and the resultingnecessity for decoding (and extra overhead) does not nega-tively affect many-to-one sessions which are commonly notlatency-sensitive.

    F. Overhead in One-to-Many Sessions

    In Section III-E, we identified a limitation of SCDPwith respect to unnecessary network overhead which mayoccur in one-to-many transport sessions in the presence ofcongestion. This is due to receivers getting behind with thereception of symbols. Consequently, up-to-date receiverswill be receiving more symbols than what they actuallyneed. In order to evaluate the extent of this limitation wesetup a similar experiment to the one presented in SectionIV-A. Figures 13a and 13b depict the CDF of the number ofsymbols that were sent unnecessarily for different values ofλ, and session sizes. We observe that as the network loadincreases, the number of sessions that induce unnecessarynetwork overhead increases. It is important to note that,even when this happens, the measured goodput for SCDP

  • 12

    0 5 10 15Overhead (unnecessary symbols)

    0

    0.2

    0.4

    0.6

    0.8

    1CD

    F=4000=1500=250

    (a) One-to-many - 1MB

    0 10 20Overhead (unnecessary symbols)

    0

    0.2

    0.4

    0.6

    0.8

    1

    CDF

    =4000=1500=250

    (b) One-to-many - 3MB

    0 2500 5000Rank of transport session

    0

    0.5

    1

    Goo

    dput

    (Gbp

    s) SCDP 1MBSCDP 3MBNDP 1MBNDP 3MB

    (c) Goodput - 1MB and 3MB - λ= 4000Figure 13: Unnecessary network overhead in one-to-many sessions

    is significantly better than that of NDP. Figure 13c illustratesthe measured goodput for the examined session sizes andhighest network load (λ= 4000). Clearly, SCDP significantlyoutperforms NDP despite the potential for some unneces-sary network overhead. The benefit of exploiting network-layer multicast makes this potential overhead negligible.

    G. Network Overhead and Induced Decoding Latency

    SCDP provides zero-overhead data transport when noloss occurs. In the opposite case, 2 extra symbols (comparedto the number of original fragments) are required by thedecoder to decode the source block (with extremely highprobability). Additionally, the required decoding induceslatency in receiving the original source block. Short flowsin data centres are commonly latency sensitive so SCDPmust be able to provide decoding-free completion of suchflows. To asses the efficacy of our MLFQ-based approach,we measure the number of unicast flows that suffer symbolpacket loss for different network loads ranging from 0.5 to0.7. For each network load, we examine different λ valuesfor the Poisson inter-arrival rate of the studied short flows(150KB). In each simulation, we generate 5000 sessionswith the respective λ value as their inter-arrival time. InFigure 14, we observe that for load values of 0.5 and 0.6,the times that a short flow would require decoding andextra 2 symbol packets is very small (0.44% and 1.2% of the

    Load 0.5 Load 0.6 Load 0.70

    100

    200

    300

    400

    # tim

    es d

    ecod

    ing

    need

    ed

    = 1000 = 2000 = 4000

    = 6000 = 8000

    Figure 14: Number of sessions that needed decoding fordifferent loads and session inter-arrival times. The totalnumber of simulated sessions is 5000

    0 2 4 6 8 10 12 14 16 18Time (seconds)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Goo

    dput

    (Gbp

    s)

    flow 1flow 2flow 3

    flow 4flow 5

    Figure 15: Convergence test

    flows, respectively, when λ = 8000), rendering the respectiveoverhead negligible.

    H. Resource Sharing

    SCDP achieves excellent fairness without needing addi-tional mechanisms. SCDP’s principles for resource sharingare as follows: (1) receivers pull symbol packets from oneor more senders in the data centre at a pace that matchestheir downlink bandwidth. Given that servers are uniformlyconnected to the network with respect to link speeds, SCDPenables fair sharing of the network to servers. (2) A receiverwill pull symbol packets for each SCDP session on a roundrobin basis.

    As a result, SCDP enables fair sharing of its downlinkto all transport sessions running at a specific receiver13.(3) SCDP employs MLFQ in the network. Obviously, thisprioritisation scheme provides fairness between competingflows only within the same priority level. In Figure 15 wereport goodput results with respect to the convergencebehaviour of 5 SCDP unicast sessions that start sequentiallywith 2 seconds interval and 18 seconds duration, from5 sending severs to the same receiving server under thesame ToR switch. SCDP performs equally well to DCTCPin that respect [18]. Clearly, flows acquire a fair share of

    13It would be straightforward to support priority scheduling at thereceiver level as in NDP.

  • 13

    the available bandwidth very quickly. Each incoming flowis initially prioritised over the ongoing flows (MFLQ) but,given the reported time scales, this cannot be shown inFigure 15. We have repeated this experiment with largernumber of flows, and we find that SCDP converges quickly,and all flows achieve their fair share.

    V. CONCLUSION

    In this paper, we proposed SCDP, a general-purposetransport protocol for data centres that is the first to exploitnetwork-layer multicast in the data centre and balanceload across senders in many-to-one communication, whileperforming at least as well as the state of the art withrespect to goodput and flow completion time for long andshort unicast flows, respectively. Supporting one-to-manyand many-to-one application workloads is very importantgiven how extremely common they are in modern datacentres [11]. SCDP achieves this remarkable combinationby integrating systematic rateless coding with receiver-driven flow control, packet trimming and in-network pri-ority scheduling.

    RaptorQ codes incur some minimal network overhead,only when loss occurs in the network, but our experi-mental evaluation showed that this is negligible comparedto the significant performance benefits of supporting one-to-many and many-to-one workloads. RaptorQ codes alsoincur computational overhead and associated latency whenwhen loss occurs. However, we showed that this is rarefor short flows because of MLFQ. For long flows, blockpipelining alleviates the problem by splitting large blocksinto smaller ones and decoding each of these smaller blockswhile retrieving the next one. As a result, latency is incurredonly for the last smaller block. RaptorQ codes have beenshown to perform at line speeds even on a single core; weexpect that with hardware offloading the overall overheadwill not be significant.

    In general and to the best of our knowledge, SCDP is thefirst protocol that provides native support for one-to-manyand many-to-one communication in data centres and ourextensive evaluation shows promising performance results.

    REFERENCES

    [1] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,”in SOSP, 2003.

    [2] “HDFS architecture guide [Online] accessed Sep 2019,” https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

    [3] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn,“Ceph: A scalable, high-performance distributed file system,” inUSENIX, 2006.

    [4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processingon large clusters,” in USENIX OSDI’04, 2004.

    [5] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: Afault-tolerant abstraction for in-memory cluster computing,” in NSDI,USENIX, 2012.

    [6] A, “Open config. streaming telemetry.” in A, Accessed on11/06/2019. [Online]. Available: http://blog.sflow.com/2016/06/streaming-telemetry.html

    [7] ——, “Microsoft Azure,” in A, Accessed on 11/06/2019.[Online]. Available: https://azure.microsoft.com/en-us/blog/cloud-service-fundamentals-telemetry-reporting/

    [8] M. L. Massie, B. N. Chun, and D. E. Culler, “The ganglia distributedmonitoring system: design, implementation, and experience,” inElsevier Parallel Computing, 2014.

    [9] A, “Akka: Build powerful reactive, concurrent, and distributedapplications more easily – using udp,” in A, Accessed on 11/06/2019.[Online]. Available: https://doc.akka.io/docs/akka/2.5.4/java/io-udp.html

    [10] ——, “Jgroups: A toolkit for reliable messaging.” in A, Accessedon 11/06/2019. [Online]. Available: http://www.jgroups.org/overview.html

    [11] M. Shahbaz, L. Suresh, N. Feamster, J. Rexford, O. Rottenstreich,and M. Hira, “Elmo: Source-routed multicast for cloud services,”in arxiv, May 2018 [Online]Available:. [Online]. Available: http://arxiv.org/abs/1802.09815

    [12] M. McBride and O. Komolafe, “Multicast in the data center overview,”in Huawei Arista Networks draft IETF, 2019.

    [13] D. Li, M. Xu, Y. Liu, X. Xie, Y. Cui, J. Wang, and G. Chen, “Reliablemulticast in data center networks,” in IEEE Transactions on Comput-ers. IEEE Transactions on Computers, 2014.

    [14] D. Li, J. Yu, J. Yu, and J. Wu, “Exploring efficient and scalable multicastrouting in future data center networks,” in INFOCOM, 2011.

    [15] W. Cui and C. Qian, “Dual-structure data center multicast usingsoftware defined networking,” in arxiv, 2014. [Online]. Available:http://arxiv.org/abs/1403.8065

    [16] X. Li and M. J. Freedman, “Scaling ip multicast on datacentertopologies,” in CoNEXT, 2013.

    [17] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi,and M. Wójcik, “Re-architecting datacenter networks and stacks forlow latency and high performance,” in Proc. of SIGCOMM, 2017.

    [18] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-hakar, S. Sengupta, and M. Sridharan, “Data Center TCP (DCTCP),”in Proc. of SIGCOMM, 2010.

    [19] A. Munir, I. A. Qazi, Z. A. Uzmi, A. Mushtaq, S. N. Ismail, M. S. Iqbal,and B. Khan, “Minimizing flow completion times in data centers,” inIEEE INFOCOM, 2013.

    [20] H. Xu and B. Li, “Repflow: Minimizing flow completion times withreplicated flows in data centers,” in IEEE INFOCOM, 2014.

    [21] Y. Lu, G. Chen, L. Luo, K. Tan, Y. Xiong, X. Wang, and E. Chen, “Onemore queue is enough: Minimizing flow completion time with explicitpriority notification,” in IEEE INFOCOM, 2017.

    [22] B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout, “Homa: A receiver-driven low-latency transport protocol using network priorities,” in InProc. ACM SIGCOMM, 2018.

    [23] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar,and S. Shenker, “pfabric: Minimal near-optimal datacenter transport,”in Proc. of SIGCOMM, 2013.

    [24] M. Kheirkhah, I. Wakeman, and G. Parisis, “MMPTCP: A multipathtransport protocol for data centers,” in Proc. of INFOCOM, 2016.

    [25] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, andM. Handley, “Improving Datacenter Performance and Robustnesswith Multipath TCP,” in Proc. of SIGCOMM, 2011.

    [26] A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella, “On the impactof packet spraying in data center networks,” in Proc. of INFOCOM,2013.

    [27] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat,“Hedera: Dynamic Flow Scheduling for Data Center Networks,” inProc. of USENIX, 2010.

    [28] C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, andM. Handley, “Data Center Networking with Multipath TCP,” in Proc.of SIGCOMM. ACM, 2010.

    [29] Y. Cui, L. Wang, X. Wang, H. Wang, and Y. Wang, “FMTCP: A foun-tain code-based multipath transmission control protocol,” IEEE/ACMTransactions on Networking, 2015.

    [30] L. Chen, J. Lingys, K. Chen, and F. Liu, “AuTO: Scaling Deep Reinforce-ment Learning for Datacenter-scale Automatic Traffic Optimization,”in In Proc. ACM SIGCOMM, 2018.

    [31] P. X. Gao, A. Narayan, G. Kumar, R. Agarwal, S. Ratnasamy, andS. Shenker, “pHost: Distributed Near-optimal Datacenter Transportover Commodity Network Fabric,” in Proc. of CoNEXT, 2015.

    [32] Y. Chen, R. Griffith, J. Liu, and A. Joseph, “Understanding TCP incastthroughput collapse in datacenter networks,” in Proc. of SIGCOMM,2009.

    [33] Y. Chen, R. Griffith, D. Zats, A. D. Joseph, and R. Katz, “UnderstandingTCP incast and its implications for big data workloads,” in Proc. ofUSENIX, 2012.

    https://hadoop.apache.org/docs/r1.2.1/hdfs_design.htmlhttps://hadoop.apache.org/docs/r1.2.1/hdfs_design.htmlhttp://blog.sflow.com/2016/06/streaming-telemetry.htmlhttp://blog.sflow.com/2016/06/streaming-telemetry.htmlhttps://azure.microsoft.com/en-us/blog/cloud-service-fundamentals-telemetry-reporting/https://azure.microsoft.com/en-us/blog/cloud-service-fundamentals-telemetry-reporting/https://doc.akka.io/docs/akka/2.5.4/java/io-udp.htmlhttps://doc.akka.io/docs/akka/2.5.4/java/io-udp.htmlhttp://www.jgroups.org/overview.htmlhttp://www.jgroups.org/overview.htmlhttp://arxiv.org/abs/1802.09815http://arxiv.org/abs/1802.09815http://arxiv.org/abs/1403.8065

  • 14

    [34] C. Jiang, D. Li, and M. Xu, “LTTP An LT code based transport protocolfor many to one communication in data centers,” in IEEE Journal onSelected Areas in Communications, 2014.

    [35] C. J. Zheng, D. Li, M. Xu, and K., “A Coding-based Approach to Mit-igate TCP Incast in Data Center Networks,” International Conferenceon Distributed Computing Systems Workshops, 2012.

    [36] P. Prakash, A. Dixit, Y. C. Hu, and R. Kompella, “The TCP OutcastProblem : Exposing Unfairness in Data Center Networks,” in Proc. ofUSENIX, 2012.

    [37] R.Mittal, V.Lam, N.Dukkipati, E.Blem, H.Wassel, M.Ghobadi, A.Vahdat,Y.Wang, D.Wetherall, and D.Zats, “TIMELY: RTT-based CongestionControl for the Datacenter,” in Proc. of SIGCOMM, 2015.

    [38] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye,S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control for large-scale rdma deployments,” in SIGCOMM, 2015.

    [39] M. P. Grosvenor, M. Schwarzkopf, I. Gog, R. N. Watson, A. W. Moore,S. Hand, and J. Crowcroft, “Queues don’t matter when you can JUMPthem!” in USENIX NSDI, 2015.

    [40] A, “Infiniband Trade Association. RoCEv2.” in A, 2014. [Online].Available: https://cw.infinibandta.org/document/dl/7781

    [41] M. Alasmar, G. Parisis, and J. Crowcroft, “Polyraptor: Embracing pathand data redundancy in data centres for efficient data transport,” inProceedings of the ACM SIGCOMM 2018 Conference on Posters andDemos, 2018.

    [42] G. Parisis, T. Moncaster, A. Madhavapeddy, and J. Crowcroft, “Trevi:Watering Down Storage Hotspots with Cool Fountain Codes,” in Proc.of HotNets, 2013.

    [43] M. Luby, A. Shokrollahi, M. Watson, T. Stockhammer, and L. Minder,“RaptorQ Forward Error Correction Scheme for Object Delivery,” IETF,RFC 6330, 2011.

    [44] A. Shokrollahi and M. Luby, “Raptor Codes, Foundations and Trends,”in Communications and Information Theory, Now Publisher, 2011.

    [45] P. Cheng, F. Ren, R. Shu, and C. Lin, “Catch the Whole Lot in anAction: Rapid Precise Packet Loss Notification in Data Center,” inProc. of USENIX, 2014.

    [46] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and H. Wang, “Information-agnostic flow scheduling for commodity data centers,” in In Proc.NSDI, USENIX, 2015.

    [47] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang,and S. Lu, “BCube: : A High Performance, Server-centric NetworkArchitecture for Modular Data Centers,” in Proc. of SIGCOMM, 2009.

    [48] A. Singla, C.-Y. Hong, L. Popa, and P. Godfrey, “Jellyfish: Networkingdata centers randomly,” in Proc. of USENIX, 2012.

    [49] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker,“P4: Programming protocol-independent packet processors,” in ACMSIGCOMM, 2014.

    [50] M. G. Luby, R. Padovani, T. J. Richardson, L. Minder, and P. Aggarwal,“Liquid cloud storage,” CoRR, vol. abs/1705.07983, 2017. [Online].Available: http://arxiv.org/abs/1705.07983

    [51] P. A. Michael Luby, Lorenz Minder, “Performance of codornicesrqsoftware package,” in International Computer Science Institute, May,2019. [Online]. Available: https://www.codornices.info/performance

    [52] G. Parisis, G. Xylomenos, and T. Apostolopoulos, “Dhtbd: A reliableblock-based storage system for high performance clusters,” in Proc.of IEEE/ACM CCGrid, 2011.

    [53] P. J. Braam, “File systems for clusters from a protocol perspective,”http://www.lustre.org.

    [54] Z. Guo and Y. Yang, “Multicast fat-tree data center networks withbounded link oversubscription,” in IEEE INFOCOM, 2013.

    [55] M. Alasmar and G. Parisis, “Evaluating modern data centre transportprotocols in OMNeT++/INET,” in Proceedings of the 6th OMNeT++Community Summit Conference, Hamburg, Germany, 2019.

    [56] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,D. a. Maltz, P. Patel, and S. Sengupta, “VL2: A Scalable and FlexibleData Center Network Albert,” in Proc. of SIGCOMM, 2009.

    https://cw.infinibandta.org/document/dl/7781http://arxiv.org/abs/1705.07983https://www.codornices.info/performancehttp://www.lustre.org.

    I IntroductionII RaptorQ Encoding and DecodingIII SCDP DesignIII-A Design OverviewIII-B Packet TypesIII-C Switch Service ModelIII-D Unicast Transport SessionsIII-E One-to-many Transport SessionsIII-F Many-to-one Transport SessionsIII-G Maximising Goodput for Long Flows through Source Block PipeliningIII-H Minimising Network Overhead and Completion Time for Short Flows

    IV Experimental EvaluationIV-A Goodput for One-to-Many and Many-To-One CommunicationIV-B Performance Benchmarking with Realistic WorkloadsIV-C Minimising Hotspots in the NetworkIV-D Eliminating Incast and OutcastIV-E The effect of the Initial Window SizeIV-F Overhead in One-to-Many SessionsIV-G Network Overhead and Induced Decoding LatencyIV-H Resource Sharing

    V ConclusionReferences