49
High-Fidelity Latency Measurements in Low- Latency Networks Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)

High -Fidelity Latency Measurements in Low -Latency Networks

  • Upload
    nike

  • View
    99

  • Download
    0

Embed Size (px)

DESCRIPTION

High -Fidelity Latency Measurements in Low -Latency Networks. Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research). Low Latency Applications. Many important data center applications require low end-to-end latencies ( microseconds ) - PowerPoint PPT Presentation

Citation preview

Page 1: High -Fidelity Latency Measurements  in Low -Latency Networks

High-Fidelity Latency Measurements in Low-

Latency Networks

Ramana Rao Kompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)

Page 2: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 2

Low Latency Applications Many important data center applications require

low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities

Page 3: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 3

Low Latency Applications Many important data center applications require

low end-to-end latencies (microseconds) High Performance Computing – lose parallelism Cluster Computing, Storage – lose performance Automated Trading – lose arbitrage opportunities

Cloud applications Recommendation Systems, Social Collaboration All-up SLAs of 200ms [AlizadehSigcomm10] Involves backend computation time and network

latencies have little budget

Page 4: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 4

… … … …

ToR S/W

Edge Router

Core Router …Latency Measurements are Needed

At every router, high-fidelity measurements are critical to localize root causes

Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis

Which router causes the problem??

1ms

Router

Measurement within a router is necessary

Page 5: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 5

Vision: Knowledge Plane

Knowledge Plane

Data Center Network

Response Query

SLA Diagnosis

Routing/Traffic Engineering

Scheduling/Job Placement

Latency Measurements

Query Interface

Latency Measurements

Push

Pull

Page 6: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 6

Contributions Thus Far… Aggregate Latency Estimation

Lossy Difference Aggregator – Sigcomm 2009 FineComb – Sigmetrics 2011 mPlane – ReArch 2009

Differentiated Latency Estimation Multiflow Estimator – Infocom 2010 Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Delay Sketching – (under review at Sigcomm 2011)

Scalable Query Interface MAPLE – (under review at Sigcomm 2011)

Per-flow latency measurements at every

hopPer-Packet Latency

Measurements

Page 7: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 7

1) PER-FLOW MEASUREMENTS WITH REFERENCE LATENCY INTERPOLATION [SIGCOMM 2010]

Page 8: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 8

Native router support: SNMP, NetFlow No latency measurements

Active probes and tomography Too many probes (~10000HZ) required wasting bandwidth

Use expensive high-fidelity measurement boxes London Stock Exchange uses Corvil boxes Cannot place them ubiquitously

Recent work: LDA [Kompella09Sigcomm] Computes average latency/variance accurately within a switch Provides a good start but may not be sufficient to diagnose flow-

specific problems

Obtaining Fine-Grained Measurements

Page 9: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 9

From Aggregates to Per-Flow

Del

ay

Time

S/W

Queue

Average latency

Interval

Large delay

Small delay

Observation: Significant amount of difference in average latencies across flows at a router

Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ?

Page 10: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 10

Measurement Model

Router

Egress E

Ingress I

Assumption: Time synchronization between router interfaces Constraint: Cannot modify regular packets to carry timestamps

Intrusive changes to the routing forwarding path

Page 11: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 11

Naïve Approach

For each flow key, Store timestamps for each packet at I and E After a flow stops sending, I sends the packet timestamps to E E computes individual packet delays E aggregates average latency, variance, etc for each flow

Problem: High communication costs At 10Gbps, few million packets per second Sampling reduces communication, but also reduces accuracy

Ingress I Egress E

10 − =2023

2730

+1513 18

−=

2232

Avg. delay = 22/2 = 11Avg. delay = 32/2 = 16

−+ −

Page 12: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 12

A (Naïve) Extension of LDA

Maintaining LDAs with many counters for flows of interest

Problem: (Potentially) high communication costs Proportional to the number of flows

Ingress I Egress E

LDA LDA

LDA LDA

LDA LDA

28

15

2

1

Packet count

Sum of timestamps

Coordination

Per-flow latency

Page 13: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 13

Key Observation: Delay Locality

True mean delay = (D1 + D2 + D3) / 3

Localized mean delay = (WD1 + WD2 + WD3) / 3

WD1 WD3WD2

How close is localized mean delay to true mean delay as window size varies?

Del

ay

Time

D1

D2D3

Page 14: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 14

Key Observation: Delay Locality

True Mean delay per key / ms

Loca

l mea

n de

lay

per k

ey /

ms

Global Mean

0.1ms: RMSRE=0.05410ms: RMSRE=0.16

1s: RMSRE=1.72

Data sets from real router and synthetic queueing models

Page 15: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 15

Exploiting Delay Locality

Reference packets are injected regularly at the ingress I Special packets carrying ingress timestamp Provides some reference delay values (substitute for window

averages) Used to approximate the latencies of regular packets

Del

ay

Time

ReferencePacket

IngressTimestamp

Page 16: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 16

RLI Architecture

Component 1: Reference Packet generator Injects reference packets regularly

Component 2: Latency Estimator Estimates packet latencies and updates per-flow statistics Estimates directly at the egress with no extra state maintained at

ingress side (reduces storage and communication overheads)

Egress EIngress I

1) ReferencePacket

Generator

2) LatencyEstimator

123 123

R

L IngressTimestamp

Page 17: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 17

Component 1: Reference Packet Generator

Question: When to inject a reference packet ? Idea 1: 1-in-n: Inject one reference packet every n packets

Problem: low accuracy under low utilization

Idea 2: 1-in-τ: Inject one reference packet every τ seconds Problem: bad in case where short-term delay variance is high

Our approach: Dynamic injection based on utilization High utilization low injection rate Low utilization high injection rate Adaptive scheme works better than fixed rate schemes Details in the paper

Page 18: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 18

Component 2: Latency Estimator

Question 1: How to estimate latencies using reference packets ?

Solution: Different estimators possible Use only the delay of a left reference packet (RLI-L) Use linear interpolation of left and right reference packets (RLI) Other non-linear estimators possible (e.g., shrinkage)

LInterpolated

delay

Del

ay

Time

R

Error indelay estimate

RegularPacket

ReferencePacket

Linear interpolationline

Arrival time is known

Arrival time and delay are known

Estimateddelay

Error indelay estimate

R

Page 19: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 19

Component 2: Latency Estimator Flow

key C1 C2 C3

8 11 392 3 6

Interpolation buffer

Estimate10 20 803 4 7

Avg. latency = C2 / C1

RL

Right Reference Packet arrived

When a flow is

exported

Question 2: How to compute per-flow latency statistics Solution: Maintain 3 counters per flow at the egress side

C1: Number of packets C2: Sum of packet delays C3: Sum of squares of packet delays (for estimating variance) To minimize state, can use any flow selection strategy to maintain

counters for only a subset of flows

Flow Key4 51Delay

Square of delay 16 251

Update

Any flow selectionstrategy Update

Selection

Page 20: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 20

Experimental Setup Data sets

No public data center traces with timestamps Real router traces with synthetic workloads: WISC Real backbone traces with synthetic queueing: CHIC

and SANJ

Simulation tool: Open source NetFlow software – YAF Supports reference packet injection mechanism Simulates a queueing model with RED active queue

management policy

Experiments with different link utilizations

Page 21: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 21

Accuracy under High Link Utilization

Relative error

CD

F

Median relative erroris 10-12%

Page 22: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 22

Comparison with Other Solutions

Utilization

Ave

rage

rela

tive

erro

rPacket sampling rate = 0.1%

1-2 orders of magnitude difference

Page 23: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 23

Overhead of RLIBandwidth overhead is low

less than 0.2% of link capacity

Impact to packet loss is small Packet loss difference with and without RLI is at

most 0.001% at around 80% utilization

Page 24: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 24

Summary A scalable architecture to obtain high-fidelity per-

flow latency measurements between router interfaces

Achieves a median relative error of 10-12% Obtains 1-2 orders of magnitude lower relative

error compared to existing solutions Measurements are obtained directly at the

egress side

Page 25: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 25

Contributions Thus Far… Aggregate Latency Estimation

Lossy Difference Aggregator – Sigcomm 2009 FineComb – Sigmetrics 2011 mPlane – ReArch 2009

Differentiated Latency Estimation Multiflow Estimator – Infocom 2010 Reference Latency Interpolation – Sigcomm 2010 RLI across Routers – Hot-ICE 2011 Virtual LDA – (under review at Sigcomm 2011)

Scalable Query Interface MAPLE – (under review at Sigcomm 2011)

Page 26: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 26

2) SCALABLE PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (UNDER REVIEW AT SIGCOMM 2011)

Page 27: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 27

MAPLE MotivationLDA and RLI are ossified in the aggregation

levelNot suitable for obtaining arbitrary sub-

population statistics Single packet delay may be important

Key Goal: How to enable a flexible and scalable architecture for packet latencies ?

Page 28: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 28

1) Packet LatencyStore

2) QueryEngineTimestamp

Unit

MAPLE Architecture

Timestamping not strictly required Can work with RLI estimated latencies

Router A Router BP1P1 T1 P1 D1P1

CentralMonitor

Q(P1) A(P1)

Page 29: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 29

Packet Latency Store (PLS) Challenge: How to store packet latencies in the

most efficient manner ? Naïve idea: Hashtables does not scale well

At a minimum, require label (32 bits) + timestamp (32 bits) per packet

To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%)

Can we do better ?

Page 30: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 30

Our Approach Idea 1: Cluster packets

Typically few dominant values Cluster packets into equivalence classes Associate one delay value with a cluster Choose cluster centers such that error is small

Idea 2: Provision storage Naïvely, we can use one Bloom Filter per cluster

(Partitioned Bloom Filter) We propose a new data structure called Shared

Vector Bloom Filter (SVBF) that is more efficient

Page 31: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 31

Selecting Representative Delays Approach 1: Logarithmic delay selection

Divide delay range into logarithmic intervals E.g., 0.1-10,000μs 0.1-1μs, 1-10μs …

Simple to implement, bounded relative error, but accuracy may not be optimal

Approach 2: Dynamic clustering k-means (medians) clustering formulation Minimizes the average absolute error of packet

latencies (minimizes total Euclidean distance) Approach 3: Hybrid clustering

Split centers equally across static and dynamic Best of both worlds

Page 32: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 32

K-means Goal: Determine k-centers every measurement cycle

Can be formulated as a k-means clustering algorithm Problem 1: Running k-means typically hard

Basic algorithm has O(nk+1 log n) run time Heuristics (Lloyd’s algorithm) also complicated in practice

Solution: Sampling and streaming algorithms Use sampling to reduce n to pn Use a streaming k-medians algorithm (approximate but sufficient)

Problem 2: Can’t find centers and record membership at the same time

Solution: Pipelined implementation Use previous interval’s centers as an approximation for this

interval

Page 33: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 33

Streaming k-Medians [CharikarSTOC03]

PacketSampling

OnlineClustering

Stage

OfflineClustering

Stage

SOFTWARE

Storage Data Structure

PacketStream

HARDWARE

DRAM/SSDData

k-centers

Flushed after everyepoch for archival

Packets in (i+2)th epoch

np packetsat i-th epoch

O(k log(np) centers

at (i+1)th epoch

Page 34: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 34

Naïve: Partitioned BF (PBF) c1

c3

c2

c4

PacketLatency

Parallel matching of closest center

1 1 0 1 1… 10 0 1 1… 0 00 0 1 1 11 1

0 1 1…0 0 1 1…

Bits are set by hashing packet

contents

INSERTION

PacketContents

Query all Bloom filters

1 1 0 1 1… 10 0 1 1… 0 11 0 1 1 11 1

0 1 1…0 0 1 1…

All bits are 1

LOOKUP c1

c3

c2

c4

Page 35: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 35

Problems with PBFProvisioning is hard

Cluster sizes not known apriori Over-estimation or under estimation of BF sizes

Lookup complexity is higher Need the data structure to be partitioned every

cycle Need to lookup multiple random locations in the

bitmap (based on number of hash functions)

Page 36: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 36

Shared-Vector Bloom Filter

c1

c3

c2

c4

PacketLatency

Parallel matching of closest center

0 0 0 1… 0 00 0 1 1 11 1

INSERTION

PacketContents

LOOKUP

H1 H2

c2

Bit is set to 1 after offset by the number of matched center

Bit position is located by hashing

0 0 1 1… 0 10 0 1 1 1

H1 H2

0 1 0 0 0 1 1 1

PacketContents

AND

0 1 0 0

Bulk read

Offset is center id

# of centers

Page 37: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 37

Comparing PBF and SVBFPBF

− Lookup is not easily parallelizable− Provisioning is hard since number of packets per BF is not known apriori

SVBF+ One Bloom filter is used+ Burst read at the length of word

COMB [Hao10Infocom]+ Single BF with groups of hash functions− More memory usage than SVBF and burst read not possible

Page 38: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 38

Comparing Storage NeedsData Structure

# of Hash functions

Capacity (bits/entry)

Insertion Lookup Note

HashTable 1 147 1 1 Storing only latency value (no label)

PBF 9 12.8 9 450 Provisioning is hard (12.8 if cardinality known before)

COMB 7 12.8 14 77 (alternate combinations exist)

SVBF 9 12.8 9 27 (burst reads)

Provisioning is easy

For same classification failure rate of 1% and 50 centers (k=50)

Page 39: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 39

Tie-Breaking HeuristicBloom filters have false positivesLookups involve search across all BFs

So, multiple BFs may return matchTie-breaking heuristic returns the group that

has the highest cardinality Store a counter per center to store number of

packets that match the center (cluster cardinality) Works well in practice (especially when skewed

distributions)

Page 40: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 40

Estimation AccuracyC

DF

Absolute error (μs)

Page 41: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 41

Accuracy of AggregatesC

DF

Relative error

Page 42: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 42

2) QueryEngine

MAPLE Architecture

Router A Router B

CentralMonitor

Q(P1) A(P1)

Page 43: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 43

Query InterfaceAssumption: Path of a packet is known

Possible to determine using forwarding tables In OpenFlow-enabled networks, controller has the

information

Query answer: Latency estimate Type: (1) Match, (2) Multi-Match, (3) No-Match

Page 44: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 44

Query Bandwidth Query method 1: Query using packet hash

Hashed using invariant fields in a packet header High query bandwidth for aggregate latency statistics (e.g., flow-

level latencies)

Query method 2: Query using flow key and IP identifier Support range search to reduce query bandwidth overhead Inserts: use flow key and IPID for hashing Query: use a flow key and ranges of continuous IPIDs are sent

f1 1 5 f1 20 35Query

message:

Continuous IPID block

Page 45: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 45

Query Bandwidth Compression

CD

F

Compression ratio

Median compression per flow reduces bw

by 90%

Page 46: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 46

Storage OC192 interface

5 Million packets 60Mbits per second Assuming 10% utilization, 6 Mbits per second

DRAM – 16 GB 40 minutes of packets

SSD – 256 GB 10 hours – enough time for diagnosis

Page 47: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 47

SummaryRLI and LDA are ossified in their aggregation

levelProposed MAPLE as a mechanism to

compute measurements across arbitrary sub-populations Relies on clustering dominant delay values Novel SVBF data structure to reduce storage and

lookup complexity

Page 48: High -Fidelity Latency Measurements  in Low -Latency Networks

Stanford 48

Conclusion Many applications demand low latencies Network operators need high-fidelity tools for

latency measurements Proposed RLI for fine-grained per-flow

measurements Proposed MAPLE to:

Store per-packet latencies in a scalable way Compose latency aggregates across arbitrary sub-

populations Many other solutions (papers on my web page)

Page 49: High -Fidelity Latency Measurements  in Low -Latency Networks

Sponsors

49

CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks

CNS – 0831647: NSF NECO: Architectural Support for Fault Management

Cisco Systems: Designing Router Primitives for Monitoring Network Health

Stanford