Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University

Reliable Multicast for Time-Critical Systems

Mahesh BalakrishnanKen Birman

Cornell University

Mission-Critical Datacenters

COTS DatacentersOnline e-tailers, search engines, corporate

applicationsWeb-services

Mission-Critical AppsNeed: Scalability, Availability, Fault-Tolerance

… Timeliness!

The Time-Critical Datacenter

Migrating time-critical applications to commodity datacenters…

… conversely, providing datacenter web-services with time-critical performance.

What’s a Time-Critical System?

Not ‘real time’, but ‘real fast’!

Financial calculators, military command and control… air traffic control (ATC)

… foobooks.com!

Technology Gap: Real-Time focuses on determinism, scale-up architectures

The French ATC System

Mid to Late 90’s Teams of 3-5 air traffic controllers on a

cluster of desktop consoles 50-200 of these console clusters in an air

traffic control center Why study the French ATC?

ATC Subsystems

Radar Image Weather Alert Track Updates Updates to Flight Plans Console to Console State Updates System Management and Monitoring ATC center to center Updates

Multicast ubiquitous…

Two Kinds of Multicast

Virtually Synchronous Multicast: very reliable, not particularly fast

Unreliable Multicast: very fast, not particularly reliable

Nothing in between!

Two Kinds of Subsystems

Category 1: Complete reliability (virtual synchrony) e.g: Routing decisions

Category 2: Careful application design + natural hardware properties + management policies. e.g: Radar

Multicast in the French ATC

Engineering Lessons: Structure application to tolerate partial failures Exploit natural hardware properties

Can we generalize to modern systems?

Research Direction: Time-Critical Reliability Can we design communication primitives that

encapsulate these lessons?

Anatomy of a Cloned Service

RACS

Updates multicast to whole group

Queries unicast to

single nodes

Services An Amazon web-page is constructed by

100s of co-operating services*

Multicast is used for:Updating Cloned ServicesPublish-Subscribe / EventingDatacenter Management/Monitoring

* Werner Vogels, CTO of amazon.com, at SOSP 2005

Multicast in the Datacenter

A node is in many multicast groups: One for each service it

hosts One for each topic it

subscribes to One or more

administration groups

Large Numbers of Overlapping Groups!

Service Semantics

Product Popularity Service

Shipping Scheduler

Store Inventory

User History Service

Product Recommendations

User Profile Data

Data Store Services: stale data can result in overselling / underselling loss of real-world dollars

Cache Services: updated

periodically by back-end data-stores

The Challenge

Datacenter Blades are failure-prone: Crash failures Byzantine behavior Bursty Packet Loss :

End-hosts kernels drop packets when subjected to traffic spikes.

A New Reliability Model

Rapid delivery is more important than perfect reliability

Probabilistic Timeliness Graceful Degradation

Wanted: a multicast primitive that

1. Scales to large numbers of arbitrarily overlapping multicast groups

2. Delivers multicasts quickly

3. Tolerates datacenter failure modes – bursty packet loss, node failures

4. Offers probabilistic properties

5. ‘Gives up’ on lost data after a threshold period

Ricochet: Lateral Error Correction

Receivers exchange error correction XORs of multicast traffic

Works very well with multiple groups – scales upto a thousand groups per node

Probabilistic Timeliness: probability distribution of delivery

latencies

Predictive Total Ordering (Plato)

Delivers messages to applications with no ordering delay in most cases

Orders messages only if there is a high probability of out-of-order delivery across different nodes

Probabilistic Timeliness: probability distribution of ordered delivery latency

Performance

SRM takes seconds to recover lost packets

Ricochet recovers almost all packets within ~70 milliseconds

Conclusion

Move from R/T to T/C yields huge benefits! Ricochet is faster… slashes latency… scalable… Clean delivery delay curve a powerful design tool,

replaced traditional hard (but conservative) limits We’re open for business:

Software and detailed paper available for download Give it a try… tell us what you think!

www.cs.cornell.edu/projects/quicksilver/ricochet.html

Documents

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University