28
The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience Company Stanford EE Computer Systems Colloquium Wednesday, November 16, 2016 http://ee380.stanford.edu

The Time-less Datacenter - Stanford University · 2016. 11. 19. · •2PC (Fail-Stop) •Vulnerable to coordinator failure (no safety proof) •3PC vulnerable to network partitions

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • The Time-less Datacenter

    Paul Borrill and Alan H. Karp Earth Computing

    The Datacenter Resilience Company

    Stanford EE Computer Systems ColloquiumWednesday, November 16, 2016

    http://ee380.stanford.edu

  • The Three Taxes: 1. Complexity 2. Fragility 3. Vulnerability

    Cloud Computing

    2

  • Earth Computing | The Datacenter Resilience Company

    Twitter Today

    Systems can fail in catastrophic ways leading to death or tremendous financial loss. Although their are many potential causes including physical failure, human error, and environmental factors, design errors are increasingly becoming the most serious culprit*

    3

    *NASA Formal Methods Program: https://shemesh.larc.nasa.gov/fm/fm-why-new.html

    https://shemesh.larc.nasa.gov/fm/fm-why-new.html

  • Earth Computing | The Datacenter Resilience Company

    Key Computer Science ProblemsReliable Consensus • Generals Problem (no fixed length protocol exists to guarantee a reliable

    solution in an environment where messages can get lost)

    • Slow Node vs. Link Failure Indistinguishability. I.e. what can one side of a failed link assume about a partner or cohort on the other side?

    FLP Result • Impossibility of Distributed Consensus with One Faulty Process

    Key Idea: • Don’t depend on processes to provide liveness, use a new kind of link

    4

  • Earth Computing | The Datacenter Resilience Company

    Problem: Event Ordering is Hard• In a distributed system over a general

    network we can’t tell if event at process R happened before event at process Q, unless P caused R in some way

    • Causal Trees provide this guarantee when they are stable

    • Dynamic Causal Trees provide guarantees through failure & healing, iff you have AIT on each link

    • Needs Atomic Information Transfer (AIT) in the Link

    P

    P:0

    Q:--

    R:--

    Q

    P:--

    Q:0

    R:--

    R

    P:--

    Q:--

    R:0

    P

    P:1

    Q:2

    R:1

    P

    P:2

    Q:2

    R:1

    P

    P:3

    Q:3

    R:3

    Q

    P:--

    Q:1

    R:1

    Q

    P:--

    Q:2

    R:1

    Q

    P:--

    Q:3

    R:1

    Q

    P:2

    Q:4

    R:1

    Q

    P:2

    Q:5

    R:1

    R

    P:--

    Q:--

    R:1

    R

    P:--

    Q:3

    R:2

    R

    P:--

    Q:3

    R:3

    R

    P:2

    Q:5

    R:4

    R

    P:2

    Q:5

    R:5

    P

    P:4

    Q:5

    R:5

    t

    Process

    Causal History

    FutureEffect

    slop

    e ≤ c

    slop

    e ≤ c

    slope ≤ c

    slope ≤ c

    11 12 13 14

    21 2223

    24 25

    3231 33 34 35

    P

    P:0

    Q:--

    R:--

    Q

    P:--

    Q:0

    R:--

    R

    P:--

    Q:--

    R:0

    P

    P:1

    Q:2

    R:1

    P

    P:2

    Q:2

    R:1

    P

    P:3

    Q:3

    R:3

    Q

    P:--

    Q:1

    R:1

    Q

    P:--

    Q:2

    R:1

    Q

    P:--

    Q:3

    R:1

    Q

    P:2

    Q:4

    R:1

    Q

    P:2

    Q:5

    R:1

    R

    P:--

    Q:--

    R:1

    R

    P:--

    Q:3

    R:2

    R

    P:--

    Q:3

    R:3

    R

    P:2

    Q:5

    R:4

    R

    P:2

    Q:5

    R:5

    P

    P:4

    Q:5

    R:5

    t

    Process

    Causal History

    slop

    e ≤ c

    slop

    e ≤ c

    slope ≤ c

    11 12 13 14

    21 22 23 24 25

    3231 33 34 35

    FutureEffect

    5

  • Earth Computing | The Datacenter Resilience Company

    Problem: Consensus is Hard•Failure detectors have failed

    to solve the problem

    •2PC (Fail-Stop)•Vulnerable to coordinator failure

    (no safety proof)

    • 3PC vulnerable to network partitions (no liveness proof)

    •Paxos (Fail-Recover)•Robust Algorithm but hard to

    understand & get right.

    • Causal Trees make roles robust, easier to understand & verify

    6

  • Earth Computing | The Datacenter Resilience Company

    Why? Because The Network is Flaky!•App developers believe the network is the problem

    •Networks drop, delay, duplicate & reorder packets•Networking people believe the apps are the problem

    •The network end to end principle: Apps should retry to distinguish between delays & drops … but … retries* ruin TCP’s ordering guarantees

    •Both are incorrect. Solution requires a simple, but fresh perspective

    Peter Bailis, Kyle Kingsbury. The network is reliable

    * Application retries (i.e. opening a new socket) 7

    http://aphyr.com/posts/288-the-network-is-reliable

  • Earth Computing | The Datacenter Resilience Company

    Datacenter Failures Cascade

    8

    Switches are 
DReDDful

    They Drop, Reorder, Delay and Duplicate Packets

    Interdependent failures Reconstruction storms Timeout storms Gossip storms Cascade failures

  • Earth Computing | The Datacenter Resilience Company

    It’s Time to SimplifyDelta Amazon Google Apple Netflix Paypal …

    9

  • Earth Computing | The Datacenter Resilience Company

    Earth Computing

    Earth Co

    mputing

    Tim Ber

    ners LeeThe Big Idea

    World Wide Web Key Idea: 2 Simple Sets of Rules

    •Document Language (html)

    •Connection Protocol (http)

    Cloud ComputingMere mortals can now get their computers to talk to each other

    Mere mortals can now manage their infrastructures

    ONE WAY LINKS

    TWO WAY LINKS

    10

    Earth Computing Key Idea: 2 Simple Sets of Rules

    •Graph Language (gvml)

    •Connection Protocol (eclp)

  • Earth Computing | The Datacenter Resilience Company

    CAS •Atomic Instruction

    Key Idea:

    Lock-Free 
data structures

    •New Concurrency Libraries

    •Atomic RMW

    AIT •Atomic Information

    Key Idea:

    Recoverable Atomic Tokens

    •Deterministic, In-Order

    •Reversible Atomic Message

    Distributed Systems Primitives

    Concurrent Safety

    Non-Blocking

    Deterministic Recoverability

    Durable Indivisible Property

    Shared Memory

    Reversible Token

    {While (CAS(oldvalue,newvalue, ) != new value}

    {Transfer (AIT(tokenID,Notify=NO, ) != Continue}11

  • Earth Computing | The Datacenter Resilience Company12

    C2C Lattice of Cells & Links

    Typical Clos-Based DC Topology, Spine and Leaf Architecture

    DC Gateway

    Spine Node

    Leaf Node

    ToR

    GW

    SN

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    LN LN

    SN

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    LN LN

    GW

    SN

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    LN LN

    SN

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    ToR

    VM VM

    VM VM

    LN LN

    DC DC DC

    DCI/WAN

    Today’s Networking: Servers & Switches EARTH Computing: Cells & Links

    Servers, Any to Any (IP) addressing

    Simpler Wiring: N2N, Switchless

  • Earth Computing | The Datacenter Resilience Company13

    Today: Internal Segregation Firewalls EARTH: Dynamic Confinement Domains

    The Datacenter Simplified

    Fundamentally Simpler

    The Datacenter Today

  • Earth Computing | The Datacenter Resilience Company

    Split infrastructure into: Cloud datacenter accessed by untrusted legacy 
protocols

    Earth dynamic, resilient, 
programmable topologies

    Core where data is immutable, secure, protected, & resilient 
to perturbations 
(failures, disasters, attacks)

    14

    Cloudplane

    OutsideWorld

    EarthCore

    Earth Computing Network Fabric

    Data Center

  • Confidential | Earth Computing Inc.

    The Big Idea

    Cloudplane

    OutsideWorld

    EarthCore

    15

  • Earth Computing | The Datacenter Resilience Company

    Logical Foundation for Resilience

    16

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    CellAgent

    NIC

    NIC

    NIC

    NICNIC

    NIC

    Fabric

  • Earth Computing | The Datacenter Resilience Company

    EARTH Computing Link Protocol (ECLP)

    • Events: Replaces Heartbeats, Timeouts • Addresses the Common Knowledge* Problem

    17

    CellAgentNICNIC

    NIC

    NICNIC

    NIC

    Cell AgentNICNIC

    NIC

    NICNIC

    NIC

    CableCellAgent

    NICNIC

    NIC

    NICNIC

    NIC

    Cable

    New Distributed Systems Foundation

    *Knowledge and Common Knowledge in a Distributed Environment – Joseph Y. Halpern & Yoram Moses ’90 (initial version 1984).

    http://www.cs.cornell.edu/home/halpern/papers/common_knowledge.pdf

  • Composable Presence ManagementRouter

    NICNIC

    NIC

    NICNIC

    NIC

    Router NICNIC

    NIC

    NICNIC

    NIC

    CellAgentNICNIC

    NIC

    NICNIC

    NIC

    Router NICNIC

    NIC

    NICNIC

    NIC

    CableCellAgent

    NICNIC

    NIC

    NICNIC

    NIC

    Cable Cable Cable

    18

  • Composable Presence ManagementRouter

    NICNIC

    NIC

    NICNIC

    NIC

    Router NICNIC

    NIC

    NICNIC

    NIC

    CellAgentNICNIC

    NIC

    NICNIC

    NIC

    Router NICNIC

    NIC

    NICNIC

    NIC

    CableCellAgent

    NICNIC

    NIC

    NICNIC

    NIC

    Cable Cable Cable

    19

  • Demo20

  • Two Generals Problem

    21

  • Earth Computing | The Datacenter Resilience Company

    Example Use Cases

    22

    Two Phase Commit

    Paxos

    Link Reversal

  • Earth Computing | The Datacenter Resilience Company

    Example Use Cases• Two-phase commit The prepare phase is asking if the receiving agent is ready to accept the

    token. This serves two purposes: communication liveness and agent readiness. Links provide the communication liveness test, and we can avoid blocking on agent ready, by having the link store the token on the receiving half of the link. If there is a failure, both sides know; and both sides know what to do next.

    • Paxos “Agents may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted”. The assumption is when a node has failed and restarted, it can’t remember the state it needs to recover. With AIT, the other half of the link can tell it the state to recover from.

    • Reliable tree generation Binary link reversal algorithms work by reversing the directions of some edges. Transforming an arbitrary directed acyclic input graph into an output graph with at least one route from each node to a special destination node. The resulting graph can thus be used to route messages in a loop-free manner. Links store the direction of the arrow (head and tail); AIT facilitates the atomic swap of the arrow’s tail and head to maintain loop-free routes during failure and recovery.

    23

  • Earth Computing | The Datacenter Resilience Company24

    Common Knowledge

    Courtesy: Adrian Coyler, The Morning Paper.https://blog.acolyer.org/2015/02/16/knowledge-and-common-knowledge-in-a-distributed-environment/

  • Confidential | Earth Computing Inc. | Paul Borrill

    Implementation On Smart NIC’s

    PHY

    PHY

    PHY

    PHY

    PHY

    PHY

    Agent

    NIC

    NIC

    NIC

    NIC

    NIC

    NIC

    Cell Hardware(containing Processor, Memory, Storage and

    (e.g 6) physical network ports)

    Cell SoftwarePrimary Agent

    NetworkInterfaceController

    NetworkConnector

    EntanglementSynchronization

    Domain

    25

  • Earth Computing | The Datacenter Resilience Company

    TRAPHs (Tree-gRAPHs) • Simple Provisioning, Confinement, 


    Elasticity, Migration, Failover

    26

    New Distributed Systems Foundation

    Bare Metal

    NALNetwork

    Asset Layer

  • Earth Computing | The Datacenter Resilience Company27

    DemoSimulator

  • Earth Computing | The Datacenter Resilience Company28

    • Don’t Make Datacenter Look Like the Internet • Simpler to Configure/Reconfigure

    • More Resilient to all perturbations

    • Easier to Secure

    • Key Ideas • RAFE: Reliable Address-Free Ethernet

    • Replace switches with cell to cell links

    • Don’t rely on blueprints, discover wiring

    • Event driven => No network timeouts

    • Keep state in links for recovery

    • No VLANs, no network-layer encryption

    • Scalable design - local only view

    • NO IP; service addressing

    • Self recovering from link & server failures

    • RAFE is a discovery process rather than a configuration process

    Questions?

    Cloudplane

    OutsideWorld

    EarthCore