20
B4:Experience with a Globally Deployed Software Defined WAN

B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

B4:Experience with a Globally Deployed Software Defined WAN

Page 2: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Why?

• To save money! – WAN hardware and links are over-provisioned

– But this hardware is expensive!

– And Google’s traffic between DC’s is increasing Rapidly!!

Page 3: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Assumptions/Insights

• Control over applications, servers, switches

• Only few dozen Datacenters

• Applications can: – handle failures

– adapt to changing bandwidth

– class and priority tells traffic patterns/importance

Page 4: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Implementation

• Full control over WAN routing – WAN scale SDN deployment

• Managing the links in smart way – Traffic Engineering (TE)

Page 5: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

TAKING CONTROL OVER WAN LINKS Step-1

Page 6: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols
Presenter
Presentation Notes
Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols. With the current implementation, each WAN site is treated as a separate AS and iBGP is used between them as a backup. At the Global level the SDN gateway controls flow between sites.
Page 7: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols
Presenter
Presentation Notes
RIB – routing information base RPC – remote procedure call RAP – routing application proxy written as SDN application for routing updates, handling routing protocol packages between quagga and OF switches, and interface updates from the switches to quagga. The RAP caches the Quagga RIB and translates RIB entries into Onix’s NIB entries.
Page 8: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

DECIDING WHO GETS RESOURCES Step-2

Page 9: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Traffic Engineering (TE)

• Goal: Share bandwidth among competing applications possibly using multiple paths.

• Sharing bandwidth is defined by Google as max-min fairness.

• Basics of max-min fairness: – No source gets a resource share larger than its

demand. – Sources with unsatisfied demands get an equal share

of the resource.

Page 10: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols
Page 11: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

TE Optimization Algorithm

• Traditional solutions are expensive • Google’s solution

– Aggregate flows into flow-groups, tunnel-groups – 25x faster, and utilizes at least 99% of the bandwidth

• Three Steps – Tunnel selection: select tunnels for flow group (FG) – Tunnel Group Generation: Allocation of bandwidth to

FGs – Tunnel Group Quantization: Changing split ratios in

each TG to match the granularity supported by switch hardware.

Page 12: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

TE State and OpenFlow

• B4 switches operate in 3 roles:

1. Encapsulating switch initiates tunnels and splits traffic between them.

2. Transit switch forwards packets based on their outer header.

3. Decapsulating switch terminates tunnels then forwards packets using regular routes.

Page 13: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols
Page 14: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Using TE and shortest path together

Page 15: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

RESULTS Step-3

Page 16: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Link Utilization

Page 17: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Link Utilization

Page 18: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Failures Google conducted experiments to test the recovery time from different types of failures. Their results are summarized below:

Presenter
Presentation Notes
Transit switch failure is slow because the encapsulation switch must update table entries for potentially several tunnels and the operation is typically 100ms
Page 19: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Experience with an outage

• During a move of switches from one physical location to another, two switches became manually configured with the same ID.

• Resulted in network state never converging to the topology.

• System recovered after all traffic was stopped, buffers emptied and OFCs restarted from scratch.

Page 20: B4:Experience with a Globally Deployed Software Defined WAN › sites › default › files › T7... · Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols

Backup