34
TungstenFabric (Contrail) at Scale in Workday Mick McCarthy, Software Engineer @ Workday David O’Brien, Software Engineer @ Workday

TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

TungstenFabric (Contrail) at Scale in WorkdayMick McCarthy, Software Engineer @ WorkdayDavid O’Brien, Software Engineer @ Workday

Page 2: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Agenda

Introduction

Contrail at Workday

Scale

High Availability

Weekly Production Release Cycle

Segmentation

Conclusion

Q & A

Page 3: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Introduction

Page 4: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Mick McCarthyPersonal Intro

• Software Engineer @ Workday• Providing Network services to the Workday

Private Cloud based on OpenStack

Page 5: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

David O’BrienPersonal Intro

• Software Engineer @ Workday• Providing Network services to the Workday

Private Cloud based on OpenStack

Page 6: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Contrail at Scale in WorkdayTopic Intro

• Workday - Enterprise SaaS‒ HCM, Finance, Payroll

Page 7: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Contrail @ Workday

Page 8: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Running Contrail in Production since early 2016• Versions in Production

‒ 2.21.x => single controller - non-HA‒ 3.2.x => 3 controller - HA

History

Page 9: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Providing Networking Services for OpenStack based Private Cloud‒ Overlay Networking (MPLSoGRE)‒ DNS‒ DHCP‒ Segmentation

Use Cases

Page 10: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Contrail Architecture

Page 11: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Scale

Page 12: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Scale

35+ OpenStack/Contrail Clusters

300K+ Cores

4K+ Hypervisors

20K+ Virtual Machines (Immutable Images)

150K+ Contrail Network Policies

100+ Tenant Networks

15+ Critical Workday Services

Page 13: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

High Availability

Page 14: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

High Availability

Page 15: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Deployment Topology

Page 16: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

1. Fault Tolerance

1. Throughput

1. ZDT upgrades

High Availability - Benefits

Page 17: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

1. Operational Complexity

1. “24 x 7” availability

1. “HA” HAProxy?

1. ZDT upgrades

High Availability - Challenges

Page 18: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

(1/4) Observability

(2/4) Orchestration

(3/4) Smaller clusters (more of them)

High Availability - Lessons Learned

Page 19: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

(4/4) Contrail DNS

‒ Hard to configure internal DNS delegations

‒ Contrail DNS keeps 2 out of 3 as active

High Availability - Lessons Learned

Page 20: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Weekly Production Release Cycle

Page 21: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Workday Services packaged as VM images• New service version is a new version of a VM image• Weekly service deployments (tight patch window)• 20K+ VM deletion and recreation

Immutable Images

Page 22: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Control Plane Usage

• Number of POST /v2.0/ports.json per sec

Page 23: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Duplicate IPs‒ Contrail bug visible only under high control plane load‒ Contrail uses Zookeeper to figure out next available IP in a subnet‒ Caused by Zookeeper race condition

• Delayed DHCP‒ Contrail Control plane slowing down under high load‒ vRouter handling out short term incomplete DHCP leases‒ Freed up Contrail Control plane by adding memcache for Keystone (helped a bit)‒ Optimized client side to reduce Contrail API traffic (big relief)

Challenges

Page 24: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Contrail Schema Failover‒ Schema busy processing requests (CPU intensive)‒ Schema gevent greenlet does not yield to Zookeeper heartbeats‒ Multi master schema (causes data corruption)

Challenges

Page 25: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Data Plane Usage

• bps through the gateway routers (Juniper MX40)

Page 26: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Contrail Analytics overwhelmed with Flow data‒ Too much flow telemetry data‒ High CPU usage (all cores)‒ High IO/Disk usage on Cassandra

Challenges

Page 27: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

• Always have a Production like environment‒ End to End‒ Exactly as it happens in Production

• Test frequently‒ Ideally in CI to narrow down the changes

• Design fault tolerance with SLA in mind‒ Not enough redundancy from SLA point of view‒ Loss of one DNS would bring down DNS briefly and violate SLA

• Monitoring, Monitoring, Monitoring

Lessons Learned

Page 28: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Segmentation

Page 29: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Segmentation

• Fine Grained Network Policies‒ Layer 4 rules per service port‒ 150K+ rules

• Tenant Isolation‒ Subnet for each tenant‒ DNS subdomain per tenant

Page 30: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Challenges - Network Policies

• Size of compiled ACLs‒ Proportional to the number of rules across all policies applied to a virtual

network.‒ Starts becoming bigger than the max HTTP request body size allowed

by the Python Bottle web server• Policy Updates

‒ CPU intensive, needs to walk the policy, network graph‒ Schema transformer gets really busy‒ Processing happens in a greenlet, doesn’t yield to anything else‒ Removed a lot of east west policies in Dev clusters

Page 31: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Challenges - Tenant Isolation

• IP Address Management‒ Unique, non overlapping subnet per tenant‒ OpenStack custom Heat plugins integrated with internal IPAM system

for creating and allocating subnets to tenant networks• Reverse DNS

‒ Could not get reverse DNS to work with Contrail‒ Simplifying the DNS stack‒ Moving to native Neutron extensions

Page 32: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Lessons Learned

• Is fine grained L4 isolation really required?‒ East West wide open within an environment‒ Mutual TLS

• Dedicated CIDRs per cluster‒ Easier to separate out gateway routers in more fine grained manner.

Advertise relevant cluster prefix to the underlay.‒ IPv4 address space limitation

• Contrail’s concurrency model• Automation

Page 33: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Conclusion

1. Contrail

1. Plan Ahead for Scaling

1. Smaller clusters (more of them)

1. DevOps mindset

Page 34: TungstenFabric at Scale in Workday - Linux Foundation Events · Private Cloud based on OpenStack. Personal David O’Brien Intro • Software Engineer @ Workday • Providing Network

Thank YouQ&A

[email protected]@workday.com