60
Lies, Damned lies, and timeouts Coping with failures in datacenters Yao Yue

Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Lies, Damned lies, and timeouts

Coping with failures

in datacenters

Yao Yue

Page 2: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve
Page 3: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Who am I?

Cache @ Twitter

Now working on performance in general

@thinkingfish

Page 4: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

I’ve been telling my coworkers the same things for years…

Why do I want to give this talk?

Page 5: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

What do I know?

▸ My views are heavily influenced by two things:

▸ In-memory, datacenter-scale caching

▸ Twitter’s environment

Page 6: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Cache in datacenters

▸ Distributed in-memory KV store

▸ Serving many, many requests

▸ with very tight latency expectation

Page 7: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Twitter’s runtime

▸ Architecture: monolithic -> SOA/Microservices

▸ Owns datacenters

▸ Services mostly on JVMs

▸ Jobs deployed with containers

▸ Scale: up to many thousands of nodes per service

Page 8: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Living in an imperfect world

Page 9: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Consensus over unreliable link is UNSOLVABLE

Two unhappy generals

?

Two General Problem / Paradox

Page 10: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts & retries

The engineering approach to cope with failures

Page 11: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts & retries

The engineering approach to cope with failures

Page 12: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Redundant requests

The engineering approach to cope with failures

Page 13: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Further mitigation

The engineering approach to cope with failures

YOU KNOW WHAT? FORGET ABOUT IT…

Page 14: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Coping with failureTimeouts, Retries, and preventions

Page 15: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Coping with failure

Request

Retry (N)

Failure Prevention

Page 16: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout and retry

▸ Timeout: an approximation of failure

▸ False positive is possible

▸ Retry: an effort to reduce failure

▸ Has a cost

▸ Has an effect on system

Page 17: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout and retry get close to the heart of the difficulty of distributed systems, and yet we treat them casually because they’re often presented as configuration.

Page 18: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

What works intuitively may lead to catastrophes.

Page 19: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout

Page 20: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout could be misleading

TIMEOUT INFORMS ME ABOUT COMMUNICATION TO REMOTE SERVICE

SERVICE A SERVICE B

???

Page 21: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

One service

SERVICE

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

Page 22: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

SERVICE B

LIBRARY+VM

KERNEL

HARDWARE

One service Another service

Page 23: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

SERVICE B

DO NOT

CARE

One service Another service

Page 24: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

One serviceSERVICE

LIBRARY+VM

KERNEL

HARDWARE

NETWORK

resource contention, head-of-line blocking, locking…

garbage collection, function calls with indeterministic timing…

system calls with indeterministic timing, background tasks, unfair scheduling…

blocking IO, congestion, cycle stealing…

packet drop, queue backup…

Page 25: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Multitenant runtime

SERVICE A

LIBRARY+VM

KERNEL

HARDWARE

OTHER SERVICE

OTHER SERVICE

Page 26: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Multi-tenancy with Noisy Neighbors

OTHER SERVICE

OTHER SERVICE

KERNEL

HARDWARE

SERVICE A

LIBRARY+VM

Page 27: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Timeout cascade

…SERVICE A

SERVICE B

SERVICE C

Accountability Gap

Accountability Gap

Page 28: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Inconvenient truths about timeout

▸ Timeouts often do not indicate remote service health

▸ The optimal timeout is a moving target

▸ Often less predictable in shared environment

▸ Have gaps in overall timeline

Page 29: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

Q: A’s requests time out, but not B’s, which on-call engineer(s) should be paged?

Page 30: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

Q: A’s requests time out, so do B’s, which on-call engineer(s) should be paged?

Page 31: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Oftentimes modern application architecture is a meshMess

Page 32: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

correlation != causality

Page 33: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

timeouts != causality

Page 34: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry

Page 35: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retries carry risks

Request

Retry (N)

RETRIES IMPROVE THE SUCCESS RATE OF SERVICE???

Page 36: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry is extra work

2x requests

Page 37: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry is extra work

2x requests 2x responses

Page 38: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry != Replay

▸ Potential behavioral change

▸ State change

▸ Hidden retries in lower stack

▸ e.g. TCP

Page 39: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Other retry key decisions

▸ How many times?

▸ How to set timeout for each retry?

▸ How much delay between retries?

▸ Where to send it?

Page 40: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Inconvenient truths about retry

▸ Retries can create positive feedback loop

▸ Retries can change system state and behavior

▸ Sophisticated retry configuration has many knobs

Page 41: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

If C is slow…

Page 42: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Chained dependencies

SERVICE ASERVICE C

SERVICE B

If B is slow…

Page 43: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Retry at scale- “How to DDoS A cache?”

▸ Timeout ⇒ connection teardown

▸ Large number of clients

▸ Stateful backend, fixed route

▸ Full connectivity mesh

▸ Variable connections per route

▸ Container-based deploy

Page 44: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Transient hotspot ⇒ initial timeouts (closing connections) ⇒ connection storm (fixed route) ⇒ massive timeouts (full mesh) ⇒ bigger storm (multiple, variable connections) ⇒ NIC saturation / backend offline ⇒ frontend failing, site down

Retry at scale- “How to DDoS A cache?”

Page 45: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Service mesh can make retries significantly worse

▸ Top-level retries affect whole system

▸ Bigger multipliers toward the bottom

▸ Hard to predict bottleneck

Page 46: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Wrong implicit assumption, single input, single feedback loop

Problems with naive failure coping mechanisms

Page 47: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

How to improve?

Page 48: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Address the positive feedback loop

Timeout

Retry++

Page 49: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Slow down, retry less

Timeout

Retry+ +

Page 50: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Introduce other feedback loop

Timeout

Retry+ +

-

Page 51: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Good timeouts

▸ Minimize false positives caused by local stack

▸ Attempt to root cause with other signals

Page 52: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Good retries

▸ Customize for purpose of request

▸ Be as conservative as possible

▸ Circuit breakers for timeout induced retries

▸ State-based rules

▸ Back-pressure

Page 53: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Manage impact across services

▸ Enforce top-down budget

▸ Apply and act on back pressure

Using micro services? Your task just got much harder…

▸ Tracing helps

Page 54: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

drill, baby, drill

▸ Test common failure scenarios

▸ Test failures at different parts

▸ Test partial failures

▸ Test relaxed timeout/retry/escalation config

Page 55: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Example: a typical cache config

(Guide w/ examples, 2K+ words excl. code)

▸ Overall timeout: 500ms* ▸ Request timeout: 150ms* ▸ Connect timeout: 200ms ▸ Pipelining: request timeout does not reset connection

▸ Read retry: 2 tries*, no write-back, no backoff ▸ Write retry: 3 tries*, random backoff (5-200ms) ▸ Overall retry budget: 20% of requests, minimum 10 retries per second (10-sec credit window)

▸ Blackhole: 5 consecutive failures*, revive after 30 seconds ▸ Centralized topology manager, changes dampened >1min

CACHE SLO: P999 <5MS

Page 56: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Reduce varianceMaking timeout/retry easier

Page 57: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Example: Making cache more predictable

▸ Staying off shared hosts

▸ Rate limiting new connections with iptables

▸ Setting CPU affinity for sirq, application

▸ Removing shared locks

▸ All blocking IO on background threads

Page 58: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

There’s no magic- Important to know when to give up

?

Page 59: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Understand common anomalies, Break the loop, Form a bigger picture,

and you’ll be happy most of the time.

Page 60: Coping with failures in datacenters Lies, Damned …Coping with failures in datacenters Yao Yue Who am I? Cache @ Twitter Now working on performance in general @thinkingfish I’ve

Questions?