48
Stability Patterns …and Antipatterns © Michael Nygard, 2007-2012 1 Michael Nygard [email protected] @mtnygard Saturday, June 23, 12

Stability patterns presentation

Embed Size (px)

DESCRIPTION

by Michael Nygard - @mtnygard - Velocity Conf 2012

Citation preview

Page 1: Stability patterns presentation

Stability Patterns…and Antipatterns

© Michael Nygard, 2007-2012 1

Michael [email protected]

@mtnygard

Saturday, June 23, 12

Page 2: Stability patterns presentation

Stability Antipatterns

2Saturday, June 23, 12

Page 3: Stability patterns presentation

Integration Points

Integrations are the #1 risk to stability.

Every out of process call can and will eventually kill your system.

Yes, even database calls.

Saturday, June 23, 12

Page 4: Stability patterns presentation

Example: Wicked database hang

Saturday, June 23, 12

Page 5: Stability patterns presentation

“In Spec” vs. “Out of Spec”

“In Spec” failuresTCP connection refusedHTTP response code 500Error message in XML response

Example: Request-Reply using XML over HTTP

Well-Behaved Errors Wicked Errors

“Out of Spec” failures

TCP connection accepted, but no data sentTCP window full, never clearedServer replies with “EHLO”Server sends link farm HTMLServer streams Weird Al mp3s

Saturday, June 23, 12

Page 6: Stability patterns presentation

Remember This

Necessary evil.

Peel back abstractions.

Large systems fail faster than small ones.

Useful patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware, Handshaking, Test Harness

Saturday, June 23, 12

Page 7: Stability patterns presentation

Chain Reaction

Failure moves horizontally across tiers

Common in search engines and app servers

Saturday, June 23, 12

Page 8: Stability patterns presentation

Remember This

One server down jeopardizes the rest.

Hunt for Resource Leaks.

Useful pattern: Bulkheads

Saturday, June 23, 12

Page 9: Stability patterns presentation

Cascading Failure

Failure moves vertically across tiers

Common in enterprise services & SOA

Saturday, June 23, 12

Page 10: Stability patterns presentation

Remember This

“Damage Containment”

Stop cracks from jumping the gap

Scrutinize resource pools

Useful patterns: Use Timeouts, Circuit Breaker

Saturday, June 23, 12

Page 11: Stability patterns presentation

Too many, too clicky

Some malicious users

Buyers

Front-page viewers

Screen scrapers

Users

Saturday, June 23, 12

Page 12: Stability patterns presentation

Handle Traffic Surges Gracefully

Degrade features automatically

Shed load.

Don’t keep sessions for bots.

Reduce per-user burden:

IDs, not object graphs.

Query parameters, not result sets.

Saturday, June 23, 12

Page 13: Stability patterns presentation

Blocked Threads

All request threads blocked = “crash”

Impossible to test away

Learn to use java.util.concurrent or System.Threading.(Ruby & PHP coders, just avoid threads completely.)

Saturday, June 23, 12

Page 14: Stability patterns presentation

Pernicious and Cumulative

Hung request handlers = less capacity.Hung request handler = frustrated user/caller

Each remaining thread serves 1/(N-1) extra requests

Saturday, June 23, 12

Page 15: Stability patterns presentation

Example: Blocking calls

String key = (String)request.getParameter(PARAM_ITEM_SKU);Availability avl = globalObjectCache.get(key);

Object obj = items.get(id);if(obj == null) { obj = strategy.create(id);}…

In a request-processing method

In GlobalObjectCache.get(String id), a synchronized method:

In the strategy:public Object create(Object key) throws Exception { return omsClient.getAvailability(key);}

Saturday, June 23, 12

Page 16: Stability patterns presentation

Remember This

Use proven constructs.

Don’t wait forever.

Scrutinize resource pools.

Beware the code you cannot see.

Useful patterns: Use Timeouts, Circuit Breaker

Saturday, June 23, 12

Page 17: Stability patterns presentation

Attacks of Self-Denial

BestBuy: XBox 360 Preorder

Amazon: XBox 360 Discount

Victoria’s Secret: Online Fashion Show

Anything on FatWallet.com

Saturday, June 23, 12

Page 18: Stability patterns presentation

Defenses

Avoid deep linksStatic landing pagesCDN diverts or throttles usersShared-nothing architecture Session only on 2nd clickDeal pool

Saturday, June 23, 12

Page 19: Stability patterns presentation

Remember This

Open lines of communication.

Support your marketers.

Saturday, June 23, 12

Page 20: Stability patterns presentation

Unbalanced Capacities

OnlineStore

SiteScopeNYC

Customers

SiteScopeSan Francisco

20 Hosts

75 Instances

3,000 Threads

OrderManagement

6 Hosts

6 Instances

450 Threads

Scheduling

1 Host

1 Instance

25 Threads

Saturday, June 23, 12

Page 21: Stability patterns presentation

Scaling Ratios

Dev QA Prod

Online Store 1/1/1 2/2/2 20/300/6

Order Management 1/1/1 2/2/2 4/6/2

Scheduling 1/1/1 2/2/2 4/2

Saturday, June 23, 12

Page 22: Stability patterns presentation

Unbalanced Capacities

Scaling effect between systems

Sensitive to traffic & behavior patterns

Stress both sides of the interface in QA

Simulate back end failures during testing

Saturday, June 23, 12

Page 23: Stability patterns presentation

SLA Inversion

Frammitz

99.99%

Corporate MTA

99.999%

SpamCannon's

DNS

98.5%

SpamCannon's

Applications

99%

Corporate DNS

99.9%

Inventory

99.9%

Message

Broker

99%

Partner 1's

Application

No SLA

Partner 1's

DNS

99%

Message

Queues

99.99%

Pricing and

Promotions

No SLA

What SLA can Frammitz really guarantee?Saturday, June 23, 12

Page 24: Stability patterns presentation

Remember This

No empty promises.

Monitor your dependencies.

Decouple from your dependencies.

Measure availability by feature, not by server.

Beware infrastructure services: DNS, SMTP, LDAP.

Saturday, June 23, 12

Page 25: Stability patterns presentation

Unbounded Result Sets

Development and testing is done with small data sets

Test databases get reloaded frequently

Queries often bonk badly with production data volume

Saturday, June 23, 12

Page 26: Stability patterns presentation

Unbounded Result Sets: Databases

SQL queries have no inherent limits

ORM tools are bad about this

Appears as slow performance degradation

Saturday, June 23, 12

Page 27: Stability patterns presentation

Unbounded Result Sets: SOA

Chatty remote protocols, N+1 query problem

Hurts caller and provider

Caller is naive, trusts server not to hurt it.

Saturday, June 23, 12

Page 28: Stability patterns presentation

Remember This

Test with realistic data volumesDon’t trust data producers.Put limits in your APIs.

Saturday, June 23, 12

Page 29: Stability patterns presentation

Stability Patterns

29Saturday, June 23, 12

Page 30: Stability patterns presentation

Circuit Breaker

Ever seen a remote call wrapped with a retry loop?

int remainingAttempts = MAX_RETRIES;

while(--remainingAttempts >= 0) { try { doSomethingDangerous(); return true; } catch(RemoteCallFailedException e) { log(e); }}return false;

Why?Saturday, June 23, 12

Page 31: Stability patterns presentation

Faults Cluster

Fast retries good for for dropped packets(but let TCP do that)

Most other faults require minutes to hours to correct

Immediate retries very likely to fail again

Saturday, June 23, 12

Page 32: Stability patterns presentation

Faults Cluster

Problems with the remote host, application or

the network will probably persist

for an long time... minutes

or hours

Saturday, June 23, 12

Page 33: Stability patterns presentation

Bad for Users and Systems

Systems:

Ties up threads, reducing overall capacity.

Multiplies load on server, at the worst times.

Induces a Cascading Failure

Users:

Wait longer to get an error response.

What happens after final retry?

Saturday, June 23, 12

Page 34: Stability patterns presentation

Stop Banging Your Head

Wrap a “dangerous” call

Count failures

After too many failures, stop passing calls

After a “cooling off” period, try the next call

If it fails, wait some more before calling again

Closed

on call / pass throughcall succeeds / reset countcall fails / count failurethreshold reached / trip breaker

Open

on call / failon timeout / attempt reset

pop

Half-Open

on call/pass throughcall succeeds/resetcall fails/trip breaker

attemptreset

reset pop

Saturday, June 23, 12

Page 35: Stability patterns presentation

Considerations

Sever malfunctioning features

Degrade gracefully on caller

Critical work must be queued for later

Saturday, June 23, 12

Page 36: Stability patterns presentation

Remember This

Stop doing it if it hurts.

Expose, monitor, track, and report state changes

Good against: Cascading Failures, Slow Responses

Works with: Use Timeouts

Saturday, June 23, 12

Page 37: Stability patterns presentation

Bulkheads

Partition the system

Allow partial failure without losing service

Applies at different granularity levels

Saturday, June 23, 12

Page 38: Stability patterns presentation

Common Mode Dependency

Foo Bar

Baz

Foo and Bar are coupled via Baz

Saturday, June 23, 12

Page 39: Stability patterns presentation

With Bulkheads

Foo Bar

Baz

Baz

Pool 1

Baz

Pool 2

Foo and Bar have dedicated resources from Baz.

Saturday, June 23, 12

Page 40: Stability patterns presentation

Remember This

Save part of the ship

Decide if less efficient use of resources is OK

Pick a useful granularity

Very important with shared-service models

Monitor each partition’s performance to SLA

Saturday, June 23, 12

Page 41: Stability patterns presentation

Test Harness

Real-world failures are hard to create in QA

Integration tests work for “in-spec” errors, but not “out-of-spec” errors.

Saturday, June 23, 12

Page 42: Stability patterns presentation

“In Spec” vs. “Out of Spec”

“In Spec” failuresTCP connection refusedHTTP response code 500Error message in XML response

Example: Request-Reply using XML over HTTP

Well-Behaved Errors Wicked Errors

“Out of Spec” failures

TCP connection accepted, but no data sentTCP window full, never clearedServer replies with “EHLO”Server sends link farm HTMLServer streams Weird Al mp3s

Saturday, June 23, 12

Page 43: Stability patterns presentation

“Out-of-spec” errors happen all the time in the

real world.

They never happenduring testing...

unless you force them to.43

Saturday, June 23, 12

Page 44: Stability patterns presentation

Daemon listening on network

Substitutes for the remote end of an interface

Can run locally (dev) or remotely (dev or QA)

Is totally evil

Killer Test Harness

Saturday, June 23, 12

Page 45: Stability patterns presentation

Port Nastiness19720 Allows connections requests into the queue, but never accepts them.

19721 Refuses all connections

19722 Reads requests at 1 byte / second

19723 Reads HTTP requests, sends back random binary

19724 Accepts requests, sends responses at 1 byte / sec.

19725 Accepts requests, sends back the entire OS kernel image.

19726 Send endless stream of data from /dev/random

Just a Few Evil Ideas

Now those are some out-of-spec errors.

45Saturday, June 23, 12

Page 46: Stability patterns presentation

Remember This

Force out-of-spec failures

Stress the caller

Build reusable harnesses for L1-L6 errors

Supplement, don’t replace, other testing methods

Saturday, June 23, 12

Page 47: Stability patterns presentation

Integration Points

Cascading Failures

Users

Blocked Threads

Attacks ofSelf-Denial

Scaling Effects

UnbalancedCapacities

Slow Responses

SLA Inversion

UnboundedResult Sets Use Timeouts

Circuit Breaker

Bulkheads

Steady State

Fail Fast

Handshaking

Test Harness

DecouplingMiddleware

counters

prevents

counters

counters

reduces impact

mitigates

finds problems in

damage

mutual

aggravation

found

nearleads to

leads toleads to

results from

violating

counters

counters

counters can avoid

leads to

avoids

counters

counters

exacerbates

lead to

works with

counters

leads to

Chain Reactions

Saturday, June 23, 12

Page 48: Stability patterns presentation

© Michael Nygard, 2007-2012 48

Michael [email protected]

@mtnygard

Saturday, June 23, 12