Operational Costs of Technical Debt

Preview:

DESCRIPTION

Slide deck from my presentation at #velocityconf 2014: http://oreil.ly/NtSknc

Citation preview

The Operational Cost of Technical Debt

Kurt Andersen

@drkurta

Kurt AndersenLinkedIn Site Reliability

kurta@linkedin.com@drkurta

In our daily work, there are things which slow us down or make us inefficient

3

noisrevNI

LinkedIn 2002 2011

With awareness and will, you can fix those problems

5

Kafka 0.7 0.8

Upgrading from 0.7

0.8, the release in which added replication, was our first backwards-incompatible release: major changes were made . . .

The upgrade from 0.7 to 0.8.x requires a special tool for migration.

This migration can be done without downtime.

from https://kafka.apache.org/documentation.html

We have all dealt with processes, systems or procedures that are lodged in the past

7

LinkedIn’s use of memcached

9

What do we need to solve the problems of technical debt?

INversion

Examples in action:

INversion

Kafka 0.7 0.8 migration

INversion

Kafka 0.7 0.8 migration6 year old version of memcached

INversion

Kafka 0.7 0.8 migration6 year old version of memcached

wire-line format change

from java-serialized objects via RPC to REST+JSON

If we recognize the problems and evaluate the costs correctly,

we make better decisions about how to spend our efforts

11

Technical Debt is a Decision

12

Technical debt accumulates from a series of small choices

ê 13

"Doing it this way" is good enough for now

ê 14

noisrevNI

We'll just skip version N+1 and look at N+2 or higher

ê 15

Changing to version X is going to take a lot of work

ê 16

"This is the way we do things, it is not open to discussion"

ê 17

Infrastructure becomes technical debt by focusing on shiny new

features

ê18

"We can't afford the time to upgrade infrastructure, we have to ship features A + B"

ê 19

"What have you done for me lately" is more sellable than preventing problems

ê 20

Y2k

Past decisions become debt unless they are updated to reflect

new realities

ê 21

Assumptions/predictions which are made early in the design process can be way off the mark.

ê 22

Mary, Mary, quite contraryHow does your system scale?

ê 23

noisrevNI

“One in a million” happens multiple times per hour or minute at web scale

ê 24

What are the direct costs of technical debt?

25

System outages and errors increase

ê 26

Development process was more and more bogged down in conflict resolution in the branch dev model

ê 27

noisrevNI

28

Teams develop work-arounds and procedures that are more

complicated than the problem

ê 29

Signs you are dealing with tech debt:

ê 30

1) “cult” ops

2) “red face” quotient

3) working around problems rather than fixing them

Signs you are dealing with tech debt:

ê 31

1) “cult” ops

2) “red face” quotient

3) working around problems rather than fixing them

Signs you are dealing with tech debt:

ê 32

1) “cult” ops

2) “red face” quotient

3) working around problems rather than fixing them

New features are blocked when the infrastructure can’t deal with new

loads.

ê 33

Capacity uplifts become increasingly painful

or impossible

ê 34

Constant rollbacks and rework cause stress on dev and ops everyone

ê 35

What are the indirect costs of technical debt?

36

Technical debt devalues ops in favor of new feature development

ê 37

"No one gets promoted for retiring debt"

ê 38

“Our ops guys are so good, they can make anything work”

ê 39

Supporting zombies leads to finger-pointing and avoidance

ê 40

Zombies are unsupported and unsupportable

ê 41

Zombies require active intervention to stop

ê 42

Technical debt leads to demoralization

ê 43

Being constantly reactive is no fun

ê 44

Friction for teams like customer support makes it harder than necessary to provide excellent support

ê 45

How do you balance retiring technical debt against other development work?

46

Recognize debt choices and decisions

ê 47

Never say "never"

ê 48

Keep an open mind

ê 49

Revisit old decisions as usage and requirements change

ê 50

Measure the right things

ê 51

52

Time to Repairand Effort

Impact frequency, severity and reach

53

Error rates

54

Capacity/Headroom

ê 55

If you were implementing package X today, what would you do differently?

ê 56

Evaluate all the costs: either to fix or to tolerate

ê 57

Make active decisions

ê 58

What is your job?

60

How did our examples turn out?

• INversion• memcached• Kafka 0.7 0.8• Rest.LinoisrevNI

1. Check code into trunk2. Peer review3. Release from trunk4. Continuous integration5. Service owners own their

services6. Canary all deployments7. New features ramped not

binary

61

How did our examples turn out?

• INversion• memcached• Kafka 0.7 0.8• Rest.Li

62

How did our examples turn out?

• INversion• memcached• Kafka 0.7 0.8• Rest.Li

63

How did our examples turn out?

• INversion• memcached• Kafka 0.7 0.8• Rest.Li

Moving beyond the debt crisis

64

Advance our standards, set upon our foes Our ancient word of courage, fair Saint George, Inspire us with the spleen of fiery dragons! Upon them! victory sits on our helms.

Richard III. act v, sc.3.

Transforming the way the world works.

Kurt Andersenkurta@linkedin.com

@drkurta

Appendix

Members first

Relationships matter

Be open, honest, and constructive

Demand excellence

Take intelligent risks

Act like an owner

Values

Recommended