METRICS - Coda Hale€¦ · The enterprise social network. Saturday, April 9, 2011

Preview:

Citation preview

METRICS

METRICS EVERYWHERESaturday, April 9, 2011

METRICS

METRICS EVERYWHERESaturday, April 9, 2011

Make better decisions by using numbers.

Saturday, April 9, 2011

Coda Hale@coda

github.com/codahale

Saturday, April 9, 2011

www.yammer.comThe enterprise social network.

Saturday, April 9, 2011

I write code.

Saturday, April 9, 2011

But that’s notactually my job.

Saturday, April 9, 2011

code

Saturday, April 9, 2011

codebusiness

value

Saturday, April 9, 2011

What the hell is business value?

Saturday, April 9, 2011

A new feature.

Saturday, April 9, 2011

An improvedexisting feature.

Saturday, April 9, 2011

Fewer bugs.

Saturday, April 9, 2011

Not pissing our users off with a slow site.

Saturday, April 9, 2011

Not pissing our users off with a slow site.

ugly

Saturday, April 9, 2011

Not pissing our users off with a slow site.

uglypretty

Saturday, April 9, 2011

Making futurechanges easier.

Saturday, April 9, 2011

Adding a unit test before fixing that bug.

Saturday, April 9, 2011

Business value is anything which makes people more likely to

give us money.

Saturday, April 9, 2011

We want to generate more business value.

Saturday, April 9, 2011

We need to makebetter decisionsabout our code.

Saturday, April 9, 2011

Our code generates business valuewhen it runs.

Saturday, April 9, 2011

Our code generates business valuewhen it runs,

not when we write it.

Saturday, April 9, 2011

We need to knowwhat our code does

when it runs.

Saturday, April 9, 2011

We can’t do this unless we measure it.

Saturday, April 9, 2011

Why measure it?

Saturday, April 9, 2011

territorymap ≠

Saturday, April 9, 2011

cityofSanFrancisco

mapof

SanFrancisco

Saturday, April 9, 2011

thewayitis

thewaywetalk

Saturday, April 9, 2011

thethinginitself

thething

wethink of

Saturday, April 9, 2011

realityperception ≠

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

We have amental model

of what our code does.

Saturday, April 9, 2011

It’s a mental model.It’s not the code.

Saturday, April 9, 2011

It is often wrong.

Saturday, April 9, 2011

Confusion.

Saturday, April 9, 2011

“This code can’t possibly work.”

Saturday, April 9, 2011

(It works.)

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

“This code can’t possibly fail.”

Saturday, April 9, 2011

(It fails.)

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Which is faster?

Saturday, April 9, 2011

Which is faster?items.sort_by { |i| i.name }

Saturday, April 9, 2011

Which is faster?items.sort_by { |i| i.name }

items.sort { |a, b| a.name <=> b.name }

Saturday, April 9, 2011

We don’t know.

Saturday, April 9, 2011

We don’t know.

def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk)end

Saturday, April 9, 2011

We don’t know.

def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk)end

def sort(&blk) # TODO: make not explode raise Exception.new("Haw haw!")end

Saturday, April 9, 2011

We can’t know untilwe measure it.

Saturday, April 9, 2011

This affects how we make decisions.

Saturday, April 9, 2011

“Our application is slow. This page takes 500ms.

Fix it.”

Saturday, April 9, 2011

Find the bottleneck!

Saturday, April 9, 2011

Find the bottleneck!

SQL Query

Saturday, April 9, 2011

Find the bottleneck!

SQL Query

Template Rendering

Saturday, April 9, 2011

Find the bottleneck!

SQL Query

Template Rendering

Session Storage

Saturday, April 9, 2011

We don’t know.

Saturday, April 9, 2011

Find The Bottleneck 2.0!

SQL Query

Template Rendering

Session Storage

Saturday, April 9, 2011

Find The Bottleneck 2.0!

SQL Query

Template Rendering

Session Storage

53ms

Saturday, April 9, 2011

Find The Bottleneck 2.0!

SQL Query

Template Rendering

Session Storage

53ms

1ms

Saturday, April 9, 2011

Find The Bottleneck 2.0!

SQL Query

Template Rendering

Session Storage

53ms

1ms

315ms

Saturday, April 9, 2011

Find The Bottleneck 2.0!

SQL Query

Template Rendering

Session Storage

53ms

1ms

315ms

Saturday, April 9, 2011

Confusion.

Saturday, April 9, 2011

Saturday, April 9, 2011

We made a better decision.

Saturday, April 9, 2011

We improve our mental model by measuringwhat our code does.

Saturday, April 9, 2011

territorymap ≠

Saturday, April 9, 2011

territorymap→

Saturday, April 9, 2011

We use ourmental model

to decide what to do.

Saturday, April 9, 2011

A bettermental model

makes us better at deciding what to do.

Saturday, April 9, 2011

A bettermental model

makes us better at generating

business value.

Saturday, April 9, 2011

Measuring makes your decisions better.

Saturday, April 9, 2011

But only if we’re measuring

the right thing.

Saturday, April 9, 2011

We need to measure our code where it

matters.

Saturday, April 9, 2011

In the wild.

Saturday, April 9, 2011

Generatingbusiness value.

Saturday, April 9, 2011

Saturday, April 9, 2011

PRODUCTION

Saturday, April 9, 2011

Continuously measuring code in production.

Saturday, April 9, 2011

Metrics

Saturday, April 9, 2011

MetricsJava/Scala

Saturday, April 9, 2011

github.com/codahale/metrics

MetricsJava/Scala

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

Each metric is associated with a class

and has a name.

Saturday, April 9, 2011

An autocomplete service for city names.

Saturday, April 9, 2011

An autocomplete service for city names.

> GET /complete?q=San%20Fra

Saturday, April 9, 2011

An autocomplete service for city names.

> GET /complete?q=San%20Fra

< HTTP/1.1 200 RAD<< ["San Francisco"]

Saturday, April 9, 2011

What does this code do that affects its business value?

Saturday, April 9, 2011

And how can we measure that?

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugeThe instantaneous value of something.

Saturday, April 9, 2011

# of cities

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011

“The service has 589 cities registered.”

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugesCounters

MetersHistograms

Timers

Saturday, April 9, 2011

CounterAn incrementing and decrementing value.

Saturday, April 9, 2011

# of open connections

Saturday, April 9, 2011

val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011

val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011

“There are 594 active sessions on that server.”

Saturday, April 9, 2011

GaugesCounters

MetersHistograms

Timers

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

MeterThe average rate of events

over a period of time.

Saturday, April 9, 2011

# of requests/sec

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS)

meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS)

meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS)

meter.mark()

Saturday, April 9, 2011

val meter = metrics.meter("requests", SECONDS)

meter.mark()

Saturday, April 9, 2011

mean rate = # of events

elapsed time

Saturday, April 9, 2011

time

# ofrequests

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Recency.

Saturday, April 9, 2011

mean rate = # of events

elapsed time

Saturday, April 9, 2011

mean rate = # of events

elapsed time

Saturday, April 9, 2011

COGNITIVE HAZARD

Saturday, April 9, 2011

Exponentially weighted moving average.

Saturday, April 9, 2011

-(1-α)kmt-1 + (1-(1-α)k)Yt

k

Saturday, April 9, 2011

-(1-α)kmt-1 + (1-(1-α)k)Yt

k

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

# ofrequests

time

Saturday, April 9, 2011

1-minute rate

Saturday, April 9, 2011

1-minute rate5-minute rate

Saturday, April 9, 2011

1-minute rate5-minute rate15-minute rate

Saturday, April 9, 2011

“We went from 3,000 requests/sec to<500 a second.”

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

HistogramThe statistical distribution of values in a stream of data.

Saturday, April 9, 2011

# of cities returned

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011

val histogram = metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011

minimum

Saturday, April 9, 2011

minimummaximum

Saturday, April 9, 2011

minimummaximum

mean

Saturday, April 9, 2011

minimummaximum

meanstandard deviation

Saturday, April 9, 2011

Quantiles

Saturday, April 9, 2011

Quantilesmedian

Saturday, April 9, 2011

Quantilesmedian

75th percentile

Saturday, April 9, 2011

Quantilesmedian

75th percentile95th percentile

Saturday, April 9, 2011

Quantilesmedian

75th percentile95th percentile98th percentile

Saturday, April 9, 2011

Quantilesmedian

75th percentile95th percentile98th percentile99th percentile

Saturday, April 9, 2011

Quantilesmedian

75th percentile95th percentile98th percentile99th percentile

99.9th percentileSaturday, April 9, 2011

We can’t keep all of these values.

Saturday, April 9, 2011

1,000 req/sec

Saturday, April 9, 2011

1,000 req/sec

×

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×1 day

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×1 day

=

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×1 day

=>86 billion values

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×1 day

=>86 billion values

>640GB of data/day

Saturday, April 9, 2011

1,000 req/sec

×1,000 actions/req

×1 day

=>86 billion values

>640GB of data/dayNot gonna happen.

Saturday, April 9, 2011

COGNITIVE HAZARD

Saturday, April 9, 2011

Reservoir sampling.Keep a statistically representative sample

of measurements as they happen.

Saturday, April 9, 2011

Vitter’s Algorithm R.

Vitter, J. (1985).Random sampling with a reservoir.

ACM Transactions on Mathematical Software (TOMS), 11(1), 57.Saturday, April 9, 2011

time

# ofcities

Saturday, April 9, 2011

time

# ofcities

Saturday, April 9, 2011

time

# ofcities

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

Vitter’s Algorithm R produces uniform

samples.

Saturday, April 9, 2011

Recency.

Saturday, April 9, 2011

SUPER-DUPERCOGNITIVE HAZARD

Saturday, April 9, 2011

Saturday, April 9, 2011

Forward-decaying priority sampling.

Cormode, G., Shkapenyuk, V., Srivastava, D., & Xu, B. (2009).Forward Decay: A Practical Time Decay Model for Streaming Systems.

ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering.Saturday, April 9, 2011

Maintain a statistically representative sample of the last 5 minutes.

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

time

# of cities

Saturday, April 9, 2011

Uniform Biased

Saturday, April 9, 2011

“95% of autocomplete results return 3 cities or

less.”

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

TimerA histogram of durations and

a meter of calls.

Saturday, April 9, 2011

# of ms to respond

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011

“At ~2,000 req/sec, our 99% latency jumps

from 13ms to 453ms.”

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

Now what?

Saturday, April 9, 2011

Instrument it.

Saturday, April 9, 2011

Instrument it.If it could affect your code’s

business value, add a metric.

Saturday, April 9, 2011

Instrument it.If it could affect your code’s

business value, add a metric.Our services have 40-50 metrics.

Saturday, April 9, 2011

Collect it.

Saturday, April 9, 2011

Collect it.JSON via HTTP.

Saturday, April 9, 2011

Collect it.JSON via HTTP.Every minute.

Saturday, April 9, 2011

Monitor it.

Saturday, April 9, 2011

Monitor it.Nagios/Zabbix/Whatever

Saturday, April 9, 2011

Monitor it.Nagios/Zabbix/Whatever

If it affects business value, someone should get woken up.

Saturday, April 9, 2011

Aggregate it.

Saturday, April 9, 2011

Aggregate it.Ganglia/Graphite/Cacti/Whatever

Saturday, April 9, 2011

Aggregate it.Ganglia/Graphite/Cacti/Whatever

Place current values in historical context.

Saturday, April 9, 2011

Aggregate it.Ganglia/Graphite/Cacti/Whatever

Place current values in historical context.See long-term patterns.

Saturday, April 9, 2011

Go faster.

Saturday, April 9, 2011

Shorten ourdecision-making cycle.

Saturday, April 9, 2011

Observe

Saturday, April 9, 2011

ObserveOrient

Saturday, April 9, 2011

ObserveOrientDecide

Saturday, April 9, 2011

ObserveOrientDecideAct

Saturday, April 9, 2011

ObserveOrientDecideAct

Saturday, April 9, 2011

Observe

What is the 99% latency of our autocomplete service right now?

Saturday, April 9, 2011

Observe

What is the 99% latency of our autocomplete service right now?

~500ms

Saturday, April 9, 2011

Orient

How does this compare toother parts of our system,

both currently and historically?

Saturday, April 9, 2011

Orient

How does this compare toother parts of our system,

both currently and historically?

way slower

Saturday, April 9, 2011

Decide

Should we make it faster?Or should we add feature X?

Saturday, April 9, 2011

Decide

Should we make it faster?Or should we add feature X?

make it faster

Saturday, April 9, 2011

Act!

Write some code.

Saturday, April 9, 2011

Act!

Write some code.

def sort_by(&blk) #sleep(100) # WTF DUDE super(&blk)end

Saturday, April 9, 2011

10 Print "Rinse"20 Print "Repeat"30 Goto 10

Saturday, April 9, 2011

If we do this fasterwe will win.

Saturday, April 9, 2011

Fewer bugs.

Saturday, April 9, 2011

More features.

Saturday, April 9, 2011

Happier users.

Saturday, April 9, 2011

Money.Saturday, April 9, 2011

tl;dr

Saturday, April 9, 2011

We might write code.

Saturday, April 9, 2011

We have to generatebusiness value.

Saturday, April 9, 2011

In order to know how well our code is generating

business value, we need metrics.

Saturday, April 9, 2011

GaugesCountersMeters

HistogramsTimers

Saturday, April 9, 2011

Monitor them for current problems.

Saturday, April 9, 2011

Aggregate them for historical perspective.

Saturday, April 9, 2011

territorymap ≠

Saturday, April 9, 2011

territorymap→

Saturday, April 9, 2011

Improve our mental model of our code.

Saturday, April 9, 2011

MIND THE GAP

Saturday, April 9, 2011

ObserveOrientDecideAct

Saturday, April 9, 2011

If you’re on the JVM, use Metrics.

Saturday, April 9, 2011

If you’re on the JVM, use Metrics.

github.com/codahale/metrics

Saturday, April 9, 2011

If not,you can build this.

Saturday, April 9, 2011

Please build this.

Saturday, April 9, 2011

Make better decisions by using numbers.

Saturday, April 9, 2011

Thank you.

Saturday, April 9, 2011

Saturday, April 9, 2011