83
Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment Cloud Tech III October 6, 2012 Roy Rapoport @royrapoport, rsr@netflix.com Thursday, October 18, 12

Cloud Tech III: Actionable Metrics

Embed Size (px)

DESCRIPTION

Presentation for Cloud Tech III on How Netflix Thinks of Metrics

Citation preview

Page 1: Cloud Tech III: Actionable Metrics

Actionable MetricsEnabling Decision-Making in Netflix’s Decentralized

Environment

Cloud Tech IIIOctober 6, 2012Roy Rapoport

@royrapoport, [email protected]

Thursday, October 18, 12

Page 2: Cloud Tech III: Actionable Metrics

Me

• Been in tech for about 20 years

• Systems engineering, networking, software development, QA, release management

• Time at Netflix: 1195 days (3y:3m:1w)

• (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )

Thursday, October 18, 12

Page 3: Cloud Tech III: Actionable Metrics

Metrics Humor

Thursday, October 18, 12

Page 4: Cloud Tech III: Actionable Metrics

Metrics Humor

Thursday, October 18, 12

Page 5: Cloud Tech III: Actionable Metrics

Metrics Humor

Thursday, October 18, 12

Page 6: Cloud Tech III: Actionable Metrics

Metrics Humor

Thursday, October 18, 12

Page 7: Cloud Tech III: Actionable Metrics

Metrics Humor

% of instances with even public IP addresses

Thursday, October 18, 12

Page 8: Cloud Tech III: Actionable Metrics

Technology Overview

Thursday, October 18, 12

Page 9: Cloud Tech III: Actionable Metrics

Technology Overview• SoA, REST, Mostly Java

Thursday, October 18, 12

Page 10: Cloud Tech III: Actionable Metrics

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Page 11: Cloud Tech III: Actionable Metrics

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Page 12: Cloud Tech III: Actionable Metrics

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Page 13: Cloud Tech III: Actionable Metrics

Culture Overview

Thursday, October 18, 12

Page 14: Cloud Tech III: Actionable Metrics

Culture Overview

• Freedom and Responsibility

Thursday, October 18, 12

Page 15: Cloud Tech III: Actionable Metrics

Culture Overview

• Freedom and Responsibility

• Distributed Operations

Thursday, October 18, 12

Page 16: Cloud Tech III: Actionable Metrics

Culture Overview

• Freedom and Responsibility

• Distributed Operations

• Get out of the way of Developers

Thursday, October 18, 12

Page 17: Cloud Tech III: Actionable Metrics

The Metric Lifecycle

Thursday, October 18, 12

Page 18: Cloud Tech III: Actionable Metrics

The Metric Lifecycle

•Send

Thursday, October 18, 12

Page 19: Cloud Tech III: Actionable Metrics

The Metric Lifecycle

•Send

•Look

Thursday, October 18, 12

Page 20: Cloud Tech III: Actionable Metrics

The Metric Lifecycle

•Send

•Look

•Alert

Thursday, October 18, 12

Page 21: Cloud Tech III: Actionable Metrics

Systems

• Flexible

• Scalable

• Self-Service

Thursday, October 18, 12

Page 22: Cloud Tech III: Actionable Metrics

TelemetryFlexible, Scalable, Self-Service

import netflix.metrics[...] self.nm = netflix.metrics.Metrics("core_cag")[...]def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application)[...]

Thursday, October 18, 12

Page 23: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 24: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 25: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 26: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 27: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 28: Cloud Tech III: Actionable Metrics

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 29: Cloud Tech III: Actionable Metrics

AlertingFlexible, Scalable, Self-Service

Thursday, October 18, 12

Page 30: Cloud Tech III: Actionable Metrics

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

Thursday, October 18, 12

Page 31: Cloud Tech III: Actionable Metrics

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

• Compare to history

Thursday, October 18, 12

Page 32: Cloud Tech III: Actionable Metrics

For Example ...

What the ...

Last 3 hours’ core_tools.core_cag_api

Thursday, October 18, 12

Page 33: Cloud Tech III: Actionable Metrics

For Example ...Visualization (Continued)

Last 4 days’ core_tools.core_cag_api

even more questions!

Thursday, October 18, 12

Page 34: Cloud Tech III: Actionable Metrics

For Example ...Visualization (Continued)

Last 10 days’ core_tools.core_cag_api

What caused the spike?

Thursday, October 18, 12

Page 35: Cloud Tech III: Actionable Metrics

For Example ...Visualization (Continued)

Show alert volume per application

Someone had a rough few days...

Thursday, October 18, 12

Page 36: Cloud Tech III: Actionable Metrics

Don’t Like Surprises...{ "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ]}

Thursday, October 18, 12

Page 37: Cloud Tech III: Actionable Metrics

Threshold Tuning

• An Abbreviated History ...

Thursday, October 18, 12

Page 38: Cloud Tech III: Actionable Metrics

Threshold Tuning(in the beginning)

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Page 39: Cloud Tech III: Actionable Metrics

Threshold Tuning(in the beginning)

• Systems owned by IT

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Page 40: Cloud Tech III: Actionable Metrics

Threshold Tuning(in the beginning)

• Systems owned by IT

• Want an alert? Submit a ticket

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Page 41: Cloud Tech III: Actionable Metrics

Threshold Tuning(in the beginning)

• Systems owned by IT

• Want an alert? Submit a ticket

• Want to tune an alert? Submit a ticket

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Page 42: Cloud Tech III: Actionable Metrics

Threshold Tuning(It gets better)

Thursday, October 18, 12

Page 43: Cloud Tech III: Actionable Metrics

Threshold Tuning(It gets better)

• You get to configure your own threshold

Thursday, October 18, 12

Page 44: Cloud Tech III: Actionable Metrics

Threshold Tuning(It gets better)

• You get to configure your own threshold

• Freedom!

Thursday, October 18, 12

Page 45: Cloud Tech III: Actionable Metrics

Threshold Tuning(It gets better)

• You get to configure your own threshold

• Freedom!

• Also, you have to configure your own thresholds

Thursday, October 18, 12

Page 46: Cloud Tech III: Actionable Metrics

Threshold Tuning(Are we there yet?)

Thursday, October 18, 12

Page 47: Cloud Tech III: Actionable Metrics

Threshold Tuning(Are we there yet?)

• Play with historical data

Thursday, October 18, 12

Page 48: Cloud Tech III: Actionable Metrics

Threshold Tuning(Are we there yet?)

• Play with historical data

• Huge difference

Thursday, October 18, 12

Page 49: Cloud Tech III: Actionable Metrics

Threshold Tuning(Are we there yet?)

• Play with historical data

• Huge difference

• Still falls short

Thursday, October 18, 12

Page 50: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Page 51: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Page 52: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Page 53: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Page 54: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Page 55: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Page 56: Cloud Tech III: Actionable Metrics

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Page 57: Cloud Tech III: Actionable Metrics

If Time Allows ...

Thursday, October 18, 12

Page 58: Cloud Tech III: Actionable Metrics

Events vs Metrics

Thursday, October 18, 12

Page 59: Cloud Tech III: Actionable Metrics

Events vs Metrics

• Irregular Interval

Thursday, October 18, 12

Page 60: Cloud Tech III: Actionable Metrics

Events vs Metrics

• Irregular Interval

• Point in time

Thursday, October 18, 12

Page 61: Cloud Tech III: Actionable Metrics

Events vs Metrics

• Irregular Interval

• Point in time

• Lack magnitude

Thursday, October 18, 12

Page 62: Cloud Tech III: Actionable Metrics

Why Build It?

Thursday, October 18, 12

Page 63: Cloud Tech III: Actionable Metrics

Why Build It?

• Change management

• Vs Change control

Thursday, October 18, 12

Page 64: Cloud Tech III: Actionable Metrics

Why Build It?

• Change management

• Vs Change control

• What Changed?

Thursday, October 18, 12

Page 65: Cloud Tech III: Actionable Metrics

Why Build It?

• Change management

• Vs Change control

• What Changed?

• Better Alerting

Thursday, October 18, 12

Page 66: Cloud Tech III: Actionable Metrics

Chronos

Thursday, October 18, 12

Page 67: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

Thursday, October 18, 12

Page 68: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

Thursday, October 18, 12

Page 69: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

Thursday, October 18, 12

Page 70: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

Thursday, October 18, 12

Page 71: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

• ... X times in Y minutes

Thursday, October 18, 12

Page 72: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

• ... X times in Y minutes

• Something didn’t happen

Thursday, October 18, 12

Page 73: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Medium volume

Thursday, October 18, 12

Page 74: Cloud Tech III: Actionable Metrics

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Medium volume

• Recursive

• Recursive

Thursday, October 18, 12

Page 75: Cloud Tech III: Actionable Metrics

End Result

Thursday, October 18, 12

Page 76: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

Thursday, October 18, 12

Page 77: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

Thursday, October 18, 12

Page 78: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

Thursday, October 18, 12

Page 79: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

Thursday, October 18, 12

Page 80: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

• Especially for bad code deployments

Thursday, October 18, 12

Page 81: Cloud Tech III: Actionable Metrics

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

• Especially for bad code deployments

• You should do this

Thursday, October 18, 12

Page 82: Cloud Tech III: Actionable Metrics

I Didn’t Mention

• End-to-end testing and alerting

• External availability and performance

• Open Connect

• Jobs

Thursday, October 18, 12

Page 83: Cloud Tech III: Actionable Metrics

Questions?

Thursday, October 18, 12