22
@xaprb What Should I Instrument And How Should I Do It?

What Should I Instrument, And How Should I Do It?

Embed Size (px)

Citation preview

@xaprb

What Should I Instrument

And How Should I Do It?

@xaprb

Logistics...● I’m Baron Schwartz: @xaprb or [email protected]● I will post the slides from this talk● This is a follow-on to What Should I Monitor And How Should I Do It

○ https://youtu.be/zLjhFrUhqxg

@xaprb

What’s The Goal?Assumption: you’re building and operating a service.

You want to instrument it so you can build and operate it better.

You want observability.

● In the present● In the past● In the future? (Predictability)

Observability is how well an external observer can infer a system’s internal state.

@xaprb

What Should I Observe?

There’s a lot to measure in a complex system. What’s important?

● It’s more important to observe the work than the service itself.

● But it’s important to observe how the service responds to the workload.

@xaprb

Some Convenient BlueprintsBrendan Gregg’s USE Method

● Utilization, Saturation, Errors● http://www.brendangregg.com/usemethod.html

Tom Wilkie’s RED Method

● Measure request {Rate, Errors, Duration}● https://www.slideshare.net/weaveworks/interactive-monitoring-for-kubernetes

The SRE Book’s 4 Golden Signals

● latency, traffic, errors, and saturation● https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html

@xaprb

Some Formal LawsQueueing Theory

● Utilization, arrival rate, throughput, latency

Little’s Law

● Concurrency, latency, throughput

Universal Scalability Law

● Throughput, concurrency

@xaprb

The Zen of PerformanceThe unifying concept in observing a service is two perspectives on requests.

External (customer’s) view:

● Request (singular), and its latency and success.

Internal (operator’s) view:

● Requests (plural, population), and their latency distribution, rates, and concurrency.

● System resources/components and their throughput, utilization, and backlog.

@xaprb

Much Confusion Comes From One-Sided ViewsMany people, when asked if a service is working well, will look at the service for problems.

But you can only answer that question by looking at the service’s work. From that, you may need to examine the service to see why it isn’t working well.

Both are necessary. You need instrumentation that enables both perspectives.

@xaprb

Metrics That MatterAll of the metrics in all of the methods & laws mentioned are important.

● Throughput, concurrency, latency, utilization, backlog/load/saturation, rates

All of them are time-related, either point-in-time or over-a-duration.

● Time is the zeroth performance metric (perfdynamics.com).

@xaprb

Your Service Must Provide These DataIf your service is to be observable, it needs to be possible to observe these things.

● You can provide the data directly, by instrumenting your service.● An instrumented system (e.g. OS) can implicitly offer a framework.● Or you can use a framework to build your service (e.g. Coda’s Metrics).

@xaprb

Service and Component InstrumentationIt’s not enough to just instrument your service’s input and output.

● You need internal components and subsystems to be observable too.● Common examples: buffers, queues, locks, mutexes, persistence.

It’s easy to see that a clear architecture can help.

● Are subsystems loosely coupled and cohesive, with clear boundaries?● Are they well defined?● Can you draw an architecture/block diagram of them? (c.f. Brendan Gregg)

Metrics on components rarely help much, beyond the basics.

@xaprb

The Process List Is GoldenFocus more on requests/work than components. This is a well-trodden path. Every mature request-oriented service has a process table/list.

● UNIX: process table, visible with `ps`● Apache: ServerStatus● MySQL: SHOW PROCESSLIST● PostgreSQL: pg_stat_activity● MongoDB: db.currentOp()

A process table tracks the existence and state of every process/worker in the system, and tasks/requests that it is executing.

@xaprb

Common Attributes Of Process TablesRequest itself

● E.g. SQL text, commandline+args, verb+url+qparams● Parent request/stage/span, if possible

State of request

● At a minimum: working or waiting (where? func/module/mutex…)● Ideally: stages/states of execution (parsing, planning, checking auth…)

Timings

● Timestamp of start; ideally timestamps of state changes too

@xaprb

One ExampleAt VividCortex, we built github.com/VividCortex/pm for API/service processlists.

● It’s for #golang● HTTP and web browser interface● See every request in-flight● Kill requests● Check request state and timings

This provides observability “now.” But not historical observability.

@xaprb

Extending Observability To Historical Views“Current state” observability is the foundation of historical views.The process list can be the foundation of request history and metrics.

For requests:

● Log every state transition/change a request makes.● Emit metrics on aggregates at these points, or at regular intervals.

○ See previous slides for which metrics to emit!

● Capture traces of requests for distributed tracing.

For components:

● Emit metrics from each component at regular intervals (ditto on prev. slides).

@xaprb

Logging, Metrics, TracesPeter Bourgon drew a diagram that helps illustrate some concepts.

https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

@xaprb

What should you log? I tend to agree with Dave Cheney:

I believe that there are only two things you should log:1. Things that developers care about when they are developing or

debugging software.2. Things that users care about when using your software.

Obviously these are debug and info levels, respectively.

https://dave.cheney.net/2015/11/05/lets-talk-about-logging

Logging

@xaprb

I am not a fan of “sampling” the way it’s commonly done.

● It’s a euphemism for “let’s ignore most things.”● Every request should be measured.

It’s typically implemented in terribly biased ways that cause all kinds of problems (e.g. “slow” query logs ignore fast-but-frequent requests).

● I prefer keeping representative samples of raw data.● But not ignoring/dropping the rest: at least aggregating it into metrics.

Logging and Traces

@xaprb

Representative Sampling Is Possible To Do

https://www.vividcortex.com/resources/sampling-a-stream-with-probabilistic-sketch

@xaprb

Observability CultureObservability is more than a Silicon Valley buzzword. It’s a culture, like DevOps.

How can you build a culture of observability?

● You get what you incentivize. Incentivize the data/metrics, you’ll get it.● Prioritize the end, not the means.● Understand the difference between culture, and visible artifacts of culture.

Many a company has tried to imitate Netflix or Etsy and gotten different results.

● See McFunley’s talk, for example: http://pushtrain.club/

@xaprb

What Should You Reward?● Clarity and intentionality; purposefulness● Empathy● Shared ownership and responsibility● Attendance at DevOpsDays

What should you think twice about rewarding?

● Metrics/data/graphs, in a vacuum for their own sake● Keep in mind Etsy’s “if it moves, graph it” slogan is a means, not an end

○ https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

@xaprb

Parting ThoughtsI’m a fan of defining the problem before working on the solution.

● Clarity of purpose tends to influence decisions for the better.● Explicit goal of observability and intelligibility tends to improve operability.● Clear understanding of performance focuses on KPIs, not vanity metrics.

Some further thoughts at https://www.vividcortex.com/resources/architecting-highly-monitorable-apps