Monitoring microservices with Prometheus

Monitoring Microserviceswith Prometheus

Tobias Schmidt - MicroCPH May 17, 2017

github.com/grobie @dagrobie tobidt@gmail.com

Monitoring

● Ability to observe and understand systems and their behavior.○ Know when things go wrong○ Understand and debug service misbehavior○ Detect trends and act in advance

● Blackbox vs. Whitebox monitoring○ Blackbox: Observes systems externally with periodic checks○ Whitebox: Provides internally observed metrics

● Whitebox: Different levels of granularity○ Logging○ Tracing○ Metrics

Monitoring

● Metrics monitoring system and time series database○ Instrumentation (client libraries and exporters)○ Metrics collection, processing and storage○ Querying, alerting and dashboards○ Analysis, trending, capacity planning○ Focused on infrastructure, not business metrics

● Key features○ Powerful query language for metrics with label dimensions○ Stable and simple operation○ Built for modern dynamic deploy environments○ Easy setup

● What it’s not○ Logging system○ Designed for perfect answers

Prometheus

Instrumentation case studyGusta: a simple like service

● Service to handle everything around liking a resource

○ List all liked likes on a resource

○ Create a like on a resource

○ Delete a like on a resource

● Implementation

○ Written in golang

○ Uses the gokit.io toolkit

Gusta overview

// Like represents all information of a single like.

type Like struct {

ResourceID string `json:"resourceID"`

UserID string `json:"userID"`

CreatedAt time.Time `json:"createdAt"`

// Service describes all methods provided by the gusta service.

type Service interface {

ListResourceLikes(resourceID string) ([]Like, error)

LikeResource(resourceID, userID string) error

UnlikeResource(resourceID, userID string) error

Gusta core

// main.go

var store gusta.Store

store = gusta.NewMemoryStore()

var s gusta.Service

s = gusta.NewService(store)

s = gusta.LoggingMiddleware(logger)(s)

var h http.Handler

h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))

http.Handle("/", h)

if err := http.ListenAndServe(*httpAddr, nil); err != nil {

logger.Log("exit error", err)

Gusta server

./gusta

ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080

ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null

ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null

ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null

ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null

ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null

ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not

found"

ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null

Gusta server

Basic InstrumentationProviding operational insight

● “Four golden signals” cover the essentials

○ Latency

○ Traffic

○ Errors

○ Saturation

● Similar concepts: RED and USE methods

○ Request: Rate, Errors, Duration

○ Utilization, Saturation, Errors

● Information about the service itself

● Interaction with dependencies (other services, databases, etc.)

What information should be provided?

● Direct instrumentation○ Traffic, Latency, Errors, Saturation○ Service specific metrics (and interaction with dependencies)○ Prometheus client libraries provide packages to instrument HTTP

requests out of the box

● Exporters○ Utilization, Saturation○ node_exporter CPU, memory, IO utilization per host○ wmi_exporter does the same for Windows○ cAdvisor (Container advisor) provides similar metrics for each container

Where to get the information from?

// main.go

import "github.com/prometheus/client_golang/prometheus"

var registry = prometheus.NewRegistry()

registry.MustRegister(

prometheus.NewGoCollector(),

prometheus.NewProcessCollector(os.Getpid(), ""),

// Pass down registry when creating HTTP handlers.

h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)

Initializing Prometheus client library

var h http.Handler = listResourceLikesHandler

var method, path string = "GET", "/api/v1/likes/{id}"

requests := prometheus.NewCounterVec(

prometheus.CounterOpts{

Name: "gusta_http_server_requests_total",

Help: "Total number of requests handled by the HTTP server.",

ConstLabels: prometheus.Labels{"method": method, "path": path},

[]string{"code"},

registry.MustRegister(requests)

h = promhttp.InstrumentHandlerCounter(requests, h)

Counting HTTP requests

var h http.Handler = listResourceLikesHandler

var method, path string = "GET", "/api/v1/likes/{id}"

requestDuration := prometheus.NewHistogramVec(

prometheus.HistogramOpts{

Name: "gusta_http_server_request_duration_seconds",

Help: "A histogram of latencies for requests.",

Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},

ConstLabels: prometheus.Labels{"method": method, "path": path},

[]string{},

registry.MustRegister(requestDuration)

h = promhttp.InstrumentHandlerDuration(requestDuration, h)

Observing HTTP request latency

Exposing metricsObserving the current state

● Prometheus is a pull based monitoring system

○ Instances expose an HTTP endpoint to expose their metrics

○ Prometheus uses service discovery or static target lists to collect the state periodically

● Centralized management

○ Prometheus decides how often to scrape instances

● Prometheus stores the data on local disc

○ In a big outage, you could run Prometheus on your laptop!

How to collect the metrics?

// main.go

// ...

http.Handle("/metrics", promhttp.HandlerFor(

registry,

promhttp.HandlerOpts{},

Exposing the metrics via HTTP

curl -s http://localhost:8080/metrics | grep requests

# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.

# TYPE gusta_http_server_requests_total counter

gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3

gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429

gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51

gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14

gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3

Request metrics

curl -s http://localhost:8080/metrics | grep request_duration

# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.

# TYPE gusta_http_server_request_duration_seconds histogram

gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414

gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429

gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984

gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429

Latency metrics

curl -s http://localhost:8080/metrics | grep process

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.

# TYPE process_cpu_seconds_total counter

process_cpu_seconds_total 892.78

# HELP process_max_fds Maximum number of open file descriptors.

# TYPE process_max_fds gauge

process_max_fds 1024

# HELP process_open_fds Number of open file descriptors.

# TYPE process_open_fds gauge

process_open_fds 23

# HELP process_resident_memory_bytes Resident memory size in bytes.

# TYPE process_resident_memory_bytes gauge

process_resident_memory_bytes 9.3446144e+07

Out-of-the-box process metrics

Collecting metricsScraping all service instances

# Scrape all targets every 5 seconds by default.

global:

scrape_interval: 5s

evaluation_interval: 5s

scrape_configs:

# Scrape the Prometheus server itself.

- job_name: prometheus

static_configs:

- targets: [localhost:9090]

# Scrape the Gusta service.

- job_name: gusta

static_configs:

- targets: [localhost:8080]

Static configuration

scrape_configs:

# Scrape the Gusta service using Consul.

- job_name: consul

consul_sd_configs:

- server: localhost:8500

relabel_configs:

- source_labels: [__meta_consul_tags]

regex: .*,prod,.*

action: keep

- source_labels: [__meta_consul_service]

target_label: job

Consul service discovery

Target overview

Simple Graph UI

DashboardsHuman-readable metrics

Grafana example

AlertsActionable metrics

ALERT InstanceDown

IF up == 0

FOR 2m

LABELS { severity = "warning" }

ANNOTATIONS {

summary = "Instance down for more than 5 minutes.",

description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",

ALERT RunningOutOfFileDescriptors

IF process_open_fds / process_fds * 100 > 95

FOR 2m

ANNOTATIONS {

summary = "Instance has many open file descriptors.",

description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",

Alert examples

ALERT GustaHighErrorRate

IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))

/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))

* 100 > 0.1

FOR 2m

LABELS { severity = "critical" }

ANNOTATIONS {

summary = "Gusta service endpoints have a high error rate.",

description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",

ALERT GustaHighLatency

IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1

LABELS { severity = "critical" }

ANNOTATIONS {

summary = "Gusta service endpoints have a high latency.",

description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}

has a 95% percentile latency of {{ $value }} seconds.",

Alert examples

ALERT FilesystemRunningFull

IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0

FOR 1h

ANNOTATIONS {

summary = "Filesystem space is filling up.",

description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}

is predicted to run out of space within the next 24 hours.",

Alert examples

Summary

● Monitoring is essential to run, understand and operate services.● Prometheus

○ Client instrumentation○ Scrape configuration○ Querying○ Dashboards○ Alert rules

● Important Metrics○ Four golden signals: Latency, Traffic, Error, Saturation

● Best practices

● https://prometheus.io● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/● Our “StackOverflow” https://www.robustperception.io/blog/● Ask the community https://prometheus.io/community/

● Google’s SRE book https://landing.google.com/sre/book/index.html● USE method http://www.brendangregg.com/usemethod.html● My philosophy on alerting https://goo.gl/UnvYhQ

Sources

Thank you

Tobias Schmidt - MicroCPH May 17, 2017

github.com/grobie - @dagrobie

● High availability

○ Run two identical servers

● Scaling

○ Shard by datacenter / team / service ( / instance )

● Aggregation across Prometheus servers

○ Federation

● Retention time

○ Generic remote storage support available.

● Pull vs. Push

○ Doesn’t matter in practice. Advantages depend on use case.

● Security

○ Focused on writing a monitoring system, left to the user.

Monitoring microservices with Prometheus

Engineering

Monitoring Kafka w/ Prometheus

OpenShift Container Platform 4.2 Monitoring · monitoring Prometheus instance to include additional alerting and recording rules. Modifying resources of the stack.The Prometheus Monitoring

edge network with Prometheus Monitoring …...Monitoring Cloudflare's planet-scale edge network with Prometheus Matt Bostock @mattbostock Platform Operations Prometheus for monitoring

Monitoring Server dengan Prometheus dan Grafana serta

Prometheus Monitoring MySQL with - Percona · Prometheus Prometheus Ben Kochie - Prometheus Lead - GitLab. Prometheus. About Prometheus Metrics collection Time-series database Graphing

It’s A Match! - GrafanaCon...Infrastructure Monitoring Component Method Elasticsearch Clusters Prometheus elasticsearch exporter Kafka Clusters Prometheus kafka exporter, Prometheus

Prometheus – a next-gen Monitoring System

thOracle Developer Meetup Monitoring with Prometheus · 2020. 5. 20. · 1 th 유원조 wonjo.yoo@oracle.com 2020.05.15 11thOracle Developer Meetup Monitoring with Prometheus

An enterprise-grade approach to monitoring and ......An enterprise-grade approach to monitoring and troubleshooting with Prometheus for containers and microservices. P ro me t h e

Monitoring kubernetes with prometheus

DevOps Meetup - Belgrade - Monitoring Microservices

Four considerations when monitoring microservices

Monitoring Cloud Native applications with Prometheus · 2018-05-04 · Monitoring Cloud Native applications with Prometheus Aaron Kirkbride @ Weaveworks. Time Series Database time_series_1

Monitoring MySQL with Prometheus, Grafana and Percona Dashboards

Monitoring mayhem - Using Prometheus

Monitoring and Data-Driven Decisions with Prometheus and ...€¦ · Monitoring and Data -Driven Decisions with Prometheus and SUSE Manager How monitoring of dynamic environments

Practical monitoring with Prometheus and Grafana · Practical monitoring with Prometheus and Grafana | OSCON 2018 9. Prometheus AlertManager The Promtheus Alertmanager handles alerts

Microservices monitoring challange

Microservices and Prometheus (Microservices NYC 2016)

microservices Monitoring a production-ready€¦ · Infra team need to understand Sensu, Prometheus alerting 6. Application metrics are single dimension 7. Grafana alerting for Application