A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

Daniel Schauenberg

[email protected]

@mrtazz

mailto:[email protected]

@mrtazz

@mrtazz

@mrtazzItem by TheBackPackShoppe

How comfortable are you deploying

a change right now?

“If this is your first day at Etsy, you deploy the site”

@mrtazz

Ganglia• System level metrics

• Instance per DC/environment

• > 220k RRD files

• Fully configured through Chef role attributes

@mrtazz

Rainbow Graphs!

@mrtazz

StatsD• Single instance on one server

• Traffic mostly from 70 Web & 24 API servers

• Node.js

• Heavy Sampling

• Graphite as backend

@mrtazz

@mrtazz

Graphite• Application level metrics

• 96G RAM, 20 Cores, 7.3T SSD RAID 10

• 525k metrics/minute

• Mirrored Master/Master Setup

• Functionally sharded relays

@mrtazz

CNAME

relays

relays

caches

caches

statsdtimers statsdcounts

statsd chef

logster fqld

search generic

@mrtazz

@mrtazz

@mrtazz

Syslog-Ng• Web, Search, Gearman, Photos, Nagios,

Network, VPN

• 1.2GB written/minute

• Chef role attribute based config

• Rule ordering!

@mrtazz

github.com/etsy/logster

• Extract metrics from log files

• Written in Python

• Runs every minute via cron

https://github.com/etsy/logster

@mrtazz

Splunk

• Indexes all of our log files

• Easy search for patterns

• Saved searches for interesting ones

• Basically using it as a glorified grep

@mrtazz

Logstash• Experiment status

• Makes it easier integrate different sources

• Easy to set up in dev environment

• Trying to figure out where/how it fits into our infrastructure

@mrtazz

Eventinator• Tracks all events in our infrastructure

• Chef runs and changes

• DNS changes

• Network

• Deploys

• Server provisioning and decommissioning

• ~ 12 million events in the last 2 years

@mrtazz

@mrtazz

Chef

• rules everything around me

• Same cookbooks on prod and dev

• every node runs Chef every 10 minutes

• ton of knife plugins and handlers

@mrtazz

@mrtazz

> 120 recipes

@mrtazz

@mrtazz

Nagios

@mrtazz

Nagios• 2 instances in each DC/environment

• Fully Chef generated configuration

• Service checks and contacts in git

• Notifications via email->SMS gateway

• ~75% ops on-call

@mrtazz

github.com/lozzd/nagdash

http://github.com/lozzd/nagdash

@mrtazz

@mrtazz

@mrtazz

@mrtazz

Nagios Herald• Add context to nagios alerts

• What are the first 5 things you do when you get paged?

• You already have the phone in your hand

• nagios notification handler

@mrtazz

@mrtazz

The Toys are real

@mrtazz

There’s another side of heaven

@mrtazz

Ops Weekly

@mrtazz

Ops Weekly

@mrtazz

Summary• Set of trusted tools

• Enhance where they come short

• Try out new things

• Write tools where applicable

• Continuous monitoring and adaptation

@mrtazz

codeascraft.com etsy.com/codeascraft/talks

etsy.github.com etsy.com/careers

http://etsy.com/careers

@mrtazz

Questions?

A Whirlwind Tour of Etsy's Monitoring Stack

Daniel Schauenberg

[email protected]

@mrtazz

mailto:[email protected]

Technology

A Whirlwind Tour of Etsy's Monitoring Stack