66
Graphite Graphs for the modern age

OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

  • Upload
    netways

  • View
    581

  • Download
    0

Embed Size (px)

DESCRIPTION

Graphite is a timeseries data charting package, similar to MRTG and Cacti. This talk will cover Graphite starting from the basics to how booking.com scaled it to millions of datapoints per second.

Citation preview

Page 1: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

GraphiteGraphs for the modern age

Page 2: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

Page 3: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

● Written in Python– This does impact performance

Page 4: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

● Written in Python– This does impact performance

● Web based and easy to use– For once, not a marketing buzzword

Page 5: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition

Page 6: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation

Page 7: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation● Analytics

Page 8: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation● Analytics● Anomaly detection

Page 9: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion

Page 10: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion● Ability to compare corresponding time periods

(time travel)

Page 11: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion● Ability to compare corresponding time periods

(time travel)● Custom retention periods

Page 12: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

Page 13: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

Page 14: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

Page 15: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

● Webapp– Django based application offering a web api and Javascript

based frontend application

Page 16: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API

Page 17: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

Page 18: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

● Graphite offers outputs in multiple formats

Page 19: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

● Graphite offers outputs in multiple formats– Graphical (PNG, SVG)

– Structured(JSON, CSV)

– Raw data

Page 20: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

Page 21: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

Page 22: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Page 23: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Page 24: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards– Third part dashboard tools

● Using third party libraries– JSON is nice for this

– Cubism, D3.js, rickshaw, etc

Page 25: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

Page 26: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

● Postmortem analytics

Page 27: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

● Postmortem analytics● Performance debugging

Page 28: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

Page 29: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines

Page 30: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines

– All those individual files force a lot of seeks

Page 31: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes

Page 32: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names

Page 33: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names

Page 34: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names● Balancing traffic is hard

Page 35: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs

Page 36: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

Page 37: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

● Losing a SSD meant we had a box die– This has been fixed

Page 38: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

● Losing a SSD meant we had a box die– This has been fixed

● SSDs are not as reliable as spinning rust– SSDs last for between 12 to 14 months

Page 39: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

Page 40: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

● Keeping disk usage balanced was even harder– Anyone is allowed to create graphs

Page 41: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding - take II

● Replace regular expressions with consistent hashing

● Switch to RAID 0– We have switched back to RAID 1

● Store data on two nodes in each ring● Mirror rings in datacenters● Shuffle metrics to avoid losing data and disk

space.

Page 42: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Disk usage

● Graphite uses a lot of disk io– Background graph is in thousands on the Y axis.

– Individual files increase seek times

● There are a lot of stat(2) calls– This hasn't been investigated yet

Page 43: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Naming conventions

● Graphite has no rules for names

Page 44: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Naming conventions

● Graphite has no rules for names● We adopted:

– sys.* is for system metrics

– user.* is for testing/other stuff

– Anything else which makes sense is acceptable

Page 45: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

Page 46: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

● Originally used collectd for system metrics– The version of collected we were using had memory

usage issues● These have been fixed later

Page 47: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● System metrics are now collected by diamond

Page 48: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● System metrics are now collected by diamond● Diamond is a Python application

– Base framework + metric collection scripts

– Added custom patches for internal metrics

– Added patches to send monitoring data directly to Nagios for passive checks

Page 49: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU

Page 50: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

Page 51: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

Page 52: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

● Ran multiple instances on each relay host– Still need more CPU

Page 53: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

● Ran multiple instances on each relay host– Still need more CPU

● Finally rewrote in C and added more relay hosts– This works for us (and we have breathing room)

Page 54: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data visibility

● We send data to multiple places– Metrics get dropped

Page 55: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data visibility

● We send data to multiple places– Metrics get dropped

● Small application in Go which gets data from multiple locations and gives us a single merged resultset– Prototyped in Python, which was too slow

Page 56: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

Page 57: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

● The PCI vulnerability scanner reliably crashed it– This was patched and pushed upstream

Page 58: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Business metrics

● Turns out, developers like Graphite– They don't reliably understand whisper semantics

● Querying Graphite like SQL doesn't work

– They create a large number of named metrics● foo.bar.YYYY-MM-DD● Disk space use is a sudden concern

– Especially when you don't try and restrict this (feature, not bug)

Page 59: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out clusters

● Different groups have different requirements– Multiple backend rings, same frontend

● Unix systems● Windows● Networking● Business metrics● User testing

Page 60: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● Hardware– Need more CPU

● Especially on the frontends where we do a lot of maths

– Better disk reliability on SSDs● Replacing disks is expensive

– More disk IO● SSDs are now maxed out under stat(2) calls● Testing Fusion IO cards

– 10% faster, but we don't know babout reliability yet

Page 61: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● People– If you need a graph, put the data in Graphite

● Even if the data isn't time series data

● Frontend scalability– The default frontend doesn't work well with a few

thousand hosts

● Software upgrades– Our last Whisper upgrade caused data recording to

stop

Page 62: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● Managability– Getting rid of older, non-required metrics is a lot of

effort

– Adding hosts into a ring requires manual rebalancing effort

Page 63: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Future possiilities

● Testing Cassandra as a backend (cyanite)● Anomaly detection

– Tested Skyline, didn't scale

● More business metrics● Sparse metrics

– Metrics with a lot of nulls, but potentially a lot of named metrics involved

Page 64: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Peopleware

● Hiring people to work on interesting challenges– Sysadmins, developers

– http://www.booking.com/jobs

● Booking.com will be sponsoring a Graphite dev summit in June (tentatively just before the devopsdays Amsterdam event)

Page 65: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Reference URLS● Graphite

– https://github.com/graphite-project

● Graphite API– http://graphite.readthedocs.org/en/latest/functions.html

● C Carbon relay– https://github.com/grobian/carbon-c-relay

● Zipper– https://github.com/grobian/carbonserver

● Cyanite– https://github.com/pyr/cyanite

– https://github.com/brutasse/graphite-cyanite

Page 66: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

?