OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

GraphiteGraphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

Graphite basics



● Written in Python– This does impact performance

Graphite basics



● Written in Python– This does impact performance

● Web based and easy to use– For once, not a marketing buzzword

The church of Graphs

● Pattern Recognition


● Pattern Recognition● Correlation


● Pattern Recognition● Correlation● Analytics


● Pattern Recognition● Correlation● Analytics● Anomaly detection

Helpful Graphite features

● Out of order data insertion


● Out of order data insertion● Ability to compare corresponding time periods

(time travel)


● Out of order data insertion● Ability to compare corresponding time periods

(time travel)● Custom retention periods

Moving parts

● Relays– Send data to correct backend store

Moving parts


● Pattern matching on metric names● Consistent hashing

Moving parts



● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

Moving parts



● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

● Webapp– Django based application offering a web api and Javascript

based frontend application

Data output

● Web API

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

Data output



● Graphite offers outputs in multiple formats

Data output



● Graphite offers outputs in multiple formats– Graphical (PNG, SVG)

– Structured(JSON, CSV)

– Raw data

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

Using Graphite


● Using the default frontend– For single, one off graphs

– Debugging problems

Using Graphite




● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Using Graphite




● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Using Graphite




● Using builtin dashboards– Users create their own dashboards– Third part dashboard tools

● Using third party libraries– JSON is nice for this

– Cubism, D3.js, rickshaw, etc

Using Graphite

● API– Monitoring

– Runtime performance tuning

Using Graphite



● Postmortem analytics

Using Graphite



● Postmortem analytics● Performance debugging

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend




– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines




– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines

– All those individual files force a lot of seeks

Scaling out - try 1

● Add more backend boxes

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names

Scaling out - try 1


– Pattern matching based on metric names

Scaling out - try 1


– Pattern matching based on metric names● Balancing traffic is hard

Scaling up

● Replace spinning disks with SSDs

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

Scaling up



● Losing a SSD meant we had a box die– This has been fixed

Scaling up



● Losing a SSD meant we had a box die– This has been fixed

● SSDs are not as reliable as spinning rust– SSDs last for between 12 to 14 months

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

● Keeping disk usage balanced was even harder– Anyone is allowed to create graphs

Sharding - take II

● Replace regular expressions with consistent hashing

● Switch to RAID 0– We have switched back to RAID 1

● Store data on two nodes in each ring● Mirror rings in datacenters● Shuffle metrics to avoid losing data and disk

space.

Disk usage

● Graphite uses a lot of disk io– Background graph is in thousands on the Y axis.

– Individual files increase seek times

● There are a lot of stat(2) calls– This hasn't been investigated yet

Naming conventions

● Graphite has no rules for names

Naming conventions

● Graphite has no rules for names● We adopted:

– sys.* is for system metrics

– user.* is for testing/other stuff

– Anything else which makes sense is acceptable

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

● Originally used collectd for system metrics– The version of collected we were using had memory

usage issues● These have been fixed later

Collecting metrics

● System metrics are now collected by diamond

Collecting metrics

● System metrics are now collected by diamond● Diamond is a Python application

– Base framework + metric collection scripts

– Added custom patches for internal metrics

– Added patches to send monitoring data directly to Nagios for passive checks

Relay issues

● The Python relaying implementation eats CPU

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

Relay issues



● Added relays in each datacenter– Still need more CPU

Relay issues




● Ran multiple instances on each relay host– Still need more CPU

Relay issues




● Ran multiple instances on each relay host– Still need more CPU

● Finally rewrote in C and added more relay hosts– This works for us (and we have breathing room)

Data visibility

● We send data to multiple places– Metrics get dropped

Data visibility

● We send data to multiple places– Metrics get dropped

● Small application in Go which gets data from multiple locations and gives us a single merged resultset– Prototyped in Python, which was too slow

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

● The PCI vulnerability scanner reliably crashed it– This was patched and pushed upstream

Business metrics

● Turns out, developers like Graphite– They don't reliably understand whisper semantics

● Querying Graphite like SQL doesn't work

– They create a large number of named metrics● foo.bar.YYYY-MM-DD● Disk space use is a sudden concern

– Especially when you don't try and restrict this (feature, not bug)

Scaling out clusters

● Different groups have different requirements– Multiple backend rings, same frontend

● Unix systems● Windows● Networking● Business metrics● User testing

Current problems

● Hardware– Need more CPU

● Especially on the frontends where we do a lot of maths

– Better disk reliability on SSDs● Replacing disks is expensive

– More disk IO● SSDs are now maxed out under stat(2) calls● Testing Fusion IO cards

– 10% faster, but we don't know babout reliability yet

Current problems

● People– If you need a graph, put the data in Graphite

● Even if the data isn't time series data

● Frontend scalability– The default frontend doesn't work well with a few

thousand hosts

● Software upgrades– Our last Whisper upgrade caused data recording to

stop

Current problems

● Managability– Getting rid of older, non-required metrics is a lot of

effort

– Adding hosts into a ring requires manual rebalancing effort

Future possiilities

● Testing Cassandra as a backend (cyanite)● Anomaly detection

– Tested Skyline, didn't scale

● More business metrics● Sparse metrics

– Metrics with a lot of nulls, but potentially a lot of named metrics involved

Peopleware

● Hiring people to work on interesting challenges– Sysadmins, developers

– http://www.booking.com/jobs

● Booking.com will be sponsoring a Graphite dev summit in June (tentatively just before the devopsdays Amsterdam event)

Reference URLS● Graphite

– https://github.com/graphite-project

● Graphite API– http://graphite.readthedocs.org/en/latest/functions.html

● C Carbon relay– https://github.com/grobian/carbon-c-relay

● Zipper– https://github.com/grobian/carbonserver

● Cyanite– https://github.com/pyr/cyanite

– https://github.com/brutasse/graphite-cyanite

?