Upload
paul-oconnor
View
6.613
Download
0
Embed Size (px)
Citation preview
…And Metrics For All
Paul O’Connorgithub.com/pauloconnor2015-05-19
About Yelp
Founded: 2004Monthly Active Users: ~142 MillionNon-US Monthly Users: ~31 MillionReview: ~77 MillionLocal Businesses: 2.1 MillionTerritories: Available in 31 countries
What are metrics?
Name Value
What are metrics?
Name Value Timestamp
What are metrics?
Name Value Timestampserver1.load.1m 28.826667 1431950640
What are metrics?
Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880
What are metrics?
Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880
Graphite Components
• Carbon:• relay• cache• aggregator
• Whisper• Web app
Carbon Relay
• Deals with 2 things• Replication• Sharding
Relay Methods
• Rules• [replicate]• pattern = ^services\.ads\..+• servers = 10.1.2.3, 10.2.2.3• continue = true
• Consistent Hashing• Defines a sharding strategy across multiple backends
Carbon Cache
• Receives metrics and persists them to disk• Writes based on storage schemas
Storage Schemas
• Details retention rates for storing metrics
[databases_10sec_1year]pattern = ^servers\.db.*$retentions = 10s:7d,1m:30d,5m:90d,30m:365d
Storage Aggregation
• Rules for aggregating data to lower-precision retentions
[all_min]pattern = \.min$xFilesFactor = 0.1aggregationMethod = min
Carbon Aggregator
• Buffers metrics before forwarding to carbon cache• Roll up metrics based on rules
Aggregation Rules
• Not to be confused with storage aggregation• Tells the carbon aggregator what to aggregate and how
output_template (frequency) = method input_pattern
<env>.applications.<app>.all.requests (60) = sum <env>.applications.<app>.*.requests
prod.applications.apache.www01.requestsprod.applications.apache.www02.requestsprod.applications.apache.www03.requestsprod.applications.apache.www04.requestsprod.applications.apache.www05.requests
prod.applications.apache.all.requests
Whisper
• Fixed size database• Allows for roll ups• Allows for backfilling data
Web App
• Django based app for rendering graphs
Putting it all together
• Carbon cache listening on port 2003• Write to disk• Listen with web
Getting more complicated
• Carbon relay using consistent hashing to multiple caches• Individual caches responsible for specific metrics
More Relays
• Use HAProxy to load balance between relays• Use more relays to use CPU
Even more relays• Useful for sending metrics to other locations
Replicate the metrics• Duplicate your metrics for backup, and redundancy
More caches instead• Consistent hash across multiple nodes
Where does the aggregator fit?
• Aggregator uses a lot of CPU. Put it on it’s own node
Scaling further
• Use nodes for particular functions:• Use forwarding relay nodes solely to forward• Have consistent hashing nodes• Have aggregation nodes
Getting your data back out
• Graphite Dashboard• Third Party Dashboard
• We use Grafana http://grafana.org/• Graphite-api https://github.com/brutasse/graphite-api
Tips
• Aggregate before ingestion• Control the metrics that can be sent• Metrics are a gas - they expand to fill all available room• Use C implementation of carbon• Use the latest webapp.
Optimize your dashboard queries
• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99• 2154 results• 35 seconds to just find these files on disk• Running functions against these results• Timeout after a minute• Dashboard automatically refreshing every 10 seconds
What’s the Future?
• InfluxDB• Cassandra• Third party
We’re hiring!http://www.yelp.com/careersHiring SREs in Dublin, London, New York, San Francisco