Observer, a "real life" time series application

Observer A real life time-series application

Kévin Lovato - @alprema

Index • Observer introduction • Architecture overview • CQL schema • Feedback

– Schema – Read/Write access

• Numbers

Observer introduction

Key features

• Publish metrics from anywhere

• Track & investigate business issues

• Alert users in case of unusual behavior

• Integrate with the infrastructure features

Architecture overview

C*

Aggregator

Publisher

Send raw metrics

C*

Aggregator

Publisher

Aggregate metrics (sec, min, hour)

C*

WebDashboard

Client

Load metrics data

HTTP

C*

WebDashboard

Client

Receive live metrics data through bus Push

(WebSocket)

C*

DataCruncher Load and compute all metrics for the day

Write daily computations (avg, percentiles, etc.)

C*

Alertor

Catch up on startup

Receive live metrics data through bus

Send alerts on the bus

CQL schema

Metric_OneSec • Schema:

((MetricId, Day), UtcDate), Value

MetricId + Day

UtcDate UtcDate …

Value Value

Metric_OneSec

• TTL: 8 days

• Max column per row: 86 400 • Average size: 1.4 MB

Metric_OneMin • Schema:

((MetricId, FirstDayOfWeek), UtcDate), Value

MetricId + FirstDayOfWeek

UtcDate UtcDate …

Value Value

Metric_OneMin

• TTL: 60 days

• Max column per row: 10 080 • Average size: 300 KB

Metric_OneHour • Schema:

(MetricId, UtcDate), Value

MetricId UtcDate UtcDate …

Value Value

Metric_OneHour

• TTL: 10 years

• Average size: 45 KB

Daily_Aggregate • Schema:

(MetricId, Date), Average, Count, Percentiles, …

MetricId Date.Average Date.Count …

Daily_Aggregate

• No TTL

• Average size: 23 KB

Feedback - Schema

Row sizing • Avoid having rows spanning over long

periods • Avoid large amounts of data / row (<100

MB is good) • Make buckets using another component

(ex: Day, FirstDayOfWeek, etc.)

TTLs • Don’t use them if you don’t really need them

(extra space wasted) • Make sure to set it right the first time (or you

will need to reinsert your data) • Consider changing gc_grace_period for your

CF (tombstones useless for TTLed time-series)

General best practices • Consider disabling inter-DC read repair on

your CF (read_repair_chance) • Use collection types (map<>, etc.)

Feedback – Read / Write

Obvious but… • Avoid Thrift (can take down your cluster on

huge rows reads) • Do not disable paging (same effect as using

Thrift) • Use prepared statements

Batches • Warning: Not intended for performance • But… • Can improve insert performance under

adequate conditions • Use small (< 5 KB) "Unlogged" batches • Benchmark with your own use case • Don’t tell @PatrickMcFadin you did it

Asynchronous queries • Mandatory if you want to be fast (from

anything over 1 query)

Asynchronous queries

Vs.

Asynchronous queries

• For massive reads, send your queries by

bunches and wait for them

General best practices • Benchmark all heavy operations in terms of

cluster load (a faster implem might just be killing the cluster for everyone else)

• Watch out for CL: ONE (we experienced slowdowns as the coordinator asked a different DC under heavy load)

Numbers time

• Total number of metrics: 17K

• Metrics inserted: 10K/s

• Data points daily aggregation speed: 500K/s

• DC size: 3 nodes (spinning disks)

Future • Use DTCS (MaybeTWCS? CASSANDRA-

9666 / CASSANDRA-10195) • Move to SSDs everywhere

Interested? We’re hiring

Questions?

Image credits – The Noun Project • Björn Andersson • Creative Stall • Gregor Cresnar • Justin Blake • Lemon Liu • Mark Shorter • Shawn Schmidt • Stéphanie Rusch

Technology

Observer, a "real life" time series application