Realtime Analytics with Cassandra
or: How I Learned to Stopped Worrying and
Love Counting
1
What is Realtime Analytics?eg “show me the number of mentions of
‘Acunu’ per day, between May and November 2011, on Twitter”
Batch (Hadoop) approach would require processing ~30 billion tweets,
or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html
2
Introduction
3
Live & historicalaggregates...
3
4
Realtime trends...
4
5
Drill downsand roll ups
5
Okay, so how are we going to do it?
For each tweet,
increment a bunch of counters,
such that answering a query
is as easy as reading some counters
6
Preparing the dataStep 1: Get a feed of
the tweets
Step 2: Tokenise the tweet
Step 3: Increment countersin time buckets for each token
12:32:15 I like #trafficlights12:33:43 Nobody expects...
12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!
[1234, man] +1[1234, acunu] +1[1234, rock] +1
7
Querying
Step 1: Do a range query
Step 2: Result table
Step 3: Plot pretty graph
start: [01/05/11, acunu]end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
0
45
90
May Jun Jul Aug Sept Oct Nov
8
Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,
so not possible to range queries on rows
• Could manually work out each row in range, do lots of point gets
• This would suck - each query would be 100’s of random IOs on disk
• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
9
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
So instead of this...
We do thisKey 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ time bucket
Column key is ‘small’ time bucket
10
Demo./painbird.py -u tom_wilkie
11
Now its your turn.....
12
1. Get a twitter account - http://twitter.com
2. Get some Cassandra VMs - http://goo.gl/O9hkv
3. Cluster them up
4. Get the code - http://goo.gl
5. Implement the missing bits!
6. (Prizes for the ones that spot bugs!)
13
Cluster them up
• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)
15
Get the codeSSH into one of the VMs:
# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz | tar zxf -
# curl -o pycassa.rpm https://acunu-oss.s3.amazonaws.com/pycassa.rpm
# rpm -i pycassa.rpm
# cd release
# ./painbird.py -u tom_wilkie
16
Implement the “core”
• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):
17
Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.
[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)
18
Extensions
19
UI
• Pretty graphs
• Automatically periodically update
• Search multiple terms
Painbird
• mentions of multiple terms
• sentiment analysis - http://www.nltk.org/
20