Upload
acunu
View
3.928
Download
0
Embed Size (px)
DESCRIPTION
A workshop on data modelling in Cassandra. Works through a messaging application example.
Citation preview
Outline
• What is data modelling?
• What do I need to know to come up with a model?
• Options and available tools
• Denormalisation
• Example and demo: scalable messaging application
Wednesday, 28 March 2012
What is data modelling?
Wednesday, 28 March 2012
Data modelling
• How you organise your data
• Store all in one big value?
• Store as columns in one row or lots of rows?
• Use counters?
• Can I avoid read-modify-write?
Wednesday, 28 March 2012
Why care about it?
• Performance
• Ensure good load balancing
• Disk usage
• Future proofing
Wednesday, 28 March 2012
Performance
• Bad data model: do read-modify-write on large column
• Good data model: just overwrite updated data
• Difference? Could be 100 ops/s vs. 100k ops/s
1000x improvement
Wednesday, 28 March 2012
Performance
• Cacheability
• Ensure your cache isn’t polluted by uncacheable things
• Cached reads are ~100x faster than uncached
Wednesday, 28 March 2012
What do you need?
Wednesday, 28 March 2012
Optimise for queries
• Data model design starts with queries
• What are the common queries?
Wednesday, 28 March 2012
Workload
• How many inserts?
• How many reads?
• Do inserts depend on current data?
• Is data write-once?
Wednesday, 28 March 2012
Sizes• How big are the values?
• Are some ‘users’ bigger than others?
• How cacheable is your data?
Wednesday, 28 March 2012
How do I get this?
• Back of the envelope calculation
• Monitor existing solution
• Prototype a solution
Wednesday, 28 March 2012
Options and tools
Wednesday, 28 March 2012
Keyspaces and Column Families
SQL Cassandra
Keyspace
Column Family
Database
Tablerow/ col_1 col_1
row/key col_1 col_1row/key col_1 col_2
Wednesday, 28 March 2012
Options and tools
• Rows
• Columns
• Supercolumns
• Composite columns
Wednesday, 28 March 2012
Rows and columns
col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
Wednesday, 28 March 2012
Column options
• Regular columns
• Super columns: columns within columns
• Composite columns: multi-dimensional column names
Wednesday, 28 March 2012
Composite columnsalice: { m2: { Sender: bob, Subject: ‘paper!’, ... }}
bob: { m1: { Sender: alice, Subject: ‘rock?’, ... }}
charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... }}
Wednesday, 28 March 2012
Tools
• Counters: atomic inc and dec
• Expiring columns: TTL
• Secondary indexes: your WHERE clause
Wednesday, 28 March 2012
Rows vs columns
• Row key is the shard key
• Need lots of rows for scalability
• Don’t be afraid of large-ish rows
• But don’t make them too big
• Avoid range queries across rows, but use them within rows
Wednesday, 28 March 2012
Range queries
• Within a row:
SELECT col3..col5 FROM Standard1 WHERE KEY=row1
row1 col1 col2 col5 col6 col8
Wednesday, 28 March 2012
Range queries• Across rows:
SELECT * FROM table WHERE key > row2 LIMIT 2
Wednesday, 28 March 2012
Range queries
row4
row2
row1row3
SELECT * FROM table WHERE key > row2 LIMIT 2
> row2, row1
Wednesday, 28 March 2012
Range queries
• Range queries within rows ‘get_slice’ are fine
• Avoid range queries across rows ‘get_range_slices’
Wednesday, 28 March 2012
Batching• Overhead on each call
• Batch together inserts, better if in the same row
• Reduce read ops, use large get_slice reads
Wednesday, 28 March 2012
Denormalisation
Wednesday, 28 March 2012
• Hard drive performance constraints:
• Sequential IO at 100s MB/s
• Seek at 100 IO/s
• Avoid random IO
Denormalisation
Wednesday, 28 March 2012
Denormalisation
• Store columns accessed at similar times near to each other
• => put them in the same row
• Involves copying
• Copying isn’t bad - pre flood prices <$100 per TB
Wednesday, 28 March 2012
Messaging ApplicationWednesday, 28 March 2012
Messaging application
• Users can send messages to other users
• Horizontally scalable
• Expect users to send to lots of recipients
Wednesday, 28 March 2012
• In an RDBMS we might have a table for :
• Users
• Messages (sender is unique)
• Mappings, Message → Receiver
Messaging
Wednesday, 28 March 2012
A relational model
Users
username
IdMessages
Subject
Content
Date
Sender_Id
Id1
1
1
∞
Example Relational DB model
∞
Msg_ReceiptId
Message_Id
User_Id
Is_read
∞
Wednesday, 28 March 2012
Querying
SELECT * FROM MessagesWHERE Messages.Sender_Id = <id>ORDER BY Messages.Date DESCLIMIT 10;
Most recent 10 messages sent by a user :
Most recent 10 messages received by a user :SELECT Messages.*
FROM Messages, Msg_ReceiptWHERE Msg_Receipt.User_Id = <id>AND Msg_Receipt.Message_Id = Messages.IdORDER BY Messages.Date DESCLIMIT 10;
Wednesday, 28 March 2012
Under the hood
id msg_id user_id
0 0 0
1 3 1
2 4 2
3 6000 0
Msg_Receipt
id subject ...
0 a
1 b
2 c
3 d
4 e
...
6000 x
Messages
Wednesday, 28 March 2012
Under the hood
• Normalisation => seeks
• So denormalise
• Hit capacity limit of one node quickly
Wednesday, 28 March 2012
Back of the envelope...
• 1 M users
• Message size 1 KB
• Each user has 5000 messages
• => 5 TB data
Wednesday, 28 March 2012
Back of the envelope...
• Reading 10 messages => 10 seeks
• If 10k active at once, need 100k seeks/s
• => need 1000 disks
• With 8 disks per node, RF 3, that’s 375 nodes
Wednesday, 28 March 2012
• Denormalize: messages are immutable
• Insert them into everyone’s inbox
• Read 10 messages is one seek
• Paging is sequential
• => 10x fewer nodes: 38 nodes now!
Back of the envelope...
Wednesday, 28 March 2012
In Cassandra
• Use a row per user
• Composite columns, with TimeUUID as ID
• Gives time ordering on messages
• Inserts go to all recipients
Wednesday, 28 March 2012
Messaging exampleFrom: aliceTo: bob, charlieSubject: rock?
m1
alice
bob
charlie
sender subject
alice rock?sender subject
alice rock?
Wednesday, 28 March 2012
Messaging exampleFrom: bobTo: alice, charlieSubject: paper!
m1 m2
alice
bob
charlie
sender subject
alice rock?
sender subject
alice rock?
sender subject
bob paper!
sender subject
bob paper!
Wednesday, 28 March 2012
Dataalice: { m2: { Sender: bob, Subject: ‘paper!’, ... }}
bob: { m1: { Sender: alice, Subject: ‘rock?’, ... }}
charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... }}
Wednesday, 28 March 2012
Demo
• Pycassa
• Send message
• List messages
• Unread count
Wednesday, 28 March 2012