43
Data modelling workshop Richard Low [email protected] @richardalow Wednesday, 28 March 2012

Cassandra EU 2012 - Data modelling workshop by Richard Low

  • Upload
    acunu

  • View
    3.928

  • Download
    0

Embed Size (px)

DESCRIPTION

A workshop on data modelling in Cassandra. Works through a messaging application example.

Citation preview

Page 1: Cassandra EU 2012 - Data modelling workshop by Richard Low

Data modelling workshop

Richard Low

[email protected] @richardalow

Wednesday, 28 March 2012

Page 2: Cassandra EU 2012 - Data modelling workshop by Richard Low

Outline

• What is data modelling?

• What do I need to know to come up with a model?

• Options and available tools

• Denormalisation

• Example and demo: scalable messaging application

Wednesday, 28 March 2012

Page 3: Cassandra EU 2012 - Data modelling workshop by Richard Low

What is data modelling?

Wednesday, 28 March 2012

Page 4: Cassandra EU 2012 - Data modelling workshop by Richard Low

Data modelling

• How you organise your data

• Store all in one big value?

• Store as columns in one row or lots of rows?

• Use counters?

• Can I avoid read-modify-write?

Wednesday, 28 March 2012

Page 5: Cassandra EU 2012 - Data modelling workshop by Richard Low

Why care about it?

• Performance

• Ensure good load balancing

• Disk usage

• Future proofing

Wednesday, 28 March 2012

Page 6: Cassandra EU 2012 - Data modelling workshop by Richard Low

Performance

• Bad data model: do read-modify-write on large column

• Good data model: just overwrite updated data

• Difference? Could be 100 ops/s vs. 100k ops/s

1000x improvement

Wednesday, 28 March 2012

Page 7: Cassandra EU 2012 - Data modelling workshop by Richard Low

Performance

• Cacheability

• Ensure your cache isn’t polluted by uncacheable things

• Cached reads are ~100x faster than uncached

Wednesday, 28 March 2012

Page 8: Cassandra EU 2012 - Data modelling workshop by Richard Low

What do you need?

Wednesday, 28 March 2012

Page 9: Cassandra EU 2012 - Data modelling workshop by Richard Low

Optimise for queries

• Data model design starts with queries

• What are the common queries?

Wednesday, 28 March 2012

Page 10: Cassandra EU 2012 - Data modelling workshop by Richard Low

Workload

• How many inserts?

• How many reads?

• Do inserts depend on current data?

• Is data write-once?

Wednesday, 28 March 2012

Page 11: Cassandra EU 2012 - Data modelling workshop by Richard Low

Sizes• How big are the values?

• Are some ‘users’ bigger than others?

• How cacheable is your data?

Wednesday, 28 March 2012

Page 12: Cassandra EU 2012 - Data modelling workshop by Richard Low

How do I get this?

• Back of the envelope calculation

• Monitor existing solution

• Prototype a solution

Wednesday, 28 March 2012

Page 13: Cassandra EU 2012 - Data modelling workshop by Richard Low

Options and tools

Wednesday, 28 March 2012

Page 14: Cassandra EU 2012 - Data modelling workshop by Richard Low

Keyspaces and Column Families

SQL Cassandra

Keyspace

Column Family

Database

Tablerow/ col_1 col_1

row/key col_1 col_1row/key col_1 col_2

Wednesday, 28 March 2012

Page 15: Cassandra EU 2012 - Data modelling workshop by Richard Low

Options and tools

• Rows

• Columns

• Supercolumns

• Composite columns

Wednesday, 28 March 2012

Page 16: Cassandra EU 2012 - Data modelling workshop by Richard Low

Rows and columns

col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x

Wednesday, 28 March 2012

Page 17: Cassandra EU 2012 - Data modelling workshop by Richard Low

Column options

• Regular columns

• Super columns: columns within columns

• Composite columns: multi-dimensional column names

Wednesday, 28 March 2012

Page 18: Cassandra EU 2012 - Data modelling workshop by Richard Low

Composite columnsalice: { m2: { Sender: bob, Subject: ‘paper!’, ... }}

bob: { m1: { Sender: alice, Subject: ‘rock?’, ... }}

charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... }}

Wednesday, 28 March 2012

Page 19: Cassandra EU 2012 - Data modelling workshop by Richard Low

Tools

• Counters: atomic inc and dec

• Expiring columns: TTL

• Secondary indexes: your WHERE clause

Wednesday, 28 March 2012

Page 20: Cassandra EU 2012 - Data modelling workshop by Richard Low

Rows vs columns

• Row key is the shard key

• Need lots of rows for scalability

• Don’t be afraid of large-ish rows

• But don’t make them too big

• Avoid range queries across rows, but use them within rows

Wednesday, 28 March 2012

Page 21: Cassandra EU 2012 - Data modelling workshop by Richard Low

Range queries

• Within a row:

SELECT col3..col5 FROM Standard1 WHERE KEY=row1

row1 col1 col2 col5 col6 col8

Wednesday, 28 March 2012

Page 22: Cassandra EU 2012 - Data modelling workshop by Richard Low

Range queries• Across rows:

SELECT * FROM table WHERE key > row2 LIMIT 2

Wednesday, 28 March 2012

Page 23: Cassandra EU 2012 - Data modelling workshop by Richard Low

Range queries

row4

row2

row1row3

SELECT * FROM table WHERE key > row2 LIMIT 2

> row2, row1

Wednesday, 28 March 2012

Page 24: Cassandra EU 2012 - Data modelling workshop by Richard Low

Range queries

• Range queries within rows ‘get_slice’ are fine

• Avoid range queries across rows ‘get_range_slices’

Wednesday, 28 March 2012

Page 25: Cassandra EU 2012 - Data modelling workshop by Richard Low

Batching• Overhead on each call

• Batch together inserts, better if in the same row

• Reduce read ops, use large get_slice reads

Wednesday, 28 March 2012

Page 26: Cassandra EU 2012 - Data modelling workshop by Richard Low

Denormalisation

Wednesday, 28 March 2012

Page 27: Cassandra EU 2012 - Data modelling workshop by Richard Low

• Hard drive performance constraints:

• Sequential IO at 100s MB/s

• Seek at 100 IO/s

• Avoid random IO

Denormalisation

Wednesday, 28 March 2012

Page 28: Cassandra EU 2012 - Data modelling workshop by Richard Low

Denormalisation

• Store columns accessed at similar times near to each other

• => put them in the same row

• Involves copying

• Copying isn’t bad - pre flood prices <$100 per TB

Wednesday, 28 March 2012

Page 29: Cassandra EU 2012 - Data modelling workshop by Richard Low

Messaging ApplicationWednesday, 28 March 2012

Page 30: Cassandra EU 2012 - Data modelling workshop by Richard Low

Messaging application

• Users can send messages to other users

• Horizontally scalable

• Expect users to send to lots of recipients

Wednesday, 28 March 2012

Page 31: Cassandra EU 2012 - Data modelling workshop by Richard Low

• In an RDBMS we might have a table for :

• Users

• Messages (sender is unique)

• Mappings, Message → Receiver

Messaging

Wednesday, 28 March 2012

Page 32: Cassandra EU 2012 - Data modelling workshop by Richard Low

A relational model

Users

username

IdMessages

Subject

Content

Date

Sender_Id

Id1

1

1

Example Relational DB model

Msg_ReceiptId

Message_Id

User_Id

Is_read

Wednesday, 28 March 2012

Page 33: Cassandra EU 2012 - Data modelling workshop by Richard Low

Querying

SELECT * FROM MessagesWHERE Messages.Sender_Id = <id>ORDER BY Messages.Date DESCLIMIT 10;

Most recent 10 messages sent by a user :

Most recent 10 messages received by a user :SELECT Messages.*

FROM Messages, Msg_ReceiptWHERE Msg_Receipt.User_Id = <id>AND Msg_Receipt.Message_Id = Messages.IdORDER BY Messages.Date DESCLIMIT 10;

Wednesday, 28 March 2012

Page 34: Cassandra EU 2012 - Data modelling workshop by Richard Low

Under the hood

id msg_id user_id

0 0 0

1 3 1

2 4 2

3 6000 0

Msg_Receipt

id subject ...

0 a

1 b

2 c

3 d

4 e

...

6000 x

Messages

Wednesday, 28 March 2012

Page 35: Cassandra EU 2012 - Data modelling workshop by Richard Low

Under the hood

• Normalisation => seeks

• So denormalise

• Hit capacity limit of one node quickly

Wednesday, 28 March 2012

Page 36: Cassandra EU 2012 - Data modelling workshop by Richard Low

Back of the envelope...

• 1 M users

• Message size 1 KB

• Each user has 5000 messages

• => 5 TB data

Wednesday, 28 March 2012

Page 37: Cassandra EU 2012 - Data modelling workshop by Richard Low

Back of the envelope...

• Reading 10 messages => 10 seeks

• If 10k active at once, need 100k seeks/s

• => need 1000 disks

• With 8 disks per node, RF 3, that’s 375 nodes

Wednesday, 28 March 2012

Page 38: Cassandra EU 2012 - Data modelling workshop by Richard Low

• Denormalize: messages are immutable

• Insert them into everyone’s inbox

• Read 10 messages is one seek

• Paging is sequential

• => 10x fewer nodes: 38 nodes now!

Back of the envelope...

Wednesday, 28 March 2012

Page 39: Cassandra EU 2012 - Data modelling workshop by Richard Low

In Cassandra

• Use a row per user

• Composite columns, with TimeUUID as ID

• Gives time ordering on messages

• Inserts go to all recipients

Wednesday, 28 March 2012

Page 40: Cassandra EU 2012 - Data modelling workshop by Richard Low

Messaging exampleFrom: aliceTo: bob, charlieSubject: rock?

m1

alice

bob

charlie

sender subject

alice rock?sender subject

alice rock?

Wednesday, 28 March 2012

Page 41: Cassandra EU 2012 - Data modelling workshop by Richard Low

Messaging exampleFrom: bobTo: alice, charlieSubject: paper!

m1 m2

alice

bob

charlie

sender subject

alice rock?

sender subject

alice rock?

sender subject

bob paper!

sender subject

bob paper!

Wednesday, 28 March 2012

Page 42: Cassandra EU 2012 - Data modelling workshop by Richard Low

Dataalice: { m2: { Sender: bob, Subject: ‘paper!’, ... }}

bob: { m1: { Sender: alice, Subject: ‘rock?’, ... }}

charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... }}

Wednesday, 28 March 2012

Page 43: Cassandra EU 2012 - Data modelling workshop by Richard Low

Demo

• Pycassa

• Send message

• List messages

• Unread count

Wednesday, 28 March 2012