120
Introduction to Cassandra Wellington Python User Group Aaron Morton @aaronmorton 1/12/2010

Nzpug welly-cassandra-02-12-2010

Embed Size (px)

Citation preview

Page 1: Nzpug welly-cassandra-02-12-2010

Introduction to Cassandra

Wellington Python User GroupAaron Morton @aaronmorton

1/12/2010

Page 2: Nzpug welly-cassandra-02-12-2010

Disclaimer.This is an introduction not

a reference.

Page 3: Nzpug welly-cassandra-02-12-2010

I may, from time to time and for the best possible

reasons, bullshit you.

Page 4: Nzpug welly-cassandra-02-12-2010

What do you already know about Cassandra?

Page 5: Nzpug welly-cassandra-02-12-2010

Get ready.

Page 6: Nzpug welly-cassandra-02-12-2010

The next slide has a lot on it.

Page 7: Nzpug welly-cassandra-02-12-2010

Cassandra is a distributed, fault tolerant, scalable, column oriented data

store.

Page 8: Nzpug welly-cassandra-02-12-2010

A word about “column oriented”.

Page 9: Nzpug welly-cassandra-02-12-2010

Relax.

Page 10: Nzpug welly-cassandra-02-12-2010

It’s different to a row oriented DB like MySQL.

So...

Page 11: Nzpug welly-cassandra-02-12-2010
Page 12: Nzpug welly-cassandra-02-12-2010

For now, think about keys and values. Where each

value is a hash / dict.

Page 13: Nzpug welly-cassandra-02-12-2010

{‘foo’ : {‘bar’ : ‘baz’,},}

{key : {col_name : col_value,},}

Page 14: Nzpug welly-cassandra-02-12-2010

Cassandra’s data model and on disk storage are based on the Google Bigtable

paper from 2006.

Page 15: Nzpug welly-cassandra-02-12-2010

The distributed cluster design is based on the

Amazon Dynamo paper from 2007.

Page 16: Nzpug welly-cassandra-02-12-2010

Easy.Lets store ‘foo’ somewhere.

Page 17: Nzpug welly-cassandra-02-12-2010

'foo'

Page 18: Nzpug welly-cassandra-02-12-2010

But I want to be able to read it back if one machine

fails.

Page 19: Nzpug welly-cassandra-02-12-2010

Lets distribute it on 3 of the 5 nodes I have.

Page 20: Nzpug welly-cassandra-02-12-2010

This is the Replication Factor.

Called RF or N.

Page 21: Nzpug welly-cassandra-02-12-2010

Each node has a token that identifies the upper value of

the key range it is responsible for.

Page 22: Nzpug welly-cassandra-02-12-2010

#1<= E

#2<= J

#3<= O

#4<= T

#5<= Z

Page 23: Nzpug welly-cassandra-02-12-2010

Client connects to a random node and asks it to coordinate storing the ‘foo’

key.

Page 24: Nzpug welly-cassandra-02-12-2010

Each node knows about all other nodes in the cluster,

including their tokens.

Page 25: Nzpug welly-cassandra-02-12-2010

This is achieved using a Gossip protocol. Every

second each node shares it’s full view of the cluster with 1 to 3 other nodes.

Page 26: Nzpug welly-cassandra-02-12-2010

Our coordinator is node 5. It knows node 2 is

responsible for the ‘foo’ key.

Page 27: Nzpug welly-cassandra-02-12-2010

#1<= E

#2'foo'

#3<= O

#4<= T

#5<= Z

Client

Page 28: Nzpug welly-cassandra-02-12-2010

But there is a problem...

Page 29: Nzpug welly-cassandra-02-12-2010

What if we have lots of values between F and J?

Page 30: Nzpug welly-cassandra-02-12-2010

We end up with a “hot” section in our ring of

nodes.

Page 31: Nzpug welly-cassandra-02-12-2010

That’s bad mmmkay?

Page 32: Nzpug welly-cassandra-02-12-2010

You shouldn't have a hot section in your ring.

mmmkay?

Page 33: Nzpug welly-cassandra-02-12-2010

A Partitioner is used to apply a transform to the

key. The transformed values are also used to define the

range for a node.

Page 34: Nzpug welly-cassandra-02-12-2010

The Random Partitioner applies a MD5 transform. The range of all possible

keys values is changed to a 128 bit number.

Page 35: Nzpug welly-cassandra-02-12-2010

There are other Partitioners, such as the

Order Preserving Partitioner. But start with the Random Partitioner.

Page 36: Nzpug welly-cassandra-02-12-2010

Let’s pretend all keys are now transformed to an

integer between 0 and 9.

Page 37: Nzpug welly-cassandra-02-12-2010

Our 5 node cluster now looks like.

Page 38: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2<= 4

#3<= 6

#4<= 8

#5<= 0

Page 39: Nzpug welly-cassandra-02-12-2010

Pretend our ‘foo’ key transforms to 3.

Page 40: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3<= 6

#4<= 8

#5<= 0

Client

Page 41: Nzpug welly-cassandra-02-12-2010

Good start.

Page 42: Nzpug welly-cassandra-02-12-2010

But where are the replicas?We want to replicate the

‘foo’ key 3 times.

Page 43: Nzpug welly-cassandra-02-12-2010

A Replication Strategy is used to determine which

nodes should store replicas.

Page 44: Nzpug welly-cassandra-02-12-2010

It’s also used to work out which nodes should have a

value when reading.

Page 45: Nzpug welly-cassandra-02-12-2010

Simple Strategy orders the nodes by their token and

places the replicas clockwise around the ring.

Page 46: Nzpug welly-cassandra-02-12-2010

Network Topology Strategy is aware of the racks and

Data Centres your servers are in. Can split replicas

between DC’s.

Page 47: Nzpug welly-cassandra-02-12-2010

Simple Strategy will do in most cases.

Page 48: Nzpug welly-cassandra-02-12-2010

Our coordinator will send the write to all 3 nodes at

once.

Page 49: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

Page 50: Nzpug welly-cassandra-02-12-2010

Once the 3 replicas tell the coordinator they have

finished, it will tell the client the write completed.

Page 51: Nzpug welly-cassandra-02-12-2010

Done.Let’s go home.

Page 52: Nzpug welly-cassandra-02-12-2010

Hang on.What about fault tolerant?What if node #4 is down?

Page 53: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

Page 54: Nzpug welly-cassandra-02-12-2010

After all, two out of three ain’t bad.

Page 55: Nzpug welly-cassandra-02-12-2010

The client must specify a Consistency Level for each

operation.

Page 56: Nzpug welly-cassandra-02-12-2010

Consistency Level specifies how many nodes must

agree before the operation is a success.

Page 57: Nzpug welly-cassandra-02-12-2010

For reads is known as R.For writes is known as W.

Page 58: Nzpug welly-cassandra-02-12-2010

Here are the simple ones (there are a few more)...

Page 59: Nzpug welly-cassandra-02-12-2010

One.The coordinator will only

wait for one node to acknowledge the write.

Page 60: Nzpug welly-cassandra-02-12-2010

Quorum.N/2 + 1

Page 61: Nzpug welly-cassandra-02-12-2010

All.

Page 62: Nzpug welly-cassandra-02-12-2010

The cluster will work to eventually make all copies

of the data consistent.

Page 63: Nzpug welly-cassandra-02-12-2010

To get consistent behaviour make sure that R + W > N.

You can do this by...

Page 64: Nzpug welly-cassandra-02-12-2010

Always using Quorum for read and writes.

Or...

Page 65: Nzpug welly-cassandra-02-12-2010

Use All for writes and One for reads.

Or...

Page 66: Nzpug welly-cassandra-02-12-2010

Use All for reads and One for writes.

Page 67: Nzpug welly-cassandra-02-12-2010

Try our write again, using Quorum consistency level.

Page 68: Nzpug welly-cassandra-02-12-2010

Coordinator will wait for 2 nodes to complete the write before telling the client it has completed.

Page 69: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

Page 70: Nzpug welly-cassandra-02-12-2010

What about when node 4 comes online?

Page 71: Nzpug welly-cassandra-02-12-2010

It will not have our “foo” key.

Page 72: Nzpug welly-cassandra-02-12-2010

Won’t somebody please think of the “foo” key!?

Page 73: Nzpug welly-cassandra-02-12-2010

During our write the coordinator will send a

Hinted Handoff to one of the online replicas.

Page 74: Nzpug welly-cassandra-02-12-2010

Hinted Handoff tells the node that one of the

replicas was down and needs to be updated later.

Page 75: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

send "3" to #4

Page 76: Nzpug welly-cassandra-02-12-2010

When node 4 comes back up, node 3 will eventually

process the Hinted Handoffs and send the

“foo” key to it.

Page 77: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4"3"

#5<= 0

Client

Page 78: Nzpug welly-cassandra-02-12-2010

What if the “foo” key is read before the Hinted Handoff is processed?

Page 79: Nzpug welly-cassandra-02-12-2010

#1<= 2

#2"3"

#3"3"

#4""

#5<= 0

Client

send "3" to #4

Page 80: Nzpug welly-cassandra-02-12-2010

At our Quorum CL the coordinator asks all nodes that should have replicas to

perform the read.

Page 81: Nzpug welly-cassandra-02-12-2010

The Replication Strategy is used by the coordinator to

find out which nodes should have replicas.

Page 82: Nzpug welly-cassandra-02-12-2010

Once CL nodes have returned, their values are

compared.

Page 83: Nzpug welly-cassandra-02-12-2010

If the do not match a Read Repair process is kicked off.

Page 84: Nzpug welly-cassandra-02-12-2010

A timestamp provided by the client during the write is used to determine the

“latest” value.

Page 85: Nzpug welly-cassandra-02-12-2010

The “foo” key is written to node 4, and consistency

achieved, before the coordinator returns to the

client.

Page 86: Nzpug welly-cassandra-02-12-2010

At lower CL’s the Read Repair happens in the

background and is probabilistic.

Page 87: Nzpug welly-cassandra-02-12-2010

We can force Cassandra to repair everything using the

Anti Entropy feature.

Page 88: Nzpug welly-cassandra-02-12-2010

Anti Entropy is the main feature for achieving

consistency. RR and HH are optimisations.

Page 89: Nzpug welly-cassandra-02-12-2010

Anti Entropy is started manually via command line

or Java JMX.

Page 90: Nzpug welly-cassandra-02-12-2010

Great so far.

Page 91: Nzpug welly-cassandra-02-12-2010

But ratemylolcats.com is going to be huge.

How do I store 100 Million pictures of cats?

Page 92: Nzpug welly-cassandra-02-12-2010

Add more nodes.

Page 93: Nzpug welly-cassandra-02-12-2010

More disk capacity, disk IO, memory, CPU, network IO.

More everything.

Page 94: Nzpug welly-cassandra-02-12-2010

Except.

Page 95: Nzpug welly-cassandra-02-12-2010

Less “work” per node.

Page 96: Nzpug welly-cassandra-02-12-2010

Linear scaling.

Page 97: Nzpug welly-cassandra-02-12-2010

Clusters of 100+ TB.

Page 98: Nzpug welly-cassandra-02-12-2010

And now for the data model.

Page 99: Nzpug welly-cassandra-02-12-2010

From the outside in.

Page 100: Nzpug welly-cassandra-02-12-2010

A Keyspace is the container for everything in your

application.

Page 101: Nzpug welly-cassandra-02-12-2010

Keyspaces can be thought of as Databases.

Page 102: Nzpug welly-cassandra-02-12-2010

A Column Family is a container for ordered and

indexed Columns.

Page 103: Nzpug welly-cassandra-02-12-2010

Columns have a name, value, and timestamp

provided by the client.

Page 104: Nzpug welly-cassandra-02-12-2010

The CF indexes the columns by name and

supports get operations by name.

Page 105: Nzpug welly-cassandra-02-12-2010

CF’s do not define which columns can be stored in

them.

Page 106: Nzpug welly-cassandra-02-12-2010

Column Families have a large memory overhead.

Page 107: Nzpug welly-cassandra-02-12-2010

You typically have few (<10) CF’s in your Keyspace. But

there is no limit.

Page 108: Nzpug welly-cassandra-02-12-2010

We have Rows. Rows have a key.

Page 109: Nzpug welly-cassandra-02-12-2010

Rows store columns in one or more Column Families.

Page 110: Nzpug welly-cassandra-02-12-2010

Different rows can store different columns in the same Column Family.

Page 111: Nzpug welly-cassandra-02-12-2010

User CF

username => fredd_o_b => 04/03

username => bobcity => wellington

key => fred

key => bob

Page 112: Nzpug welly-cassandra-02-12-2010

A key can store different columns in different Column Families.

Page 113: Nzpug welly-cassandra-02-12-2010

User CF

username => fredd_o_b => 04/03

09:01 => tweet_6009:02 => tweet_70

key => fred

key => fred

Timeline CF

Page 114: Nzpug welly-cassandra-02-12-2010

Here comes the Super Column Family to ruin it all.

Page 115: Nzpug welly-cassandra-02-12-2010

Arrgggghhhhh.

Page 116: Nzpug welly-cassandra-02-12-2010

A Super Column Family is a container for ordered and indexes Super Columns.

Page 117: Nzpug welly-cassandra-02-12-2010

A Super Column has a name and an ordered and indexed list of Columns.

Page 118: Nzpug welly-cassandra-02-12-2010

So the Super Column Family just gives another

level to our hash.

Page 119: Nzpug welly-cassandra-02-12-2010

Social Super CF

following => {bob => 01/01/2010, tom => 01/02/2010}

followers => {bob => 01/01/2010}

key => fred

Page 120: Nzpug welly-cassandra-02-12-2010

How about some code?