Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University Subject : Cassandra - A Decentralized Structured Storage System Professor

Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University

Subject : Cassandra - A Decentralized Structured Storage System

Professor : Dr. sh.Esmaili

The Student’s Identifiers :Mr. Houshyar Mohammadi Talvar(Slides 4 to 17)Miss.Hakimi(Slides 19 to 27)Mr. Hossien Sadrizadeh(Slides 29 to 65)

The Date :June 6th 2012 , (On Thursday , 25th Khordad 1391 )

1 /66


Contenet Of The Presentation:• Abstract• Introduction• Related Work• Data Model• API• System Arcgitecture

• Partitioning• Replication• Membership• Bootstrapping• Scaling the Cluster• Local Persistance• Implementation Details

• Practical Experiences• Facebook Inbox Search• Conclusion• Acknowledgements

2 / 66


Mr. Houshyar Mohammadi Talvar

Slides From 4 To 17

3 / 66


Cassandra is a distributed storage system for managing verylarge amounts of structured data spread out across manycommodity servers, while providing highly available servicewith no single point of failure.

Abstract

4 / 66


IntroductionFacebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world.

5 / 66


Related Work

Systems like Ficus and Coda replicate files for high availability at the expense of consistency. Update conflicts are typically managed using specialized conflict resolution procedures.

Bayou Coda Ficus Dynamo

6 / 66


7 / 66


8 / 66


9 / 66


10 / 66


Data Model

A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured.

Cassandra exposes two kinds of columns families

Simple column families

Super column families

11 / 66


KEYColumnFamily1 Name : MailList Type : Simple Sort : Name

Name : tid1

Value : <Binary>

TimeStamp : t1

Name : tid2

Value : <Binary>

TimeStamp : t2

Name : tid3

Value : <Binary>

TimeStamp : t3

Name : tid4

Value : <Binary>

TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Name : aloha

C1

V1

T1

C2

V2

T2

C3

V3

T3

C4

V4

T4

Name : dude

C2

V2

T2

C6

V6

T6

Column Families are declared

upfront

Columns are added and modified dynamically

SuperColumns are added and modified

dynamically

Columns are added and modified dynamically

Data Model(Continue)

12 / 66


Any column within a column family is accessed using the convention column family : column any column within a column family that is of type super is accessed using the convention column family :super column : column.

Data Model(Continue)

13 / 66


The Cassandra API consists of the following three simple methods.

insert(table; key; rowMutation) get(table; key; columnName) delete(table; key; columnName)

API

14 / 66


System Architecture

The architecture of a storage system that needs to operate in a production setting is complex.

In addition to the actual data persistence component, the system needs to have the following characteristics

15 / 66


System Architecture(Continue) scalable and robust solutions for load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming configuration management

16 / 66


Read

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest Query

Result

Client

17 / 66


Miss. Hakimi

Slides From 19 To 27

18 / 66


Partitioning

One of the key design features for Cassandra is the ability to scale incrementally.

19 / 66


01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)Partitioning and Replication

20 /66


Membership

Cluster membership in Cassandra is based on Scuttlebutt, a very efficint anti-entropy Gossip based mechanism.

21 / 66


Membership

• Gossip protocol is used for cluster membership.• Super lightweight with mathematically provable

properties.• State disseminated in O(logN) rounds where N is

the number of nodes in the cluster.• Every T seconds each member increments its

heartbeat counter and selects one other member to send its list to.

• A member merges the list with its own list .

22 / 66


Failure Detection

Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down.

23 / 66


Accrual Failure Detector

• Valuable for system management, replication, load balancing etc.

• Defined as a failure detector that outputs a value, PHI, associated with each process.

• Also known as Adaptive Failure detectors - designed to adapt to changing network conditions.

• The value output, PHI, represents a suspicion level.• Applications set an appropriate threshold, trigger

suspicions and perform appropriate actions.• In Cassandra the average time taken to detect a failure is

10-15 seconds with the PHI threshold set at 5

24 / 66


Bootstrapping

When a node starts for the first time, it chooses a random token for its position in the ring.

25 / 66


Scaling The Cluster

When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node.

26 / 66


Local Persistence

The Cassandra system relies on the local file system for data persistence.

27 / 66


Hossien SadrizadehSlides From 29 To 65

28 / 66


Implementaion Details• The following abstractions are need for Cassandra

Process on a Single Machine.• Partitioning module.• Cluster membership and Failure detection module.• Storage engine module.

• Each of these module has been implemented from the ground using Java.

• Each of these modules rely on an event driven where the message processing pipeline and the task pipeline are split into multiple stage along the line of the SEDA architecture.(Staged Event-Driven Architecture).

29 / 66


SEDA1

1 SEDA: An Architecture for Well-Conditioned,Scalable Internet Services (Matt Welsh, David Culler, and Eric Brewer )

• SEDA combines of threads and event-based programming models to manage :• Concurrency.• I/O.• Schedulaing.• Resource management needs of Internet services.

• In SEDA, applications consist of:• A network of event-driven stages.• Each stage connected by explicit queues.

• SEDA is intended to support massive concurrency demands and simplify the construction of well-conditioned services.

30 / 66


SEDA(Continue)

Thread Server Design :Each incoming request is dispatched to a separate threads, which processes the request and returns a result to the client. Edges represent control flow between components .

31 / 66


Routing

• All system control messages rely on UDP1 based messageing while the application related messages for replication and request routing relies on TCP2.

• The request routing modules are implemented using a certain state machine.

1.UDP : User Datagram Protocol.(a connectionless protocol)2.TCP : Transfer Control Protocol.(a connection-Oriented protocol)

32 / 66


What Happened When a Read/Write Request From a Node In The Cluster ?

33 / 66


Partitioning1

• In cassandra, the total data managed by the cluster is represented as a circular space or ring.

• The ring is divided up into ranges equal to the number of nodes, which each node being responsible for one or more ranges of the overall data.

• Before a node can join the ring, it must be assigned a token.

• The token determines the node’s position on the ring and the range of data it is responsible for.

1 http://www.datastax.com/docs/0.8/cluster_architecture

34 / 66


Partitioning – Single Data Center(Continue)1


A cluster with 4 nodes, the row keys managed by the cluster were numbers in the range of 0 to 100.

Each node is assigned a token that represents a point in this range.

In this simple example, the token values are 0, 25, 50, and 75.

35 / 66


Partitioning – Replica Placement(Continue)1


• In multi-data center deployments, replica placement is calculated per data center.

• Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack.

36 / 66


Partitioning – Multi Data Center(Continue)1


37 / 66


Partitioning – Multi Data Center(Continue)1


The goal is to ensure that the nodes for each data center have token assignments that evenly divide the overall range.

38 / 66


About Client Request• All nodes in cassandra are peers.

• A client read/write request can go to any node in the cluster.

• When a client connect to a node and issues a read/write request , that node serves as a proxy the coordinator for that particular operation.

• The job of the coordinator is to act between the client application and the nodes(replicas)that own the data being requested.

39 / 66


About Client Request(Continue)• The coordinator sends the write request to all replicas

that own the row being written. if all replica nodes are up and available.

• They will get the write regardless of the consistency level specified by the client.

• The write consistency level determines how many replica nodes must respond with a success acknowledgement in order for the write to be considered successful.

40 / 66


An Example To WriteFor example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row.

41 / 66


11

1

3

2

8

4

6

5

7

10Client

9

12

R1

R2

R3

42 / 66


Replication Factor & Replication In Cassandra

• Replication factor The total number of replicas across the cluster is often referred to as the replication factor.

• A replication factor of 1 means that there is only one copy of each row, and a replication factor of 2 means two copies of each row.

• Replication Is the process of storing copies of data on multiple nodes.

43 / 66


Replication Factor In Cassandra

Replication is the process of storing copies of data on multiple nodes.

44 / 66


Commit Log

• The cassandra system base on the local file system for data persistance.

• We have a dedicated disk on each machine for the commit log.

• The write into the in-memory data structure is performed only after a successful write into the commit log.

45 / 66


Commit Log & In-Memory Structure• The cassandra ,first writes data to a commit

log(for durability), and then an in-memory table structure called memtable.

• A write is successful when :1. First, It is written to the commit log.2. Second, write in the Memory.

• Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable.

46 / 66


Structure Of Commit Log

• Every commit log has a header which is basically:• A bit vector with fixed size.• The size of the bit vector is more than the number of column

families.

• These bit vectors are per commit log and also hold in memory.

47 / 66


Write Operation Into The Commit Log

• The write operation into the commit log can either be in normal mode or in fast sync mode.

• In the fast sync mode the writes to the commit log are buffered.(if the machine is crashed some of data maybe loss).

48 / 66


Implementaion The Commit Log(Continue)

• Traditional databases are not designed to handle high write throughput.

• Cassandra do writes to disk into sequential writes thus maximize disk write throughput.

• Since the files dumped to the disk are never changed then no locks need to be taken while reading them.for instance the server of cassandra is practically lockless for read/write operation.

49 / 66


When Should We Delete The Commit Log?

In any logging system , we need a mechanism to purge commit log entries.

Question :Is there any different between delete a commit log and delete the entries of commit log ?

50 / 66


Implementaion The Index• The cassandra system indexes all database on

primary key.

• The data file on disk is broken down into a sequence of blocks.

• Each block:• Contains at most 128 keys.• Is demarcated by a block index.

• The block index capture the relative offset of a key within the block and the size of its data.

51 / 66


Layout Of a Sample Block

Structure of a block and their index demarcated in memory

52 / 66


Implementaion The Index(Continue)

• When an in-memory data structure(block) is dumped to disk a block index is generated and their offsets written out to disk as indics.

• This index is also hold in memory for fast access.

53 / 66


What Happened When a Typical Read Is Take Place?

54 / 66


What Should We1 Do When The Number Of Files Are Increased On The Disk ?

• Over time the number of data files will increase on disk.

• We perform a compaction process, very much like the Bigtable system.

• Merges multiple files into one ;essentially merge sort on a cluster of sorted data files.

• Periodically a compaction process is run to compact all related data files into one big file.

1 The Research team

55 / 66


Practical Experiences• In the design process of cassandra, we learnt a lot of

usefull experience and it is very benefitical for us.

• We experimented with various implementations of Failure Detectors.if the size of cluster is grown then the time of detected faliure is increased.

• Most application only require atomic operation per key per replica, but there are some application to do on secondray indexes.(because most developers work on RDBMS).

• Cassandra is a completely decentralized system(distributed system).

56 / 66


Ganglia /ģæŋ.lia/• Old monitoring is not benefit anymore,because the

Cassandra system is well integerated with Ganglia (a distributed monitoring tool)1.

• Ganglia is a scalable distributed system monitor tool for high-performance computing system such as clusters.

• The strategy uses a distributed tree structure that enables organizations to monitor an arbitrarily large number of clusters while placing bounds on the required processing load.

1 Matthew L.Massie,Brent N.Chun, and David E, Culler.The Ganglia distributed monitoring system

57 / 66


Ganglia(Continue)

Ganglia is comprised of two components:• Gmon,local-area monitoring system .• Gmeta wide-area system.

Ganglia local and wide area monitor interaction. Gmon runs on each cluster node; gmeta can fail over between nodes.

Gmon uses UDP multicast.Gmon communicates with its Gmeta counterpart usingXML streams sent over TCP connections.

58 / 66


Facebook Inbox Search

• what is the matter?

• Millions of messages are sent everyday on Facebook.

• Messages stored in different data centers.

• How to handle indexing all of this information for Inbox search ?

59 / 66


Facebook Inbox Search

• For inbox search we have to make a list of all messages per user that have been exchanged between the sender and recipients.

• There are two kinds of search features:• Term search.• Search interaction.

60 / 66


Term Search / Search Interaction

• Term search :• Key = user ID.• Super column = the words that make up the message

become.

• Search interaction :• Key = user ID.• Super column = the recipients id’s.

61 / 66


An Actual Example

• The current system store about 50TB of data on a 150 node cluster.

• The previous data are spread out between east and west coast data center.

• Some measure we product them are in the following table.

62 / 66


What Works Are There To Do On The Future ?

The works that we can do them are:

• Adding compression.• Secondary index support.

63 / 66


Cassandra Goals(Conclusion)

• High scalability.

• High performance.• Throughput. • Response time.

• High availability.

64 / 66


Headline Summary• All implimentation use of java.

• Use the UDP anf TCP protocol for routing.

• Ring mechanism used for clustering.

• All the nodes in the ring are peers.

• Use of replication.

• Use of commit log to persistence files.

• As use of sequential write we have a high throughput.

• All files broken into some blocks.

• It doesn’t use of lock to write/read.

• Use of compression to compat the files.

65 / 66


Now,PleaseAsk Your Questions !

66 / 66

Documents

Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University Subject : Cassandra - A Decentralized Structured Storage System Professor