Upload
marion-ross
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Subject : Cassandra - A Decentralized Structured Storage System
Professor : Dr. sh.Esmaili
The Student’s Identifiers :Mr. Houshyar Mohammadi Talvar(Slides 4 to 17)Miss.Hakimi(Slides 19 to 27)Mr. Hossien Sadrizadeh(Slides 29 to 65)
The Date :June 6th 2012 , (On Thursday , 25th Khordad 1391 )
1 /66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Contenet Of The Presentation:• Abstract• Introduction• Related Work• Data Model• API• System Arcgitecture
• Partitioning• Replication• Membership• Bootstrapping• Scaling the Cluster• Local Persistance• Implementation Details
• Practical Experiences• Facebook Inbox Search• Conclusion• Acknowledgements
2 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Mr. Houshyar Mohammadi Talvar
Slides From 4 To 17
3 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Cassandra is a distributed storage system for managing verylarge amounts of structured data spread out across manycommodity servers, while providing highly available servicewith no single point of failure.
Abstract
4 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
IntroductionFacebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world.
5 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Related Work
Systems like Ficus and Coda replicate files for high availability at the expense of consistency. Update conflicts are typically managed using specialized conflict resolution procedures.
Bayou Coda Ficus Dynamo
6 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
7 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
8 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
9 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
10 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Data Model
A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured.
Cassandra exposes two kinds of columns families
Simple column families
Super column families
11 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
KEYColumnFamily1 Name : MailList Type : Simple Sort : Name
Name : tid1
Value : <Binary>
TimeStamp : t1
Name : tid2
Value : <Binary>
TimeStamp : t2
Name : tid3
Value : <Binary>
TimeStamp : t3
Name : tid4
Value : <Binary>
TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Name : aloha
C1
V1
T1
C2
V2
T2
C3
V3
T3
C4
V4
T4
Name : dude
C2
V2
T2
C6
V6
T6
Column Families are declared
upfront
Columns are added and modified dynamically
SuperColumns are added and modified
dynamically
Columns are added and modified dynamically
Data Model(Continue)
12 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Any column within a column family is accessed using the convention column family : column any column within a column family that is of type super is accessed using the convention column family :super column : column.
Data Model(Continue)
13 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
The Cassandra API consists of the following three simple methods.
insert(table; key; rowMutation) get(table; key; columnName) delete(table; key; columnName)
API
14 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
System Architecture
The architecture of a storage system that needs to operate in a production setting is complex.
In addition to the actual data persistence component, the system needs to have the following characteristics
15 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
System Architecture(Continue) scalable and robust solutions for load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming configuration management
16 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Read
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest Query
Result
Client
17 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Miss. Hakimi
Slides From 19 To 27
18 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning
One of the key design features for Cassandra is the ability to scale incrementally.
19 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)Partitioning and Replication
20 /66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Membership
Cluster membership in Cassandra is based on Scuttlebutt, a very efficint anti-entropy Gossip based mechanism.
21 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Membership
• Gossip protocol is used for cluster membership.• Super lightweight with mathematically provable
properties.• State disseminated in O(logN) rounds where N is
the number of nodes in the cluster.• Every T seconds each member increments its
heartbeat counter and selects one other member to send its list to.
• A member merges the list with its own list .
22 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Failure Detection
Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down.
23 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Accrual Failure Detector
• Valuable for system management, replication, load balancing etc.
• Defined as a failure detector that outputs a value, PHI, associated with each process.
• Also known as Adaptive Failure detectors - designed to adapt to changing network conditions.
• The value output, PHI, represents a suspicion level.• Applications set an appropriate threshold, trigger
suspicions and perform appropriate actions.• In Cassandra the average time taken to detect a failure is
10-15 seconds with the PHI threshold set at 5
24 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Bootstrapping
When a node starts for the first time, it chooses a random token for its position in the ring.
25 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Scaling The Cluster
When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node.
26 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Local Persistence
The Cassandra system relies on the local file system for data persistence.
27 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Hossien SadrizadehSlides From 29 To 65
28 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Implementaion Details• The following abstractions are need for Cassandra
Process on a Single Machine.• Partitioning module.• Cluster membership and Failure detection module.• Storage engine module.
• Each of these module has been implemented from the ground using Java.
• Each of these modules rely on an event driven where the message processing pipeline and the task pipeline are split into multiple stage along the line of the SEDA architecture.(Staged Event-Driven Architecture).
29 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
SEDA1
1 SEDA: An Architecture for Well-Conditioned,Scalable Internet Services (Matt Welsh, David Culler, and Eric Brewer )
• SEDA combines of threads and event-based programming models to manage :• Concurrency.• I/O.• Schedulaing.• Resource management needs of Internet services.
• In SEDA, applications consist of:• A network of event-driven stages.• Each stage connected by explicit queues.
• SEDA is intended to support massive concurrency demands and simplify the construction of well-conditioned services.
30 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
SEDA(Continue)
Thread Server Design :Each incoming request is dispatched to a separate threads, which processes the request and returns a result to the client. Edges represent control flow between components .
31 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Routing
• All system control messages rely on UDP1 based messageing while the application related messages for replication and request routing relies on TCP2.
• The request routing modules are implemented using a certain state machine.
1.UDP : User Datagram Protocol.(a connectionless protocol)2.TCP : Transfer Control Protocol.(a connection-Oriented protocol)
32 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
What Happened When a Read/Write Request From a Node In The Cluster ?
33 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning1
• In cassandra, the total data managed by the cluster is represented as a circular space or ring.
• The ring is divided up into ranges equal to the number of nodes, which each node being responsible for one or more ranges of the overall data.
• Before a node can join the ring, it must be assigned a token.
• The token determines the node’s position on the ring and the range of data it is responsible for.
1 http://www.datastax.com/docs/0.8/cluster_architecture
34 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning – Single Data Center(Continue)1
1 http://www.datastax.com/docs/0.8/cluster_architecture
A cluster with 4 nodes, the row keys managed by the cluster were numbers in the range of 0 to 100.
Each node is assigned a token that represents a point in this range.
In this simple example, the token values are 0, 25, 50, and 75.
35 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning – Replica Placement(Continue)1
1 http://www.datastax.com/docs/0.8/cluster_architecture
• In multi-data center deployments, replica placement is calculated per data center.
• Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack.
36 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning – Multi Data Center(Continue)1
1 http://www.datastax.com/docs/0.8/cluster_architecture
37 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Partitioning – Multi Data Center(Continue)1
1 http://www.datastax.com/docs/0.8/cluster_architecture
The goal is to ensure that the nodes for each data center have token assignments that evenly divide the overall range.
38 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
About Client Request• All nodes in cassandra are peers.
• A client read/write request can go to any node in the cluster.
• When a client connect to a node and issues a read/write request , that node serves as a proxy the coordinator for that particular operation.
• The job of the coordinator is to act between the client application and the nodes(replicas)that own the data being requested.
39 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
About Client Request(Continue)• The coordinator sends the write request to all replicas
that own the row being written. if all replica nodes are up and available.
• They will get the write regardless of the consistency level specified by the client.
• The write consistency level determines how many replica nodes must respond with a success acknowledgement in order for the write to be considered successful.
40 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
An Example To WriteFor example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row.
41 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
11
1
3
2
8
4
6
5
7
10Client
9
12
R1
R2
R3
42 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Replication Factor & Replication In Cassandra
• Replication factor The total number of replicas across the cluster is often referred to as the replication factor.
• A replication factor of 1 means that there is only one copy of each row, and a replication factor of 2 means two copies of each row.
• Replication Is the process of storing copies of data on multiple nodes.
43 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Replication Factor In Cassandra
Replication is the process of storing copies of data on multiple nodes.
44 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Commit Log
• The cassandra system base on the local file system for data persistance.
• We have a dedicated disk on each machine for the commit log.
• The write into the in-memory data structure is performed only after a successful write into the commit log.
45 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Commit Log & In-Memory Structure• The cassandra ,first writes data to a commit
log(for durability), and then an in-memory table structure called memtable.
• A write is successful when :1. First, It is written to the commit log.2. Second, write in the Memory.
• Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable.
46 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Structure Of Commit Log
• Every commit log has a header which is basically:• A bit vector with fixed size.• The size of the bit vector is more than the number of column
families.
• These bit vectors are per commit log and also hold in memory.
47 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Write Operation Into The Commit Log
• The write operation into the commit log can either be in normal mode or in fast sync mode.
• In the fast sync mode the writes to the commit log are buffered.(if the machine is crashed some of data maybe loss).
48 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Implementaion The Commit Log(Continue)
• Traditional databases are not designed to handle high write throughput.
• Cassandra do writes to disk into sequential writes thus maximize disk write throughput.
• Since the files dumped to the disk are never changed then no locks need to be taken while reading them.for instance the server of cassandra is practically lockless for read/write operation.
49 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
When Should We Delete The Commit Log?
In any logging system , we need a mechanism to purge commit log entries.
Question :Is there any different between delete a commit log and delete the entries of commit log ?
50 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Implementaion The Index• The cassandra system indexes all database on
primary key.
• The data file on disk is broken down into a sequence of blocks.
• Each block:• Contains at most 128 keys.• Is demarcated by a block index.
• The block index capture the relative offset of a key within the block and the size of its data.
51 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Layout Of a Sample Block
Structure of a block and their index demarcated in memory
52 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Implementaion The Index(Continue)
• When an in-memory data structure(block) is dumped to disk a block index is generated and their offsets written out to disk as indics.
• This index is also hold in memory for fast access.
53 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
What Happened When a Typical Read Is Take Place?
54 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
What Should We1 Do When The Number Of Files Are Increased On The Disk ?
• Over time the number of data files will increase on disk.
• We perform a compaction process, very much like the Bigtable system.
• Merges multiple files into one ;essentially merge sort on a cluster of sorted data files.
• Periodically a compaction process is run to compact all related data files into one big file.
1 The Research team
55 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Practical Experiences• In the design process of cassandra, we learnt a lot of
usefull experience and it is very benefitical for us.
• We experimented with various implementations of Failure Detectors.if the size of cluster is grown then the time of detected faliure is increased.
• Most application only require atomic operation per key per replica, but there are some application to do on secondray indexes.(because most developers work on RDBMS).
• Cassandra is a completely decentralized system(distributed system).
56 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Ganglia /ģæŋ.lia/• Old monitoring is not benefit anymore,because the
Cassandra system is well integerated with Ganglia (a distributed monitoring tool)1.
• Ganglia is a scalable distributed system monitor tool for high-performance computing system such as clusters.
• The strategy uses a distributed tree structure that enables organizations to monitor an arbitrarily large number of clusters while placing bounds on the required processing load.
1 Matthew L.Massie,Brent N.Chun, and David E, Culler.The Ganglia distributed monitoring system
57 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Ganglia(Continue)
Ganglia is comprised of two components:• Gmon,local-area monitoring system .• Gmeta wide-area system.
Ganglia local and wide area monitor interaction. Gmon runs on each cluster node; gmeta can fail over between nodes.
Gmon uses UDP multicast.Gmon communicates with its Gmeta counterpart usingXML streams sent over TCP connections.
58 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Facebook Inbox Search
• what is the matter?
• Millions of messages are sent everyday on Facebook.
• Messages stored in different data centers.
• How to handle indexing all of this information for Inbox search ?
59 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Facebook Inbox Search
• For inbox search we have to make a list of all messages per user that have been exchanged between the sender and recipients.
• There are two kinds of search features:• Term search.• Search interaction.
60 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Term Search / Search Interaction
• Term search :• Key = user ID.• Super column = the words that make up the message
become.
• Search interaction :• Key = user ID.• Super column = the recipients id’s.
61 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
An Actual Example
• The current system store about 50TB of data on a 150 node cluster.
• The previous data are spread out between east and west coast data center.
• Some measure we product them are in the following table.
62 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
What Works Are There To Do On The Future ?
The works that we can do them are:
• Adding compression.• Secondary index support.
63 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Cassandra Goals(Conclusion)
• High scalability.
• High performance.• Throughput. • Response time.
• High availability.
64 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Headline Summary• All implimentation use of java.
• Use the UDP anf TCP protocol for routing.
• Ring mechanism used for clustering.
• All the nodes in the ring are peers.
• Use of replication.
• Use of commit log to persistence files.
• As use of sequential write we have a high throughput.
• All files broken into some blocks.
• It doesn’t use of lock to write/read.
• Use of compression to compat the files.
65 / 66
Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University
Now,PleaseAsk Your Questions !
66 / 66