View
234
Download
4
Tags:
Embed Size (px)
Citation preview
A NOSQL STUDY: APACHE CASSANDRA
Shujaat Hussain
Data Model
A single column
Data Model
A single row
Data Model
CAP Theorem
Consistency –the system is in a consistent state after an operation
Availability –the system is “always on”, no downtime
Partition tolerance–the system continues to function even when split into disconnected subsets (by a network disruption)
Performance vs MySQL w/ 50GB
MySQL 300ms write 350ms read
Cassandra 0.12ms write 15ms read
Querying: Overview
You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’
And columns to retrieve: Slice: cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama}
Nothing like SQL “WHERE col=‘faz’”
Digg is a social news site that allows people to discover and share content from anywhere on the Internet by submitting stories and links, and voting and commenting on submitted stories and links.
Problems Terabytes of data; high transaction rate (reads
dominated) Multiple clusters Management nightmare (high effort, error
prone) Unsatisfied availability requirements
(geographic isolation) Solution
Cassandra as primary data store Datacenter and rack-aware replication
Twitter is a social networking and microblogging service that enables its users to send and read tweets, text-based posts of up to 140 characters.
Terabytes of data, ~1,000,000 ops/s
Inbox Search 100 TB 160 nodes 1/2 billion writes per day (2yr old number?)
Pros
Advantages Massive scalability High availability Lower cost (than competitive solutions at that
scale) (usually) predictable elasticity Schema flexibility, sparse & semi-structured
data
Cons
Disadvantages Limited query capabilities (so far) Eventual consistency is not intuitive to
program for Makes client applications more complicated
No standardizatrion Portability might be an issue
Insufficient access control