View
3.516
Download
3
Category
Tags:
Preview:
DESCRIPTION
In-depth exploration of Priam, a side kick application to help cassandra run inside of Amazon's cloud.
Citation preview
Running Cassandra in the Cloud:
An Introduction to PriamJason Brown
@jasobrown jasedbrown@gmail.comwww.linkedin.com/in/jasedbrown
About me
● Senior Software Engineer, Netflix● Apache Cassandra committer
● E-Commerce Architect, Major League Baseball Advanced Media
● Wireless developer (J2ME and BREW)
Netflix Databases
● Oracle in the datacenter● Migrate to EC2
○ SimpleDB at first○ Cassandra
Cassandra meet EC2
● shell script(s)● python scripts
● backup / restore● centralized model● installing 2.7 broke CentOS yum● first time we ran it in prod, my cluster was
destroyed
Hello, Priam!
Priam, the father of Cassandra(http://en.wikipedia.org/wiki/Priam)
Java web app● Token Assignment● Backup / Restore● Multi-region support● Configuration management
Branches
each priam branch corresponds to a c* version● priam 1.1 -> c* 1.1● priam master -> c* 1.2● ??? -> c* trunk
Token Assignment
● Cassandra needs an assigned token● Priam tries to
○ replace a dead instance○ join as a new node
● External storage for known cluster members○ host name/IP addr/instance id○ token○ region/availability zone
Replacing a dead node
● Get known nodes in region/AZ from storage○ {A, B, C}
● Get live nodes in region/AZ from ASG api○ {A, B}
● Take over a dead node's token○ C
● uses c*'s replace_token
Joining as a new node
● Calculate token○ per-region offset○ determine 'slot' in region/AZ○ derive token
Region hash offset
● Each region needs a different base offset○ avoids token collisions
int hash = "us-east-1".hashCode();
Determining slot
New nodes takes next numbered slot in AZ- looks for other registered nodes in sdb
Node Slotting Layout +--------+--------+--------+| zone A | zone B | zone C |+--------+--------+--------+| 0 | 1 | 2 |+--------+--------+--------+| 3 | 4 | 5 |+--------+--------+--------+| 6 | 7 | 8 |+--------+--------+--------+| 9 | 10 | 11 |+--------------------------+
(ascii art rocks)
Here's your token
MAXIMUM_TOKEN .divide(regionNodeCount) .multiply(mySlot) .add(regionHashOffset);
example:100 / 10 (ten nodes in region) 3 + (in slot three) + 12 = 42
Seeds
● first node in each AZ, in every region● except if current node is in the first slot
○ seeds cannot auto bootstrap
Multi-region communication
AWS security groups block ingress requests
Intra-region: whitelist by other in-region SG
Inter-region: whitelist by IP address○ must use public IP address!
Whitelisting IP address
● Seed nodes compare○ current region's SG IP address○ entries in SimpleDB database
● Add new nodes's to SG● Remove dead nodes from SG
++
us-east-1 || eu-west-1
+-------------+ ||
| simpleDB | ||
+-------------+ ||
||
+--+ || +--+
|S | || |S |
|e | || |e |
|c | || |c |
+----------+ |G | || |G | +----------+
| c* 1 | |r | || |r | | c* 2 |
+----------+ |p | || |p | +----------+
| | || | |
|1 | || |2 |
+--+ || +--+
||
++
Backup
Two types:● Snapshot
○ invokes nodetool snapshot○ once a day, cron-like
● Incremental○ copy all newly flushed sstables
Backup location
Upload to S3 bucket in same region
Bucket lifecycle rules● configure TTL for data
Backup path
Bucket: netflix-cassandra-data
Path: base dir / region / cluster name / token / snapshot time / [SNP | SST | META] / keyspace / column family / data file
example: test_backup/us-east-1/cass_jasobrown/42/1234567/ SNP/jasobrown/dog/jasobrown-dog-ja-1-Data.db
Restore
● best with same size cluster as source● best if tokens match with source
Uses (besides the obvious)● prod to test refresh● reproduce prod data problems● incremental restore - WIP
Configuration Management
Control aspects of priam and c*● yaml● startup script(s) env values
Netflix needs this as we have ~55 production clusters, with slightly different configs
So, does Netflix actually use Priam?
55 production clusters, > 750 nodes
Internal extensions● Hook into internal DNS, properties systems● Alternative storage to SimpleDB● BI messaging integration - WIP● C* JMX monitoring
Monitoring
● Poll C* every 60 seconds● selected JMX metrics● publish to internal metrics aggregator
○ currently uses Netflix's OSS Servo library (github.com/Netflix/servo)
Next directions
Commit log backups
Datastax Enterprise support● security● solr● configuration
c* 1.2 virtual nodes (a/k/a vnodes)
auto scaling
Thank you!
Q & A time
@jasobrown
Recommended