Four Orders of Magnitude: Running Large Scale Accumulo Clusters
Aaron Cordova Accumulo Summit, June 2014
Scale, Security, Schema
Scale
to scale1 - (vt) to change the size of something
“let’s scale the cluster up to twice the original size”
to scale2 - (vi) to function properly at a large scale
“Accumulo scales”
What is Large Scale?
Notebook Computer
• 16 GB DRAM
• 512 GB Flash Storage
• 2.3 GHz quad-core i7 CPU
Modern Server
• 100s of GB DRAM
• 10s of TB on disk
• 10s of cores
Large ScaleLaptop Server 10 Node
Cluster100
Nodes1000
Nodes10,000 Nodes
10 GB
100 GB
1 TB
10 TB
100 TB
1 PB
10 PB
100 PB
In RAM On Disk
Data Composition
0
45
90
135
180
January February March April
Original Raw Derivative QFDs Indexes
Accumulo Scales
• From GB to PB, Accumulo keeps two things low:
• Administrative effort
• Scan latency
Scan Latency
0
0.013
0.025
0.038
0.05
0 250 500 750 1000
Administrative Overhead
0
3
6
9
12
0 250 500 750 1000
Failed Machines Admin Intervention
Accumulo Scales
• From GB to PB three things grow linearly:
• Total storage size
• Ingest Rate
• Concurrent scans
Ingest Benchmark
0
25
50
75
100
0 250 500 750 1000
Milli
ons
of e
ntrie
s pe
r sec
ond
AWB Benchmark
http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf
1000 machines
100 M entries written per second
408 terabytes
7.56 trillion total entries
Graph Benchmark
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
1200 machines
4.4 trillion vertices
70.4 trillion edges
149 M edges traversed per second
1 petabyte
Graph Analysis
Billions of Edges
1
100
10000
Twitter Yahoo! Facebook Accumulo
70,000
1,000
6.61.5
Accumulo is designed after Google’s BigTable
BigTable powers hundreds of applications at Google
BigTable serves 2+ exabytes
http://hbasecon.com/sessions/#session33
600 M queries per second organization wide
From 10 to 10,000
Starting with ten machines 101
One rack
1 TB RAM
10-100 TB Disk
Hardware failures rare
Test Application Designs
Designing Applications for Scale
Keys to Scaling
1. Live writes go to all servers
2. User requests are satisfied by few scans
3. Turning updates into inserts
Keys to Scaling
Writes on all servers Few Scans
Hash / UUID KeysRowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
Uniform writes
MonitorParticipating Tablet Servers
MyTable
Servers Hosted Tablets … Ingest
r1n1 1500 200k
r1n2 1501 210k
r2n1 1499 190k
r2n2 1500 200k
Hash / UUID KeysRowID Col Value
af362de4 Bob
b23dc4be Annie
b98de2ff Joe
c48e2ade $30
c7e43fb2 $25
d938ff3d 32
e2e4dac4 59
e98f2eab3 43
3 x 1-entry scans on 3 servers
get(userA)
Keys to Scaling
Writes on all servers Few Scans
Hash / UUID Keys
Group for LocalityKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
userA:name Bob
userA:age 43
userA:account $30
userB:name Annie
userB:age 32
userB:account $25
userC:name Joe
userC:age 59
RowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
Still fairly uniform writes
Group for LocalityRowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
1 x 3-entry scan on 1 server
get(userA)
Keys to Scaling
Writes on all servers Few Scans
Grouped Keys
Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
20140101 44
20140102 22
20140103 23
Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Always write to one server
No write parallelism
Temporal KeysRowID Col Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
Fetching ranges uses few scans
get(20140101 to 201404)
Keys to Scaling
Writes on all servers Few Scans
Temporal Keys
Binned Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
RowID Col Value
0_20140101 44
1_20140102 22
2_20140103 23
Uniform Writes
Binned Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
RowID Col Value
0_20140101 44
0_20140104 25
1_20140102 22
1_20140105 31
2_20140103 23
2_20140106 27
Uniform Writes
Binned Temporal KeysKey Value
userA:name Bob
userA:age 43
userB:name Annie
userB:age 32
userC:name Fred
userC:age 29
userD:name Joe
userD:age 59
Key Value
20140101 44
20140102 22
20140103 23
20140104 25
20140105 31
20140106 27
20140107 25
20140108 17
RowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
Uniform Writes
Binned Temporal KeysRowID Col Value
0_20140101 44
0_20140104 25
0_20140107 25
1_20140102 22
1_20140105 31
1_20140108 17
2_20140103 23
2_20140106 27
One scan per bin
get(20140101 to 201404)
Keys to Scaling
Writes on all servers Few Scans
Binned Temporal Keys
Keys to Scaling
• Key design is critical
• Group data under common row IDs to reduce scans
• Prepend bins to row IDs to increase write parallelism
Splits
• Pre-split or organic splits
• Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system
• Hundreds or thousands of tablets per server is ok
• Want at least one tablet per server
Effect of Compression
• Similar sorted keys compress well
• May need more data than you think to auto-split
Inserts are fast 10s of thousands per second per
machine
Updates *can* be …
Update Types
• Overwrite
• Combine
• Complex
Update - Overwrite
• Performance same as insert
• Ignore (don’t read) existing value
• Accumulo’s Versioning Iterator does the overwrite
Update - OverwriteRowID Col Value
af362de4 name Annie
af362de4 age 32
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:age -> 34
Update - OverwriteRowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:age -> 34
Update - Combine
• Things like X = X + 1
• Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction
• Performance is same as inserts
Update - CombineRowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:account -> +10
Update - CombineRowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
af362de4 account $10
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
userB:account -> +10
Update - CombineRowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $25
af362de4 account $10
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
getAccount(userB) $35
Update - Combine
After compaction
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
Update - Complex
• Some updates require looking at more data than Iterators have access to - such as multiple rows
• These require reading the data out in order to write the new value
• Performance will be much slower
Update - ComplexuserC:account = getBalance(userA) + getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $40
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
35+30 = 65
Update - ComplexuserC:account = getBalance(userA) + getBalance(userB)
RowID Col Value
af362de4 name Annie
af362de4 age 34
af362de4 account $35
c48e2ade name Joe
c48e2ade age 59
c48e2ade account $65
e2e4dac4 name Bob
e2e4dac4 age 43
e2e4dac4 account $30
35+30 = 65
Planning a Larger-Scale Cluster 102 - 104
Storage vs Ingest
1
1000
1000000
10 100 1000 10000
Ingest Rate 1x1TB 12x3TB
120,000
12,000
1,200
120
10,000
1,000
100
10 Stor
age
Tera
byte
s
Milli
ons
of E
ntrie
s pe
r sec
ond
Model for Ingest Rates
A = 0.85log2N * N * S
N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second)
Expect 85% increase in write rate when doubling the size of the cluster
Estimating Machines Required
N = 2 (log2(A/S) / 0.7655347)
N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second)
Expect 85% increase in write rate when doubling the size of the cluster
Predicted Cluster SizesN
umbe
r of M
achi
nes
0
3000
6000
9000
12000
Millions of Entries per Second
0 150 300 450 600
100 Machines 102
Multiple racks
10 TB RAM
100 TB - 1PB Disk
Some hardware failures in the first week
(burn in)
Expect 3 failed HDs in first 3 mo
Another 4 within the first year
http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf
Can store and index the Common Crawl Corpus
!
2.8 Billion web pages 541 TB
commoncrawl.org
One year of Twitter 182 trillion tweets
483 TB
http://www.sec.gov/Archives/edgar/data/1418091/000119312513390321/d564001ds1.htm
Deploying an ApplicationTablet ServersClientsUsers
May not see the affect of writing to disk for a while
1000 machines 103
Multiple rows of racks
100 TB RAM
1-10 PB Disk
Hardware failure is a regular occurrence
Hard drive failure about every 5 days (average).
!
Will be skewed towards beginning of the year
Can traverse the ‘brain graph’ 70 trillion edges, 1 PB
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
Facebook Graph 1s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf
Netflix Video Master Copies 3.14 PB
http://www.businessweek.com/articles/2013-05-09/netflix-reed-hastings-survive-missteps-to-join-silicon-valleys-elite
World of Warcraft Backend Storage 1.3 PB
http://www.datacenterknowledge.com/archives/2009/11/25/wows-back-end-10-data-centers-75000-cores/
Webpages, live on the Internet 14.3 Trillion
http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html
Things like the difference between two compression algorithms start
to make a big difference
Use range compactions to affect changes on portions of table
Lay off Zookeeper
Watch Garbage Collector and Namenode ops
Garbage Collection > 5 minutes?
Start thinking about NameNode Federation
Accumulo 1.6
Multiple NameNodes
Accumulo
Namenode Namenode
DataNodesDataNodes
Multiple HDFS Clusters
Multiple NameNodes
Accumulo
DataNodes
Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)
Namenode Namenode
More Namenodes = higher risk of one going down.
!
Can use HA Namenodes in conjunction w/ Federation
10,000 machines 104
You, my friend, are here to kick a** and chew bubble gum
1 PB RAM
10-100 PB Disk
1 hardware failure every hour on average
Entire Internet Archive 15 PB
http://www.motherjones.com/media/2014/05/internet-archive-wayback-machine-brewster-kahle
A year’s worth of data from the Large Hadron Collider
15 PB
http://home.web.cern.ch/about/computing
0.1% of all Internet traffic in 2013 43.6 PB
http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html
Facebook Messaging Data 10s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf
Facebook Photos 240 billion
High 10s of PB
http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf
Must use multiple NameNodes
Can tune back heartbeats, periodicity of central processes in
general
Can combine multiple PB data sets
Up to 10 quadrillion entries in a single table
While maintaining sub-second lookup times
Only with Accumulo 1.6
Dealing with data over time
Data Over Time - Patterns
• Initial Load
• Increasing Velocity
• Focus on Recency
• Historical Summaries
Initial Load
• Get a pile of old data into Accumulo fast
• Latency not important (data is old)
• Throughput critical
Bulk Load RFiles
Bulk Loading
MapReduce
RFiles Accumulo
Increasing velocity
If your data isn’t big today, wait a little while
Accumulo scales up dynamically, online. No downtime
The first scale, ‘can change size’
Scaling UpClients
Accumulo
HDFS
3 physical servers Each running
a Tablet Server process and a Data Node process
Scaling UpClients
Accumulo
HDFS Start 3 new Tablet Server procs
3 new Data node processes
Scaling UpClients
Accumulo
HDFS master immediately assigns tablets
Scaling UpClients
Accumulo
HDFS
Clients immediately begin querying new
Tablet Servers
Scaling UpClients
Accumulo
HDFS
new Tablet Servers read data from old Data nodes
Scaling UpClients
Accumulo
HDFS
new Tablet Servers write data to new Data Nodes
Never really seen anyone do this
Except myself
20 machines in Amazon EC2
to 400 machines
all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back
Scaled back down to 20 machines when done
Just killed Tablet Servers
Decommissioned Data Nodes for safe data consolidation to
remaining 20 nodes
Other ways to go from 10x to 10x+1
Accumulo Table Export
followed by HDFS DistCP to new cluster
Maybe new replication feature
Newer Data is Read more Often
Accumulo keeps newly written data in memory
Block Cache can keep recently queried data in memory
Combining Iterators make maintaining summaries of large
amounts of raw events easy
Reduces storage burden
Historical Summaries
0
2000
4000
6000
8000
April May June July
Unique Entities Stored Raw Events Processed
Age-off iterator can automatically remove data over a certain age
IBM estimates 2.5 exabytes of data is created every day
http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
90% of available data created in last 2 years
http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
25 new 10k node Accumulo clusters per day
Accumulo is doing it’s part to get in front of the big data trend
Questions ?
@aaroncordova