Upload
antonio-severien
View
8.768
Download
2
Embed Size (px)
Citation preview
YCSBYahoo! Cloud Serving Benchmark
Scalable Distributed Systems
Antonio L. Severien [email protected]
João [email protected]
Overview
• Distributed Databases• Cassandra• HBase• YCSB General View• YCSB Details• Amazon EC2• YCSB Results• YCSB Future• Conclusions• References
Distributed Databases
Traditional RDBMS• ACID transactions• Query language (SQL)• Data tied to the modeling (hard to analyze) • Scalable to a limit
Distributed Databases• Not ACID• Not Relational• Column oriented (key-value)• CAP (Consistency, Availability, Partitioning)• Big Data (Massively scalable)
Distributed Databases• Sherpa/PNUTS • BigTable • HBase, Hypertable, HTable • Megastore • Azure • Cassandra • Amazon Web Services • S3, SimpleDB, EBS • CouchDB
• Voldemort • Dynomite • Tokyo• Redis• MongoDB
Distributed Databases
• NoSQL Databases have different designs and architecture
CassandraThriftGossipToken ring
…
HbaseHDFSZookeeperHadoop (MapReduce)
BigTableGFSChubby (Lock Service)MapReduce
Cassandra• Highlights
• High availability• Incremental scalability• Eventually consistent• Tradeoffs between consistency and latency• Minimal administration• No SPF (Single Point of Failure)
Cassandra• CAP-aware
• Cassandra values Availability and Partitioning tolerance (AP) eventually consistent
• Providing strong Consistency in Cassandra increases latency
• Partitioning• Token oriented
• Explicit Replication• Replication factor ≤ Total nodes
• High level clients• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc• Thrift driver-level interface
Cassandra• Data Model
• Cluster: • Machines (nodes) in a logical
Cassandra instance• can contain multiple keyspaces
• Keyspace: • name for ColumnFamilies
• ColumnFamilies: • contain multiple columns each with name, value and timestamp
referenced by row keys.• Analogous to table on RDBMS
• SuperColumns: • columns with subcolumns
• Rows• Columns
keyA Column1 Column2 Column3
keyB Column5 Column6 column10
Column
Byte[] Name
Byte[] Value
I64 Timestamp
Cassandra
Partitioning Replication
HBase
“HBase is more a datastore than a database”
• It lacks many of the features of RDBMS
• Distributed and scalable big data store.
• Regions model
• Strong consistency
HBase
Built on top of Hadoop Distributed Filesystem (HDFS)
HBase
• The NameNode is responsible for maintaining the filesystem metadata.
• The DataNodes are responsible for storing HDFS blocks.
HBase
• The NameNode is responsible for maintaining the filesystem metadata.
• The DataNodes are responsible for storing HDFS blocks.
Note: In our study case, we only had interest on HDFS layer.
HBase
HBase
DatanodesNamenode
HBase
• Data is stored into HBase tables.
• Tables are made of rows and columns.
• All columns belong to a particular column family.
Important note: All column family members are stored together.
• A query on a column family model has a better performance
YCSB General View
• Which is the best NoSQL DB?• How to compare?
• Yahoo! Cloud Serving Benchmark (YCSB)• Benchmarking tool
• Evaluate key-value and cloud DBs performance on a common set of workloads
• Client – an extensible workload generator
• Yahoo! Research• Brian F. Cooper - [email protected]• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan
and Russell Sear
YCSB Details• How it works?
YCSB Client
DB
Inte
rface
La
yer
Client Threads
StatisticsWork
load
Exe
cuto
r
Cloud Serving Store
Workload file• Read/write mix• Record size• Popularity distribution• …
Command line• DB to use• Workload to use• Target throughput• Number of threads• …
YCSB Details
Benchmark Tiers• Performance
• Measure latency/throughput curve• Increase throughput until saturation
• Scalability• Scale up: increase hardware, data size and throughput
proportionally• Elastic speedup: add servers while running a workload
YCSB Details
Load phase
- Load the database$ ycsb load cassandra-10 –p hosts=127.0.0.1 –P workloadX
Transactions phase
- Executes the workload$ ycsb run cassandra-10 –p hosts=127.0.0.1 –P workloadX
Random Load Distribution
YCSB Details• # Yahoo! Cloud System Benchmark• # Workload A: Update heavy workload• # Application example: Session store recording recent actions• # • # Read/update ratio: 50/50• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
• # Request distribution: zipfian
• recordcount=1000• operationcount=1000• workload=com.yahoo.ycsb.workloads.CoreWorkload
• readallfields=true
• readproportion=0.5• updateproportion=0.5• scanproportion=0• insertproportion=0
• requestdistribution=zipfian
YCSB Details• Execution parameters• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100 > transactions.dat
[OVERALL],RunTime(ms), 10110[OVERALL],Throughput(ops/sec), 98.91196834817013[UPDATE], Operations, 491[UPDATE], AverageLatency(ms), 0.054989816700611[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 1[UPDATE], 95thPercentileLatency(ms), 1[UPDATE], 99thPercentileLatency(ms), 1[UPDATE], Return=0, 491[UPDATE], 0, 464[UPDATE], 1, 27[UPDATE], 2, 0[UPDATE], 3, 0[UPDATE], 4, 0...
YCSB Details• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 > transactions.dat
[OVERALL],RunTime(ms), 10077[OVERALL],Throughput(ops/sec), 9923.58836955443[UPDATE], Operations, 50396[UPDATE], AverageLatency(ms), 0.04339630129375347[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 338[UPDATE], Return=0, 50396[UPDATE], 0, 0.10264765784114054[UPDATE], 2000, 0.026989343690867442[UPDATE], 4000, 0.0352882703777336[UPDATE], 6000, 0.004238958990536277[UPDATE], 8000, 0.052813085033008175[UPDATE], 10000, 0.0[READ], Operations, 49604[READ], AverageLatency(ms), 0.038242883638416256[READ], MinLatency(ms), 0[READ], MaxLatency(ms), 230[READ], Return=0, 49604[READ], 0, 0.08997245741099663[READ], 2000, 0.02207505518763797[READ], 4000, 0.03188493260913297[READ], 6000, 0.004869141813755326[READ], 8000, 0.04355329949238579[READ], 10000, 0.005405405405405406
YCSB Details
Status Output
Amazon EC2 Configuration
Large Instance
7.5 GB memory4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage64-bit platform
I/O Performance: HighAPI name: m1.large
Experiment Set-up
Cassandra Cluster3 nodes + 1 node (Elasticity)
Hbase Cluster3 nodes
Amazon EC2 Usage
CassandraLoad phase: 60,000,000 records of 1Kb
Amazon EC2 Usage
HBaseLoad phase: 60,000,000 records of 1Kb
Amazon EC2 UsageLoad phase: 60,000,000 records of 1Kb
CassandraHBase
Amazon EC2 UsageLoad phase: 60,000,000 records of 1KbCassandra HBase
Amazon EC2 Usage
Transaction phase: - 10,000 records - 1,000,000 operations - 250 threads
Cassandra
YCSB Cassandra ResultsUpdate Heavy Workload
(50/50)
0 1,000 2,000 3,000 4,000 5,000 6,0000
10
20
30
40
50
60
Update
Throughput (ops/sec)
Ave
rag
e L
aten
cy (
ms)
0 1,000 2,000 3,000 4,000 5,000 6,0000
10
20
30
40
50
60
Read
Throughput (ops/sec)
Ave
rag
e L
aten
cy (
ms)
YCSB HBase Results
471 485 492 507 562 620 635 734 8450.00
0.05
0.10
0.15
0.20
0.25
0.30
Update Hbase 0.90.5
Throughput (ops/sec)
Ave
rag
e L
aten
cy (
ms)
471 485 492 507 562 620 635 734 8450.00
200.00
400.00
600.00
800.00
1000.00
1200.00
Read HBase 0.90.5
Throughput (ops/sec)
Ave
rag
e L
aten
cy (
ms)
YCSB Cassandra Results
0 50000 100000 150000 200000 250000 300000 350000 4000000
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Elasticity Cassandra 1.0
Time miliseconds
Lat
ency
(m
s)
YCSB Cassandra Results
0 50000 100000 150000 200000 250000 300000 350000 4000000
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Elasticity Cassandra 1.0
Time miliseconds
Lat
ency
(m
s)
YCSB Future
Provide statistics for:
- Availability
- Replication
Additional Distributed Databases
Currently supported:
Cassandra Mapkeeper
MongoDB Redis
Voldemort Vmware vFabric Gemfire
Hbase
Conclusions
• YCSB provides a common ground for benchmarking cloud DB services
• Good for leaning and experimenting with different distributed databases
• Open source, extensible for new databases
• Laboratory with Amazon EC2 provided good insight into setting up cloud services
• Challenges• Installation problems• Hard to follow documentation• Working on distributed environment require lots of configuration
References
• YCSB (Yahoo! Cloud Serving Benchmark)• https://github.com/brianfrankcooper/YCSB/wiki
• Yahoo! Research• http://research.yahoo.com/Web_Information_Management/YCSB
• BigTable• http://en.wikipedia.org/wiki/BigTable
• Cassandra • http://wiki.apache.org/cassandra/
• HBase• http://hbase.apache.org/
Questions