Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Scalable Table Stores: Tools for Understanding Advanced Key-Value Systems for Hadoop
Garth Gibson Professor, Carnegie Mellon Univ., & CTO, Panasas Inc.
with Julio Lopez, Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, CMU to appear in SoCC October 2011
with Wittawat Tantisiriroj, Swapnil Patil, CMU
Seung Woo Son, Sam Lang, Rob Ross, Argonne Nat Lab to appear in SC11 November 2011
Garth Gibson, Sept 20, 2011!
The Future is Data-Led
NIST: translate 100 articles – Arabic-English competition
2005 outcome: Google wins! Qualitatively better on 1st entry Brute force statistics with
more data & compute !! 200M words from UN translations 1 billion words of English grammar 1000 processor cluster
BLEU Score
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Google ISI IBM+CMU UMD JHU+CU Edinburgh
Systran
Mitre
FSC
0.7
Topic identification
Human-edittable translation
Usable translation
Expert human translator
Useless
IEEE Intelligent Systems, March/April 2009 www.pdl.cmu.edu 2
Science of Many Types is Data-Led Contact Field Comments J Lopez, CSD Astrophysics SDSS digital sky survey including spectroscopy, 50TB
T Di Matteo, Physics Astrophysics Bigben BHCosmo hydrodynamics (1B particles simulated), 30TB
F Gilman, Physics Astrophysics Large Synoptic Survey Telescope, LSST (2012) digital sky survey, 15TB/day
C Langmead,CSD Biology Xray, NMR, CryoEM images; Sim’d molecular dynamics trajectories
J Bielak, CE Earth sciences USGS sensor images; Sim’d 4D earthquake wavefields >10TB/run
D Brumley, ECE Cyber security Worldwide Malware Archive; 2TB doubling each year
O Mutlu, ECE Genomics 50GB per compressed genome sequencing; expands to TBs to process
B Yu, ECE Neuroscience Neural recordings (electrodes, optical) for prosthetics; 10-100GB each
J Callan, LTI Info Retrieval ClueWeb09, 25TB, 1B high rank web pages, 10 languages
T Mitchell, MLD Machine Learning English sentences of ClueWeb for continuous automated reading (5TB)
M Herbert, RI Image Understanding Flickr archive (>4TB); broadcast TV archive; street video; soldier video
Y Sheikh, RI Virtual Reality Terascale VR sensor, 1000 camera+ 200 microphone, up to 5TB/sec
C Guestrin, CSD Machine Learning Blog update archives, 2TB now + 2.7TB/yr (about 500K blogs/day)
C Faloutsos, CSD Data Mining Wikipedia change archive (1TB), Fly embryo images (1.5TB), links from Yahoo web
S Vogel, LTI Machine Translation Pre-filtered N-gram language model based on statistics on word alignment, 100 TB
J Baker, LTI Machine Translation Spoken language recording archive, many languages, many sources, up to 1PB
B Becker, RI Computer Vision Social network image/video archive for training computer vision systems, 1-5TB
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 3
CMU PDL History of Scalable Storage • 1995 DARPA funds Network-Attached Secure Disks (NASD) • 1999+ NASD spin offs
• Object Storage Device standardized by T10/SCSI 2004, 2009 • Panasas parallel storage system, Gibson co-founder & CTO
– Primary storage on first petascale computer: LANL Roadrunner • Also: NIH, Citadel, ING, BNP, BP, ConocoPhillips, PetroChina, StatOil,
Ferrari, BMW, 3M, Lockheed Martin, Northrop Grumman, Sandia, NASA
• Lustre Linux open source parallel file system – With Panasas, Lustre & PVFS, 3/4 top500.org are object-based
• Graduates go to storage, server & internet companies – Eg Google FileSystem (2003) & BigTable (2006) cloud database
• Parallel NFS achieves IETF RFC in 2010 spurred on by Panasas – Linux adoption in 2.6.39, 3.0 and 3.1 (2011)
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 4
For the Experience, Operate A Cloud • Two clusters: 3 TF, 2.2 TB, 142 nodes, 1.1K cores, ½ PB • Available to CMU eScience
users as a Hadoop queue • IR, ML classes • ML research • Comp bio research • Astro research • Geo research • Malware analysis • Social network analysis • Systems research
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 5
PDL &
OpenCirrus
External switch
Other
OpenCloud sites
2 x 10GE SRoptical link
10 Gbps LR opticallink to NLR
CMU OpenCloud
6x 10GE trunkSFP+ twinax
48 port10-GE switch
Logical rack:32 worker nodes6 RAID protectedstorage nodes
Logical rack:32 worker nodes7 RAID protected storage nodes
38 10GE linksSFP+ twinax
39 10GE linksSFP+ twinax
48 port10-GE switch
6x 10GE trunkSFP+ twinax
CMU OpenCirrus
24 port10-GE switch
20 x 2 x 1GE links
19 x 2 x 1GE links
20 x 2 x 1GE links
19 x 2 x 1GE links
Logicalrack:
20 workernodes
Logicalrack:
19 workernodes
Logicalrack:
20 workernodes
Logicalrack:
19 workernodes
Switch1-GE down/10-GE up
Switch1-GE down/10-GE up
Switch1-GE down/10-GE up
Switch1-GE down/10-GE up
2x 10GE trunk
2x 10GE trunk
2x 10GE trunk
2x 10GE trunk
To Understand: Cloud FS vs. Parallel FS
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 6 To be published in SC11, Nov. 2011
• Buf: Prefetching • HDFS: write once • Add deep prefetch
• Map: Layout • Stripe Unit::Node • Optimized Launch
• Rep: Replicate data • No HW RAID!
• Hadoop’s storage, library, HDFS, is replaceable
• Replace with PVFS,
a user-level Parallel FS, to understand differences
Replication inside a PVFS file
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 7
• PVFS, like most cluster/parallel file systems, assumes RAID HW • HDFS, like GoogleFS, does not like scaling of RAID HW • Teach PVFS client to internally replicate (Hybrid approx. HDFS)
• Code is not production quality – error path is too hard for academics J
Interesting Implementation Issues • HDFS performance disk-bound by chunk creation • PVFS insufficient parallelism in single stream
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 8
Differences Not Visible in Apps • OpenCloud Apps
• Astrophysics • Social network analysis
• Hadoop helps • Job scheduler does
load balancing • Dataset is directory of files
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 9
Scalable Table Stores
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 10
• Inspired by Google’s BigTable • Reported to SCALE:
• >76 PB in one “database” • >10 M operation/sec
• B-tree with giant nodes • Data model is dynamic, lots of
columns, strings everywhere • Writeback of mutations written
as sorted, indexed log files • Read-misses search all logs:
Log-structured Merge Trees • Layered on GFS (HDFS)
Extending a Prior Benchmark Tool • Yahoo! Cloud Serving Benchmark (YCSB) tool
• steady state load of CRUD (create-read-update-delete) operations
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
5
Benchmark tool• Java application
– Many systems have Java APIs– Other systems via HTTP/REST, JNI or some other solution
Workload parameter file• R/W mix• Record size• Data set• …
Command-line parameters• DB to use• Target throughput• Number of threads• …
YCSB client
DB c
lient
Client threads
Stats
Workload executor Cl
oud
DB
Extensible: plug in new clientsExtensible: plug in new clientsExtensible: define new workloadsExtensible: define new workloads
github.com/brianfrankcooper/YCSB [SoCC10] 11
Adv. Features of YCSB++ • High Ingest Rate Features
• Deep batch writing • Pre-splitting tablets (given future insert distribution) • Bulk-load: MR format map files externally
• Read Features • Read-after-write: what price eventual consistency? • Offloading filtering to servers • Security ACLs – what performance price?
• Better interpretation of monitoring • Integrate knowledge of services, user jobs (Otus)
• To be published in SoCC (October 2011) – www.pdl.cmu.edu/ycsb++/
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 12
YCSB++ Framework
Client nodes
Workload parameter !le
-! R/W mix!
-! RecordSize!
-! DataSet!
-! …!
Extensions
HBase
IcyTable
Other DBs
Storage Servers
Stats
Workload Executor
Client Threads
DB
Clie
nts
New workloads
API ext
Multi-Phase Processing
YCSB Client (with our
extensions)
Ganglia monitoring
Hadoop, HDFS and OS metrics
YCSB metrics
YCSB Client Coordination
ZooKeeper-based barrier sync and
event noti!cation
Command-line parameters (e.g, DB name, NumThreads)!
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 13
Accumulo
github.com/MiloPolte/YCSB (pushing to main branch)
Accepted into Apache Incubator Sept 12
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 14
Extensions for Monitoring (Otus)
• Service stats (Hadoop, Hbase, HDFS, …) • Walk process group tree looking
for specific command lines • Aggregate stats for subgroups
• Customizable displays
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
Virt
ual M
emor
y (B
ytes
)
Other MRJob Data Node Task Tracker Other Processes
Running Map Tasks
Rea
d R
eque
sts (
ops)
CPU
Usa
ge
HDFS DataNode Read Request From Remote Clients HDFS DataNode CPU Usage
15 github.com/Otus/otus
Server side Filtering
• Filtering when little data is desired leads to excessive prefetching on server, because it fills scanner batch • Size scanner batch to expected result size (scaled buffer)
• Hbase table was decomposed into more columnar stores, so Accumulo does more work
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
DoD BigTable HBase
16
Batch Writers & Eventual Consistency
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
• Small batches burn excessive client CPU, limiting thruput • Large batches saturate servers, limiting benefit of batch
17
Batch Writers & Eventual Consistency
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Fra
ction o
f re
quests
read-after-write time lag (ms)
(a) HBase: Time lag for different buffer sizes
10 KB100 KB
1 MB10 MB
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Fra
ctio
n o
f re
qu
est
s
read-after-write time lag (ms)
(b) IcyTable: Time lag for different buffer sizes
10 KB100 KB
1 MB10 MB
18
• Deferred write wins, but visible latency can be 100 secs
Pre- (and post-) Tablet Splitting
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 19
• 6 servers • Per server: Preload 1M rows;
Load 8M rows; Measure@100 ops/s • 20% faster load if pre-split
• post-load rebalancing hurts for minutes
Improving Ingest Speed: Bulk Load • Faster ingest is format with MapReduce, ingest/import
with bulk load, rebalance during measurement phase • Test: preload, monitor/measure, format bulk, bulk load,
monitor/measure, sleep 5 minutes, monitor/measure • Per server: Preload 1M rows; Load 8M rows; Measure@100 ops/s
• Import turns out to be nearly instant, but rebalancing is not • Load 48M rows one at a time: 1400-1600 secs, 23-26 mins • Bulk load, including formatting time: 5-12 mins (2-5X faster)
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
Map Reduce Rebalancing Import
End-to-end ingest time
Data becomes available Queries may slow down
20
Scaling & Bulk Loading • 1/8M rows per server • Accumulo
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu
0.9
1.5
2.5
2.1
3.3
4.5
3.95
7.4
12.8
1.7
1.8
2.2
0 5 10 15 20 25
6 Servers (36MF)
24 Servers (36MF)
54 Servers (36MF)
PreLoad PL-Rebalance BulkLoad BL-Rebalance
Map Reduce Rebalancing Import
End-to-end ingest time
Data becomes available Queries may slow down
(1) (3) (4) (6)
Minutes
Scaling MR means more files & more compaction
21
Rebalancing Timeline (54 Servers/36 MapFiles)
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 22
(7)
(8)
(4) (6)
(1)
(3)
• Phase 1 rebalancing starts late • Too much rebalancing work
So How Do We Test At Scale?
• At Cloud scale very few users can afford extended experiment time on public clouds
• Many systems experiments desire: repeatable, isolated, instrumented, fault-injected, specialized kernels • Almost no one running a public cloud could
(would) (SHOULD) support such invasive apps
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 23
LANL was going to trash this!
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 24
NSF PRObE to the Rescue • NSF Funds the New Mexico Consortium
to recycle LANL supercomputers
• PRObE: Parallel Reconfigurable Observational Environment • Low level systems research facility • Days to weeks of dedicated usage • Complete control of hardware and software • Fault injection and failure statistics
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 25
PRObE Hardware Plan • Spring 2012: Sitka (2048 cores) acquired
– 1024 Nodes, Dual Socket, Single Core AMD Opteron; 4GB RAM per core; Full fat-tree Myrinet
• Summer 2012: Kodiak (2048 cores) acquired – 1024 Nodes, Dual Socket, Single Core AMD Opteron, 4GB RAM per
core; Fat-tree SDR Infiniband – 128 Nodes version at CMU, Marmot, standing up now
• Fall 2011: Susitna (1700 Cores) being acquired • 26 Nodes, 16 core CPUs, 1 GB RAM / core, QDR Infiniband, GPU • Planning to build at CMU soon
• Fall 2013: Nome (1600 cores) anticipated – 200 Nodes, Quad Socket, Dual Core AMD Opteron; 2GB RAM per
core, Fat-tree DDR Infiniband • Fall 2013: Matanuska (3456 Cores) anticipated
• 36 Nodes, 24 core CPUs, 1-2GB RAM / core, 100Gbit Ethernet
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 26
PRObE Software • First, “none” is allowed
• Researchers can put any software they want onto the clusters
• Second, a well known tool managing clusters of hardware for research • Emulab (www.emulab.org), Flux Group, U. Utah • On staging clusters, also on large clusters • Enhanced for PRObE hardware, scale, networks, resource
partitioning policies, remote power and console, failure injection, deep instrumentation
• PRObE provides hardware support (spares)
Garth Gibson, Oct 2010!www.pdl.cmu.edu 27
For Systems Research Users • NSF “who can apply” rules
• Includes international and corporate research projects (“best” in partnership with US university)
newmexicoconsortium.org/probe
Garth Gibson, Oct 2010!www.pdl.cmu.edu 28
On Education Front: BigData Masters • Extends MSIT Very Large Information Systems (VLIS)
• Tracks for BigData “systems” and “applications”
• One year on campus, incl. two project courses, plus 7 month internship at end • Already using Hadoop on OpenCloud cluster in some courses
• Systems courses: • Distr’d Computing, Storage Systems, Cloud Computing,
Data Mining, Parallel Comp. Arch & Programming
• Applications courses: • VLIS, Software Eng., Machine Learning, Information Retrieval
• Seeking students, internship & permanent employers • It’s all about expanding training of BigData professionals
Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 29
Research Sponsors
Companies of Parallel Data Consortium: APC, EMC, Facebook, Google, Hewlett-Packard, Hitachi, Intel, Microsoft, NEC, NetApp, Oracle,
Panasas, Riverbed, Samsung, Seagate, STEC, Symantec, VMware