Scalable Table Stores: Tools for Understanding Advanced ... · • Lustre Linux open source parallel file system – With Panasas, Lustre & PVFS, 3/4 top500.org are object-based •

Scalable Table Stores: Tools for Understanding Advanced Key-Value Systems for Hadoop

Garth Gibson Professor, Carnegie Mellon Univ., & CTO, Panasas Inc.

with Julio Lopez, Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, CMU to appear in SoCC October 2011

with Wittawat Tantisiriroj, Swapnil Patil, CMU

Seung Woo Son, Sam Lang, Rob Ross, Argonne Nat Lab to appear in SC11 November 2011

[email protected]

Garth Gibson, Sept 20, 2011!

The Future is Data-Led

NIST: translate 100 articles –  Arabic-English competition

2005 outcome: Google wins! Qualitatively better on 1st entry Brute force statistics with

more data & compute !! 200M words from UN translations 1 billion words of English grammar 1000 processor cluster

BLEU Score

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Google ISI IBM+CMU UMD JHU+CU Edinburgh

Systran

Mitre

FSC

0.7

Topic identification

Human-edittable translation

Usable translation

Expert human translator

Useless

IEEE Intelligent Systems, March/April 2009 www.pdl.cmu.edu 2

Science of Many Types is Data-Led Contact Field Comments J Lopez, CSD Astrophysics SDSS digital sky survey including spectroscopy, 50TB

T Di Matteo, Physics Astrophysics Bigben BHCosmo hydrodynamics (1B particles simulated), 30TB

F Gilman, Physics Astrophysics Large Synoptic Survey Telescope, LSST (2012) digital sky survey, 15TB/day

C Langmead,CSD Biology Xray, NMR, CryoEM images; Sim’d molecular dynamics trajectories

J Bielak, CE Earth sciences USGS sensor images; Sim’d 4D earthquake wavefields >10TB/run

D Brumley, ECE Cyber security Worldwide Malware Archive; 2TB doubling each year

O Mutlu, ECE Genomics 50GB per compressed genome sequencing; expands to TBs to process

B Yu, ECE Neuroscience Neural recordings (electrodes, optical) for prosthetics; 10-100GB each

J Callan, LTI Info Retrieval ClueWeb09, 25TB, 1B high rank web pages, 10 languages

T Mitchell, MLD Machine Learning English sentences of ClueWeb for continuous automated reading (5TB)

M Herbert, RI Image Understanding Flickr archive (>4TB); broadcast TV archive; street video; soldier video

Y Sheikh, RI Virtual Reality Terascale VR sensor, 1000 camera+ 200 microphone, up to 5TB/sec

C Guestrin, CSD Machine Learning Blog update archives, 2TB now + 2.7TB/yr (about 500K blogs/day)

C Faloutsos, CSD Data Mining Wikipedia change archive (1TB), Fly embryo images (1.5TB), links from Yahoo web

S Vogel, LTI Machine Translation Pre-filtered N-gram language model based on statistics on word alignment, 100 TB

J Baker, LTI Machine Translation Spoken language recording archive, many languages, many sources, up to 1PB

B Becker, RI Computer Vision Social network image/video archive for training computer vision systems, 1-5TB

Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 3

CMU PDL History of Scalable Storage •  1995 DARPA funds Network-Attached Secure Disks (NASD) •  1999+ NASD spin offs

•  Object Storage Device standardized by T10/SCSI 2004, 2009 •  Panasas parallel storage system, Gibson co-founder & CTO

–  Primary storage on first petascale computer: LANL Roadrunner •  Also: NIH, Citadel, ING, BNP, BP, ConocoPhillips, PetroChina, StatOil,

Ferrari, BMW, 3M, Lockheed Martin, Northrop Grumman, Sandia, NASA

•  Lustre Linux open source parallel file system – With Panasas, Lustre & PVFS, 3/4 top500.org are object-based

•  Graduates go to storage, server & internet companies –  Eg Google FileSystem (2003) & BigTable (2006) cloud database

•  Parallel NFS achieves IETF RFC in 2010 spurred on by Panasas –  Linux adoption in 2.6.39, 3.0 and 3.1 (2011)


For the Experience, Operate A Cloud •  Two clusters: 3 TF, 2.2 TB, 142 nodes, 1.1K cores, ½ PB •  Available to CMU eScience

users as a Hadoop queue •  IR, ML classes •  ML research •  Comp bio research •  Astro research •  Geo research •  Malware analysis •  Social network analysis •  Systems research


PDL &

OpenCirrus

External switch

Other

OpenCloud sites

2 x 10GE SRoptical link

10 Gbps LR opticallink to NLR

CMU OpenCloud

6x 10GE trunkSFP+ twinax

48 port10-GE switch

Logical rack:32 worker nodes6 RAID protectedstorage nodes

Logical rack:32 worker nodes7 RAID protected storage nodes

38 10GE linksSFP+ twinax

39 10GE linksSFP+ twinax

48 port10-GE switch

6x 10GE trunkSFP+ twinax

CMU OpenCirrus

24 port10-GE switch

20 x 2 x 1GE links

19 x 2 x 1GE links

20 x 2 x 1GE links

19 x 2 x 1GE links

Logicalrack:

20 workernodes

Logicalrack:

19 workernodes

Logicalrack:

20 workernodes

Logicalrack:

19 workernodes

Switch1-GE down/10-GE up




2x 10GE trunk

2x 10GE trunk

2x 10GE trunk

2x 10GE trunk

To Understand: Cloud FS vs. Parallel FS

Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu 6 To be published in SC11, Nov. 2011

•  Buf: Prefetching •  HDFS: write once •  Add deep prefetch

•  Map: Layout •  Stripe Unit::Node •  Optimized Launch

•  Rep: Replicate data •  No HW RAID!

•  Hadoop’s storage, library, HDFS, is replaceable

•  Replace with PVFS,

a user-level Parallel FS, to understand differences

Replication inside a PVFS file


•  PVFS, like most cluster/parallel file systems, assumes RAID HW •  HDFS, like GoogleFS, does not like scaling of RAID HW •  Teach PVFS client to internally replicate (Hybrid approx. HDFS)

•  Code is not production quality – error path is too hard for academics J

Interesting Implementation Issues •  HDFS performance disk-bound by chunk creation •  PVFS insufficient parallelism in single stream


Differences Not Visible in Apps •  OpenCloud Apps

•  Astrophysics •  Social network analysis

•  Hadoop helps •  Job scheduler does

load balancing •  Dataset is directory of files


Scalable Table Stores


•  Inspired by Google’s BigTable •  Reported to SCALE:

•  >76 PB in one “database” •  >10 M operation/sec

•  B-tree with giant nodes •  Data model is dynamic, lots of

columns, strings everywhere •  Writeback of mutations written

as sorted, indexed log files •  Read-misses search all logs:

Log-structured Merge Trees •  Layered on GFS (HDFS)

Extending a Prior Benchmark Tool •  Yahoo! Cloud Serving Benchmark (YCSB) tool

•  steady state load of CRUD (create-read-update-delete) operations

Garth Gibson, Sept 20, 2011!www.pdl.cmu.edu

5

Benchmark tool• Java application

– Many systems have Java APIs– Other systems via HTTP/REST, JNI or some other solution

Workload parameter file• R/W mix• Record size• Data set• …

Command-line parameters• DB to use• Target throughput• Number of threads• …

YCSB client

DB c

lient

Client threads

Stats

Workload executor Cl

oud

DB

Extensible: plug in new clientsExtensible: plug in new clientsExtensible: define new workloadsExtensible: define new workloads

github.com/brianfrankcooper/YCSB [SoCC10] 11

Adv. Features of YCSB++ •  High Ingest Rate Features

•  Deep batch writing •  Pre-splitting tablets (given future insert distribution) •  Bulk-load: MR format map files externally

•  Read Features •  Read-after-write: what price eventual consistency? •  Offloading filtering to servers •  Security ACLs – what performance price?

•  Better interpretation of monitoring •  Integrate knowledge of services, user jobs (Otus)

•  To be published in SoCC (October 2011) – www.pdl.cmu.edu/ycsb++/


YCSB++ Framework

Client nodes

Workload parameter !le

-! R/W mix!

-! RecordSize!

-! DataSet!

-! …!

Extensions

HBase

IcyTable

Other DBs

Storage Servers

Stats

Workload Executor

Client Threads

DB

Clie

nts

New workloads

API ext

Multi-Phase Processing

YCSB Client (with our

extensions)

Ganglia monitoring

Hadoop, HDFS and OS metrics

YCSB metrics

YCSB Client Coordination

ZooKeeper-based barrier sync and

event noti!cation

Command-line parameters (e.g, DB name, NumThreads)!


Accumulo

github.com/MiloPolte/YCSB (pushing to main branch)

Accepted into Apache Incubator Sept 12


Extensions for Monitoring (Otus)

•  Service stats (Hadoop, Hbase, HDFS, …) •  Walk process group tree looking

for specific command lines •  Aggregate stats for subgroups

•  Customizable displays


Virt

ual M

emor

y (B

ytes

)

Other MRJob Data Node Task Tracker Other Processes

Running Map Tasks

Rea

d R

eque

sts (

ops)

CPU

Usa

ge

HDFS DataNode Read Request From Remote Clients HDFS DataNode CPU Usage

15 github.com/Otus/otus

Server side Filtering

•  Filtering when little data is desired leads to excessive prefetching on server, because it fills scanner batch •  Size scanner batch to expected result size (scaled buffer)

•  Hbase table was decomposed into more columnar stores, so Accumulo does more work


DoD BigTable HBase

16

Batch Writers & Eventual Consistency


•  Small batches burn excessive client CPU, limiting thruput •  Large batches saturate servers, limiting benefit of batch

17

Batch Writers & Eventual Consistency


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Fra

ction o

f re

quests

read-after-write time lag (ms)

(a) HBase: Time lag for different buffer sizes

10 KB100 KB

1 MB10 MB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Fra

ctio

n o

f re

qu

est

s

read-after-write time lag (ms)

(b) IcyTable: Time lag for different buffer sizes

10 KB100 KB

1 MB10 MB

18

•  Deferred write wins, but visible latency can be 100 secs

Pre- (and post-) Tablet Splitting


•  6 servers •  Per server: Preload 1M rows;

Load 8M rows; Measure@100 ops/s •  20% faster load if pre-split

•  post-load rebalancing hurts for minutes

Improving Ingest Speed: Bulk Load •  Faster ingest is format with MapReduce, ingest/import

with bulk load, rebalance during measurement phase •  Test: preload, monitor/measure, format bulk, bulk load,

monitor/measure, sleep 5 minutes, monitor/measure •  Per server: Preload 1M rows; Load 8M rows; Measure@100 ops/s

•  Import turns out to be nearly instant, but rebalancing is not •  Load 48M rows one at a time: 1400-1600 secs, 23-26 mins •  Bulk load, including formatting time: 5-12 mins (2-5X faster)


Map Reduce Rebalancing Import

End-to-end ingest time

Data becomes available Queries may slow down

20

Scaling & Bulk Loading •  1/8M rows per server •  Accumulo


0.9

1.5

2.5

2.1

3.3

4.5

3.95

7.4

12.8

1.7

1.8

2.2

0 5 10 15 20 25

6 Servers (36MF)

24 Servers (36MF)

54 Servers (36MF)

PreLoad PL-Rebalance BulkLoad BL-Rebalance

Map Reduce Rebalancing Import

End-to-end ingest time

Data becomes available Queries may slow down

(1) (3) (4) (6)

Minutes

Scaling MR means more files & more compaction

21

Rebalancing Timeline (54 Servers/36 MapFiles)


(7)

(8)

(4) (6)

(1)

(3)

•  Phase 1 rebalancing starts late •  Too much rebalancing work

So How Do We Test At Scale?

•  At Cloud scale very few users can afford extended experiment time on public clouds

•  Many systems experiments desire: repeatable, isolated, instrumented, fault-injected, specialized kernels •  Almost no one running a public cloud could

(would) (SHOULD) support such invasive apps


LANL was going to trash this!


NSF PRObE to the Rescue •  NSF Funds the New Mexico Consortium

to recycle LANL supercomputers

•  PRObE: Parallel Reconfigurable Observational Environment •  Low level systems research facility •  Days to weeks of dedicated usage •  Complete control of hardware and software •  Fault injection and failure statistics


PRObE Hardware Plan •  Spring 2012: Sitka (2048 cores) acquired

–  1024 Nodes, Dual Socket, Single Core AMD Opteron; 4GB RAM per core; Full fat-tree Myrinet

•  Summer 2012: Kodiak (2048 cores) acquired –  1024 Nodes, Dual Socket, Single Core AMD Opteron, 4GB RAM per

core; Fat-tree SDR Infiniband –  128 Nodes version at CMU, Marmot, standing up now

•  Fall 2011: Susitna (1700 Cores) being acquired •  26 Nodes, 16 core CPUs, 1 GB RAM / core, QDR Infiniband, GPU •  Planning to build at CMU soon

•  Fall 2013: Nome (1600 cores) anticipated –  200 Nodes, Quad Socket, Dual Core AMD Opteron; 2GB RAM per

core, Fat-tree DDR Infiniband •  Fall 2013: Matanuska (3456 Cores) anticipated

•  36 Nodes, 24 core CPUs, 1-2GB RAM / core, 100Gbit Ethernet


PRObE Software •  First, “none” is allowed

•  Researchers can put any software they want onto the clusters

•  Second, a well known tool managing clusters of hardware for research •  Emulab (www.emulab.org), Flux Group, U. Utah •  On staging clusters, also on large clusters •  Enhanced for PRObE hardware, scale, networks, resource

partitioning policies, remote power and console, failure injection, deep instrumentation

•  PRObE provides hardware support (spares)

Garth Gibson, Oct 2010!www.pdl.cmu.edu 27

For Systems Research Users •  NSF “who can apply” rules

•  Includes international and corporate research projects (“best” in partnership with US university)

newmexicoconsortium.org/probe

Garth Gibson, Oct 2010!www.pdl.cmu.edu 28

On Education Front: BigData Masters •  Extends MSIT Very Large Information Systems (VLIS)

•  Tracks for BigData “systems” and “applications”

•  One year on campus, incl. two project courses, plus 7 month internship at end •  Already using Hadoop on OpenCloud cluster in some courses

•  Systems courses: •  Distr’d Computing, Storage Systems, Cloud Computing,

Data Mining, Parallel Comp. Arch & Programming

•  Applications courses: •  VLIS, Software Eng., Machine Learning, Information Retrieval

•  Seeking students, internship & permanent employers •  It’s all about expanding training of BigData professionals


Research Sponsors

Companies of Parallel Data Consortium: APC, EMC, Facebook, Google, Hewlett-Packard, Hitachi, Intel, Microsoft, NEC, NetApp, Oracle,

Panasas, Riverbed, Samsung, Seagate, STEC, Symantec, VMware

[email protected]

Documents

Scalable Table Stores: Tools for Understanding Advanced ... · • Lustre Linux open source parallel file system – With Panasas, Lustre & PVFS, 3/4 top500.org are object-based •