Analysis of HDFS Under HBase - WISDOMwisdom.cs.wisc.edu/workshops/spring-14/talks/Tyler.pdf ·...

Analysis of HDFS Under HBase A Facebook Messages Case Study

Tyler Harter, Dhruba Borthakur*, Siying Dong*, Amitanand Aiyer*, Liyin Tang*, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

University of Wisconsin-Madison *Facebook Inc.

Why Study Facebook Messages? Represents an important type of application. Universal backend for:

▪  Cellphone texts

▪  Chats

▪  Emails

▪  Chats

▪  Emails

▪  Chats

▪  Emails

▪  Chats

▪  Emails

▪  Chats

▪  Emails

Represents HBase over HDFS

▪  Common backend at Facebook and other companies

▪  Similar stack used at Google (BigTable over GFS)

▪  Chats

▪  Emails

Represents HBase over HDFS

▪  Common backend at Facebook and other companies

▪  Similar stack used at Google (BigTable over GFS)

Represents layered storage

Building a Distributed Application (Messages)

We have many machines with many disks. How should we use them to store messages?

Machine 1 Machine 3 Machine 2

Messages

One option: use machines and disks directly.

One option: use machines and disks directly. Very specialized, but very high development cost.

Messages Machine 1 Machine 3 Machine 2

HBase Messages

Use HBase for K/V logic

Worker Hadoop File System

Messages HBase

Worker Worker Machine 1 Machine 3 Machine 2

Use HBase for K/V logic Use HDFS for replication

Messages HBase

FS FS FS FS FS FS FS FS FS FS FS FS

Use HBase for K/V logic Use HDFS for replication Use Local FS for allocation

Layered Storage Discussion Layering Questions ▪  Is layering free performance-wise?

▪  Can layer integration be useful?

▪  Should there be multiple HW layers?

Layering Advantages ▪  Simplicity (thus fewer software bugs)

▪  Lower development costs

▪  Code sharing between systems

Messages HBase

FS FS FS FS FS FS FS FS FS FS FS FS

Outline Intro

▪  Messages stack overview

▪  Methodology: trace-driven analysis and simulation

▪  HBase background

Results

▪  Workload analysis

▪  Hardware simulation: adding a flash layer

▪  Software simulation: integrating layers

Conclusions

Methodology

Messages

Local FS

Actual stack

Methodology

Messages

Local FS

HDFS Traces

Hadoop Trace FS (HTFS) ▪  Collects request details

▪  Reads/writes, offsets, lengths

▪  9 shadow machines

▪  8.3 days

Actual stack

Methodology

Messages

Local FS

HDFS Traces

MapReduce Analysis Pipeline

Workload Analysis

Actual stack

Methodology

Messages

Local FS

HDFS Traces

MapReduce Analysis Pipeline

Workload Analysis

HBase+HDFS Actual stack Simulated stack

Local Traces (inferred)

what -ifs

Local Storage what -ifs

Simulation Results

Methodology

Messages

Local FS

HDFS Traces

Actual stack

Methodology

Messages

Local FS Background: how does HBase use HDFS?

Actual stack

HDFS Traces

Outline Intro

Results

Conclusions

HBase’s HDFS Files Four activities do HDFS I/O:

HDFS files:

HBase memory:

MemTable

Four activities do HDFS I/O: ▪  Logging

HDFS files:

HBase memory:

MemTable

HBase receives a put()

HBase’s HDFS Files

HBase’s HDFS Files Four activities do HDFS I/O: ▪  Logging

HDFS files:

HBase memory:

MemTable

After many puts, MemTable is full

HBase’s HDFS Files Four activities do HDFS I/O : ▪  Logging

▪  Flushing

HDFS files:

HBase memory:

MemTable

Flush MemTable to sorted file

DATA LOG

▪  Flushing

HDFS files:

HBase memory:

MemTable

DATA LOG

▪  Flushing

HDFS files:

HBase memory:

After many flushes, files accumulate

MemTable

▪  Flushing

HDFS files:

HBase memory:

get() requests may check many of these

MemTable

▪  Flushing

▪  Foreground reads

HDFS files:

HBase memory:

get() requests may check many of these

MemTable

▪  Flushing

▪  Compaction

HDFS files:

HBase memory:

compaction merge sorts the files

MemTable

▪  Flushing

▪  Compaction

HDFS files:

HBase memory:

compaction merge sorts the files

MemTable

DATA LOG

▪  Flushing

▪  Compaction

Baseline I/O:

▪  Flushing and foreground reads are always required

▪  Flushing

▪  Compaction

Baseline I/O:

▪  Flushing and foreground reads are always required

HBase overheads:

▪  Logging: useful for crash recovery (but not normal operation)

▪  Compaction: improves performance (but not required for correctness)

Outline Intro

Results

Conclusions

Workload Analysis Questions At each layer, what activities read or write?

How large is the dataset?

How large are created files?

How sequential is I/O?

Baseline HDFS I/O:

Cross-layer R/W Ratios

0 20 40 60 80 100I/O (TB)

cachemisses

1% writes

reads writes

Baseline HDFS I/O:

compact LOG

0 20 40 60 80 100I/O (TB)

cachemisses

All HDFS I/O:

1% writes

21% writes

Baseline HDFS I/O:

compact LOG

R1 R2 R3

0 20 40 60 80 100I/O (TB)

cachemisses

All HDFS I/O:

Local FS:

1% writes

21% writes

45% writes

replicas

Baseline HDFS I/O:

compact LOG

R1 R2 R3

0 20 40 60 80 100I/O (TB)

cachemisses

0 20 40 60 80 100I/O (TB)

cachemisses

All HDFS I/O:

Local FS:

1% writes

64% writes

21% writes

45% writes

Workload Analysis Conclusions ①  Layers amplify writes: 1% => 64%

u  Logging, compaction, and replication increase writes

u  Caching decreases reads

LOGcompact

0 20 40Footprint (TB)

R1 R2 R3

Baseline HDFS I/O:

Cross-layer Dataset (Accessed Data)

All HDFS I/O:

Local (FS/disk):

18% written

77% written

91% written

R1 R2 R3

②  Most touched data is only written

LOGcompact

R1 R2 R3

Baseline HDFS I/O:

Cold Data

All HDFS I/O:

Local (FS/disk):

R1 R2 R3

Cold Data

Local (FS/disk):

R1 R2 R3

Cold Data

Local (FS/disk):

0 20 40 60 80 100 120Footprint (TB)

R1 R2 R3 cold data

0 20 40 60 80 100 120Footprint (TB)

R1 R2 R3 cold data

③  The dataset is large and cold: 2/3 of 120TB never touched

Created Files: Size Distribution

File Size

f File

50% of files are <750KB

File Size

f File

90% of files are <6.3MB

File Size

f File

④  Files are very small: 90% smaller than 6.3MB

Reads: Run Size

50% of runs (weighted by I/O) <130KB

Reads: Run Size

80% of runs (weighted by I/O) <250KB

②  Data is read or written, but rarely both

④  Files are very small: 90% smaller than 6.3MB

⑤  Fairly random I/O: 130KB median read run

Outline Intro

Results

Conclusions

Hardware Architecture: Workload Implications

Option 1: pure disk Option 2: pure flash Option 3: hybrid

Option 1: pure disk ▪  Very random reads

▪  Small files

Option 2: pure flash Option 3: hybrid

▪  Small files

Option 2: pure flash Option 3: hybrid

▪  Small files

Option 2: pure flash ▪  Large dataset

▪  Mostly very cold

▪  >$10K / machine

Option 3: hybrid

▪  Small files

Option 3: hybrid

▪  Small files

Option 3: hybrid ▪  Process of elimination

Evaluate cost and performance of 36 hardware combinations (3x3x4) ▪  Disks: 10, 15, or 20 ▪  RAM (cache): 10, 30, or 100GB ▪  Flash (cache): 0, 60, 120, or 240GB

Assumptions:

Hardware Architecture: Simulation Results

Hardware Cost Performance HDD $100/disk 10ms seek, 100MB/s RAM $5/GB zero latency Flash $0.8/GB 0.5ms

900 1200 1500 1800 2100 2400 27000

Cost ($)

Cost/performance tradeoff for 36 hardware combinations

900 1200 1500 1800 2100 2400 27000

Cost ($)

Upgrades decrease latency but increase cost

900 1200 1500 1800 2100 2400 27000

Cost ($)

Good upgrade

900 1200 1500 1800 2100 2400 27000

Cost ($)

Bad upgrade

900 1200 1500 1800 2100 2400 27000

Cost ($)

900 1200 1500 1800 2100 2400 27000

Cost ($)

10 disks

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 15 disks

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 20 disks

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 20 disks

Upgrading disk:

900 1200 1500 1800 2100 2400 27000

Cost ($)

900 1200 1500 1800 2100 2400 27000

Cost ($)

10GB RAM

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 30GB RAM

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 100GB RAM

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 100GB RAM

Upgrading RAM:

900 1200 1500 1800 2100 2400 27000

Cost ($)

900 1200 1500 1800 2100 2400 27000

Cost ($)

no flash!

900 1200 1500 1800 2100 2400 27000

Cost ($)

upgrade to 60GB flash

900 1200 1500 1800 2100 2400 27000

Cost ($)

900 1200 1500 1800 2100 2400 27000

Cost ($)

900 1200 1500 1800 2100 2400 27000

Cost ($)

Upgrading flash:

Outline Intro

Results

Conclusions

Software Architecture: Workload Implications

Writes are amplified

▪  1% at HDFS (excluding overheads) to 64% at disk (given 30GB RAM)

▪  We should optimize writes

Software Architecture: Workload Implications

Writes are greatly amplified

▪  1% at HDFS (excluding overheads) to 64% at disk

▪  We should optimize writes

61% of writes are for compaction

▪  We should optimize compaction

▪  Compaction interacts with replication inefficiently

Replication Overview

HBase Worker

HDFS Worker

Machine 2

HBase Worker

HDFS Worker

Machine 1

HBase Worker

HDFS Worker

Machine 3

Problem: Network I/O (red lines)

HBase Worker

HDFS Worker

Machine 2

HBase Worker

HDFS Worker

Machine 1

HBase Worker

HDFS Worker

Machine 3

Solution: Ship Computation to Data

HBase Worker

HDFS Worker

Machine 2

HBase Worker

HDFS Worker

Machine 1

HBase Worker

HDFS Worker

Machine 3

Solution: do Local Compaction

HBase Worker

HDFS Worker

Machine 2

HBase Worker

HDFS Worker

Machine 1

HBase Worker

HDFS Worker

Machine 3

do compact do compact

Solution: do Local Compaction

HBase Worker

HDFS Worker

Machine 2

HBase Worker

HDFS Worker

Machine 1

HBase Worker

HDFS Worker

Machine 3

compaction compaction compaction

Local Compaction R

I/O (T

0 100 200 300 4000

Cache size (GB)

net (normal)

Normally 3.5TB of network I/O

Local Compaction Normally 3.5TB of network I/O

Local comp: 62% reduction

0 100 200 300 4000

Cache size (GB)

net (normal)

net (local)

0 100 200 300 4000

Cache size (GB)

net (normal)

net (local)disk (normal)2.2

Network I/O becomes disk I/O

▪  9% overhead (30GB cache)

▪  Compaction reads: (a) usually misses, (b) pollute cache

0 100 200 300 4000

Cache size (GB)

net (normal)

net (local)disk (normal)

disk (local)

Network I/O becomes disk I/O

▪  9% overhead (30GB cache)

▪  Compaction reads: (a) usually misses, (b) pollute cache

Still good!

▪  Disk I/O is cheaper than network

0 100 200 300 4000

Cache size (GB)

net (normal)

net (local)disk (normal)

disk (local)

Outline Intro

Results

Conclusions

Conclusion 1: Messages is a New HDFS Workload

Original GFS paper:

▪  “high sustained bandwidth is more important than low latency”

▪  “multi-GB files are the common case”

We find files are small and reads are random

▪  50% of files <750KB

▪  >75% of reads are random

Conclusion 2: Layering is Not Free Layering “proved to be vital for the verification and logical soundness” of the THE operating system ~ Dijkstra

We find layering is not free ▪  Over half of network I/O for replication is unnecessary

Layers can amplify writes, multiplicatively ▪  E.g., logging overhead (10x) with replication (3x) => 30x write increase

Layer integration can help ▪  Local compaction reduces network I/O caused by layers

Conclusion 3: Flash Should not Replace Disk

Jim Gray predicted (for ~2012) that “tape is dead, disk is tape, flash is disk”

We find flash is a poor disk replacement for Messages

▪  Data is very large and mostly cold

▪  Pure flash would cost >$10K/machine

However, small flash tier is useful

▪  A 60GB SSD cache can double performance for a 5% cost increase

Thank you! Any questions? University of Wisconsin-Madison Facebook Inc.

Analysis of HDFS Under HBase - WISDOMwisdom.cs.wisc.edu/workshops/spring-14/talks/Tyler.pdf ·...

Documents

Google File System, HDFS, BigTable, Hbase

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs

HTrace: Tracing in HBase and HDFS (HBase Meetup)

Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Tecnologías de Procesamiento Big Data · Tecnologías de procesamiento Big Data Introducción a Sqoop RDBMS (MySQL, Oracle, Teradata ) HadoopFile System (HDFS, Hbase,Hive) IMPORT

Sweet Storage SLOs with Frosting - USENIX · 2019. 12. 18. · HBase MySQL . 3 Exploratory drill-down Interactive web-serving HDFS Batch analytics HBase MySQL Copy . 4 ... –Reduced

Oracle NoSQL Database...HBase is designed to work on top of the HDFS file system. Unlike Hive, HBase ... high performance, transactional database. Infrastruct ure HBase uses the Hadoop

Hortonworks Data Platform - Apache Ambari Minor Upgrade ... · Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor

IIHTiihttrichy.com/brochuers/BigdataHadoopCoursesBrochure.pdf · Java Fundamentals, Hadoop Fundamentals, HDFS, MapReduce, Spark, Hive, Pig and Latin, HBase, Sqoop, Yarn, MongoDB and

High Performance RDMA-based Design of HDFS over InfiniBandnowlab.cse.ohio-state.edu/static/media/publications/... · 2017. 7. 18. · HBase 5 HDFS HBase MapReduce Hadoop Framework

Ibis:ScalingPythonAnaly=cs on HadoopandImpalabiconsulting.hu/letoltes/2015budapestbi/budapestbiforum... · 2015-10-27 · HDFS, Kudu, HBase ©"Cloudera,"Inc."All"rights"reserved."

Scalable Table Stores: Tools for Understanding Advanced ......• Service stats (Hadoop, Hbase, HDFS, …) • Walk process group tree looking for specific command lines • Aggregate

HDFS Under the Hood - assets.en.oreilly.comassets.en.oreilly.com/1/event/12/HDFS Under the... · –Hbase (3 Committers) •Community of contributors is growing –Though mostly Yahoo

How Hadoop Changes the Analytics Paradigm...Search Solr Model Machine Learning SAS, R, Spark, Mahout Serve NoSQL Database HBase Streaming Spark Streaming Unlimited Storage HDFS, HBase

Encryption and Anonymization in Hadoop · PDF filein Hadoop • HDFS, Hbase ... • Admin can still see raw data . Page 9 Ranger KMS HDFS Transparent Encryption Solution NN A B

Hortonworks Data Platform - Apache Kafka Component Guide · Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Accumulo Summit 2015: HDFS Short Circuit Local Read Performance Benchmarking with Apache Accumulo and Apache HBase [Performance]

@JamesPlusPlus What is HBase? Completed ! Developed as part of Apache Hadoop ! Runs on top of HDFS ! Key/value store Map Sorted Distributed Consistent Sparse What is HBase? Completed

Big datahome.agh.edu.pl/~wojnicki/wiki/_media/pl:ztb:ztb-hadoop.pdf · 2013-11-08 · Hadoop Sector MapReduce MapReduce Sphere UDF BigTable HBase/Hive Space GFS HDFS SDFS — —