54
Storage Systems for Big Data Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. [email protected] , @sameertech

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Embed Size (px)

DESCRIPTION

There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File Systems, in-memory Key Value Stores, Big Table Stores and provide insights on how to choose the right storage solution for a specific class of problems. For instance, running large analytic workloads, iterative machine learning algorithms, and real time analytics. The talk will cover HDFS, HBase and brief introduction to Redis

Citation preview

Page 1: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Systems forBig Data

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech

Page 2: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Systems forBig Data

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech

Page 3: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 4: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 5: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Hadoop Distributed File System(HDFS)

● History○ Based on Google File System Paper (2003)○ Built at Yahoo by a small team

● Goals○ Tolerance to Hardware failure○ Sequential access as opposed to Random○ High aggregated throughput for Large Data Sets○ “Write Once Read Many” paradigm

Page 6: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Client2-FileB

Rack 1 Rack 2

Page 7: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Client2-FileB

Rack 1 Rack 2

File.create() MetaDataNN OPs

FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4

FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4

Page 8: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Client2-FileB

Rack 1 Rack 2

File.create() MetaDataNN OPs

FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4

FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4

AB1

BB1

Data BlocksDN OPs

File.write()

Page 9: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

AB1 AB2 BB1

BB1

AB1

BB1

AB1

Client2-FileB

Rack 1 Rack 2

AB2 AB2

File.create() MetaDataNN OPs

Data BlocksDN OPs

File.write()

FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4

FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4

Replication PipeLining

Page 10: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Client1-FileA

NameNode

HDFS - Communication

HDFS Client API. RPC:ClientProtocol

Page 11: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Client1-FileA

NameNode

DataNode 1

AB1 AB2

BB1

HDFS - Communication

HDFS Client API. RPC:ClientProtocol

HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering

Page 12: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Client1-FileA

NameNode

DataNode 1 DataNode 2

AB1 AB2 BB1

BB1

HDFS - Communication

HDFS Client API. RPC:ClientProtocol

HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering

RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)

AB2

Page 13: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Client1-FileA

NameNode

DataNode 1 DataNode 2

AB1 AB2 BB1

BB1

HDFS - Communication

HDFS Client API. RPC:ClientProtocol

HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering

RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)

AB2ReplicationPipeLining.Streaming

Page 14: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - NameNode 1 of 4

● Heart of HDFS. Typically Lots of Memory ~128Gigs● Hosts two important tables● The HDFS Namespace: File->Block mapping

○ Persisted for backup● The iNode table: Block->Datanode mapping

○ Not persisted.○ Re-built from block reports

● HDFS is Journaled File system○ Maintains a WAL called edit log○ Edit log is merged into fsimage at a preset log size

Page 15: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - NameNode 2 of 4

● Can take on 3 roles● Regular mode: Hosts the HDFS Namespace● Backup mode: Secondary NN

○ Downloads fsimage regularly○ Merges changes to namespace○ Its a misnomer, it more of a checkpointing server

● Safemode: Startup time○ Its a R/O mode○ Collects data from active DNs

Page 16: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - NameNode 3 of 4HA using Quorum Journal Manager (Hadoop 2.0+)

Active NN Standby NN

DataNodesDataNodes

DataNodesDataNodes

JournalNodesJournal

NodesJournalNodes

ClientsClients

Clients

ZK ClusterZK

ClusterZK Cluster

Page 17: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - NameNode 4 of 4

● Replication Monitor: Fix over/under replicated blocks○ Replica Modes: Corrupt, Current, Out-of-date,

under-construction● Lease Management: During file creation

○ Ensures single writer (multiple readers are ok)○ Synchronously checks active lease○ Asynchronously checks the entire Tree of leases

● Heartbeat monitor: Collects DN stats and marks them down if no heartbeat recvd for ~10mins.

Page 18: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - DataNode

● Typical Machine: ~ 4TB X 12 disks JBOD● Has no idea about HDFS, only knows about blocks● Serves 2 types of requests

○ NN requests for Block create/delete/replicate○ Serves Block R/W requests from Clients

● Maintains only one table○ Block->Real Bytes on the local FS○ Stored locally and not backed up○ DN can re-build this table by scanning its local dir

Page 19: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

● Creates a chksum file for each block● Runs blockScanner() to find corrupt blocks● DataNode to NameNode communication

○ Init - registration○ Sends HeartBeat to NN every few secs○ Block completion: blockReceived()○ Lets NN respond with block commands○ Sends full Block Report every hour

HDFS - DataNode

Page 20: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HDFS - Typical Deployment

RACK1 . . .

Aggregator Switch 1 Aggregator Switch 2

Master Switch

TOR

RACK N(10-20)

TOR

RACK1 . . .

TOR

RACK N(10-20)

TOR

Aggregator Switch 3. . .

. . .

Page 21: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

● NN holds the Namespace in a single Java process● 64Gig Heap == ~250 million files + blocks

○ Federation sort of solves the problem○ Moving Namespace to a KV Store is one solution

● Enterprise features slowly being added○ Snapshots○ NFS access○ Geo replication○ Run Length Encoding to reduce 3X copies to 1.3X

HDFS - Limitations

Page 22: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

● Support for fadvise readahead and drop-behind● HDFS takes advantage of multiple disks

○ Individual failures do not cause DN failures○ Spills are parallelized

● Replica and Task placement○ Done by DNSToSwitchMapping():resolve()○ User supplied rack topology○ IP address -> Rack id mapping○ net.topology.* setttings in core-site.xml

HDFS - Advanced Concepts

Page 23: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

● Couple of tools for Perf monitoring○ Ganglia for HDFS○ Nagios for general health of the machine.

HDFS - Advanced Concepts

Page 24: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 25: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 26: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase

● History○ Based on Google’s Big Table (2006)○ Built at Powerset (later acquired by Microsoft)○ Facebook and Yahoo use it extensively (~1000 machines)

● Goals○ Random R/W access○ Tables with Billions of Rows X Millions of Columns○ Often referred to as a “NoSQL” Data store○ High speed ingest rate. FB == ~Billion msgs+chat per day.○ Good consistency model

Page 27: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Key Components

NameNodeJobTrackerHMaster

DataNodeTaskTrackerHRegionServer

ZK ClusterZK

ClusterZK Cluster

Client

Master(s):Active and Backup

Slaves:Many

Page 28: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data Model

● Google BigTable Paper on #2 says

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes

Let’s break that down over the next few slides...

Page 29: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data Model

● Data is stored in Tables● Tables have Rows and Columns● Thats where the similarity ends

○ Columns are grouped into Column Families

● Rows are stored in a sorted(increasing) order○ Implies, there is only one primary key

● Rows can be sparsely populated○ Variable length rows are common

● Same row can be updated multiple times○ Each will be stored as a versioned update

Page 30: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data ModelConceptual View

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor

"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"

"com.cnn.www" t5 contents:html = "<html>..."

"com.cnn.www" t3 contents:html = "<html>..."

Single column in “contents”byte-array

Column => Column Family: Qualifiere.g. Two Columns in the “anchor”byte-array

Versionstimemillis()

Row-Keybyte-array, Sorted by byte order

Page 31: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data ModelPhysical View

Row Key Time Stamp ColumnFamily anchor

"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"

Row Key Time Stamp ColumnFamily contents

"com.cnn.www" t5 contents:html = "<html>..."

"com.cnn.www" t3 contents:html = "<html>..."

Page 32: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Table Objects

Region1R1-R10

Logical Table

Data : R1- R40

HFileMemStore Blocks Blocks

Region Server : ~200 Regions per Server

Region Servers

Shards

HDFSBlocksHDFS

Blocks

HDFSBlocks

HLog/WAL HDFSBlocksHDFS

Blocks

Region2R11-R20 HFileMemStore Blocks Blocks

HLog/WALHDFSBlocks

Page 33: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data Model Operations

○ HTable class offers 4 techniques: get, put, delete and scan.○ The first 3 have a single or batch mode available

//Scan example

public static final byte[] CF1 = "empData1".getBytes();public static final byte[] ATTR1 = "empId".getBytes();HTable htable = new HTable(blah... // create an instance of HTable

Scan scan = new Scan();scan.addColumn(CF1, ATTR1);scan.setStartRow(Bytes.toBytes("200"));scan.setStopRow(Bytes.toBytes("500"));ResultScanner rs = htable.getScanner(scan);try { for (Result r = rs.next(); r != null; r = rs.next()) { // do something with it...} finally { rs.close();}

Page 34: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Data Versioning

○ By default a put() uses timestamp, but you can override it○ Get.setMaxVersions() or Get.setTimeRange○ By default a get() returns the latest version, but you can ask for any○ All Data model operations are in !sorted order. Row:CF:Col:Version○ Delete flavors: delete col+ver, delete col, delete col family, delete row○ Deletes work by creating tombstone markers○ LIMITATIONS:

■ delete() masks a put() till a major compaction takes place■ Major compactions can change get() results

○ All operations are ATOMIC within a row

Page 35: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Read Path

ZK ClusterZK

ClusterZK Cluster

Client

Region Server1

Q:Where is -ROOT-?A: RegionServer1

.META. Table for all regions in the system, never splits

table, startKey, id::regionInfo, Server

Q:Where is .META.?A: RegionServer2

Region Server2

Q: HTable.get()

-ROOT- Table for keeping track of .META. table

.META.,region,key:regionInfo, Server

MemStore

HFile - 1HFile - 2

1

2

3 4

5A: Row

6

Page 36: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Write Path

ZK ClusterZK

ClusterZK Cluster

Client

Region Server1

Q:Where is -ROOT-?A: RegionServer1

.META. Table for all regions in the system, never splits

table, startKey, id::regionInfo, Server

Q:Where is .META.?A: RegionServer2

Region Server2

HTable.put()

-ROOT- Table for keeping track of .META. table

.META.,region,key:regionInfo, Server

MemStore

HLog/WAL

HDFSBlocks

1

2

3 4

5

Offline flush

6return Code

Page 37: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Shell

○ Table MetaData: e.g. create/alter/drop/describe table○ Table Data: e.g. put/scan/delete/count row(s)○ Admin: e.g. flush/rebalance/compact regions, split tables○ Replication Tools: e.g. add/enable/list/start/stop replication○ Security: e.g. grant/revoke/list user permissions

■ Shell interaction example:■ hbase(main):001:0> create 'myTable', 'myColFam1'■ 0 row(s) in 3.8890 seconds■■ hbase(main):002:0> put 'myTable’, 'row-1', 'myColFam1:col1', 'value-1'■ 0 row(s) in 0.1840 seconds ■■ hbase(main):003:0> scan 'test'■ ROW COLUMN+CELL row-11 column=myColFam1:col1, timestamp=1457381922312, value=value-1 ■ 1 row(s) in 0.1160 seconds■■ hbase(main):004:0>

Page 38: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - Advanced Topics

○ Bulk Loading○ Cluster Replication○ Merging and Splitting of regions○ Predicate pushdown using Server side Filters○ Bloom filters○ Co-Processors○ Snapshots○ Performance Tuning

Page 39: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - What its not

○ HBase is not for everyone○ Has no support for

■ SQL■ Joins■ Secondary indexes■ Transactions■ JDBC driver

○ Works well with large deployments○ Requires good working knowledge of the Hadoop eco-system.

Page 40: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

HBase - What its good at

● Strongly consistent reads/writes● Automatic sharding● Automatic RegionServer failover● HBase supports MapReduce for using HBase as both source and sink● Works on top of HDFS● HBase provides Java Client AP and a REST/Thrift API● Block Cache and Bloom Filters support● Web UI and JMX support, for operational management

Page 41: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 42: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 43: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis● Redis is an open source, in-memory key-value store, with Disk persistence● Originally written at LLOGG by Salvator Sanfilippo ~2009● Written in ANSI C and works in most Linux Systems● No external dependencies● Very small ~1MB memory per instance● Datatypes can be data-structures: String, Hash, Set, Sorted Set.● Compressed in-memory representation of data● Clients are available in lots of languages. C, C#, Clojure, Scala, Lua...

Page 44: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Key Components

Network

Single Threaded Server

Highly OptimizedNetwork Layer

CPU - 1

Highly OptimizedMemory Storage

CPU - 2

Single Threaded Server

Highly OptimizedNetwork Layer

Highly OptimizedMemory Storage

CPU - N

Single Threaded Server

Highly OptimizedNetwork Layer

Highly OptimizedMemory Storage

Memory

Page 45: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Key Components

Network

Single Threaded Server

Highly OptimizedNetwork Layer

CPU - 1

Highly OptimizedMemory Storage

CPU - 2

Single Threaded Server

Highly OptimizedNetwork Layer

Highly OptimizedMemory Storage

CPU - N

Single Threaded Server

Highly OptimizedNetwork Layer

Highly OptimizedMemory Storage

Memory

Page 46: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls

Page 47: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000

Response Queue

Page 48: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000

Response Queue

● Bypass OS socket layer abstraction○ Uses low level epoll(), kqueue(), select() calls

● Low overhead of waiting threads.● Allows, handling of close to 10K concurrent clients

Page 49: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Memory Optimizations● Integer encoding for small values● Small hashes are converted to arrays

○ Leverage CPU caching● Uses 32 bit version when possible● Leads to 5X to 10X memory saving

Page 50: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis Enterprise Features

Redis Master

Slave1

Slave2

Async. replication

Client

Redis Master

Slave1

Slave2

Async. replication

Cluster 1

Cluster 2

Shard 1

Shard 2

Page 51: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Redis WrapUp● Super fast in memory KV store● Provides a CLI● Typical apps will require client side coding● Spills to disk for large data-sets, with reduced performance● Upcoming “cluster” feature will keep 3 copies for HA

Page 52: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Page 53: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Questions?

Page 54: Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Systems forBig Data

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech