Upload
sameer-tiwari
View
1.760
Download
4
Embed Size (px)
DESCRIPTION
There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File Systems, in-memory Key Value Stores, Big Table Stores and provide insights on how to choose the right storage solution for a specific class of problems. For instance, running large analytic workloads, iterative machine learning algorithms, and real time analytics. The talk will cover HDFS, HBase and brief introduction to Redis
Citation preview
Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Hadoop Distributed File System(HDFS)
● History○ Based on Google File System Paper (2003)○ Built at Yahoo by a small team
● Goals○ Tolerance to Hardware failure○ Sequential access as opposed to Random○ High aggregated throughput for Large Data Sets○ “Write Once Read Many” paradigm
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
File.create() MetaDataNN OPs
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
File.create() MetaDataNN OPs
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
AB1
BB1
Data BlocksDN OPs
File.write()
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
AB1 AB2 BB1
BB1
AB1
BB1
AB1
Client2-FileB
Rack 1 Rack 2
AB2 AB2
File.create() MetaDataNN OPs
Data BlocksDN OPs
File.write()
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
Replication PipeLining
Client1-FileA
NameNode
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
Client1-FileA
NameNode
DataNode 1
AB1 AB2
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
Client1-FileA
NameNode
DataNode 1 DataNode 2
AB1 AB2 BB1
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)
AB2
Client1-FileA
NameNode
DataNode 1 DataNode 2
AB1 AB2 BB1
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)
AB2ReplicationPipeLining.Streaming
HDFS - NameNode 1 of 4
● Heart of HDFS. Typically Lots of Memory ~128Gigs● Hosts two important tables● The HDFS Namespace: File->Block mapping
○ Persisted for backup● The iNode table: Block->Datanode mapping
○ Not persisted.○ Re-built from block reports
● HDFS is Journaled File system○ Maintains a WAL called edit log○ Edit log is merged into fsimage at a preset log size
HDFS - NameNode 2 of 4
● Can take on 3 roles● Regular mode: Hosts the HDFS Namespace● Backup mode: Secondary NN
○ Downloads fsimage regularly○ Merges changes to namespace○ Its a misnomer, it more of a checkpointing server
● Safemode: Startup time○ Its a R/O mode○ Collects data from active DNs
HDFS - NameNode 3 of 4HA using Quorum Journal Manager (Hadoop 2.0+)
Active NN Standby NN
DataNodesDataNodes
DataNodesDataNodes
JournalNodesJournal
NodesJournalNodes
ClientsClients
Clients
ZK ClusterZK
ClusterZK Cluster
HDFS - NameNode 4 of 4
● Replication Monitor: Fix over/under replicated blocks○ Replica Modes: Corrupt, Current, Out-of-date,
under-construction● Lease Management: During file creation
○ Ensures single writer (multiple readers are ok)○ Synchronously checks active lease○ Asynchronously checks the entire Tree of leases
● Heartbeat monitor: Collects DN stats and marks them down if no heartbeat recvd for ~10mins.
HDFS - DataNode
● Typical Machine: ~ 4TB X 12 disks JBOD● Has no idea about HDFS, only knows about blocks● Serves 2 types of requests
○ NN requests for Block create/delete/replicate○ Serves Block R/W requests from Clients
● Maintains only one table○ Block->Real Bytes on the local FS○ Stored locally and not backed up○ DN can re-build this table by scanning its local dir
● Creates a chksum file for each block● Runs blockScanner() to find corrupt blocks● DataNode to NameNode communication
○ Init - registration○ Sends HeartBeat to NN every few secs○ Block completion: blockReceived()○ Lets NN respond with block commands○ Sends full Block Report every hour
HDFS - DataNode
HDFS - Typical Deployment
RACK1 . . .
Aggregator Switch 1 Aggregator Switch 2
Master Switch
TOR
RACK N(10-20)
TOR
RACK1 . . .
TOR
RACK N(10-20)
TOR
Aggregator Switch 3. . .
. . .
● NN holds the Namespace in a single Java process● 64Gig Heap == ~250 million files + blocks
○ Federation sort of solves the problem○ Moving Namespace to a KV Store is one solution
● Enterprise features slowly being added○ Snapshots○ NFS access○ Geo replication○ Run Length Encoding to reduce 3X copies to 1.3X
HDFS - Limitations
● Support for fadvise readahead and drop-behind● HDFS takes advantage of multiple disks
○ Individual failures do not cause DN failures○ Spills are parallelized
● Replica and Task placement○ Done by DNSToSwitchMapping():resolve()○ User supplied rack topology○ IP address -> Rack id mapping○ net.topology.* setttings in core-site.xml
HDFS - Advanced Concepts
● Couple of tools for Perf monitoring○ Ganglia for HDFS○ Nagios for general health of the machine.
HDFS - Advanced Concepts
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
HBase
● History○ Based on Google’s Big Table (2006)○ Built at Powerset (later acquired by Microsoft)○ Facebook and Yahoo use it extensively (~1000 machines)
● Goals○ Random R/W access○ Tables with Billions of Rows X Millions of Columns○ Often referred to as a “NoSQL” Data store○ High speed ingest rate. FB == ~Billion msgs+chat per day.○ Good consistency model
HBase - Key Components
NameNodeJobTrackerHMaster
DataNodeTaskTrackerHRegionServer
ZK ClusterZK
ClusterZK Cluster
Client
Master(s):Active and Backup
Slaves:Many
HBase - Data Model
● Google BigTable Paper on #2 says
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes
Let’s break that down over the next few slides...
HBase - Data Model
● Data is stored in Tables● Tables have Rows and Columns● Thats where the similarity ends
○ Columns are grouped into Column Families
● Rows are stored in a sorted(increasing) order○ Implies, there is only one primary key
● Rows can be sparsely populated○ Variable length rows are common
● Same row can be updated multiple times○ Each will be stored as a versioned update
HBase - Data ModelConceptual View
Row Key Time Stamp ColumnFamily contents ColumnFamily anchor
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"
"com.cnn.www" t5 contents:html = "<html>..."
"com.cnn.www" t3 contents:html = "<html>..."
Single column in “contents”byte-array
Column => Column Family: Qualifiere.g. Two Columns in the “anchor”byte-array
Versionstimemillis()
Row-Keybyte-array, Sorted by byte order
HBase - Data ModelPhysical View
Row Key Time Stamp ColumnFamily anchor
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"
Row Key Time Stamp ColumnFamily contents
"com.cnn.www" t5 contents:html = "<html>..."
"com.cnn.www" t3 contents:html = "<html>..."
HBase - Table Objects
Region1R1-R10
Logical Table
Data : R1- R40
HFileMemStore Blocks Blocks
Region Server : ~200 Regions per Server
Region Servers
Shards
HDFSBlocksHDFS
Blocks
HDFSBlocks
HLog/WAL HDFSBlocksHDFS
Blocks
Region2R11-R20 HFileMemStore Blocks Blocks
HLog/WALHDFSBlocks
HBase - Data Model Operations
○ HTable class offers 4 techniques: get, put, delete and scan.○ The first 3 have a single or batch mode available
//Scan example
public static final byte[] CF1 = "empData1".getBytes();public static final byte[] ATTR1 = "empId".getBytes();HTable htable = new HTable(blah... // create an instance of HTable
Scan scan = new Scan();scan.addColumn(CF1, ATTR1);scan.setStartRow(Bytes.toBytes("200"));scan.setStopRow(Bytes.toBytes("500"));ResultScanner rs = htable.getScanner(scan);try { for (Result r = rs.next(); r != null; r = rs.next()) { // do something with it...} finally { rs.close();}
HBase - Data Versioning
○ By default a put() uses timestamp, but you can override it○ Get.setMaxVersions() or Get.setTimeRange○ By default a get() returns the latest version, but you can ask for any○ All Data model operations are in !sorted order. Row:CF:Col:Version○ Delete flavors: delete col+ver, delete col, delete col family, delete row○ Deletes work by creating tombstone markers○ LIMITATIONS:
■ delete() masks a put() till a major compaction takes place■ Major compactions can change get() results
○ All operations are ATOMIC within a row
HBase - Read Path
ZK ClusterZK
ClusterZK Cluster
Client
Region Server1
Q:Where is -ROOT-?A: RegionServer1
.META. Table for all regions in the system, never splits
table, startKey, id::regionInfo, Server
Q:Where is .META.?A: RegionServer2
Region Server2
Q: HTable.get()
-ROOT- Table for keeping track of .META. table
.META.,region,key:regionInfo, Server
MemStore
HFile - 1HFile - 2
1
2
3 4
5A: Row
6
HBase - Write Path
ZK ClusterZK
ClusterZK Cluster
Client
Region Server1
Q:Where is -ROOT-?A: RegionServer1
.META. Table for all regions in the system, never splits
table, startKey, id::regionInfo, Server
Q:Where is .META.?A: RegionServer2
Region Server2
HTable.put()
-ROOT- Table for keeping track of .META. table
.META.,region,key:regionInfo, Server
MemStore
HLog/WAL
HDFSBlocks
1
2
3 4
5
Offline flush
6return Code
HBase - Shell
○ Table MetaData: e.g. create/alter/drop/describe table○ Table Data: e.g. put/scan/delete/count row(s)○ Admin: e.g. flush/rebalance/compact regions, split tables○ Replication Tools: e.g. add/enable/list/start/stop replication○ Security: e.g. grant/revoke/list user permissions
■ Shell interaction example:■ hbase(main):001:0> create 'myTable', 'myColFam1'■ 0 row(s) in 3.8890 seconds■■ hbase(main):002:0> put 'myTable’, 'row-1', 'myColFam1:col1', 'value-1'■ 0 row(s) in 0.1840 seconds ■■ hbase(main):003:0> scan 'test'■ ROW COLUMN+CELL row-11 column=myColFam1:col1, timestamp=1457381922312, value=value-1 ■ 1 row(s) in 0.1160 seconds■■ hbase(main):004:0>
HBase - Advanced Topics
○ Bulk Loading○ Cluster Replication○ Merging and Splitting of regions○ Predicate pushdown using Server side Filters○ Bloom filters○ Co-Processors○ Snapshots○ Performance Tuning
HBase - What its not
○ HBase is not for everyone○ Has no support for
■ SQL■ Joins■ Secondary indexes■ Transactions■ JDBC driver
○ Works well with large deployments○ Requires good working knowledge of the Hadoop eco-system.
HBase - What its good at
● Strongly consistent reads/writes● Automatic sharding● Automatic RegionServer failover● HBase supports MapReduce for using HBase as both source and sink● Works on top of HDFS● HBase provides Java Client AP and a REST/Thrift API● Block Cache and Bloom Filters support● Web UI and JMX support, for operational management
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Redis● Redis is an open source, in-memory key-value store, with Disk persistence● Originally written at LLOGG by Salvator Sanfilippo ~2009● Written in ANSI C and works in most Linux Systems● No external dependencies● Very small ~1MB memory per instance● Datatypes can be data-structures: String, Hash, Set, Sorted Set.● Compressed in-memory representation of data● Clients are available in lots of languages. C, C#, Clojure, Scala, Lua...
Redis Key Components
Network
Single Threaded Server
Highly OptimizedNetwork Layer
CPU - 1
Highly OptimizedMemory Storage
CPU - 2
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
CPU - N
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
Memory
Redis Key Components
Network
Single Threaded Server
Highly OptimizedNetwork Layer
CPU - 1
Highly OptimizedMemory Storage
CPU - 2
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
CPU - N
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
Memory
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000
Response Queue
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000
Response Queue
● Bypass OS socket layer abstraction○ Uses low level epoll(), kqueue(), select() calls
● Low overhead of waiting threads.● Allows, handling of close to 10K concurrent clients
Redis Memory Optimizations● Integer encoding for small values● Small hashes are converted to arrays
○ Leverage CPU caching● Uses 32 bit version when possible● Leads to 5X to 10X memory saving
Redis Enterprise Features
Redis Master
Slave1
Slave2
Async. replication
Client
Redis Master
Slave1
Slave2
Async. replication
Cluster 1
Cluster 2
Shard 1
Shard 2
Redis WrapUp● Super fast in memory KV store● Provides a CLI● Typical apps will require client side coding● Spills to disk for large data-sets, with reduced performance● Upcoming “cluster” feature will keep 3 copies for HA
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Questions?
Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech