Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
><HOPSFS & EPIPE
HopsFS & ePipeMahmoud Ismail <[email protected]>Gautier Berthou <[email protected]>
1
www.hops.io @hopshadoop
github.com/hopshadoop
><HOPSFS & EPIPE 2
Agenda
• From HDFS to HopsFS
• ePipe to enable a searchable HopsFS
• Tutorial
next 3
A File System with a million ops/sec and searchable in sub
seconds?!
next 4Bill Gates’ biggest product regret?
><HOPSFS & EPIPE 5
WinFS
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
•“WinFS was an attempt to bring the benefits of schema and relational databases to the Windows file system. …The WinFS effort was started around 1999 as the successor to the planned storage layer of Cairo and died in 2006 after consuming many thousands of hours of efforts from really smart engineers.” - [Brian Welcker]*
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
directory -> {f1, f2,..}
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
directory -> {f1, f2,..}file -> {b1, b2,..}
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
directory -> {f1, f2,..}file -> {b1, b2,..}
block -> {r1, r2,..}
><HOPSFS & EPIPE 6
HDFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
directory -> {f1, f2,..}file -> {b1, b2,..}
block -> {r1, r2,..}….
><HOPSFS & EPIPE 7
JVM Heap is the limit
><HOPSFS & EPIPE 7
JVM Heap is the limit
• Storing NameNode metadata in JVM Heap
><HOPSFS & EPIPE 7
JVM Heap is the limit
• Storing NameNode metadata in JVM Heap
• Very efficient, yet
><HOPSFS & EPIPE 7
JVM Heap is the limit
• Storing NameNode metadata in JVM Heap
• Very efficient, yet
• Number of files/directories are limited
><HOPSFS & EPIPE 7
JVM Heap is the limit
• Storing NameNode metadata in JVM Heap
• Very efficient, yet
• Number of files/directories are limited
• Garbage collection pause times
><HOPSFS & EPIPE 8
One Lock to rule them all
/root
dir1 dir n...dir2
... ... ...
><HOPSFS & EPIPE 8
One Lock to rule them all
/root
dir1 dir n...dir2
... ... ...
multi-reader, single writer concurrency semantics
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
• Throughput dependent on our chosen data store
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
• Throughput dependent on our chosen data store
• Choosing a data store?
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
• Throughput dependent on our chosen data store
• Choosing a data store?
• An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
• Throughput dependent on our chosen data store
• Choosing a data store?
• An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
• Row-level locking
><HOPSFS & EPIPE 9
Move the NameNode metadata off the JVM Heap
• Stateless NameNode
• Multiple NameNodes to increase Throughput
• Throughput dependent on our chosen data store
• Choosing a data store?
• An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
• Row-level locking
• Efficient Cross partition transaction
><HOPSFS & EPIPE 10
MySQL Cluster (NDB) to the rescue
><HOPSFS & EPIPE 10
MySQL Cluster (NDB) to the rescue
• NewSQL (Relational) DB
• User-defined partitioning
• Row-level Locking
• Distribution-aware transactions
• Partition-pruned index scans
• Real-time, 2-Phase Commit
• 1.2 sec TransactionInactive timeouts
><HOPSFS & EPIPE 10
MySQL Cluster (NDB) to the rescue
• NewSQL (Relational) DB
• User-defined partitioning
• Row-level Locking
• Distribution-aware transactions
• Partition-pruned index scans
• Real-time, 2-Phase Commit
• 1.2 sec TransactionInactive timeouts
• Commodity Hardware
• Scales to 48 nodes
• Supports on-disk columns
• SQL API
• C++/Java Native API
• C++ Event API
><HOPSFS & EPIPE 10
MySQL Cluster (NDB) to the rescue
• NewSQL (Relational) DB
• User-defined partitioning
• Row-level Locking
• Distribution-aware transactions
• Partition-pruned index scans
• Real-time, 2-Phase Commit
• 1.2 sec TransactionInactive timeouts
200 Million NoSQL Ops/sec
• Commodity Hardware
• Scales to 48 nodes
• Supports on-disk columns
• SQL API
• C++/Java Native API
• C++ Event API
><HOPSFS & EPIPE 11
HopsFS
NNs
HopsFS / HDFS Clients
SbNN
Datanodes
ZooKeeper Nodes
HDFSClients
HDFS
Met
adat
a M
gm
Leader
DAL Driver
NDB
File
Blk
s
Journal Nodes
HopsFS
ANN
Datanodes
><HOPSFS & EPIPE 12
From Memory to Database
/root
2014
dir1
... ...
us
c.log
b.log
nydir1
...
...
Quota
INode
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
ConnectionListListener
(Nio Thread)
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
ConnectionList
Call Queue
Listener(Nio Thread)
ipc.server.read.threadpool.size (default 1)
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
flush
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
><HOPSFS & EPIPE 13
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
DAL-Impl
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
ConnectionListListener
(Nio Thread)
DAL-Impl
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
ConnectionList
Call Queue
Listener(Nio Thread)
ipc.server.read.threadpool.size (default 1)
DAL-Impl
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
DAL-Impl
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
DAL-ImplDAL API
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
leases…
DAL-ImplDAL API
><HOPSFS & EPIPE 14
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-ImplDAL API
><HOPSFS & EPIPE 15
Fine grain Locking
/root
2014 archiveeu
dir1 dir n...dir2
... ... ...
us
a.log
se dir1 dir2
... ...
de c.log
b.log
dept
d1 d2
... ...
...
d0
dn
><HOPSFS & EPIPE 15
Fine grain Locking
/root
2014 archiveeu
dir1 dir n...dir2
... ... ...
us
a.log
se dir1 dir2
... ...
de c.log
b.log
dept
d1 d2
... ...
...
d0
dn
• Hierarchical Locking (Implicit Locking)
><HOPSFS & EPIPE 15
Fine grain Locking
/root
2014 archiveeu
dir1 dir n...dir2
... ... ...
us
a.log
se dir1 dir2
... ...
de c.log
b.log
dept
d1 d2
... ...
...
d0
dn
• Hierarchical Locking (Implicit Locking)
• Subtree Locking
><HOPSFS & EPIPE 16
Implicit Locking
/root
2014
dir1
... ...
us
c.log
b.log
nydir1
...
...
Quota
INode
><HOPSFS & EPIPE 16
Implicit Locking
/root
2014
dir1
... ...
us
c.log
b.log
nydir1
...
...
Quota
INode
><HOPSFS & EPIPE 16
Implicit Locking
/root
2014
dir1
... ...
us
c.log
b.log
nydir1
...
...
Quota
INode
><HOPSFS & EPIPE 17
Pluggable DBs: Data Abstraction Layer
NameNode(Apache v2)
NDB-DAL-Impl(GPL v2)
DAL API(Apache v2)
><HOPSFS & EPIPE 17
Pluggable DBs: Data Abstraction Layer
NameNode(Apache v2)
NDB-DAL-Impl(GPL v2)
Other DB(Other License)
DAL API(Apache v2)
><HOPSFS & EPIPE 18
Spotify Workload
Op Name Percentage Op Name Percentageappend file 0.0% content summary 0.01%mkdirs 0.02% set permissions 0.03% [26.3%⇤]set replication 0.14% set owner 0.32 % [100%⇤]delete 0.75% [3.5%⇤] create file 1.2%rename 1.3% [0.03%⇤] add blocks 1.5%list (listStatus) 9% [94.5%⇤] stat (fileInfo) 17% [23.3%⇤]read (getBlkLoc) 68.73%
Table 1: Relative frequency of operations on a Spotify HDFS cluster.List, read, and stat operations account for ⇡ 95% of the metadata op-erations in the cluster.⇤Of which, the relative percentage is on directories
NameNode on which to execute file system operations.HopsFS clients periodically refresh the NameNode list,enabling new NameNodes to join an operational cluster.HDFS v2.x clients are fully compatible with HopsFS,although they do not distribute operations over Name-Nodes, as they assume there is a single active Name-Node.
MySQL Cluster: NDB is the storage enginefor MySQL Cluster and it is a shared-nothing, in-memory, auto-sharding, consistent, distributed, rela-tional database [38]. NDB frequently checkpoints thedata and supports both NDB datanode and cluster levelrecovery.
NDB horizontally partitions the tables among theNDB datanodes. It also supports application defined par-titioning (ADP) for the tables. The transaction coordina-tors are located on all the NDB datanodes, enabling highperformance transactions between data shards. Distri-bution aware transactions (DAT) are possible by provid-ing a hint, based on the application defined partitioningscheme, to start a transaction on the NDB datanode con-taining the data read/updated by the transaction. In par-ticular, single row read operations and partition prunedindex scans (scan operations in which a single data shardparticipates) benefit from distribution aware transactionsas they can read all their data locally [77]. Incorrect hintsresult in additional network traffic being incurred but oth-erwise correct system operation.
NDB only supports read-committed transaction isola-tion, which does not allow dirty reads but phantom andfuzzy (non-repeatable) reads can happen in a transac-tion [7]. NDB supports row level locks, exclusive (write)locks, shared (read) locks, and read-committed locks.HopsFS uses locks to serialize conflicting file system op-erations.
3 Partitioning Scheme and TransactionsMetadata for hierarchical distributed file systems typi-cally contains information on inodes, blocks, replicas,quotas, leases and mappings (directories to files, files toblocks, and blocks to replicas). When metadata is dis-tributed, an application defined partitioning scheme isneeded to shard the metadata and a distributed consensus
Figure 2: (a) Shows the relative cost of different operations in NewSQLdatabase. (b) HopsFS avoids FTS and IS operations as the cost theseoperation is relatively higher than PPIS, B, and PK operations.
protocol is required to ensure metadata integrity for op-erations that cross shards. Quorum-based consensus pro-tocols, such as Paxos, provide high performance withina single shard, but are typically combined with transac-tions, implemented using the two-phase commit proto-col, for operations that cross shards, as in Megastore [6]and Spanner [11]. File system operations in HopsFS areimplemented primarily using transactions and row-levellocks in MySQL Cluster to provide serializability [23]for metadata operations.
The choice of partitioning scheme for the hierarchi-cal namespace is a key design decision for distributedmetadata architectures. We base our partitioning schemeon the expected relative frequency of HDFS operationsand the cost of different database operations that can beused to implement the file system operations. Table 1shows the relative frequency of selected HDFS opera-tions in a workload generated by Hadoop applications(Pig, Hive, HBase, MapReduce, Tez, Spark, and Giraph)at Spotify. List, stat and file read operations alone ac-count for ⇡ 95% of the operations in the HDFS cluster.These statistics are similar to the published workloads forHadoop clusters at Yahoo [1], LinkedIn [53], and Face-book [66]. Figure 2a shows the relative cost of differentdatabase operations. We can see that the cost of a fulltable scan or an index scan, in which all database shardsparticipate, is much higher than a partition pruned indexscan in which only a single database shard participates.HopsFS implements common file system operations us-ing only the low cost database operations, that is, primarykey read, batched primary key reads and partition prunedindex scans. For example, the read and directory list-ing operations (see Table 3), are implemented using only(batched) primary key lookups and partition pruned in-dex scans. Index scans and full table scans were avoided,where possible, as they touch all database shards andscale poorly.
3.1 HopsFS Partitioned MetadataIn HopsFS, the file system metadata is stored in tableswhere a directory inode is represented by a single row inthe Inode table. File inodes, however, have more asso-ciated metadata, such as a set of blocks, block locations,and checksums that are stored in separate tables, with
3
Relative frequency of operations on a Spotify HDFS cluster. List, read, and stat operations account for ≈ 95% of the metadata operations in the cluster.
*Of which, the relative percentage is on directories
[Niazi, Salman, et al. "HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases." arXiv preprint (2016)]
><HOPSFS & EPIPE 19
To Infinity and Beyond
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
0
100k
200k
300k
400k
500k
600k
700k
1 6 11 16 21 26 31 36 41 46
ops/
sec
Number of Namenodes
HopsFS-SpotifyHDFS-Spotify
~8.5X
><HOPSFS & EPIPE 19
To Infinity and Beyond
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
0
100k
200k
300k
400k
500k
600k
700k
1 6 11 16 21 26 31 36 41 46
ops/
sec
Number of Namenodes
HopsFS-SpotifyHDFS-Spotify
~8.5X
><HOPSFS & EPIPE 19
To Infinity and Beyond
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
0
100k
200k
300k
400k
500k
600k
700k
1 6 11 16 21 26 31 36 41 46
ops/
sec
Number of Namenodes
HopsFS-SpotifyHDFS-Spotify
~8.5X
><HOPSFS & EPIPE 19
To Infinity and Beyond
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
0
100k
200k
300k
400k
500k
600k
700k
1 6 11 16 21 26 31 36 41 46
ops/
sec
Number of Namenodes
HopsFS-SpotifyHDFS-Spotify
~8.5X
><HOPSFS & EPIPE 20
Bigger Clusters
0 20 40 60 80
100 120 140 160
0.25 0.50 0.75 1.00 50
100
150
200
250
300
Hop
sFS
op ti
me
(sec
)
HD
FS o
p tim
e (m
s)
No of files in a directory (million)
HDFS renameHDFS delete
HopsFS renameHopsFS delete
Figure 7: Performance of rename and delete operations on large direct-ories. Note, the different time scales for HDFS and HopsFS. HopsFS isan order of magnitude slower as it reads all the descendant inodes of thesubtree from an external database.
the memory. However, due to the low frequency of suchoperations in typical industrial workloads (see Table 1),we think it is an acceptable trade-off for the higher per-formance of common file system operations in HopsFS.
7.3 Metadata (Namespace) ScalabilityIn HDFS, as the entire namespace metadata must fit onthe heap of single JVM, the data structures are highly op-timized to reduce the memory footprint [62]. In HDFS,a file with two blocks that are replicated three ways re-quires 448 + L bytes of metadata1 where L represents thefilename length. If the file names are 10 characters long,then a 1 GB JVM heap can store 2.3 million files. In real-ity the JVM heap size has to be significantly larger to ac-commodate secondary metadata, thousands of concurrentRPC requests, block reports that can each be tens of mega-bytes in size, as well as other temporary objects.
Number of FilesMemory HDFS HopsFS1 GB 2.3 million 0.44 million50 GB 115 million 22 million100 GB 230 million 44 million200 GB 460 million 88 million500 GB Does Not Scale 220 million1 TB Does Not Scale 440 million24 TB Does Not Scale 10.8 billion
Table 2: HDFS and HopsFS Metadata Scalability.
Migrating the metadata to a database causes an expan-sion in the amount of memory required to accommod-ate indexes, primary/foreign keys and padding. In orderto calculate the size of each entity we use a tool calledsizer [40]. In HopsFS the same file described above takes2420 bytes if the metadata is replicated twice. For ahighly available deployment with an active and standbynamenodes for HDFS, you will need twice the amount ofmemory, thus, HopsFS requires ⇡ 2 times more memorythan HDFS to store metadata that is highly available.
Table 2 shows the metadata scalability of HDFS andHopsFS. NDB supports up to 48 datanodes, which allowsit to scale up to 24 TB of data in a cluster with 512 GBRAM on each NDB datanode. HopsFS can store up to
1These size estimates are for HDFS version 2.0.4 from whichHopsFS was forked. Newer version of HDFS require additional memoryfor new features such as snapshots and extended attributes.
10.8 billion files using 24 TB of metadata, which is anorder of magnitude higher (24 times) than HDFS.
7.4 Industrial Workload ExperimentsWe benchmarked HopsFS using workloads based on oper-ational traces from Spotify that operates a Hadoop clusterconsisting of 1600+ nodes containing 60 petabytes ofdata. The namespace contains 13 million directories and218 million files where each file on average contains 1.3blocks. The Hadoop cluster at Spotify runs on averageforty thousand jobs of different applications, such as, Pig,Hive, HBase, MapReduce, Tez, Spark, and Giraph everyday. The file system workload generated by these applic-ation is summarized in Table 1, which shows the relativefrequency of HDFS operations. At Spotify the averagefile path depth is 7 and average inode name length is 34characters. On average each directory contains 16 filesand 2 sub-directories. There are 289 million blocks storedon the datanodes. We use these statistics to generate filesystem workloads that approximate HDFS usage in pro-duction at Spotify.
7.4.1 Scalability ArgumentFigure 8 shows that, for our industrial workload, HopsFSdelivers 2.6 times the throughput of HDFS with 12 name-nodes and 8 NDB nodes. Even for 2-node NDB deploy-ments, HopsFS can outperform HDFS and scale linearlyup to 8 namenodes.
As discussed before in medium to large Hadoopclusters 5 to 8 servers are required to provide high avail-ability for HDFS. With 6 servers, HopsFS delivers higherthroughput than HDFS, increasing further as more name-nodes are added to the system.
For higher numbers of namenodes, HopsFS’ through-put levels off because NDB becomes overloaded. Byincreasing the number of NDB datanode instances to 4,HopsFS increases throughput up to 11 namenodes, andby increasing the number of NDB datanode instances to8, HopsFS scales up to at least 12 namenodes. The 2-node NDB cluster has a similar performance as the 4-nodeNDB cluster because each NDB datanode in 2-node NDBcluster holds the complete copy of the entire metadata,which improves the performance of transactions as all the
20K
40K
60K
80K
100K
120K
140K
1 2 3 4 5 6 7 8 9 10 11 12
ops/
sec
Number of Namenodes
HopsFS with 2 NDB NodesHopsFS with 4 NDB NodesHopsFS with 8 NDB Nodes
HDFS (Using 5 Server)
Figure 8: HopsFS and HDFS throughput for our industrial workload.
9
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
• Extending INodes (files/directories)
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
• Extending INodes (files/directories)
• Adding a new table with a foreign key to the nodes table
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
• Extending INodes (files/directories)
• Adding a new table with a foreign key to the nodes table
• Attaching metadata to a file/directory
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
• Extending INodes (files/directories)
• Adding a new table with a foreign key to the nodes table
• Attaching metadata to a file/directory
• Schema-less
><HOPSFS & EPIPE 21
Tinker Friendly Metadata
• The Database (NDB) is the single source of truth
• Extending INodes (files/directories)
• Adding a new table with a foreign key to the nodes table
• Attaching metadata to a file/directory
• Schema-less
• Schema-based
><HOPSFS & EPIPE 22
Schema-less Metadata
><HOPSFS & EPIPE 22
Schema-less Metadata
inodeID Name parentId1 / 02 Users 13 alice.txt 2
><HOPSFS & EPIPE 22
Schema-less Metadata
inodeID Name parentId1 / 02 Users 13 alice.txt 2
attach /Users/alice.txt ’{“age” : 20, “gender” : “female”, “about”: “I am alice”}’
><HOPSFS & EPIPE 22
Schema-less Metadata
inodeID Name parentId1 / 02 Users 13 alice.txt 2
attach /Users/alice.txt ’{“age” : 20, “gender” : “female”, “about”: “I am alice”}’
inodeID metadata
3 {“age” : 20, “gender” : “female”, “about”: “I am alice”}
><HOPSFS & EPIPE 22
Schema-less Metadata
inodeID Name parentId1 / 02 Users 13 alice.txt 2
attach /Users/alice.txt ’{“age” : 20, “gender” : “female”, “about”: “I am alice”}’
inodeID metadata
3 {“age” : 20, “gender” : “female”, “about”: “I am alice”}
><HOPSFS & EPIPE 23
Schema-based Metadata
><HOPSFS & EPIPE 23
Schema-based Metadata
Template
><HOPSFS & EPIPE 23
Schema-based Metadata
Template
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTable
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTable
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
INode
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
INode
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
INode
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
INode Metadata Row
><HOPSFS & EPIPE 23
Schema-based Metadata
Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn
INode Metadata Row
><HOPSFS & EPIPE 24
Searching through the Namespace
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
• inefficient by design
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
• inefficient by design
• Pig/MapReduce
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
• inefficient by design
• Pig/MapReduce
• NameNode is a critical point, avoid overloading
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
• inefficient by design
• Pig/MapReduce
• NameNode is a critical point, avoid overloading
• SQL Query on NDB
><HOPSFS & EPIPE 24
Searching through the Namespace
• hdfs -find
• limited
• inefficient by design
• Pig/MapReduce
• NameNode is a critical point, avoid overloading
• SQL Query on NDB
• One Size does not fit all
><HOPSFS & EPIPE 25
Polyglot Persistence: the Right Tool for the Job
><HOPSFS & EPIPE 26
Monolithic vs Polyglot Persistence
[http://martinfowler.com/]
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
Database
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
Database
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch
ePipe
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch one-way replication
ePipe
><HOPSFS & EPIPE 27
Asynchronous Update of the Metadata
Eventual Consistency for Metadata. Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch one-way replicationimmutable data
ePipe
><HOPSFS & EPIPE 28
HopsFS | ElasticSearch
MySQL Cluster
ElasticHandlerTailer
Table Unit
Batcher Reader
ePipe
ElasticSearchCluster
next 29Supporting Project-Level
Multi-Tenancy
How can we introduce GitHub-style projects to Hadoop?
><HOPSFS & EPIPE 30
Tutorial
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
• http://www.hops.io/?q=boss
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
• http://www.hops.io/?q=boss
• HopsWorks Clusters
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
• http://www.hops.io/?q=boss
• HopsWorks Clusters
• http://52.210.205.125:8080/hopsworks/
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
• http://www.hops.io/?q=boss
• HopsWorks Clusters
• http://52.210.205.125:8080/hopsworks/
• http://52.209.143.68:8080/hopsworks/
><HOPSFS & EPIPE 30
Tutorial
• Karamel Automated Installation
• http://www.hops.io/?q=boss
• HopsWorks Clusters
• http://52.210.205.125:8080/hopsworks/
• http://52.209.143.68:8080/hopsworks/
• http://52.18.186.135:8080/hopsworks/
><HOPSFS & EPIPE 31
Conclusions
• Hops is a next-generation distribution of Hadoop.
• HopsWorks is a frontend to Hops that supports multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs.
• Looking for contributors/committers
• Check out our github (github.com/hopshadoop)
next 32Questions
next 33
next 34Karamel
><HOPSFS & EPIPE 35
Automated Installation
• Vagrant/Chef to spin up on a single host
• Karamel/Chef to deploy on AWS/GCE/OpenStack or on-premises
name: HopsWorks ec2: type: m3.medium cookbooks: hadoop: github: "hopshadoop/hopsworks-chef" version: "v0.1" groups: ui: size: 1 recipes: - hopsworks metadata: size: 2 recipes: - hops::nn - hops::rm datanodes: size: 50 recipes: - hops::dn - hops::nm
next 36HopsWorks
><HOPSFS & EPIPE 37
Problem: Sensitive Data needs its own Cluster
NSA DataSet
has access to
Alice
><HOPSFS & EPIPE 37
Problem: Sensitive Data needs its own Cluster
NSA DataSet
has access to
User DataSet
give access toAlice
><HOPSFS & EPIPE 37
Problem: Sensitive Data needs its own Cluster
NSA DataSet
has access to
User DataSet
give access to
Alice can copy/cross-link between data sets
Alice
><HOPSFS & EPIPE 37
Problem: Sensitive Data needs its own Cluster
NSA DataSet
has access to
User DataSet
give access to
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
><HOPSFS & EPIPE 38
Solution: Project-Specific UserIDs
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
><HOPSFS & EPIPE 38
Solution: Project-Specific UserIDs
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
HDFS enforcesaccess control
><HOPSFS & EPIPE 38
Solution: Project-Specific UserIDs
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
HDFS enforcesaccess control
How can we share DataSets between Projects?
><HOPSFS & EPIPE 39
Sharing DataSets between Projects
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
><HOPSFS & EPIPE 39
Sharing DataSets between Projects
Project NSA
Project UsersMember of
DataSetowns
NSA__Alice
Users__Alice
Member of
><HOPSFS & EPIPE 39
Sharing DataSets between Projects
Project NSA
Project UsersMember of
DataSetowns
Add members of Project NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
><HOPSFS & EPIPE 40
HopsWorks enforces dynamic Roles
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Kafka
><HOPSFS & EPIPE 40
HopsWorks enforces dynamic Roles
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Kafka
><HOPSFS & EPIPE 40
HopsWorks enforces dynamic Roles
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Kafka
><HOPSFS & EPIPE 40
HopsWorks enforces dynamic Roles
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Kafka
><HOPSFS & EPIPE 40
HopsWorks enforces dynamic Roles
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
ProjectsSecure
Impersonation
Kafka
X.509 Certificates
><HOPSFS & EPIPE 41
X.509 Certificate per project specific user
Authenticate
Add/Del Users
Distributed Database
Insert/Remove CertsProject Mgr
Root CA
Services Hadoop Spark Kafka etc
Cert Signing Requests
><HOPSFS & EPIPE 42
Project
• A project is a collection of
• Members
• HDFS DataSets
• Kafka Topics
• Notebooks, Jobs
• A project has an owner
• A project has quotas
projectdataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
><HOPSFS & EPIPE 43
Project Roles
•Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
•Data Scientist Privileges
- Write and Run code
><HOPSFS & EPIPE 43
Project Roles
•Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
•Data Scientist Privileges
- Write and Run codeWe delegate administration of privileges to users