HopsFS & ePipe - TU Berlinboss.dima.tu-berlin.de/media/BOSS16-hopsfs-epipe.pdf · HOPSFS & EPIPE < 10 > MySQL Cluster (NDB) to the rescue • NewSQL (Relational) DB • User-defined

><HOPSFS & EPIPE

HopsFS & ePipeMahmoud Ismail <[email protected]>Gautier Berthou <[email protected]>

1

www.hops.io @hopshadoop

github.com/hopshadoop

http://github.com/hopshadoop

><HOPSFS & EPIPE 2

Agenda

• From HDFS to HopsFS

• ePipe to enable a searchable HopsFS

• Tutorial

next 3

A File System with a million ops/sec and searchable in sub

seconds?!

next 4Bill Gates’ biggest product regret?

><HOPSFS & EPIPE 5

WinFS

*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/

•“WinFS was an attempt to bring the benefits of schema and relational databases to the Windows file system. …The WinFS effort was started around 1999 as the successor to the planned storage layer of Cairo and died in 2006 after consuming many thousands of hours of efforts from really smart engineers.” - [Brian Welcker]*

><HOPSFS & EPIPE 6

HDFS

NNs

HopsFS / HDFS Clients

SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

directory -> {f1, f2,..}

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

directory -> {f1, f2,..}file -> {b1, b2,..}

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes


block -> {r1, r2,..}

><HOPSFS & EPIPE 6

HDFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes


block -> {r1, r2,..}….

><HOPSFS & EPIPE 7

JVM Heap is the limit

><HOPSFS & EPIPE 7


• Storing NameNode metadata in JVM Heap

><HOPSFS & EPIPE 7



• Very efficient, yet

><HOPSFS & EPIPE 7




• Number of files/directories are limited

><HOPSFS & EPIPE 7




• Number of files/directories are limited

• Garbage collection pause times

><HOPSFS & EPIPE 8

One Lock to rule them all

/root

dir1 dir n...dir2

... ... ...

><HOPSFS & EPIPE 8

One Lock to rule them all

/root

dir1 dir n...dir2

... ... ...

multi-reader, single writer concurrency semantics

><HOPSFS & EPIPE 9

Move the NameNode metadata off the JVM Heap

><HOPSFS & EPIPE 9


• Stateless NameNode

><HOPSFS & EPIPE 9



• Multiple NameNodes to increase Throughput

><HOPSFS & EPIPE 9




• Throughput dependent on our chosen data store

><HOPSFS & EPIPE 9





• Choosing a data store?

><HOPSFS & EPIPE 9






• An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.

><HOPSFS & EPIPE 9







• Row-level locking

><HOPSFS & EPIPE 9







• Row-level locking

• Efficient Cross partition transaction

><HOPSFS & EPIPE 10

MySQL Cluster (NDB) to the rescue

><HOPSFS & EPIPE 10


• NewSQL (Relational) DB

• User-defined partitioning

• Row-level Locking

• Distribution-aware transactions

• Partition-pruned index scans

• Real-time, 2-Phase Commit

• 1.2 sec TransactionInactive timeouts

><HOPSFS & EPIPE 10









• Commodity Hardware

• Scales to 48 nodes

• Supports on-disk columns

• SQL API

• C++/Java Native API

• C++ Event API

><HOPSFS & EPIPE 10









200 Million NoSQL Ops/sec

• Commodity Hardware

• Scales to 48 nodes

• Supports on-disk columns

• SQL API

• C++/Java Native API

• C++ Event API

><HOPSFS & EPIPE 11

HopsFS

NNs


SbNN

Datanodes

ZooKeeper Nodes

HDFSClients

HDFS

Met

adat

a M

gm

Leader

DAL Driver

NDB

File

Blk

s

Journal Nodes

HopsFS

ANN

Datanodes

><HOPSFS & EPIPE 12

From Memory to Database

/root

2014

dir1

... ...

us

c.log

b.log

nydir1

...

...

Quota

INode

><HOPSFS & EPIPE 13

Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

Journal Nodes

Client

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

ConnectionListListener

(Nio Thread)

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

ConnectionList

Call Queue

Listener(Nio Thread)

ipc.server.read.threadpool.size (default 1)

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue


dfs.namenode.service.handlercount (default 10)


…

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

Meta Data & In-Memory EditLogFSNameSystem Lock




…

Handler1 HandlerM…

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue


EditLog Buffer




…


><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue


EditLog Buffer

EditLog1 EditLog2 EditLog3




…


flush

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue


EditLog Buffer





…

Handler1 HandlerM… Done RPCs

ackIdsflush

><HOPSFS & EPIPE 13


NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue


EditLog Buffer



Responder(Nio Thread)



…

Handler1 HandlerM… Done RPCs

ackIdsflush

><HOPSFS & EPIPE 14

HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

NDB

Client

DAL-Impl

><HOPSFS & EPIPE 14


NameNode

NDB

Client

ConnectionListListener

(Nio Thread)

DAL-Impl

><HOPSFS & EPIPE 14


NameNode

NDB

Client

Reader1 ReaderN…

ConnectionList

Call Queue



DAL-Impl

><HOPSFS & EPIPE 14


NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue




…

DAL-Impl

><HOPSFS & EPIPE 14


NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue




…

DAL-ImplDAL API

><HOPSFS & EPIPE 14


NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

inodes block_infos replicas




…


leases…

DAL-ImplDAL API

><HOPSFS & EPIPE 14


NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

inodes block_infos replicas


Responder(Nio Thread)



…


Done RPCs

ackIds

leases…

DAL-ImplDAL API

><HOPSFS & EPIPE 15

Fine grain Locking

/root

2014 archiveeu

dir1 dir n...dir2

... ... ...

us

a.log

se dir1 dir2

... ...

de c.log

b.log

dept

d1 d2

... ...

...

d0

dn

><HOPSFS & EPIPE 15

Fine grain Locking

/root

2014 archiveeu

dir1 dir n...dir2

... ... ...

us

a.log

se dir1 dir2

... ...

de c.log

b.log

dept

d1 d2

... ...

...

d0

dn

• Hierarchical Locking (Implicit Locking)

><HOPSFS & EPIPE 15

Fine grain Locking

/root

2014 archiveeu

dir1 dir n...dir2

... ... ...

us

a.log

se dir1 dir2

... ...

de c.log

b.log

dept

d1 d2

... ...

...

d0

dn

• Hierarchical Locking (Implicit Locking)

• Subtree Locking

><HOPSFS & EPIPE 16

Implicit Locking

/root

2014

dir1

... ...

us

c.log

b.log

nydir1

...

...

Quota

INode

><HOPSFS & EPIPE 16

Implicit Locking

/root

2014

dir1

... ...

us

c.log

b.log

nydir1

...

...

Quota

INode

><HOPSFS & EPIPE 16

Implicit Locking

/root

2014

dir1

... ...

us

c.log

b.log

nydir1

...

...

Quota

INode

><HOPSFS & EPIPE 17

Pluggable DBs: Data Abstraction Layer

NameNode(Apache v2)

NDB-DAL-Impl(GPL v2)

DAL API(Apache v2)

><HOPSFS & EPIPE 17

Pluggable DBs: Data Abstraction Layer

NameNode(Apache v2)

NDB-DAL-Impl(GPL v2)

Other DB(Other License)

DAL API(Apache v2)

><HOPSFS & EPIPE 18

Spotify Workload

Op Name Percentage Op Name Percentageappend file 0.0% content summary 0.01%mkdirs 0.02% set permissions 0.03% [26.3%⇤]set replication 0.14% set owner 0.32 % [100%⇤]delete 0.75% [3.5%⇤] create file 1.2%rename 1.3% [0.03%⇤] add blocks 1.5%list (listStatus) 9% [94.5%⇤] stat (fileInfo) 17% [23.3%⇤]read (getBlkLoc) 68.73%

Table 1: Relative frequency of operations on a Spotify HDFS cluster.List, read, and stat operations account for ⇡ 95% of the metadata op-erations in the cluster.⇤Of which, the relative percentage is on directories

NameNode on which to execute file system operations.HopsFS clients periodically refresh the NameNode list,enabling new NameNodes to join an operational cluster.HDFS v2.x clients are fully compatible with HopsFS,although they do not distribute operations over Name-Nodes, as they assume there is a single active Name-Node.

MySQL Cluster: NDB is the storage enginefor MySQL Cluster and it is a shared-nothing, in-memory, auto-sharding, consistent, distributed, rela-tional database [38]. NDB frequently checkpoints thedata and supports both NDB datanode and cluster levelrecovery.

NDB horizontally partitions the tables among theNDB datanodes. It also supports application defined par-titioning (ADP) for the tables. The transaction coordina-tors are located on all the NDB datanodes, enabling highperformance transactions between data shards. Distri-bution aware transactions (DAT) are possible by provid-ing a hint, based on the application defined partitioningscheme, to start a transaction on the NDB datanode con-taining the data read/updated by the transaction. In par-ticular, single row read operations and partition prunedindex scans (scan operations in which a single data shardparticipates) benefit from distribution aware transactionsas they can read all their data locally [77]. Incorrect hintsresult in additional network traffic being incurred but oth-erwise correct system operation.

NDB only supports read-committed transaction isola-tion, which does not allow dirty reads but phantom andfuzzy (non-repeatable) reads can happen in a transac-tion [7]. NDB supports row level locks, exclusive (write)locks, shared (read) locks, and read-committed locks.HopsFS uses locks to serialize conflicting file system op-erations.

3 Partitioning Scheme and TransactionsMetadata for hierarchical distributed file systems typi-cally contains information on inodes, blocks, replicas,quotas, leases and mappings (directories to files, files toblocks, and blocks to replicas). When metadata is dis-tributed, an application defined partitioning scheme isneeded to shard the metadata and a distributed consensus

Figure 2: (a) Shows the relative cost of different operations in NewSQLdatabase. (b) HopsFS avoids FTS and IS operations as the cost theseoperation is relatively higher than PPIS, B, and PK operations.

protocol is required to ensure metadata integrity for op-erations that cross shards. Quorum-based consensus pro-tocols, such as Paxos, provide high performance withina single shard, but are typically combined with transac-tions, implemented using the two-phase commit proto-col, for operations that cross shards, as in Megastore [6]and Spanner [11]. File system operations in HopsFS areimplemented primarily using transactions and row-levellocks in MySQL Cluster to provide serializability [23]for metadata operations.

The choice of partitioning scheme for the hierarchi-cal namespace is a key design decision for distributedmetadata architectures. We base our partitioning schemeon the expected relative frequency of HDFS operationsand the cost of different database operations that can beused to implement the file system operations. Table 1shows the relative frequency of selected HDFS opera-tions in a workload generated by Hadoop applications(Pig, Hive, HBase, MapReduce, Tez, Spark, and Giraph)at Spotify. List, stat and file read operations alone ac-count for ⇡ 95% of the operations in the HDFS cluster.These statistics are similar to the published workloads forHadoop clusters at Yahoo [1], LinkedIn [53], and Face-book [66]. Figure 2a shows the relative cost of differentdatabase operations. We can see that the cost of a fulltable scan or an index scan, in which all database shardsparticipate, is much higher than a partition pruned indexscan in which only a single database shard participates.HopsFS implements common file system operations us-ing only the low cost database operations, that is, primarykey read, batched primary key reads and partition prunedindex scans. For example, the read and directory list-ing operations (see Table 3), are implemented using only(batched) primary key lookups and partition pruned in-dex scans. Index scans and full table scans were avoided,where possible, as they touch all database shards andscale poorly.

3.1 HopsFS Partitioned MetadataIn HopsFS, the file system metadata is stored in tableswhere a directory inode is represented by a single row inthe Inode table. File inodes, however, have more asso-ciated metadata, such as a set of blocks, block locations,and checksums that are stored in separate tables, with

3

Relative frequency of operations on a Spotify HDFS cluster. List, read, and stat operations account for ≈ 95% of the metadata operations in the cluster.

*Of which, the relative percentage is on directories

[Niazi, Salman, et al. "HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases." arXiv preprint (2016)]

><HOPSFS & EPIPE 19

To Infinity and Beyond

NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

0

100k

200k

300k

400k

500k

600k

700k

1 6 11 16 21 26 31 36 41 46

ops/

sec

Number of Namenodes

HopsFS-SpotifyHDFS-Spotify

~8.5X

><HOPSFS & EPIPE 19



0

100k

200k

300k

400k

500k

600k

700k

1 6 11 16 21 26 31 36 41 46

ops/

sec

Number of Namenodes


~8.5X

><HOPSFS & EPIPE 19



0

100k

200k

300k

400k

500k

600k

700k

1 6 11 16 21 26 31 36 41 46

ops/

sec

Number of Namenodes


~8.5X

><HOPSFS & EPIPE 19



0

100k

200k

300k

400k

500k

600k

700k

1 6 11 16 21 26 31 36 41 46

ops/

sec

Number of Namenodes


~8.5X

><HOPSFS & EPIPE 20

Bigger Clusters

0 20 40 60 80

100 120 140 160

0.25 0.50 0.75 1.00 50

100

150

200

250

300

Hop

sFS

op ti

me

(sec

)

HD

FS o

p tim

e (m

s)

No of files in a directory (million)

HDFS renameHDFS delete

HopsFS renameHopsFS delete

Figure 7: Performance of rename and delete operations on large direct-ories. Note, the different time scales for HDFS and HopsFS. HopsFS isan order of magnitude slower as it reads all the descendant inodes of thesubtree from an external database.

the memory. However, due to the low frequency of suchoperations in typical industrial workloads (see Table 1),we think it is an acceptable trade-off for the higher per-formance of common file system operations in HopsFS.

7.3 Metadata (Namespace) ScalabilityIn HDFS, as the entire namespace metadata must fit onthe heap of single JVM, the data structures are highly op-timized to reduce the memory footprint [62]. In HDFS,a file with two blocks that are replicated three ways re-quires 448 + L bytes of metadata1 where L represents thefilename length. If the file names are 10 characters long,then a 1 GB JVM heap can store 2.3 million files. In real-ity the JVM heap size has to be significantly larger to ac-commodate secondary metadata, thousands of concurrentRPC requests, block reports that can each be tens of mega-bytes in size, as well as other temporary objects.

Number of FilesMemory HDFS HopsFS1 GB 2.3 million 0.44 million50 GB 115 million 22 million100 GB 230 million 44 million200 GB 460 million 88 million500 GB Does Not Scale 220 million1 TB Does Not Scale 440 million24 TB Does Not Scale 10.8 billion

Table 2: HDFS and HopsFS Metadata Scalability.

Migrating the metadata to a database causes an expan-sion in the amount of memory required to accommod-ate indexes, primary/foreign keys and padding. In orderto calculate the size of each entity we use a tool calledsizer [40]. In HopsFS the same file described above takes2420 bytes if the metadata is replicated twice. For ahighly available deployment with an active and standbynamenodes for HDFS, you will need twice the amount ofmemory, thus, HopsFS requires ⇡ 2 times more memorythan HDFS to store metadata that is highly available.

Table 2 shows the metadata scalability of HDFS andHopsFS. NDB supports up to 48 datanodes, which allowsit to scale up to 24 TB of data in a cluster with 512 GBRAM on each NDB datanode. HopsFS can store up to

1These size estimates are for HDFS version 2.0.4 from whichHopsFS was forked. Newer version of HDFS require additional memoryfor new features such as snapshots and extended attributes.

10.8 billion files using 24 TB of metadata, which is anorder of magnitude higher (24 times) than HDFS.

7.4 Industrial Workload ExperimentsWe benchmarked HopsFS using workloads based on oper-ational traces from Spotify that operates a Hadoop clusterconsisting of 1600+ nodes containing 60 petabytes ofdata. The namespace contains 13 million directories and218 million files where each file on average contains 1.3blocks. The Hadoop cluster at Spotify runs on averageforty thousand jobs of different applications, such as, Pig,Hive, HBase, MapReduce, Tez, Spark, and Giraph everyday. The file system workload generated by these applic-ation is summarized in Table 1, which shows the relativefrequency of HDFS operations. At Spotify the averagefile path depth is 7 and average inode name length is 34characters. On average each directory contains 16 filesand 2 sub-directories. There are 289 million blocks storedon the datanodes. We use these statistics to generate filesystem workloads that approximate HDFS usage in pro-duction at Spotify.

7.4.1 Scalability ArgumentFigure 8 shows that, for our industrial workload, HopsFSdelivers 2.6 times the throughput of HDFS with 12 name-nodes and 8 NDB nodes. Even for 2-node NDB deploy-ments, HopsFS can outperform HDFS and scale linearlyup to 8 namenodes.

As discussed before in medium to large Hadoopclusters 5 to 8 servers are required to provide high avail-ability for HDFS. With 6 servers, HopsFS delivers higherthroughput than HDFS, increasing further as more name-nodes are added to the system.

For higher numbers of namenodes, HopsFS’ through-put levels off because NDB becomes overloaded. Byincreasing the number of NDB datanode instances to 4,HopsFS increases throughput up to 11 namenodes, andby increasing the number of NDB datanode instances to8, HopsFS scales up to at least 12 namenodes. The 2-node NDB cluster has a similar performance as the 4-nodeNDB cluster because each NDB datanode in 2-node NDBcluster holds the complete copy of the entire metadata,which improves the performance of transactions as all the

20K

40K

60K

80K

100K

120K

140K

1 2 3 4 5 6 7 8 9 10 11 12

ops/

sec

Number of Namenodes

HopsFS with 2 NDB NodesHopsFS with 4 NDB NodesHopsFS with 8 NDB Nodes

HDFS (Using 5 Server)

Figure 8: HopsFS and HDFS throughput for our industrial workload.

9

><HOPSFS & EPIPE 21

Tinker Friendly Metadata

><HOPSFS & EPIPE 21


• The Database (NDB) is the single source of truth

><HOPSFS & EPIPE 21



• Extending INodes (files/directories)

><HOPSFS & EPIPE 21




• Adding a new table with a foreign key to the nodes table

><HOPSFS & EPIPE 21





• Attaching metadata to a file/directory

><HOPSFS & EPIPE 21






• Schema-less

><HOPSFS & EPIPE 21






• Schema-less

• Schema-based

><HOPSFS & EPIPE 22

Schema-less Metadata

><HOPSFS & EPIPE 22


inodeID Name parentId1 / 02 Users 13 alice.txt 2

><HOPSFS & EPIPE 22



attach /Users/alice.txt ’{“age” : 20, “gender” : “female”, “about”: “I am alice”}’

><HOPSFS & EPIPE 22




inodeID metadata

3 {“age” : 20, “gender” : “female”, “about”: “I am alice”}

><HOPSFS & EPIPE 22




inodeID metadata

3 {“age” : 20, “gender” : “female”, “about”: “I am alice”}

><HOPSFS & EPIPE 23

Schema-based Metadata

><HOPSFS & EPIPE 23


Template

><HOPSFS & EPIPE 23


Template

><HOPSFS & EPIPE 23


Template TemplateTemplateTable

><HOPSFS & EPIPE 23


Template TemplateTemplateTable

><HOPSFS & EPIPE 23


Template TemplateTemplateTableColumnColumnColumnColumnColumnColumn

><HOPSFS & EPIPE 23



INode

><HOPSFS & EPIPE 23



INode

><HOPSFS & EPIPE 23



INode

><HOPSFS & EPIPE 23



INode Metadata Row

><HOPSFS & EPIPE 23



INode Metadata Row

><HOPSFS & EPIPE 24

Searching through the Namespace

><HOPSFS & EPIPE 24


• hdfs -find

><HOPSFS & EPIPE 24


• hdfs -find

• limited

><HOPSFS & EPIPE 24


• hdfs -find

• limited

• inefficient by design

><HOPSFS & EPIPE 24


• hdfs -find

• limited


• Pig/MapReduce

><HOPSFS & EPIPE 24


• hdfs -find

• limited


• Pig/MapReduce

• NameNode is a critical point, avoid overloading

><HOPSFS & EPIPE 24


• hdfs -find

• limited


• Pig/MapReduce


• SQL Query on NDB

><HOPSFS & EPIPE 24


• hdfs -find

• limited


• Pig/MapReduce


• SQL Query on NDB

• One Size does not fit all

><HOPSFS & EPIPE 25

Polyglot Persistence: the Right Tool for the Job

><HOPSFS & EPIPE 26

Monolithic vs Polyglot Persistence

[http://martinfowler.com/]

><HOPSFS & EPIPE 27

Asynchronous Update of the Metadata

Eventual Consistency for Metadata. Metadata Integrity maintained by

Asynchronous Replication and Metadata Immutability.

FilesDirectoriesMetadata

Search Indexes

><HOPSFS & EPIPE 27





Search Indexes

Database

><HOPSFS & EPIPE 27





Search Indexes

Database

><HOPSFS & EPIPE 27





Search Indexes

DatabaseElasticsearch

><HOPSFS & EPIPE 27





Search Indexes


><HOPSFS & EPIPE 27





Search Indexes


><HOPSFS & EPIPE 27





Search Indexes


ePipe

><HOPSFS & EPIPE 27





Search Indexes

DatabaseElasticsearch one-way replication

ePipe

><HOPSFS & EPIPE 27





Search Indexes

DatabaseElasticsearch one-way replicationimmutable data

ePipe

><HOPSFS & EPIPE 28

HopsFS | ElasticSearch

MySQL Cluster

ElasticHandlerTailer

Table Unit

Batcher Reader

ePipe

ElasticSearchCluster

next 29Supporting Project-Level

Multi-Tenancy

How can we introduce GitHub-style projects to Hadoop?

><HOPSFS & EPIPE 30

Tutorial

><HOPSFS & EPIPE 30

Tutorial

• Karamel Automated Installation

http://www.hops.io/?q=boss

http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 30

Tutorial


• http://www.hops.io/?q=boss


http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 30

Tutorial



• HopsWorks Clusters


http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 30

Tutorial




• http://52.210.205.125:8080/hopsworks/


http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 30

Tutorial




• http://52.210.205.125:8080/hopsworks/

• http://52.209.143.68:8080/hopsworks/


http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 30

Tutorial




• http://52.210.205.125:8080/hopsworks/

• http://52.209.143.68:8080/hopsworks/

• http://52.18.186.135:8080/hopsworks/


http://52.210.205.125:8080/hopsworks/

http://52.209.143.68:8080/hopsworks/

http://52.18.186.135:8080/hopsworks/

><HOPSFS & EPIPE 31

Conclusions

• Hops is a next-generation distribution of Hadoop.

• HopsWorks is a frontend to Hops that supports multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs.

• Looking for contributors/committers

• Check out our github (github.com/hopshadoop)

http://github.com/hopshadoop

next 32Questions

next 33

next 34Karamel

><HOPSFS & EPIPE 35

Automated Installation

• Vagrant/Chef to spin up on a single host

• Karamel/Chef to deploy on AWS/GCE/OpenStack or on-premises

name: HopsWorks ec2: type: m3.medium cookbooks: hadoop: github: "hopshadoop/hopsworks-chef" version: "v0.1" groups: ui: size: 1 recipes: - hopsworks metadata: size: 2 recipes: - hops::nn - hops::rm datanodes: size: 50 recipes: - hops::dn - hops::nm

next 36HopsWorks

><HOPSFS & EPIPE 37

Problem: Sensitive Data needs its own Cluster

NSA DataSet

has access to

Alice

><HOPSFS & EPIPE 37


NSA DataSet

has access to

User DataSet

give access toAlice

><HOPSFS & EPIPE 37


NSA DataSet

has access to

User DataSet

give access to

Alice can copy/cross-link between data sets

Alice

><HOPSFS & EPIPE 37


NSA DataSet

has access to

User DataSet

give access to

Alice can copy/cross-link between data sets

Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop.

Alice

><HOPSFS & EPIPE 38

Solution: Project-Specific UserIDs

Project NSA

Project UsersMember of

NSA__Alice

Users__Alice

Member of

><HOPSFS & EPIPE 38


Project NSA


NSA__Alice

Users__Alice

Member of

HDFS enforcesaccess control

><HOPSFS & EPIPE 38


Project NSA


NSA__Alice

Users__Alice

Member of

HDFS enforcesaccess control

How can we share DataSets between Projects?

><HOPSFS & EPIPE 39

Sharing DataSets between Projects

Project NSA


NSA__Alice

Users__Alice

Member of

><HOPSFS & EPIPE 39


Project NSA


DataSetowns

NSA__Alice

Users__Alice

Member of

><HOPSFS & EPIPE 39


Project NSA


DataSetowns

Add members of Project NSA to the DataSet group

NSA__Alice

Users__Alice

Member of

><HOPSFS & EPIPE 40

HopsWorks enforces dynamic Roles

[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

Projects

Kafka

><HOPSFS & EPIPE 40


[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

Projects

Kafka

><HOPSFS & EPIPE 40


[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

Projects

Kafka

><HOPSFS & EPIPE 40


[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

Projects

Kafka

><HOPSFS & EPIPE 40


[email protected]

NSA__Alice

Authenticate

Users__Alice

HopsWorks

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

><HOPSFS & EPIPE 41

X.509 Certificate per project specific user

[email protected]

Authenticate

Add/Del Users

Distributed Database

Insert/Remove CertsProject Mgr

Root CA

Services Hadoop Spark Kafka etc

Cert Signing Requests

><HOPSFS & EPIPE 42

Project

• A project is a collection of

• Members

• HDFS DataSets

• Kafka Topics

• Notebooks, Jobs

• A project has an owner

• A project has quotas

projectdataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

><HOPSFS & EPIPE 43

Project Roles

•Data Owner Privileges

- Import/Export data

- Manage Membership

- Share DataSets, Topics

•Data Scientist Privileges

- Write and Run code

><HOPSFS & EPIPE 43

Project Roles

•Data Owner Privileges

- Import/Export data

- Manage Membership

- Share DataSets, Topics

•Data Scientist Privileges

- Write and Run codeWe delegate administration of privileges to users

Documents

HopsFS & ePipe - TU Berlinboss.dima.tu-berlin.de/media/BOSS16-hopsfs-epipe.pdf · HOPSFS & EPIPE < 10 > MySQL Cluster (NDB) to the rescue • NewSQL (Relational) DB • User-defined