Upload
doannhu
View
220
Download
0
Embed Size (px)
Citation preview
Let’s Make Parallel File System More Parallel[LA-UR-15-25811]
Qing Zheng1, Kai Ren1, Garth Gibson1, Bradley W. Settlemyer21Carnegie MellonUniversity
2Los AlamosNationalLaboratory
HPC defined by …
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 2
Parallel scientific appslow-latency network for msg passing
tired cluster deployments
PFS for highly scalable storage I/O
compute nodes(10,000+)
storage nodes(100+)
App2
App1App3
Parallel File System[Lustre]
Failure Handling …
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 3
compute nodes(10,000+)
storage nodes(100+)
App2
App1App3
Parallel File System[Lustre]Nodes/network will fail
apps use checkpoints to avoid
complete re-execution
each proc dumps its memory to a file
Failure Handling …
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 4
compute nodes(10,000+)
storage nodes(100+)
App2
App1App3
Parallel File System[Lustre]When failure happens
an app is simply re-scheduled
and resumes
execution from a latest checkpoint
Checkpointing …
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 5
compute nodes(10,000+)
storage nodes(100+)
App2
App1App3
Parallel File System[Lustre]
640K open()/close()N * 640K write()
• Assuming 20,000 nodes and 32 CPUs per node
1 if (proc_id == 0) {2 mkdir(“/proj/a/chk/001”);3 }4 sync();5 int fd = open(“/proj/a/chk/001/<proc_id>”,6 O_CREAT | O_EXCL | O_WRONLY);7 write(fd, “<…..>”);8 write(fd, “<…..>”);9 close(fd);
YES? NO?[ DATA ] [ METADATA]
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 6
Will existing PFS deliver sufficient perf?
NO?
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 7
[ METADATA]open(), close(), unlink(),
mkdir(), rmdir(),rename(), getattr(), chmod(), readdir(), …
Metadata
Namespace Tree1
Data Location3
File Attributes2
1• hierarchical directory structure
• file name, file size, last modification time, …
• where to find file/directory data ?
Decoupled PFSParallel File System
metadata service data service[a single (or a few) machines] [a large collection of machines]
Allow data to scale without scaling metadata
e.g. Lustre MDS e.g. Lustre OSS
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 8
NOFS only stores
large files
Isn’t Metadata a Problem?
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 9
NOmetadata is small in size
NO90% of ops are
I/O
NOFS only stores
large files
Isn’t Metadata a Problem?
Median file size in actually tiny/small
• < 64KB in cloud computing data centers
• < 64MB in super computing environments
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 10
64MB is the default block size for Google File System
Isn’t Metadata a Problem?
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 11
NOmetadata is small in size
NO90% of ops are
I/O
bigger & bigger cluster• # app processes
• metadata size
• # of metadata op
HPC is growing Fast
Tomorrow we will have EXASCALE computing facilities
more intensive METADATA WORKLOADS
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 12
Metadataeventually a huge problem !!
NO!![ METADATA]
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 13
Will existing PFS deliver sufficient perf?
Middleware Design
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 16
Parallel Scientific Applications
Underlying Storage Infrastructure[Object Storage/Parallel File System]
metadata ops
data storage metadata storage
Middleware Design
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 17
Underlying Storage Infrastructure[Object Storage/Parallel File System]
metadata storagedata/metadata storage
Parallel Scientific Application
Primary Server
metadata operations
fast interconnect
Client Proc Private Server
Middleware Design
Underlying Storage Infrastructure[Object Storage/Parallel File System]
metadata storagedata/metadata storage
Parallel Scientific Application
Primary Server
metadata operations
fast interconnect
Client Proc Private Server
Enables metadata to be potentially servedfrom compute nodes
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 18
Client-funded File System Metadata Architecture
Agenda
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 19
BulkInsertion
Metadata Representation
1 2
Block-based Metadata
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 21
data block mapinode map
superblock
inode blocks data blocks
id=161
size=64
type=[file]
time=2015-07-27
…
id=157
size=4096
type=[directory]
time=2015-07-27
…
inode [..] -> 132
[.] -> 157
zhengq-> 158
kair -> 159
garth -> 160
bws -> 161
directory entry list
UNIX Model
Block-based Metadata
data block mapinode map
superblock
inode blocks data blocks
id=161
size=64
type=[file]
time=2015-07-27
…
id=157
size=4096
type=[directory]
time=2015-07-27
…
[..] -> 132
[.] -> 157
zhengq-> 158
kair -> 159
garth -> 160
bws -> 161
inode directory entry list
file creates -> disk seeks, liner directory entry search costzero per-directory concurrency
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 22
Table-based Metadata
KEY = parent_id + hash(fname), VALUE = an embedded inode + fname
proj (id=1)
batchfs (id=5)
src (id=2)
fs.h fs.c
ROOT (id=0)
orde
red
KV p
airs
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 23
0,h(src) id=2, type=dir, fname=src
0,h(proj) id=1, type=dir, fname=proj
1,h(batchfs) id=5, type=dir, fname=batchfs
2,h(fs.h) id=3, type=file, fname=fs.h
2,h(fs.c) id=4, type=file, fname=fs.c
key valuereaddir “[ROOT]“
readdir “/src “
Table-based Metadata
KEY = parent_id + hash(fname), VALUE = an embedded inode + fname
0,h(src) id=2, type=dir, fname=src
0,h(proj) id=1, type=dir, fname=proj
proj (id=1)
batchfs (id=5)
src (id=2)
fs.h fs.c
ROOT (id=0)
1,h(batchfs) id=5, type=dir, fname=batchfs
2,h(fs.h) id=3, type=file, fname=fs.h
2,h(fs.c) id=4, type=file, fname=fs.c
orde
red
KV p
airs
readdir “[ROOT]“
readdir “/src “
key value
A large distributed sorted directory entry tablewith embedded inodes
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 24
Table RepresentationLog-structured Merge Trees [LSM]
create file/directory
In-mem B-Tree
A collection of B-trees at different levels
k/v
k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
merge merge
Level-0 Level-1
Level-2
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 25
level-0 always sits in memory
Table RepresentationLog-structured Merge Trees [LSM]
create file/directory
In-mem B-Tree
A collection of B-trees at different levels
k/v
k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
merge merge
Level-0 Level-1
Level-2
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 26
merge level-0 into level-1
FULL
Table RepresentationLog-structured Merge Trees [LSM]
create file/directory
In-mem B-Tree
A collection of B-trees at different levels
k/v
k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
merge merge
Level-0 Level-1
Level-2
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 27
merge partial level-1 into level-2
FULL
Table RepresentationLog-structured Merge Trees [LSM]
create file/directory
In-mem B-Tree
A collection of B-trees at different levels
k/v
k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
merge merge
Level-0 Level-1
Level-2
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 28
convert random disk I/O into sequential I/Oavoids disk seeks
(optimized for K/V insertion)
LSM - Updates
Convert K/V updates to K/V insertion operations
chmod(“/proj/batchfs”, …)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 29
proj (id=1)
batchfs (id=5)
src (id=2)
fs.h fs.c
ROOT (id=0)
1,h(batchfs) perm=xxx, fname=batchfs, seq=245
1,h(batchfs) perm=yyy, fname=batchfs, seq=361
seq 361>245no write in-place
LSM - Deletions
Convert K/V deletions to K/V insertion operations
rmdir(“/proj/batchfs”, …)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 30
proj (id=1)
batchfs (id=5)
src (id=2)
fs.h fs.c
ROOT (id=0)
1,h(batchfs) live=true, fname=batchfs, seq=245
1,h(batchfs) live=false, fname=batchfs, seq=361
seq 361>245no explicit deletion
LSM - Deletions
Convert K/V deletions to an K/V insertion operations
rmdir(“/proj/batchfs”, …)
proj (id=1)
batchfs (id=5)
src (id=2)
fs.h fs.c
ROOT (id=0)
1,h(batchfs) live=true, fname=batchfs, seq=245
1,h(batchfs) live=false, fname=batchfs, seq=361
seq 361>245no explicit deletion1. immutable data structure
2. snapshotting a file system image is trivialLANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 31
Underlying Storage Infrastructure[Object Storage/Parallel File System]
LSM - Storage
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 32
k/v
k/v k/vk/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
represented formatted
namespace
LSM-Tree
T1 T2
T3 T4
32MB each
LSM - Storagek/v
k/v k/vk/v
k/v k/v
k/v k/v k/v
k/v
k/v k/v
k/v k/v k/v
k/v k/v k/v k/v
represented formatted
namespace
LSM-Tree
T1 T2
T3 T4
e.g. 32MB each
Pack metadata into large filesReuse data path to deliver scalable metadata
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 33
ExperimentsEach client process creates 1 private directoryand inserts a set of empty files into that directory
(CHECKPOINT WORKLOAD)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 35
Hadoop File System (HDFS) ClusterDataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port
[metadata node]
ExperimentsEach client process creates 1 private directoryand inserts a set of empty files into that directory
(CHECKPOINT WORKLOAD)
Hadoop File System (HDFS) ClusterDataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port
[metadata node]The original Hadoop file system gives 600 op/s
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 36
Experiment Settings
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 37
HDFS Data Node
HDFSName Node 1 BatchFS Server
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
DISK DISK DISK
…
1 million files inserted without bulk insertion
HDFS Baseline v.s. BatchFS
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 38
0.6 0.6 0.6 0.6
11
13 1312
0
2
4
6
8
10
12
14
Thro
ughp
ut (K
op/
s)
20X 20X 20X 20X
8 client processes 16 client processes 32 client processes 64 client processesEfficient Metadata Representation
Traditional Model
ParallelScientific Application
DedicatedMetadata Server
mkdir(), create()
T1 T2 T3 T4
on-disk namespace storage
write tree files
Shared Underlying Storage Infrastructure
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 40
Traditional Model
ParallelScientific Application
DedicatedMetadata Server
mkdir(), create()
T1 T2 T3 T4
on-disk namespace storage
write tree files
Shared Underlying Storage Infrastructure
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 41
Sync. Interface
Strong Consistent
Traditional Model
ParallelScientific Application
DedicatedMetadata Server
mkdir(), create()
T1 T2 T3 T4
on-disk namespace storage
write tree files
Shared Underlying Storage Infrastructure1. Dedicated service doesn’t work in exascale
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 42
320K client processes
Sync. Interface
Strong Consistent
bottleneck
Traditional Model
ParallelScientific Application
DedicatedMetadata Server
mkdir(), create()
T1 T2 T3 T4
on-disk namespace storage
write tree files
Shared Underlying Storage Infrastructure1. Dedicated service doesn’t work in exascale
2. Traditional model overkill for scientific applications LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 43
320K client processes
Sync. Interface
Strong Consistent
bottleneck
Bulk Insertion
ParallelScientific Application
DedicatedMetadata Server
on-disk namespace storage
write tree files
T5 T6
client’s metadata mutations
(1) write tree files
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 44
T1 T2 T3 T4
mkdir()create()
via private servers
Bulk Insertion
ParallelScientific Application
DedicatedMetadata Server
on-disk namespace storage
write tree files
T5 T6
(2) bulk submit
client’s metadata mutations
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 45
T1 T2 T3 T4
finishes execution by as easily as picking upall submitted tree files
Bulk Insertion
ParallelScientific Application
DedicatedMetadata Server
on-disk namespace storage
write tree files
T5 T6
(2) bulk submit
client’s metadata mutations
T1 T2 T3 T4
finishes execution by as easily as picking upall submitted tree files
Similar to database pre-loadingData inserted via a low-level protocol instead of SQL
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 46
Bulk Insertion
ParallelScientific Application
DedicatedMetadata Server
on-disk namespace storage
write tree files
T5 T6
(2) bulk submit
client’s metadata mutations
T1 T2 T3 T4
finishes execution by as easily as picking upall submitted tree files
Similar to database pre-loadingData inserted via a low-level protocol instead of SQL
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 47
1. More efficient h/w utilization2. less calls to dedicated servers: more scalable metadata
Concurrency Control
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 48
Total ordering of mutations from different clients
client1
1 chmod(“/proj”, …)
client2
1 rmdir(“/proj”, …)
client1
1 chmod(“/proj”, …)
client2
1 chmod(“/proj”, …)
client1
1 mkdir(“/proj”, …)
client2
1 mkdir(“/proj”, …)
client1
1 rename(“/proj”, “/a”)
client2
1 rename(“/proj”, “/b”)
Optimistic Locking
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 49
proj
batchfs
src
fs.h fs.c
ROOT
proj
batchfs
src
fs.h fs.c
ROOT
checkpoint
ck1checkpoint
ck1
batchfs
SNAPSHOT
BatchFSClient
BOOTSTRAP
CHECK/MERGE
SUBMIT
Optimistic Locking
proj
batchfs
src
fs.h fs.c
ROOT
SNAPSHOT CHECK/MERGE proj
batchfs
src
fs.h fs.c
ROOT
checkpoint
ck1checkpoint
ck1
batchfs
BatchFSClient
BOOTSTRAP SUBMIT
Similar to source code control (github/svn)Except there is no data copying (we do copy-by-ref)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 50
Optimistic Locking
proj
batchfs
src
fs.h fs.c
ROOT
SNAPSHOT CHECK/MERGE proj
batchfs
src
fs.h fs.c
ROOT
checkpoint
ck1checkpoint
ck1
batchfs
BatchFSClient
BOOTSTRAP SUBMIT
Similar to source code control (github/svn)Except there is no data copying (we do copy-by-ref)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 51
Fundamental AssumptionScientific applications rarely produce conflicts
Phase 1: BranchingClient instantiates a private namespace
from a global snapshot
global branch
bulk_insert(…)
Client snapshot(…)
mkdir(…)chmod(…)
T
T1 T2 T3
global namespace
client’s private branch
a global snapshot
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 52
KV pairs
T1 T2 T3 T4 T5
Phase 2: MergingServer picks up and schedules a check
on client’s metadata mutations
global branch
bulk_insert(…)
Client snapshot(…)
mkdir(…)chmod(…)
T1 T2 T3
global namespace
client’s private branch
a global snapshot
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 53
KV pairs
T1 T2 T3 T4 T5
open(…) Client2T
tentative accepted,subject to future rejection
Phase 3: Verification
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 54
T5
T6
T7
client’s metadata mutations
Log
metadata operation log view
SST Interpreter
global namespace
soft re-execution
T1 T2 T3 T4
concurrent updatesthat mostly don’t produce conflicts
T1 T2 T3 T4
COMMIT
T5 T6 T7 T8
conflict resolution
Previous Setting
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 56
HDFS Data Node
HDFSName Node 1 BatchFS Server
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
DISK DISK DISK
…
1 million files inserted without bulk insertion
New Setting
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 57
HDFS Data Node
HDFSName Node 1 BatchFS Server
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
HDFS Data Node
1-8 BatchFS clients
DISK DISK DISK
…
8 million files inserted with bulk insertion
No v.s. w/ Bulk Insertion
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 58
0.6 0.6 0.6 0.611 13 13 12
139
188203
216
0
50
100
150
200
250
Thro
ughp
ut (K
op/
s)
8 client processes 16 client processes 32 client processes 64 client processes
8X15X 15X 18X
Bulk Insertion - 20X * 18X = 360X faster then HDFS
Client-funded File System Metadata Architecture
Agenda
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 59
BulkInsertion
Metadata Representation
1 2
Why FS is slow?
Inefficient metadata representation
At least one RPC per operation
Synchronous metadata interface
Pessimistic concurrency control
Dedicated authorization service
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 60
Client-funded HPC
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 61
PrimaryMetadata Server
compute nodes
pre-executesmetadata ops privately
Underlying Storage
not in critical path
per-batchsynchronizationExascale PFS architecture
• Move metadata computation from servers to apps
• Better h/w utilization
• FS scales w/ # of clients
App1App3
App2
Client-funded HPCPrimary
Metadata Server
compute nodes
pre-executesmetadata ops privately
Underlying Storage
not in critical path
per-batchsynchronizationExascale PFS architecture
• Move metadata computation from servers to apps
• Better h/w utilization
• FS scales w/ # of clients
App1App3
App2
Apps have long had rich h/w resourcesNow they can buy themselves scalable metadata
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 62
Reference
Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14)
Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)
LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 66