Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2012-11-1
1
Distributed File Systems
http://idc.hust.edu.cn/~rxli/
1
p
Xinhua Dong, Ruixuan LiSchool of Computer Science and TechnologyHuazhong University of Science and TechnologyOct. 15, 2012
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
2
The Andrew File System
Google File System MapReduce
Hadoop File System
Introduction
A network file system is any computer file system that supports sharing of files, printers
and other resources as persistent storage over a computer network.
Distribution
3
A DFS is a network file system whose clients, servers, and storage devices are dispersed among the machines.
Transparency a DFS should appear to its users to be a conventional,
centralized file system.
Performance the amount of time needed to satisfy service requests
additional overhead besides a disk-access time and a small amount of CPU-processing time time to deliver the request to a server,
4
time to deliver the request to a server, the time to deliver the response to the client, and for each direction, a CPU overhead of running the communication
protocol software.
Concurrent file updates access and update the same files.
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
5
The Andrew File System
Google File System MapReduce
Hadoop File System
File Systems Overview
System that permanently stores data
Usually layered on top of a lower-level physical storage medium
Divided into logical units called “files”
6
Divided into logical units called files Addressable by a filename (“foo.txt”)
Usually supports hierarchical nesting (directories)
2012-11-1
2
File Paths
A file path joins file & directory names into a relative or absolute address to identify a file Absolute: /home/aaron/foo.txt
Relative: docs/someFile.doc
7
Relative: docs/someFile.doc
The shortest absolute path to a file is called its canonical path
The set of all canonical paths establishes the namespace for the file system
What Gets Stored
User data itself is the bulk of the file system's contents
Also includes meta-data on a drive-wide and per-file basis:
P fil
8
Drive-wide:
available space
formatting info
character set
...
Per-file:
name
owner
modification date
...
High-Level Organization
Files are organized in a “tree” structure made of nested directories
One directory acts as the “root” .
“links” (symlinks shortcuts etc) provide simple
9
links (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file .
Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)
Low-Level Organization
File data and meta-data stored separately
File descriptors + meta-data stored in inodes Large tree or table at designated location on disk
10
g g
Tells how to look up file contents
Meta-data may be replicated to increase system reliability
“Standard” read-write medium is a hard drive (other media: CDROM, tape, ...)
Viewed as a sequential array of blocks .
Must address ~1 KB chunk at a time
11
Tree structure is “flattened” into blocks
Overlapping reads/writes/deletes can cause fragmentation: files are often not stored with a linear layout inodes store all block ids related to file
Fragmentation
12
2012-11-1
3
Design Considerations Smaller inode size reduces amount of wasted
space
Larger inode size increases speed of sequential reads (may not help random access)
Sh ld th fil t b f t li bl ?
13
Should the file system be faster or more reliable?
But faster at what: Large files? Small files ? Lots of reading? Frequent writers, occasional readers?
File system Security
File systems in multi-user environments need to secure private data Notion of username is heavily built into FS
Different users have different access writes to files .
14
Different users have different access writes to files .
UNIX Permission Bits
World is divided into three scopes: User – The person who owns (usually created) the file
Group – A list of particular users who have “group ownership” of the file .
15
p
Other – Everyone else
“Read,” “write” and “execute” permissions applicable at each level .
Access Control Lists
More general permissions mechanism
Implemented in Windows
Richer notion of privileges than r/w/xe g SetPri ilege Delete Cop
16
e.g., SetPrivilege, Delete, Copy…
Allow for inheritance as well as deny lists Can be complicated to reason about and lead to
security gaps
Process Permissions
Important note: processes running on behalf of user X have permissions associated with X, not process file owner Y
So if root owns ls, user aaron can not use ls to
17
peek at other users’ files . Exception: special permission “setuid” sets the
user-id associated with a running process to the owner of the program file
Disk Encryption
Data storage medium is another security concern Most file systems store data in the clear, rely on runtime security
to deny access Assumes the physical disk won’t be stolen
The disk itself can be encrypted
18
yp Hopefully by using separate passkeys for each user’s files . (Challenge: how do you implement read access for group
members?) Metadata encryption may be a separate concern
2012-11-1
4
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
19
The Andrew File System
Google File System MapReduce
Hadoop File System
Distributed File Systems
Support access to files on remote servers
Must support concurrency Make varying guarantees about locking, who “wins”
with concurrent writes, etc...
20
Must gracefully handle dropped connections
Can offer support for replication and local caching
Different implementations sit in different places on complexity/feature scale
General goal: Try to make a file system transparently available to remote clients.
21
(a) The remote access model. (b) The upload/download model.
Network File System (NFS)
First developed in 1980s by Sun .
Presented with standard UNIX FS interface
Network drives are mounted into local directory hierarchy
22
hierarchy Type ‘man mount’, 'mount' some time at the prompt if
curious
NFS Protocol
Initially completely stateless Operated over UDP; did not use TCP streams
File locking, etc, implemented in higher-level protocols
Modern implementations use TCP/IP & stateful
23
Modern implementations use TCP/IP & statefulprotocols
NFS Architecture for UNIX systems NFS is implemented using the Virtual File System abstraction, which is now
used for lots of different operating systems:
24
Essence: VFS provides standard file system interface, and allows to hide difference between accessing local or remote file system.
2012-11-1
5
Server-side Implementation
NFS defines a virtual file system Does not actually manage local disk layout on server
Server instantiates NFS volume on top of local file system Local hard drives managed by concrete file systems (EXT,
ReiserFS )
25
ReiserFS, ...)
Typical implementation Assuming a Unix-style scenario in which one machine requires access to data
stored on another machine:1. The server implements NFS daemon processes in order to make its data
generically available to clients.2. The server administrator determines what to make available, exporting the
names and parameters of directories.3. The server security-administration ensures that it can recognize and approve
validated clients
26
validated clients.4. The server network configuration ensures that appropriate clients can
negotiate with it through any firewall system.5. The client machine requests access to exported data, typically by issuing a
mount command.6. If all goes well, users on the client machine can then view and interact with
mounted file systems on the server within the parameters permitted.
[webg@index1 ~]$ vi /etc/exports
# the file /etc/exports serves as the access control list for
# file systems which may be exported to NFS clients.
/home/infomall/udata 192.168.100.0/24(rw,async) 222.29.154.11(rw,async)
/home/infomall/hist 192.168.100.0/24(rw,async)
27
( y )
NFS Locking
NFS v4 supports stateful locking of files Clients inform server of intent to lock
Server can notify clients of outstanding lock requests
Locking is lease-based: clients must continually renew
28
g ylocks before a timeout
Loss of contact with server abandons locks
NFS Client Caching
NFS Clients are allowed to cache copies of remote files for subsequent accesses
Supports close-to-open cache consistency When client A closes a file, its contents are
synchronized with the master and timestamp is
29
synchronized with the master, and timestamp is changed
When client B opens the file, it checks that local timestamp agrees with server timestamp. If not, it discards local copy.
Concurrent reader/writers must use flags to disable caching
NFS: Tradeoffs
NFS Volume managed by single server Higher load on central server
Simplifies coherency protocols
Full POSIX system means it “drops in” very easily,
30
Full POSIX system means it drops in very easily, but isn’t “great” for any specific need
2012-11-1
6
Distributed FS Security
Security is a concern at several levels throughout DFS stack Authentication
Data transfer
31
Data transfer
Privilege escalation
How are these applied in NFS?
Authentication in NFS
Initial NFS system trusted client programs User login credentials were passed to OS kernel which
forwarded them to NFS server
… A malicious client could easily subvert this
32
y
Modern implementations use more sophisticated systems (e.g., Kerberos)
Data Privacy
Early NFS implementations sent data in “plaintext” over network Modern versions tunnel through SSH .
Double problem with UDP (connectionless)
33
Double problem with UDP (connectionless) protocol: Observers could watch which files were being opened
and then insert “write” requests with fake credentials to corrupt data
Privilege Escalation
Local file system username is used as NFS username Implication: being “root” on local machine gives you
root access to entire NFS cluster
34
Solution: “root squash” – NFS hard-codes a privilege de-escalation from “root” down to “nobody” for all accesses.
RPCs in File System
Observation: Many (traditional) distributed file systems deploy remote procedure calls to access files. When wide-area networks need to be crossed, alternatives need to be exploited:
35
File Sharing Semantics
Problem: When dealing with distributed file systems, we need to take into account the ordering of concurrent read/write operations, and expected semantics (=consistency).
36
2012-11-1
7
UNIX semantics: a read operation returns the effect of the last write operation )
can only be implemented for remote access models in which there is only a single copy of the file
Transaction semantics: the file system supports transactions on a single file ) issue is
37
the file system supports transactions on a single file ) issue is how to allow concurrent access to a physically distributed file
Session semantics: the effects of read and write operations are seen only by the
client that has opened (a local copy) of the file ) what happens when a file is closed (only one client may actually win)
Consistency and Replication Observation: In modern distributed file systems, clientside caching
is the preferred technique for attaining performance; server-side replication is done for fault tolerance.
Observation: Clients are allowed to keep (large parts of) a file, and will be notified when control is withdrawn) servers are now generally stateful
38
Fault Tolerance
Observation: FT is handled by simply replicating file servers, generally using a standard primary-backup protocol:
39
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
40
The Andrew File System
Google File System MapReduce
Hadoop File System
AFS (The Andrew File System)
Developed at Carnegie Mellon
Strong security, high scalability Supports 50,000+ clients at enterprise level
AFS heavily influenced Version 4 of NFS
41
AFS heavily influenced Version 4 of NFS.
Security in AFS
Uses Kerberos authentication
Supports richer set of access control bits than UNIX Separate “administer”, “delete” bits
42
Separate administer , delete bits
Allows application-specific bits
2012-11-1
8
Local Caching
File reads/writes operate on locally cached copy
Local copy sent back to master when file is closed
Open local copies are notified of external updates through callbacks
43
through callbacks
Local Caching - Tradeoffs
Shared database files do not work well on this system
Does not support write-through to shared medium
44
Replication
AFS allows read-only copies of filesystem volumes
Copies are guaranteed to be atomic checkpoints of entire FS at time of read-only copy generation
45
Modifying data requires access to the sole r/w volume Changes do not propagate to read-only
copies
AFS Conclusions
Not quite POSIX Stronger security/permissions
No file write-through
High availability through replicas local caching
46
High availability through replicas, local caching
Not appropriate for all file types
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
47
The Andrew File System
Google File System MapReduce
Hadoop File System
Why Does Google Need GFS
Motivation Google stores dozens of copies of the entire Web!
More than 15,000 commodity-class PC’s
Multiple clusters distributed worldwide
Th d f i d d
48
Thousands of queries served per second
One query reads 100’s of MB of data
One query consumes 10’s of billions of CPU cycles
conclusion: Need large, distributed, highly fault tolerant file system
2012-11-1
9
Key Assumptions
All commodity hardware Cheap but unreliable Constantly failing, 100 disk failures/day
“Modest” # of large file Few million
49
Few million Each 100MB - multi – GB Some small files, but few
Read-mostly workload Large streaming reads (multi-MB at a time) Large sequential append operations Must provide atomic consistency to parallel writes
Design Decisions in GFS
Files stored as chunks Fixed size (immutable), each with a handle
Reliability through replication Each chunk replicated across 3+ chunk servers Stored as local files on Linux file system
50
Stored as local files on Linux file system
Single master to coordinate access, keep metadata Simple centralized master per GFS cluster Periodic heartbeat messages to checkup on servers
No caching Large data set / streaming reads render caching useless Rely on Linux buffer cache to keep data in memory
Architecture of GFS
51
What is a master? A single process running on a separate machine
Stores all metadata: File namespace
52
p
File to chunk mapping
Chunk location information
Access control information
Chunk version numbers
Etc.
File chunks: 64MB, large chunk Advantage:
Lower metadata overhead, can be stored in memory
Cache more index in client
53
Time to interact with master as many operations in one chunk, so that the network traffic reduced
Disadvantage: One file may in one chunkserver, it may become hot spots
Chunk location information Not stored in Master persistently
Obtained by querying chunkserver at startup or whenever joins the cluster
54
j
The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages
2012-11-1
10
In-Memory Data Structures Operation is fast, easy and efficient to periodically scan through
state Re-replication in the presence of chunkserver failure
Chunk migration for load balancing
55
Chunk migration for load balancing
Garbage collection
Memory is cheap. Capacity of system limited by memory of master, 64B/64M
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory
Master <-> Chunkserver Communication: regularly to obtain state:
Is chunkserver down?
Are there disk failures on chunkserver?
56
Are any replicas corrupted?
Which chunk replicas does chunkserver store?
Master sends instructions to chunkserver: Delete existing chunk
Create new chunk
Operation log for persistent logging The operation log contains a historical record of critical
metadata changes,replicated on multiple remote machines。
Checkpoints for fast recovery
Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent
57
goperations.
Files and chunks, as well as their versions are all uniquely and eternally identified by the logical times at which they were created.
The master recovers its file system state by replaying the operation log.
Serving Requests:
Client retrieves metadata for operation from master.
Read/Write data flows between client and
58
Read/Write data flows between client and chunkserver.
Single master is not bottleneck, because its involvement with read/write operations is minimized.
Mutation Operations in GFS
Mutation: any write or append operation:Primary Leases The data need to
written to all replicas
59
Guarantee of the same
order when multi user
request the mutation
operation
Application
①
60
①
(file name, data)
GFS Client Master
(file name, chunk index)
②
(chunk handle,primary and
secondary replica locations)
③
2012-11-1
11
Application
Primary
bufferchunk
data
61
GFS Client
Secondary
bufferchunk
Secondary
bufferchunk
data
data
④
Application
Primary
D1|D2|D3chunk
write command
⑥
write command,serial order
⑦
62
GFS Client
Secondary
D1|D2|D3chunk
Secondary
D1|D2|D3chunk
⑤
Application
Primary
emptychunk
response
⑨
⑧
re
63
GFS Client
Secondary
emptychunk
Secondary
emptychunk
esponse
Application originates write request 1. GFS client translates request from (filename, data) ->
(filename, chunk index), and sends it to master.2. Master responds with chunk handle and (primary +
secondary) replica locations.3. Client pushes write data to all locations.4 D t i t d i h k ’ i t l b ff
64
4. Data is stored in chunkservers’ internal buffers.5. Client sends write command to primary.6. Primary determines serial order for data instances stored
in its buffer and writes the instances in that order to thechunk.
7. Primary sends serial order to the secondaries and tellsthem to perform the write.
8. Secondaries respond to the primary.9. Primary responds back to client.
Note: If write fails at one of chunkservers, client is informed and retries the write.
A failed mutation makes the region inconsistent (hence also undefined): different clients may see
65
(hence also undefined): different clients may see different data at different times.
Record Append Algorithm
Important operation at Google:
Merging results from multiple machines in one file.
Using file as producer - consumer queue.
Cli t d i ll l
66
Clients can read in parallel.
Clients can write in parallel.
Clients can append records in parallel.
2012-11-1
12
1. Application originates record append request.2. GFS client translates request and sends it to master.3. Master responds with chunk handle and (primary +
secondary) replica locations.4. Client pushes write data to all locations.5. Primary checks if record fits in specified chunk.
67
5. Primary checks if record fits in specified chunk.6. If record does not fit, then the primary:
pads the chunk, tells secondaries to do the same, andinforms the client.
Client then retries the append with the next chunk.7. If record fits, then the primary:
appends the record, tells secondaries to do the same,receives responses from secondaries, and sends finalresponse to the client.
Consistency Model Consistency problem comes out from concurrent
modification. A set of data modifications all executed by different clients. Consistent if all clients see the same thing. Defined if all clients see the modification in its entirety (atomic).
GFS can distinguish defined regions from undefined one
68
Write Record Append
Serial success Defined Defined, but interspersed with inconsistent
Concurrent success
Consistent but undefined
Failure Inconsistent
applying mutations to a chunk in the same order on all its replicas
using chunk version numbers to detect any replica that has become stale because it has
69
replica that has become stale because it has missed mutations while its chunkserver was down
Record appending operation made multi-mutations can individually added to end of file.
Reading Concurrently
Apparently all bets are off.
Clients cache chunk locations.
Seems to not be a problem since most of their modifications are record appends
70
modifications are record appends.
Fault Tolerance
Recovery: master and chunkservers are designed to restart and restore state in a few seconds.
Chunk Replication: across multiple machines, across multiple racks.
Master Mechanisms:
71
Log of all changes made to metadata. Periodic checkpoints of the log. Log and checkpoints replicated on multiple machines. Master state is replicated on multiple machines. “Shadow” masters for reading data if “real” master is down. Data integrity: Each chunk has an associated checksum.
Data Integrity
each chunkserver must independently verify the integrity of its own copy by maintaining checksums.
A chunk is broken up into 64 KB blocks. Each has a corresponding 32 bit checksum.
72
For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester,
On read error, error is reported. Master will rereplicate the chunk.
2012-11-1
13
Creation, Re-replication, and Rebalancing
Replicas created for three reasons: Chunk creation、re-replication、rebalances
When the master creates a chunk, it chooses where to place the initially empty replicas.
73
p y p y p to place new replicas on chunkservers with below-average
disk space utilization. Over time this will equalize disk utilization across chunkservers.
To limit the number of “recent” creations on each chunkserver.
to spread replicas of a chunk across racks.
Garbage Collection
Distributed garbage collection is a hard problem, GFS do it in a effective way.
Easily identify all references to chunks: they are in the file to chunk mappings maintained
74
exclusively by the master. We can also easily identify all the chunk replicas:
they are Linux files under designated directories on each chunkserver.
Any such replica not known to the master is “garbage.”
After a file is deleted, GFS does not immediately reclaim the available physical storage.
It does so only lazily during regular garbage collection at both the file and chunk levels
75
collection at both the file and chunk levels.
It is found that this approach makes the system much simpler and more reliable.
Security
… Basically none
Relies on Google’s network being private
File permissions not mentioned in paperIndi id al sers / applications m st cooperate
76
Individual users / applications must cooperate
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets
Put forward by Google Map function: Input key/value pairs intermediate
77
Map function: Input key/value pairs intermediate output key/value pairs; Reduce function: Intermediate key K and all intermediate values associated with K output value(s)
This model can express the work of realistic world More than 1000 MapReduce procedures are run on
Google racks everyday
MapReduce ModelModel
Two user-defined functions:
Map Input key/value pairs
intermediate output
78
pkey/value pairs
Reduce Intermediate key K and all
intermediate values associated with K output value(s)
2012-11-1
14
MapReduce ExampleExample
Inverted Index Map: parses a document and emits <word, docId>
pairs
Reduce: takes all pairs for a given word, sorts the
79
p gdocId values, and emits a <word, list(docId)> pair
MapReduce ExecutionExecution
1. Input files split (M splits)
2. Assign Master & Workers
3. Map tasks
W iti i t di t d t t di k (R i )
80
4. Writing intermediate data to disk (R regions)
5. Intermediate data read & sort
6. Reduce tasks
7. Return
81 82
MapReduce Job Processing
83
Rewrote Google's production indexing system using MapReduce New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Experience: Rewrite of Production Indexing System
84
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more machines
2012-11-1
15
Usage: MapReduce jobs run in August 2004
Number of jobs 29,423 Average job completion time 634 secs Machine days used 79,186 days Input data read 3,288 TB Intermediate data produced 758 TB Output data written 193 TB
85
Output data written 193 TB Average worker machines per job 157 Average worker deaths per job 1.2 Average map tasks per job 3,351 Average reduce tasks per job 55 Unique map implementations 395 Unique reduce implementations 269 Unique map/reduce combinations 426
Outline
Introduction
File systems overview
SUN Network File System
Th A d Fil S t
86
The Andrew File System
Google File System MapReduce
Hadoop File System
Hadoop
Hadoop:a distribution file system framework, lead by Apache
Instance of GFS and MapReduce Hadoop Distributed File System HDFS
87
Hadoop Distributed File System, HDFS
MapReduce programming model
Hadoop was named after its creator's (Doug Cutting's) child's stuffed elephant.
Challeges in Hadoop
Scalable Hadoop can reliably store and process petabytes.
Economical It distributes the data and processing across clusters of
commonly available computers.
88
These clusters can number into the thousands of nodes.
Efficient By distributing the data, Hadoop can process it in parallel on the
nodes where the data is located.
Reliable Hadoop automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
HDFS Architecture
89
Hadoop MapReduce Hadoop MapReduce
90
2012-11-1
16
MapReduce Component
Mapper和Reducer Basic MapReduce consist of Mapper, Reducer and
JobConf
JobTracker和TaskTracker
91
JobTracker和TaskTracker JobTracker on master and TaskTracker on slaver
provide scheduling service
Master schedule and monitor every task on slaver, salver run a task running on
TaskTracker run on the DataNode of HDFS
JobClient Submit job to master
JobInProgress Created by JobTracker to trace and monitor
92
Created by JobTracker, to trace and monitor TaskInProgress
Launch task
MapTask and ReduceTask A job includes mapper, combiner and reducer, mapper and
combiner executed by MapTask and Reducer executed by ReduceTask
93
Install
precondition Linux
Java
Install hadoop
94
Install hadoop Decompress; modify profile
Configure ssh Secure Shell
Deliver data by encrpytion
Node communication tool (protocol)
Using
Start hadoop First, need format namenode hadoop namenode -format start-all.sh: start all Hadoop progress, include namenode,
datanode, jobtracker, tasktrack
95
datanode, jobtracker, tasktrack Stop hadoop
stop-all.sh: stop all Hadoop progress HDFS operarion
Create directory: hadoop dfs -mkdir testdir Copy file: hadoop dfs -put README.txt readme List file: hadoop dfs -ls
Observe node state Can using following order when hadoop is running NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/
Using MapReduce
96
Using MapReduce hadoop-xxx-core.jar include all hadoop classes Design your MapReduce method(Java) Run MapReduce application
Nutch1.2 A Distribution Information Retrieval, using HDFS as file system
2012-11-1
17
Example public class WordCount { public static class Map extends MapReduceBase implements Mapper { public void map(key, value, output, reporter) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, 1); } } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(key, values, output, reporter) {
97
p ( y, , p , p ) { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); ... JobClient.runJob(conf); } }
Thanks!
98
Thanks!
附录:
在windows上安装hadoop和nutch
99
一、安装cygwin
100
101 102
2012-11-1
18
103
二、安装hadoop
104
105 106
107 108
2012-11-1
19
109
三、安装nutch
110
111 112
113