Distributed File Systemsidc.hust.edu.cn/~rxli/teaching/ds/6 DistributedFileSystem...2012-11-1 4 Outline Introduction File systems overview SUN Network File System Th A d Fil S t 19

2012-11-1

1

Distributed File Systems

http://idc.hust.edu.cn/~rxli/

1

p

Xinhua Dong, Ruixuan LiSchool of Computer Science and TechnologyHuazhong University of Science and TechnologyOct. 15, 2012

Outline

Introduction

File systems overview

SUN Network File System

Th A d Fil S t

2

The Andrew File System

Google File System MapReduce

Hadoop File System

Introduction

A network file system is any computer file system that supports sharing of files, printers

and other resources as persistent storage over a computer network.

Distribution

3

A DFS is a network file system whose clients, servers, and storage devices are dispersed among the machines.

Transparency a DFS should appear to its users to be a conventional,

centralized file system.

Performance the amount of time needed to satisfy service requests

additional overhead besides a disk-access time and a small amount of CPU-processing time time to deliver the request to a server,

4

time to deliver the request to a server, the time to deliver the response to the client, and for each direction, a CPU overhead of running the communication

protocol software.

Concurrent file updates access and update the same files.

Outline

Introduction



Th A d Fil S t

5



Hadoop File System

File Systems Overview

System that permanently stores data

Usually layered on top of a lower-level physical storage medium

Divided into logical units called “files”

6

Divided into logical units called files Addressable by a filename (“foo.txt”)

Usually supports hierarchical nesting (directories)

2012-11-1

2

File Paths

A file path joins file & directory names into a relative or absolute address to identify a file Absolute: /home/aaron/foo.txt

Relative: docs/someFile.doc

7

Relative: docs/someFile.doc

The shortest absolute path to a file is called its canonical path

The set of all canonical paths establishes the namespace for the file system

What Gets Stored

User data itself is the bulk of the file system's contents

Also includes meta-data on a drive-wide and per-file basis:

P fil

8

Drive-wide:

available space

formatting info

character set

...

Per-file:

name

owner

modification date

...

High-Level Organization

Files are organized in a “tree” structure made of nested directories

One directory acts as the “root” .

“links” (symlinks shortcuts etc) provide simple

9

links (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file .

Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)

Low-Level Organization

File data and meta-data stored separately

File descriptors + meta-data stored in inodes Large tree or table at designated location on disk

10

g g

Tells how to look up file contents

Meta-data may be replicated to increase system reliability

“Standard” read-write medium is a hard drive (other media: CDROM, tape, ...)

Viewed as a sequential array of blocks .

Must address ~1 KB chunk at a time

11

Tree structure is “flattened” into blocks

Overlapping reads/writes/deletes can cause fragmentation: files are often not stored with a linear layout inodes store all block ids related to file

Fragmentation

12

2012-11-1

3

Design Considerations Smaller inode size reduces amount of wasted

space

Larger inode size increases speed of sequential reads (may not help random access)

Sh ld th fil t b f t li bl ?

13

Should the file system be faster or more reliable?

But faster at what: Large files? Small files ? Lots of reading? Frequent writers, occasional readers?

File system Security

File systems in multi-user environments need to secure private data Notion of username is heavily built into FS

Different users have different access writes to files .

14

Different users have different access writes to files .

UNIX Permission Bits

World is divided into three scopes: User – The person who owns (usually created) the file

Group – A list of particular users who have “group ownership” of the file .

15

p

Other – Everyone else

“Read,” “write” and “execute” permissions applicable at each level .

Access Control Lists

More general permissions mechanism

Implemented in Windows

Richer notion of privileges than r/w/xe g SetPri ilege Delete Cop

16

e.g., SetPrivilege, Delete, Copy…

Allow for inheritance as well as deny lists Can be complicated to reason about and lead to

security gaps

Process Permissions

Important note: processes running on behalf of user X have permissions associated with X, not process file owner Y

So if root owns ls, user aaron can not use ls to

17

peek at other users’ files . Exception: special permission “setuid” sets the

user-id associated with a running process to the owner of the program file

Disk Encryption

Data storage medium is another security concern Most file systems store data in the clear, rely on runtime security

to deny access Assumes the physical disk won’t be stolen

The disk itself can be encrypted

18

yp Hopefully by using separate passkeys for each user’s files . (Challenge: how do you implement read access for group

members?) Metadata encryption may be a separate concern

2012-11-1

4

Outline

Introduction



Th A d Fil S t

19



Hadoop File System

Distributed File Systems

Support access to files on remote servers

Must support concurrency Make varying guarantees about locking, who “wins”

with concurrent writes, etc...

20

Must gracefully handle dropped connections

Can offer support for replication and local caching

Different implementations sit in different places on complexity/feature scale

General goal: Try to make a file system transparently available to remote clients.

21

(a) The remote access model. (b) The upload/download model.

Network File System (NFS)

First developed in 1980s by Sun .

Presented with standard UNIX FS interface

Network drives are mounted into local directory hierarchy

22

hierarchy Type ‘man mount’, 'mount' some time at the prompt if

curious

NFS Protocol

Initially completely stateless Operated over UDP; did not use TCP streams

File locking, etc, implemented in higher-level protocols

Modern implementations use TCP/IP & stateful

23

Modern implementations use TCP/IP & statefulprotocols

NFS Architecture for UNIX systems NFS is implemented using the Virtual File System abstraction, which is now

used for lots of different operating systems:

24

Essence: VFS provides standard file system interface, and allows to hide difference between accessing local or remote file system.

2012-11-1

5

Server-side Implementation

NFS defines a virtual file system Does not actually manage local disk layout on server

Server instantiates NFS volume on top of local file system Local hard drives managed by concrete file systems (EXT,

ReiserFS )

25

ReiserFS, ...)

Typical implementation Assuming a Unix-style scenario in which one machine requires access to data

stored on another machine:1. The server implements NFS daemon processes in order to make its data

generically available to clients.2. The server administrator determines what to make available, exporting the

names and parameters of directories.3. The server security-administration ensures that it can recognize and approve

validated clients

26

validated clients.4. The server network configuration ensures that appropriate clients can

negotiate with it through any firewall system.5. The client machine requests access to exported data, typically by issuing a

mount command.6. If all goes well, users on the client machine can then view and interact with

mounted file systems on the server within the parameters permitted.

[webg@index1 ~]$ vi /etc/exports

# the file /etc/exports serves as the access control list for

# file systems which may be exported to NFS clients.

/home/infomall/udata 192.168.100.0/24(rw,async) 222.29.154.11(rw,async)

/home/infomall/hist 192.168.100.0/24(rw,async)

27

( y )

NFS Locking

NFS v4 supports stateful locking of files Clients inform server of intent to lock

Server can notify clients of outstanding lock requests

Locking is lease-based: clients must continually renew

28

g ylocks before a timeout

Loss of contact with server abandons locks

NFS Client Caching

NFS Clients are allowed to cache copies of remote files for subsequent accesses

Supports close-to-open cache consistency When client A closes a file, its contents are

synchronized with the master and timestamp is

29

synchronized with the master, and timestamp is changed

When client B opens the file, it checks that local timestamp agrees with server timestamp. If not, it discards local copy.

Concurrent reader/writers must use flags to disable caching

NFS: Tradeoffs

NFS Volume managed by single server Higher load on central server

Simplifies coherency protocols

Full POSIX system means it “drops in” very easily,

30

Full POSIX system means it drops in very easily, but isn’t “great” for any specific need

2012-11-1

6

Distributed FS Security

Security is a concern at several levels throughout DFS stack Authentication

Data transfer

31

Data transfer

Privilege escalation

How are these applied in NFS?

Authentication in NFS

Initial NFS system trusted client programs User login credentials were passed to OS kernel which

forwarded them to NFS server

… A malicious client could easily subvert this

32

y

Modern implementations use more sophisticated systems (e.g., Kerberos)

Data Privacy

Early NFS implementations sent data in “plaintext” over network Modern versions tunnel through SSH .

Double problem with UDP (connectionless)

33

Double problem with UDP (connectionless) protocol: Observers could watch which files were being opened

and then insert “write” requests with fake credentials to corrupt data

Privilege Escalation

Local file system username is used as NFS username Implication: being “root” on local machine gives you

root access to entire NFS cluster

34

Solution: “root squash” – NFS hard-codes a privilege de-escalation from “root” down to “nobody” for all accesses.

RPCs in File System

Observation: Many (traditional) distributed file systems deploy remote procedure calls to access files. When wide-area networks need to be crossed, alternatives need to be exploited:

35

File Sharing Semantics

Problem: When dealing with distributed file systems, we need to take into account the ordering of concurrent read/write operations, and expected semantics (=consistency).

36

2012-11-1

7

UNIX semantics: a read operation returns the effect of the last write operation )

can only be implemented for remote access models in which there is only a single copy of the file

Transaction semantics: the file system supports transactions on a single file ) issue is

37

the file system supports transactions on a single file ) issue is how to allow concurrent access to a physically distributed file

Session semantics: the effects of read and write operations are seen only by the

client that has opened (a local copy) of the file ) what happens when a file is closed (only one client may actually win)

Consistency and Replication Observation: In modern distributed file systems, clientside caching

is the preferred technique for attaining performance; server-side replication is done for fault tolerance.

Observation: Clients are allowed to keep (large parts of) a file, and will be notified when control is withdrawn) servers are now generally stateful

38

Fault Tolerance

Observation: FT is handled by simply replicating file servers, generally using a standard primary-backup protocol:

39

Outline

Introduction



Th A d Fil S t

40



Hadoop File System

AFS (The Andrew File System)

Developed at Carnegie Mellon

Strong security, high scalability Supports 50,000+ clients at enterprise level

AFS heavily influenced Version 4 of NFS

41

AFS heavily influenced Version 4 of NFS.

Security in AFS

Uses Kerberos authentication

Supports richer set of access control bits than UNIX Separate “administer”, “delete” bits

42

Separate administer , delete bits

Allows application-specific bits

2012-11-1

8

Local Caching

File reads/writes operate on locally cached copy

Local copy sent back to master when file is closed

Open local copies are notified of external updates through callbacks

43

through callbacks

Local Caching - Tradeoffs

Shared database files do not work well on this system

Does not support write-through to shared medium

44

Replication

AFS allows read-only copies of filesystem volumes

Copies are guaranteed to be atomic checkpoints of entire FS at time of read-only copy generation

45

Modifying data requires access to the sole r/w volume Changes do not propagate to read-only

copies

AFS Conclusions

Not quite POSIX Stronger security/permissions

No file write-through

High availability through replicas local caching

46

High availability through replicas, local caching

Not appropriate for all file types

Outline

Introduction



Th A d Fil S t

47



Hadoop File System

Why Does Google Need GFS

Motivation Google stores dozens of copies of the entire Web!

More than 15,000 commodity-class PC’s

Multiple clusters distributed worldwide

Th d f i d d

48

Thousands of queries served per second

One query reads 100’s of MB of data

One query consumes 10’s of billions of CPU cycles

conclusion: Need large, distributed, highly fault tolerant file system

2012-11-1

9

Key Assumptions

All commodity hardware Cheap but unreliable Constantly failing, 100 disk failures/day

“Modest” # of large file Few million

49

Few million Each 100MB - multi – GB Some small files, but few

Read-mostly workload Large streaming reads (multi-MB at a time) Large sequential append operations Must provide atomic consistency to parallel writes

Design Decisions in GFS

Files stored as chunks Fixed size (immutable), each with a handle

Reliability through replication Each chunk replicated across 3+ chunk servers Stored as local files on Linux file system

50

Stored as local files on Linux file system

Single master to coordinate access, keep metadata Simple centralized master per GFS cluster Periodic heartbeat messages to checkup on servers

No caching Large data set / streaming reads render caching useless Rely on Linux buffer cache to keep data in memory

Architecture of GFS

51

What is a master? A single process running on a separate machine

Stores all metadata: File namespace

52

p

File to chunk mapping

Chunk location information

Access control information

Chunk version numbers

Etc.

File chunks: 64MB, large chunk Advantage:

Lower metadata overhead, can be stored in memory

Cache more index in client

53

Time to interact with master as many operations in one chunk, so that the network traffic reduced

Disadvantage: One file may in one chunkserver, it may become hot spots

Chunk location information Not stored in Master persistently

Obtained by querying chunkserver at startup or whenever joins the cluster

54

j

The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages

2012-11-1

10

In-Memory Data Structures Operation is fast, easy and efficient to periodically scan through

state Re-replication in the presence of chunkserver failure

Chunk migration for load balancing

55

Chunk migration for load balancing

Garbage collection

Memory is cheap. Capacity of system limited by memory of master, 64B/64M

If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory

Master <-> Chunkserver Communication: regularly to obtain state:

Is chunkserver down?

Are there disk failures on chunkserver?

56

Are any replicas corrupted?

Which chunk replicas does chunkserver store?

Master sends instructions to chunkserver: Delete existing chunk

Create new chunk

Operation log for persistent logging The operation log contains a historical record of critical

metadata changes，replicated on multiple remote machines。

Checkpoints for fast recovery

Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent

57

goperations.

Files and chunks, as well as their versions are all uniquely and eternally identified by the logical times at which they were created.

The master recovers its file system state by replaying the operation log.

Serving Requests:

Client retrieves metadata for operation from master.

Read/Write data flows between client and

58

Read/Write data flows between client and chunkserver.

Single master is not bottleneck, because its involvement with read/write operations is minimized.

Mutation Operations in GFS

Mutation: any write or append operation：Primary Leases The data need to

written to all replicas

59

Guarantee of the same

order when multi user

request the mutation

operation

Application

①

60

①

（file name, data）

GFS Client Master

（file name, chunk index）

②

（chunk handle,primary and

secondary replica locations）

③

2012-11-1

11

Application

Primary

bufferchunk

data

61

GFS Client

Secondary

bufferchunk

Secondary

bufferchunk

data

data

④

Application

Primary

D1|D2|D3chunk

write command

⑥

write command,serial order

⑦

62

GFS Client

Secondary

D1|D2|D3chunk

Secondary

D1|D2|D3chunk

⑤

Application

Primary

emptychunk

response

⑨

⑧

re

63

GFS Client

Secondary

emptychunk

Secondary

emptychunk

esponse

Application originates write request 1. GFS client translates request from (filename, data) ->

(filename, chunk index), and sends it to master.2. Master responds with chunk handle and (primary +

secondary) replica locations.3. Client pushes write data to all locations.4 D t i t d i h k ’ i t l b ff

64

4. Data is stored in chunkservers’ internal buffers.5. Client sends write command to primary.6. Primary determines serial order for data instances stored

in its buffer and writes the instances in that order to thechunk.

7. Primary sends serial order to the secondaries and tellsthem to perform the write.

8. Secondaries respond to the primary.9. Primary responds back to client.

Note: If write fails at one of chunkservers, client is informed and retries the write.

A failed mutation makes the region inconsistent (hence also undefined): different clients may see

65

(hence also undefined): different clients may see different data at different times.

Record Append Algorithm

Important operation at Google:

Merging results from multiple machines in one file.

Using file as producer - consumer queue.

Cli t d i ll l

66

Clients can read in parallel.

Clients can write in parallel.

Clients can append records in parallel.

2012-11-1

12

1. Application originates record append request.2. GFS client translates request and sends it to master.3. Master responds with chunk handle and (primary +

secondary) replica locations.4. Client pushes write data to all locations.5. Primary checks if record fits in specified chunk.

67

5. Primary checks if record fits in specified chunk.6. If record does not fit, then the primary:

pads the chunk, tells secondaries to do the same, andinforms the client.

Client then retries the append with the next chunk.7. If record fits, then the primary:

appends the record, tells secondaries to do the same,receives responses from secondaries, and sends finalresponse to the client.

Consistency Model Consistency problem comes out from concurrent

modification. A set of data modifications all executed by different clients. Consistent if all clients see the same thing. Defined if all clients see the modification in its entirety (atomic).

GFS can distinguish defined regions from undefined one

68

Write Record Append

Serial success Defined Defined, but interspersed with inconsistent

Concurrent success

Consistent but undefined

Failure Inconsistent

applying mutations to a chunk in the same order on all its replicas

using chunk version numbers to detect any replica that has become stale because it has

69

replica that has become stale because it has missed mutations while its chunkserver was down

Record appending operation made multi-mutations can individually added to end of file.

Reading Concurrently

Apparently all bets are off.

Clients cache chunk locations.

Seems to not be a problem since most of their modifications are record appends

70

modifications are record appends.

Fault Tolerance

Recovery: master and chunkservers are designed to restart and restore state in a few seconds.

Chunk Replication: across multiple machines, across multiple racks.

Master Mechanisms:

71

Log of all changes made to metadata. Periodic checkpoints of the log. Log and checkpoints replicated on multiple machines. Master state is replicated on multiple machines. “Shadow” masters for reading data if “real” master is down. Data integrity: Each chunk has an associated checksum.

Data Integrity

each chunkserver must independently verify the integrity of its own copy by maintaining checksums.

A chunk is broken up into 64 KB blocks. Each has a corresponding 32 bit checksum.

72

For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester,

On read error, error is reported. Master will rereplicate the chunk.

2012-11-1

13

Creation, Re-replication, and Rebalancing

Replicas created for three reasons: Chunk creation、re-replication、rebalances

When the master creates a chunk, it chooses where to place the initially empty replicas.

73

p y p y p to place new replicas on chunkservers with below-average

disk space utilization. Over time this will equalize disk utilization across chunkservers.

To limit the number of “recent” creations on each chunkserver.

to spread replicas of a chunk across racks.

Garbage Collection

Distributed garbage collection is a hard problem, GFS do it in a effective way.

Easily identify all references to chunks: they are in the file to chunk mappings maintained

74

exclusively by the master. We can also easily identify all the chunk replicas:

they are Linux files under designated directories on each chunkserver.

Any such replica not known to the master is “garbage.”

After a file is deleted, GFS does not immediately reclaim the available physical storage.

It does so only lazily during regular garbage collection at both the file and chunk levels

75

collection at both the file and chunk levels.

It is found that this approach makes the system much simpler and more reliable.

Security

… Basically none

Relies on Google’s network being private

File permissions not mentioned in paperIndi id al sers / applications m st cooperate

76

Individual users / applications must cooperate

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets

Put forward by Google Map function: Input key/value pairs intermediate

77

Map function: Input key/value pairs intermediate output key/value pairs; Reduce function: Intermediate key K and all intermediate values associated with K output value(s)

This model can express the work of realistic world More than 1000 MapReduce procedures are run on

Google racks everyday

MapReduce ModelModel

Two user-defined functions:

Map Input key/value pairs

intermediate output

78

pkey/value pairs

Reduce Intermediate key K and all

intermediate values associated with K output value(s)

2012-11-1

14

MapReduce ExampleExample

Inverted Index Map: parses a document and emits <word, docId>

pairs

Reduce: takes all pairs for a given word, sorts the

79

p gdocId values, and emits a <word, list(docId)> pair

MapReduce ExecutionExecution

1. Input files split (M splits)

2. Assign Master & Workers

3. Map tasks

W iti i t di t d t t di k (R i )

80

4. Writing intermediate data to disk (R regions)

5. Intermediate data read & sort

6. Reduce tasks

7. Return

81 82

MapReduce Job Processing

83

Rewrote Google's production indexing system using MapReduce New code is simpler, easier to understand

MapReduce takes care of failures, slow machines

Experience: Rewrite of Production Indexing System

84

MapReduce takes care of failures, slow machines

Easy to make indexing faster by adding more machines

2012-11-1

15

Usage: MapReduce jobs run in August 2004

Number of jobs 29,423 Average job completion time 634 secs Machine days used 79,186 days Input data read 3,288 TB Intermediate data produced 758 TB Output data written 193 TB

85

Output data written 193 TB Average worker machines per job 157 Average worker deaths per job 1.2 Average map tasks per job 3,351 Average reduce tasks per job 55 Unique map implementations 395 Unique reduce implementations 269 Unique map/reduce combinations 426

Outline

Introduction



Th A d Fil S t

86



Hadoop File System

Hadoop

Hadoop：a distribution file system framework, lead by Apache

Instance of GFS and MapReduce Hadoop Distributed File System HDFS

87

Hadoop Distributed File System, HDFS

MapReduce programming model

Hadoop was named after its creator's (Doug Cutting's) child's stuffed elephant.

Challeges in Hadoop

Scalable Hadoop can reliably store and process petabytes.

Economical It distributes the data and processing across clusters of

commonly available computers.

88

These clusters can number into the thousands of nodes.

Efficient By distributing the data, Hadoop can process it in parallel on the

nodes where the data is located.

Reliable Hadoop automatically maintains multiple copies of data and

automatically redeploys computing tasks based on failures.

HDFS Architecture

89

Hadoop MapReduce Hadoop MapReduce

90

2012-11-1

16

MapReduce Component

Mapper和Reducer Basic MapReduce consist of Mapper, Reducer and

JobConf

JobTracker和TaskTracker

91

JobTracker和TaskTracker JobTracker on master and TaskTracker on slaver

provide scheduling service

Master schedule and monitor every task on slaver, salver run a task running on

TaskTracker run on the DataNode of HDFS

JobClient Submit job to master

JobInProgress Created by JobTracker to trace and monitor

92

Created by JobTracker, to trace and monitor TaskInProgress

Launch task

MapTask and ReduceTask A job includes mapper, combiner and reducer, mapper and

combiner executed by MapTask and Reducer executed by ReduceTask

93

Install

precondition Linux

Java

Install hadoop

94

Install hadoop Decompress; modify profile

Configure ssh Secure Shell

Deliver data by encrpytion

Node communication tool (protocol)

Using

Start hadoop First, need format namenode hadoop namenode -format start-all.sh: start all Hadoop progress, include namenode,

datanode, jobtracker, tasktrack

95

datanode, jobtracker, tasktrack Stop hadoop

stop-all.sh: stop all Hadoop progress HDFS operarion

Create directory: hadoop dfs -mkdir testdir Copy file: hadoop dfs -put README.txt readme List file: hadoop dfs -ls

Observe node state Can using following order when hadoop is running NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/

Using MapReduce

96

Using MapReduce hadoop-xxx-core.jar include all hadoop classes Design your MapReduce method（Java） Run MapReduce application

Nutch1.2 A Distribution Information Retrieval, using HDFS as file system

2012-11-1

17

Example public class WordCount { public static class Map extends MapReduceBase implements Mapper { public void map(key, value, output, reporter) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, 1); } } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(key, values, output, reporter) {

97

p ( y, , p , p ) { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); ... JobClient.runJob(conf); } }

Thanks！

98

Thanks！

附录：

在windows上安装hadoop和nutch

99

一、安装cygwin

100

101 102

2012-11-1

18

103

二、安装hadoop

104

105 106

107 108

2012-11-1

19

109

三、安装nutch

110

111 112

113

Documents

Distributed File Systemsidc.hust.edu.cn/~rxli/teaching/ds/6 DistributedFileSystem...2012-11-1 4 Outline Introduction File systems overview SUN Network File System Th A d Fil S t 19