47
The Evolution of Storage on Linux Lenz Grimmer <[email protected]> FrOSCON 2015, Sankt Augustin 22. August 2015

The evolution of storage on Linux

Embed Size (px)

Citation preview

The Evolution of Storage on LinuxLenz Grimmer <[email protected]>

FrOSCON 2015, Sankt Augustin

22. August 2015

2

Agenda

A trip down memory lane (pun intended)

Overview of how storage on Linux has evolved

Local file systems and related concepts/technologies

Network Services

Distributed / Cluster filesystems

3

Introduction

40+ file systems in /fs/

Focus on the most popular/widely used systems

Primary focus on the software side

High-level Descriptions only

4

Noteworthy Observations / Conclusions

The role of SourceForge.net today

Distribution kernels vs. mainline Linux

Honorable mention: Christoph Hellwig

Don‘t miss his talk about the Linux Storage Stack tomorrow (14:00, HS6)

Big Thanks to: LWN, Kernelnewbies.org, Thorsten Leemhuis

(Heise) and Wikipedia

The early days

6

MINIX file system

While developing Linux in 1991, Linus required some form of

persistent storage

A Minix-compatible file system was the canonical choice:

Well-documented, robust

Exchange data with the host OS (and vice versa)

Severely limited

Max. file/filesystem size: 64MB (16bit block addresses)

14 char file names

Only one time stamp (mtime)

7

Virtual File System Switch (VFS)

Abstraction / indirection layer to route file oriented system calls to

necessary functions in the physical filesystem code to do the I/O

Eased the addition of new file systems

Initially written by Chris Provenzano

Integrated into Linux 0.96

Defines a set of functions that every filesystem has to implement

Three kinds of objects: filesystems, inodes, and open files

8

Extended File System (ext)

Designed by Rémy Card

Max. file/filesystem size: 2 GB, max. file name size was 255 chars

Metadata structure inspired by the traditional Unix File System

(UFS)

Added to Linux 0.96c in April 1992

Issues remained (bad performance, missing time stamps,

fragmentation)

9

Second Extended File System (ext2)

Also implemented by Rémy Card

Introduced in Linux Kernel 0.99 (January 1993)

Designed with extensibility in mind

Adopted advanced ideas from other file systems (e.g. BSD Fast File System),

e.g. mtime/ctime/atime, file attributes, BSD/SysV semantics, different block

sizes, immutable/append-only files

Initially supported file/file systems sizes up to 2TB (limitation of the block

device layer)

Kernel version 2.6.17 (March 2006) extended max. file system size to 32TB

(using 8kB Blocks)

10

FAT/MSDOS

Added to Linux in 1992/1993 by Werner Almesberger

VFAT support was later developed by Gordon Chaffee

VFAT filesystem is compatible with Windows 95/NT long filenames on the

FAT filesystem

Initially called xmsdos

Patches for Linux 1.2.x and 1.3.x.

As of Linux 1.3.60, the vfat filesystem is part of the Linux kernel distribution

Mtools as a userland-only alternative

11

NTFS

NTFS driver for Linux by Martin von Löwis (started around 1996)

Legato Systems later sponsored Anton Altaparmakov to further

develop NTFS on Linux since June 2001

Read-only mode only, with no fault-tolerance supported

NFTS-TNG replaced old NTFS driver in Linux 2.5.11 (April 29th,

2002)

NTFS-3G (FUSE-based) by Tuxera (read-write support)

The Age of JournalingFilesystems

13

Fsck vs. Journaling

Unclean unmounts, too many mount counts, or remounts after

a long time period triggered file system checks

Disk drives got bigger

A Journaling file system keeps track of changes not yet

committed to the file system's main part in a Journal

Keep track of just metadata changes or data as well

Several file systems were developed in parallel, to alleviate this

shortcoming of ext2, namely ext3, XFS, JFS and ReiserFS.

14

Journaling Block Device layer (JBD)

JBD established as a filesystem-independent service, to be used

by any file system

First incarnation of JBD developed by Stephen C. Tweedie

together with the ext3 file system

OCFS2 and later ext4 also used JBD and it’s successor JBD2

15

Third extended filesystem (ext3)

Originally released in September 1999

Written by Stephen Tweedie for the 2.2 branch

Ported to 2.4 kernels by Peter Braam, Andreas Dilger, Andrew

Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie

Merged with the mainline Linux kernel 2.4.15 (November 2001)

Basically ext2 with journaling capabilities, easy conversion

Max filesystem size: 8TB, Max 32k subdirs/directory

16

IBM JFS

Rooted in AIX and OS/2 Warp Server (new design in 1995)

Port to Linux started in December 1999 (Dave Kleikamp, Steve Best)

Uses own journaling implementation (metadata only)

Max volume size: 32PB, Max file size: 4PB

Later ported to AIX 5L as JFS2 (April 2001)

JFS 0.0.1 released in Feb. 2000., 0.1.0 (Beta) in August 2000

Version 1.0.0 was released in June 2001

Kernel module since 2.4.18pre9-ac4, Version 1.1.0 was included by Marcelo

Tosatti in Linux 2.4.20.

17

ReiserFS

Early supported by SuSE, Introduced in version 2.4.1 (2001)

The first journaling file system to be included in mainline

Max volume size: 16TB

Based on B+ trees

Metadata-only journaling (block journaling since 2.6.8)

Online resizing

Tail packing block suballocation

Reiser4 still under active development (Edward Shishkin)

18

SGI XFS

64-bit journaling file system created by Silicon Graphics

SGI IRIX since 1994, GPLed in 2000

Version 1.0 for Linux in May 2001 as Patch against 2.4.2

Merged in 2.6.x and 2.4.25 (Feb 2004)

Steve Lord, Russell Cattelan, Nathan Scott, Jim Mostek

Advanced features, high performance

Max volume size: 16EB

Volume Management

20

The need for Logical Volume Management

Initially, Linux could only address disks/partitions

Changes to the layout required downtime and shuffling of data

Logical Volume Management abstracts physical disk drives

First incarnation of Linux LVM was introduced in Kernel version

2.4

Heinz Mauelshagen wrote the original LVM code in 1998,

inspired by HP-UX's volume manager.

21

Device Mapper (DM)

A kernel framework for mapping physical block devices onto higher-

level virtual block devices

Added in Linux 2.6

Passes data from a virtual block device, which is provided by the

device mapper itself, to another block device

Pluggable design

Data can be also modified in transition

Forms the foundation of LVM2/EVMS, RAID and dm-crypt disk

encryption and many other useful features

22

DM Multipath (DM-MPIO)

Consists of kernel components and user-space components

Provides input-output (I/O) fail-over and load-balancing within Linux

for block devices

Handles the rerouting of block I/O to an alternate path in the event of

a path failure

Can also balance the I/O load across all of the available paths in Fibre

Channel (FC) or iSCSI SAN environments

Started as part of a patchset created by Joe Thornber, later

maintained by Alasdair G Kergon at Red Hat. Christophe Varoqui

maintains the userland multipath tools

23

DM-Cache

Allows a fast device (e.g. an SSD) to be used as a cache for a slower device

(e.g. a rotating disk)

Different policy plugins can be used to change the algorithms used to select

which blocks are promoted, demoted, cleaned etc.

Supports writeback and writethrough modes

Requires three physical storage devices to separately store actual data,

cache data and required metadata

Joe Thornber, Heinz Mauelshagen and Mike Snitzer

Inclusion into the Linux mainline kernel version 3.9, released on April 28,

2013

24

LVM2

Based on DM

Flexible storage management

Add/remove disks

Resize/move logical volumes

Move LVs between PVs

Span volumes across multiple physical devices

RAID

Thin provisioning

Cluster Volume Manager

25

IBM EVMS

IBM-sponsored effort to provide volume management services for

Linux

A single, unified system for handling all storage management tasks

Despite many of the features and GUI management tools found in

EVMS, LVM2 was preferred

As a result, IBM dropped their kernel driver and reworked their tools

to work with LVM2 instead

Development stopped in 2006

Storage Services

27

NFS

Rick Sladkey original author of the NFS client and also ported the NFS server

and the RPC library code. Doug Quale helped extending the kernel to

support networking filesystems

NFS Version 2 since 1.2 kernel series

Kernel 2.2.18 a major milestone: mixing Linux NFS with other operating

systems' NFS, use file locking reliably over NFS, and NFS Version 3.

NFS Versions 2, 3, and 4 are supported on 2.6 and later kernels. Version 4.1

(Client) at least kernel 2.6.31

NFSv4 for Linux has been under development at CITI and NetApp since 2001

28

Samba

A free-software re-implementation of the SMB/CIFS networking protocol

Andrew Tridgell started development of Samba in 1992, Jeremy Allison

joined early on

Volker Lendecke founded SerNet in 1997, to provide commercial support

Version 3 (2003): file and print services for Microsoft Windows clients and can

integrate with a Windows NT 4.0 server domain, either as a Primary Domain

Controller (PDC) or as a domain member

Samba4 installations can act as an Active Directory domain controller or

member server, at Windows 2008 domain and forest functional levels.

29

SMB vs.CIFS

SMB "server message block" and CIFS "common internet file system"

are protocols. CIFS is the extension of the SMB protocol

“smbfs” was an older FS originated from the Samba project, heavily

coupled with the Samba tools (smb.conf, smbmount, etc.). Removed

in Linux 2.6.27

CIFS VFS was added to mainline Linux kernels in 2.5.42 Supports

advanced network file system features such as locking, Unicode

(advanced internationalization), hardlinks, dfs (hierarchical,

replicated name space), distributed caching and uses native TCP

names. All key network functions implemented in kernel

Current Filesystems

31

Fourth Extended Filesystem (ext4)

Advanced version of ext3, led by Ted Tso et al

Incorporated scalability and reliability enhancements for supporting

large filesystems up to 1EB.

First experimental support for ext4 was merged into Linux 2.6.19,

which was released on 29 November 2006.

Ext4 was marked as experimental until Linux 2.6.27

Starting with 2.6.28 (December 2008), ext4 was marked as stable

New extent format reduced metadata overhead (RAM, IO for access,

transactions)

32

Btrfs

Chris Mason (Oracle) in 2007

COW (Snapshots)

Checksums, Compression

RAID, Volume management

Conversion of ext3/4 file systems

Merged into mainline Linux 2.6.29 (March 2009)

Florian Winkler talks about Btrfs today (11:15, HS7)

33

ZFS

Filesystem and logical volume manager combined

Designed and implemented at Sun Microsystems (Jeff Bonwick, Matthew

Ahrens)

Development started in 2001,officially announced in 2004

128bit, COW, Snapshots, Deduplication, RAID

OpenSolaris (CDDL)

Early port based on FUSE

Kernel modules based OpenZFS (2013)

Not included in mainline Linux due to license incompatibilities

Network Storage

35

Network Block Device (NBD)

Remotely access a block device attached to another system

Userspace Server/Client, Client kernel module

Issues arise if network goes down or server crashes

Markus Pargmann talks about NBD on Sunday (16:30, HS6)

36

Distributed Replicated Block Device (DRBD)

A shared-nothing, synchronously replicated block device

“RAID1 over Network”

Writes to the primary node are transferred to the lower-level block device and

simultaneously propagated to the secondary node

The secondary node then transfers data to its corresponding lower-level block

device. All read I/O is performed locally

Fail-over capabilities (Secondary/Primary)

Lars Ellenberg and Philipp Reisner originally submitted code in July 2007

DRBD was merged on 8 December 2009 during the "merge window" for Linux

kernel version 2.6.33

Cluster Filesystems

38

OCFS/OCFS2

Shared disk file system by Oracle

Main focus of OCFS was to accommodate Oracle clustered databases,

not POSIX-compliant

OCFS2 designed as a Linux filesystem from scratch

On-disk filesystem implementation heavily inspired by ext3, uses JBD

for journaling

OCFS2 integrated into version 2.6.16 of mainline Linux

Max Volume/File Size 4PB (currently limited to 16TB)

Trivia question: what feature do OCFS2 and Btrfs have in common?

39

GFS/GFS2

Shared disk filesystem, allows concurrent access to the same block storage

Development of GFS began in 1995 and was originally developed by

University of Minnesota professor Matthew O'Keefe and a group of students

Originally for SGI IRIX, ported to Linux in 1998

Acquired by Sistina in 2000, turned into proprietary product

OpenGFS fork

Red Hat acquired Sistina in 2003 and released GFS2 under GPL in June 2004

GFS2 and the DLM merged into Linux 2.6.19 (29 November 2006)

40

Storage Requirements and Challenges

Amount of data to be stored grows exponentially

Today, Storage has to be:

Fault tolerant, reliable

Scalable without limitations or service interruptions

Distributable

Easy to manage / automate

Previous approaches do not address these requirements

Distributed Filesystems

42

GlusterFS

Aggregates various storage servers over Ethernet or Infiniband RDMA

interconnect into one large parallel network file system

Storage bricks export local file systems as volumes

GlusterFS clients create composite virtual volumes from multiple remote

servers using stackable „translators“

Translators provide Mirroring, Replication, Striping, etc.

Final volume mounted by client host using its own native protocol via FUSE,

using NFS v3 protocol (via built-in server translator)

Originally developed by Gluster, Inc., which was acquired by Red Hat in 2011

43

Ceph

Initially created by Sage Weil, founded Inktank in 2012

First release in July 2012

Object, block, and file storage from a single distributed computer cluster

Reliable autonomic distributed object store (RADOS)

RADOS Block Device (RBD), Snapshots

RadosGW provides REST API (Amazon S3/OpenStack Swift)

Completely distributed without a single point of failure

Replicates data for fault tolerance (CRUSH)

Ceph client code was merged into mainling Linux version 2.6.34

Red Hat acquired Inktank in April 2014

44

Lustre

Parallel distributed file system, generally used for large-scale cluster computing

Widely used in TOP500 supercomputers

Max. volume size: 100 PB (production), over 16 EB (theoretical)

Max. file size: 2.5 PB (ext4), 16 EB (ZFS)

Started as a research project in 1999 by Peter Braam at CMU, who founded Cluster Filesystems Inc. in

2001 to work on Intermezzo, Coda and Lustre

First installed in March 2003 on the MCR Linux Cluster (Lawrence Livermore National Laboratory).

Lustre 1.0.0 was released in December 2003.

Acquired by Sun Microsystems in 2007

Oracle acquired Sun in 2010 and discontinued the development

Whamcloud->Intel, OpenScalabaleFilesystems Inc. (OpenSFS), Xyratex Inc.

45

Shameless plug: openATTIC

Unified Storage: manage XFS, ZFS, Btrfs, NFS, Samba

Modern GUI (AngularJS/Boostrap)

REST API

Built-in Monitoring

Clustering (Pacemaker/Corosync, DRBD)

http://www.openattic.org/

Find us in the exhibition hall

46

PHP-ENTWICKLER (M/W) mit Linux Know-how

Sie entwickeln leidenschaftlich gerne und fühlen sich im Open Source-Umfeld Zuhause?

Dann sollten wir uns kennenlernen!

Diese Aufgaben erwarten Sie bei uns…

• Entwicklung unseres Systemmonitoring-Tools openITCOCKPIT für Frontend und/oder Backend

• Konzeption und Realisierung von Projekten in Teamarbeit

• Testing der entwickelten Anwendungen

• Pflege und Ausbau der bestehenden Entwicklungs- und Testumgebung

Weitere Informationen finden Sie unter:

www.it-novum.com/karriere

Gesucht: PHP-Entwickler (m/w) mit Linux Know-How

Thank you!