EXTRA MATERIAL - Research School of Computer Science | · A BigData Tour – HDFS, Ceph and ... (Reliable Autonomic Distributed Object Store) ... CRUSH IS A QUICK CALCULATION Ceph

16/05/15

1

A BigData Tour – HDFS, Ceph and MapReduce

These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

EXTRA MATERIAL

16/05/15

2

CEPH – A HDFS REPLACEMENT

What is Ceph?

•  Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware

ARCHITECTURAL COMPONENTS

12 Copyright © 2014 by Inktank | Private and Confidential

APP HOST/VM CLIENT

16/05/15

3

Ceph Architecture – Host Level

•  At the host level… •  We have Object Storage Devices (OSDs) and Monitors •  Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) •  The device, host, rack, row, and room are stored by the Monitors and used to compute

a failure domain •  OSDs store the Ceph data objects •  A host can run multiple OSDs, but it needs to be appropriately provisioned

OBJECT STORAGE DAEMONS

14

btrfs xfs ext4

http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – Block Level

•  At the block device level... •  Object Storage Device (OSD) can be an entire drive, a partition, or

a folder •  OSDs must be formatted in ext4, XFS, or btrfs (experimental).

connect.linaro.org

At the block device level…

● Object Storage Device (OSD) can be an entire drive, a partition, or a folder.

● OSDs must be formatted in ext4, XFS, or btrfs (experimental).

Lightning Introduction to Ceph Architecture (2)

Drive/Partition

FilesystemOSD

Pools

Drive/Partition

FilesystemOSD

Drive/Partition

FilesystemOSD

Drive/Partition

FilesystemOSD

6

https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

16/05/15

4

Ceph Architecture – Data Organization Level

•  At the data organization level… •  Data are partitioned into pools •  Pools contain a number of Placement Groups (PGs) •  Ceph data objects map to PGs (via a modulo of hash of name) •  PGs then map to multiple OSDs.

connect.linaro.org

At the data organization level...● Data are partitioned into pools.● Pools contain a number of Placement Groups (PGs).● Ceph data objects map to PGs (via a modulo of hash of name).● PGs then map to multiple OSDs.

Lightning Introduction to Ceph Architecture (3)

Pool: mydataobjobj PG #1

PG #2obj obj

OSD

OSD

OSD

OSD

7

https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

Ceph Placement Groups •  Ceph shards a pool into placement

groups distributed evenly and pseudo-randomly across the cluster

•  The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs—creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object

•  The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs

•  This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail

storage cluster can grow, shrink and recover from failure efficiently.

The following diagram depicts how CRUSH assigns objects to placement groups, and placement

groups to OSDs.

If a pool has too few placement groups relative to the overall cluster size, Ceph will have too much

data per placement group and won’t perform well. If a pool has too many placement groups relative

to the overall cluster, Ceph OSDs will use too much RAM and CPU and won’t perform well. Setting

an appropriate number of placement groups per pool, and an upper limit on the number of

placement groups assigned to each OSD in the cluster is critical to Ceph performance.

4. CRUSH

Ceph assigns a CRUSH ruleset to a pool. When a Ceph client stores or retrieves data in a pool,

Ceph identifies the CRUSH ruleset, a rule and the top-level bucket in the rule for storing and

retrieving data. As Ceph processes the CRUSH rule, it identifies the primary OSD that contains the

placement group for an object. That enables the client to connect directly to the OSD and read or

write data.

To map placement groups to OSDs, a CRUSH map defines a hierarchical list of bucket types (i.e.,

under types in the generated CRUSH map). The purpose of creating a bucket hierarchy is to

segregate the leaf nodes by their failure domains and/or performance domains, such drive type,

hosts, chassis, racks, power distribution units, pods, rows, rooms, and data centers.

With the exception of the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and

you may define it according to your own needs if the default types don’t suit your requirements.

CRUSH supports a directed acyclic graph that models your Ceph OSD nodes, typically in a

hierarchy. So you can support multiple hierarchies with multiple root nodes in a single CRUSH map.

For example, you can create a hierarchy of SSDs for a cache tier, a hierarchy of hard drives with

SSD journals, etc.

5. I/O OPERATIONS

CHAPTER 1. STORAGE CLUSTER ARCHITECTURE

7

RedHat Ceph Architecture v1.2.3

16/05/15

5

Ceph Architecture – Overall View Software: Ceph

Rados

MDS MDS.1

MDS.n

......

MONs

MON.1

MON.n

......

Pool 1

Pool 2

Pool n ..... .....

Pool X

CRUSH map

PG 1 PG 2 PG 3 PG 4 PG n .........

1 n

Cluster Node [OSDs]

... 1 n

Cluster Node [OSDs]

... 1 n

Cluster Node [OSDs]

... .........

LibRados

RadosGW RBD CephFS

APP HOST / VM Client

S3 Swift

https://www.terena.org/activities/tf-storage/ws16/slides/140210-low_cost_storage_ceph-openstack_swift.pdf

RADOS CLUSTER

15

RADOS CLUSTER

Ceph Architecture – RADOS •  An Application interacts with a

RADOS cluster

•  RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects

• On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or Rados Block Device) and a cache-coherent distributed file system (CephFS).


16/05/15

6

Ceph Architecture – RADOS Components RADOS COMPONENTS

16

OSDs: ! 10s to 10000s in a cluster ! One per disk (or one per SSD, RAID group…) ! Serve stored objects to clients !  Intelligently peer for replication & recovery

Monitors: ! Maintain cluster membership and state ! Provide consensus for distributed decision-making ! Small, odd number ! These do not serve stored objects to clients


Ceph Architecture – Where Do Objects Live? WHERE DO OBJECTS LIVE?

17

??


16/05/15

7

Ceph Architecture – Where Do Objects Live?

•  Contact a Metadata server?

A METADATA SERVER?

18

1

2


Ceph Architecture – Where Do Objects Live?

• Or calculate the placement via static mapping?

CALCULATED PLACEMENT

19

A-G

H-N

O-T

U-Z


16/05/15

8

Ceph Architecture – CRUSH Maps EVEN BETTER: CRUSH!

20

RADOS CLUSTER

*) Controlled Replication Under Scalable Hashing


Ceph Architecture – CRUSH Maps

•  Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing)

•  CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters)

•  The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store

EVEN BETTER: CRUSH!

20

RADOS CLUSTER

*) Controlled Replication Under Scalable Hashing


16/05/15

9

Ceph Architecture – CRUSH – 1/2 CRUSH IS A QUICK CALCULATION

21

RADOS CLUSTER


Ceph Architecture – CRUSH – 2/2 CRUSH: DYNAMIC DATA PLACEMENT

22

CRUSH: ! Pseudo-random placement algorithm

! Fast calculation, no lookup ! Repeatable, deterministic

! Statistically uniform distribution ! Stable mapping

! Limited data migration on change ! Rule-based configuration

!  Infrastructure topology aware ! Adjustable replication ! Weighting


16/05/15

10

Ceph Architecture – librados LIBRADOS: RADOS ACCESS FOR APPS

25

LIBRADOS: ! Direct access to RADOS for applications ! C, C++, Python, PHP, Java, Erlang ! Direct access to storage nodes ! No HTTP overhead

ACCESSING A RADOS CLUSTER

24

RADOS CLUSTER

socket


Ceph Architecture – RADOS Gateway

THE RADOS GATEWAY

27

RADOS CLUSTER

socket

REST

RADOSGW MAKES RADOS WEBBY

28

RADOSGW: ! REST-based object storage proxy ! Uses RADOS to store objects ! API supports buckets, accounts ! Usage accounting for billing ! Compatible with S3 and Swift applications


16/05/15

11

Ceph Architecture – RADOS Block Device (RBD) – 1/3 RBD STORES VIRTUAL DISKS

33

RADOS BLOCK DEVICE: ! Storage of disk images in RADOS ! Decouples VMs from host !  Images are striped across the cluster (pool) ! Snapshots ! Copy-on-write clones ! Support in:

! Mainline Linux Kernel (2.6.39+) ! Qemu/KVM, native Xen coming soon ! OpenStack, CloudStack, Nebula, Proxmox


Ceph Architecture – RADOS Block Device (RBD) – 2/3

•  Virtual Machine storage using RDB

•  Live Migration using RBD

STORING VIRTUAL DISKS

30

RADOS CLUSTER SEPARATE COMPUTE FROM STORAGE

31

RADOS CLUSTER


16/05/15

12

Ceph Architecture – RADOS Block Device (RBD) – 3/3

•  Direct host access from Linux

KERNEL MODULE FOR MAXIMUM FLEXIBILITY

32

RADOS CLUSTER


Ceph Architecture – CephFS – POSIX F/S

SEPARATE METADATA SERVER

35

RADOS CLUSTER

data metadata

SCALABLE METADATA SERVERS

36

METADATA SERVER ! Manages metadata for a POSIX-compliant shared

filesystem ! Directory hierarchy ! File metadata (owner, timestamps, mode, etc.)

! Stores metadata in RADOS ! Does not serve file data to clients ! Only required for shared filesystem


16/05/15

13

Ceph – Read/Write Flows

https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

Ceph Replicated I/O

With the ability to perform data replication on behalf of Ceph clients, Ceph OSD Daemons relieve

Ceph clients from that duty, while ensuring high data availability and data safety.

Note

The primary OSD and the secondary OSDs are typically configured to be in separate

failure domains (i.e., rows, racks, nodes, etc.). CRUSH computes the ID(s) of the

secondary OSD(s) with consideration for the failure domains.

5.2. Erasure-coded I/O

Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write

operations. In replicated pools, Ceph makes a deep copy of each object in the placement group on

the secondary OSD(s) in the set. For erasure coding, the process is a bit different. An erasure

coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks.

The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting

set. The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for

encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for

maintaining an authoritative version of the placement group logs.

For instance an erasure coded pool is created to use five OSDs (K+M = 5) and sustain the loss of

two of them (M = 2).

When the object NYAN containing ABCDEFGHI is written to the pool, the erasure encoding function

splits the content into three data chunks simply by dividing the content in three: the first contains

ABC, the second DEF and the last GHI. The content will be padded if the content length is not a

CHAPTER 1. STORAGE CLUSTER ARCHITECTURE

9


16/05/15

14

Ceph – Erasure Coding – 1/5

•  Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes.

•  Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K

•  E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like:

•  RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content.



•  Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations

•  In replicated pools, Ceph makes a deep copy of each object in the placement group on the secondary OSD(s) in the set

•  For erasure coding, the process is a bit different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set.

•  The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs.


16/05/15

15


•  5 OSDs (K+M=5); sustain loss of 2 (M=2)

• Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K

•  Coding blocks are YXY and QGC

multiple of K. The function also creates two coding chunks: the fourth with YXY and the fifth with

GQC. Each chunk is stored in an OSD in the acting set. The chunks are stored in objects that have

the same name (NYAN) but reside on different OSDs. The order in which the chunks were created

must be preserved and is stored as an attribute of the object (shard_t), in addition to its name.

Chunk 1 contains ABC and is stored on OSD5 while chunk 4 contains YXY and is stored on OSD3.

When the object NYAN is read from the erasure coded pool, the decoding function reads three

chunks: chunk 1 containing ABC, chunk 3 containing GHI and chunk 4 containing YXY. Then, it

rebuilds the original content of the object ABCDEFGHI. The decoding function is informed that the

chunks 2 and 5 are missing (they are called erasures). The chunk 5 could not be read because the

OSD4 is out. The decoding function can be called as soon as three chunks are read: OSD2 was the

slowest and its chunk was not taken into account.

5.3. Cache-Tier I/O

A cache tier provides Ceph clients with better I/O performance for a subset of the data stored in a

backing storage tier. Cache tiering involves creating a pool of relatively fast/expensive storage

devices (e.g., solid state drives) configured to act as a cache tier, and a backing pool of either

erasure-coded or relatively slower/cheaper devices configured to act as an economical storage tier.

Red Hat Ceph Storage 1.2.3 Red Hat Ceph Architecture

10



• On reading object NYAN from an erasure coded pool, decoding function retrieves chunks 1, 2, 3 and 4

•  If any two chunks are missing (ie an erasure is present), decoding function can reconstruct other chunks












5.3. Cache-Tier I/O






10


16/05/15

16


•  5 OSDs (K+M=5); sustain loss of 2 (M=2)

• Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K

•  Coding blocks are YXY and QGC












5.3. Cache-Tier I/O






10


Documents

EXTRA MATERIAL - Research School of Computer Science | · A BigData Tour – HDFS, Ceph and ... (Reliable Autonomic Distributed Object Store) ... CRUSH IS A QUICK CALCULATION Ceph