What you need to know about ceph

What you need to know about CephGluster Community Day, 20 May 2014

Haruka Iwao

Index

What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up

What is Ceph?

Ceph

The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.

Cephalopod?

Ceph is...

{ }object storage and file system

Open-sourceMassively scalableSoftware-defined

History of Ceph

2003 Project born at UCSC

2006 Open sourced Papers published

2012 Inktank founded “Argonaut” released

In April 2014

Yesterday

Red Hat acquires me

I joined Red Hat as an architect of storage systems

This is just a coincidence.

Red Hat acquires me

Ceph releases

Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)

Ceph architecture

Ceph at a glance

Layers in Ceph

RADOS = /dev/sda Ceph FS = ext4

/dev/sda

ext4

RADOS

Reliable Replicated to avoid data loss

Autonomic Communicate each other to

detect failures Replication done transparently

Distributed Object Store

RADOS (2)

Fundamentals of Ceph Everything is stored in

RADOS Including Ceph FS metadata

Two components: mon, osd CRUSH algorithm

OSD

Object storage daemon One OSD per disk Uses xfs/btrfs as backend

Btrfs is experimental! Write-ahead journal for

integrity and performance 3 to 10000s OSDs in a cluster

OSD (2)

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MON

Monitoring daemon Maintain cluster map and

state Small, odd number

Locating objects

RADOS uses an algorithm “CRUSH” to locate objects Location is decided through

pure “calculation”

No central “metadata” server No SPoF Massive scalability

CRUSH1. Assign a placement group pg = Hash(object name) % num pg2. CRUSH(pg, cluster map, rule)

1

2

Cluster map

Hierarchical OSD map Replicating across failure domains Avoiding network congestion

Object locations computed

0100111010100111011 Name: abc, Pool: test

Hash(“abc”) % 256 = 0x23“test” = 3

Placement Group: 3.23

PG to OSD

Placement Group: 3.23

CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9

15

9

Synchronous Replication

Replication is synchronousto maintain strong consistency

When OSD fails

OSD marked “down” 5mins later, marked “out”

Cluster map updated

CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9

CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9

Wrap-up: CRUSH

Object name + cluster map → object locations Deterministic

No metadata at all Calculation done on clients Cluster map reflects

network hierarchy

RADOSGW

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

S3 / Swift compatible gateway to RADOS

RBD

RADOS


LIBRADOS


RBD


CEPH FS


RADOSGW


RBD

RADOS Block Devices

RBD

Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo

OpenStack integration Cinder & Glance Will explain later

Ceph FS

RADOS


LIBRADOS


RBD


CEPH FS


RADOSGW


Ceph FS

POSIX compliant file system build on top of RADOS

Can mount with Linux native kernel driver (cephfs) or FUSE

Metadata servers (mds) manages metadata of the file system tree

Ceph FS is reliable

MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures

Multiple MDS can run for HA and load balancing

Ceph FS and OSD

MDS

OSDOSDOSD

POSIX Metadata(directory, time, owner, etc)

MDS

Write metadata journal

Data I/O

Metadata held in-memory

DYNAMIC SUBTREE PARTITIONING

Ceph FS is experimental

Other features

Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network

Rolling upgrades

No interruption to the service when upgrading

Stop/Start daemons one by one mon → osd → mds →

radowgw

Erasure coding

Use erasure coding instead of parity for data durability

Suitable for rarely modified or accessed objects

Erasure Coding

Replication

Space overhead(survive 2 fails)

Approx 40% 200%

CPU High Low

Latency High Low

Cache tiering

Cache tierex. SSD

Base tierex. HDD,erasure coded

librados

transparent to clients

read/write

read when miss

fetch when missflush to base tier

Key-value OSD backend

Use LevelDB for OSD backend (instead of xfs)

Better performance esp for small objects

Plans to support RocksDB, NVMKV, etc

Separate backend network

Backend network for replication

Clients

Frontend network for service

OSDs

1. Write

2. Replicate

OpenStack Integration

OpenStack with Ceph

RADOSGW and Keystone

Keystone Server

RADOSGW

RESTful Object Store

Query token

Access with token

Grant/revoke

Glance Integration

RBD

Glance Server

/etc/glance/glance-api.conf

default_store=rbdrbd_store_user=glancerbd_store_pool=images

Store, Download

Need just 3 lines!

Image

Cinder/Nova Integration

RBD

Cinder Server

qemu

VM

librbd

nova-compute

Boot from volume

Management

Volume Image

Copy-on-write clone

Benefits of using with

Unified storage for both images and volumes

Copy-on-write cloning and snapshot support

Native qemu / KVM support for better performance

Wrap-up

Ceph is

Massively scalable storage Unified architecture for

object / block / POSIX FS OpenStack integration is

ready to use & awesome

Ceph and GlusterFSCeph GlusterFS

Distribution Object based File based

File location Deterministic algorithm (CRUSH)

Distributed hash table, stored in xattr

Replication Server side Client side

Primary usage Object / block storage

POSIX-like file system

Challenge POSIX file system needs improvement

Object / block storage needs improvement

Further readings

Ceph Documents

https://ceph.com/docs/master/

Well documented.

Sébastien Han

http://www.sebastien-han.fr/blog/

An awesome blog.

CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data

http://ceph.com/papers/weil-crush-sc06.pdf

CRUSH algorithm paper

Ceph: A Scalable, High-Performance Distributed File System

http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf

Ceph paper

Ceph の覚え書きのインデックスhttp://www.nminoru.jp/~nminoru/unix/ceph/

Well written introduction in Japanese

One more thing

Calamari will be open sourced

“Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced.”

http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf

Calamari screens

Thank you!

Technology

What you need to know about ceph