63
What you need to know about Ceph Gluster Community Day, 20 May 2014 Haruka Iwao

What you need to know about ceph

Embed Size (px)

DESCRIPTION

Introduction to Ceph, an open-source, massively scalable distributed file system. This document explains the architecture of Ceph and integration with OpenStack.

Citation preview

Page 1: What you need to know about ceph

What you need to know about CephGluster Community Day, 20 May 2014

Haruka Iwao

Page 2: What you need to know about ceph

Index

What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up

Page 3: What you need to know about ceph

What is Ceph?

Page 4: What you need to know about ceph

Ceph

The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.

Page 5: What you need to know about ceph

Cephalopod?

Page 6: What you need to know about ceph
Page 7: What you need to know about ceph

Ceph is...

{ }object storage and file system

Open-sourceMassively scalableSoftware-defined

Page 8: What you need to know about ceph

History of Ceph

2003 Project born at UCSC

2006 Open sourced Papers published

2012 Inktank founded “Argonaut” released

Page 9: What you need to know about ceph

In April 2014

Page 10: What you need to know about ceph

Yesterday

Red Hat acquires me

I joined Red Hat as an architect of storage systems

This is just a coincidence.

Red Hat acquires me

Page 11: What you need to know about ceph

Ceph releases

Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)

Page 12: What you need to know about ceph

Ceph architecture

Page 13: What you need to know about ceph

Ceph at a glance

Page 14: What you need to know about ceph

Layers in Ceph

RADOS = /dev/sda Ceph FS = ext4

/dev/sda

ext4

Page 15: What you need to know about ceph

RADOS

Reliable Replicated to avoid data loss

Autonomic Communicate each other to

detect failures Replication done transparently

Distributed Object Store

Page 16: What you need to know about ceph

RADOS (2)

Fundamentals of Ceph Everything is stored in

RADOS Including Ceph FS metadata

Two components: mon, osd CRUSH algorithm

Page 17: What you need to know about ceph

OSD

Object storage daemon One OSD per disk Uses xfs/btrfs as backend

Btrfs is experimental! Write-ahead journal for

integrity and performance 3 to 10000s OSDs in a cluster

Page 18: What you need to know about ceph

OSD (2)

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

Page 19: What you need to know about ceph

MON

Monitoring daemon Maintain cluster map and

state Small, odd number

Page 20: What you need to know about ceph

Locating objects

RADOS uses an algorithm “CRUSH” to locate objects Location is decided through

pure “calculation”

No central “metadata” server No SPoF Massive scalability

Page 21: What you need to know about ceph

CRUSH1. Assign a placement group pg = Hash(object name) % num pg2. CRUSH(pg, cluster map, rule)

1

2

Page 22: What you need to know about ceph

Cluster map

Hierarchical OSD map Replicating across failure domains Avoiding network congestion

Page 23: What you need to know about ceph

Object locations computed

0100111010100111011 Name: abc, Pool: test

Hash(“abc”) % 256 = 0x23“test” = 3

Placement Group: 3.23

Page 24: What you need to know about ceph

PG to OSD

Placement Group: 3.23

CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9

15

9

Page 25: What you need to know about ceph

Synchronous Replication

Replication is synchronousto maintain strong consistency

Page 26: What you need to know about ceph

When OSD fails

OSD marked “down” 5mins later, marked “out”

Cluster map updated

CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9

CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9

Page 27: What you need to know about ceph

Wrap-up: CRUSH

Object name + cluster map → object locations Deterministic

No metadata at all Calculation done on clients Cluster map reflects

network hierarchy

Page 28: What you need to know about ceph

RADOSGW

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

Page 29: What you need to know about ceph

RADOSGW

S3 / Swift compatible gateway to RADOS

Page 30: What you need to know about ceph

RBD

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

Page 31: What you need to know about ceph

RBD

RADOS Block Devices

Page 32: What you need to know about ceph

RBD

Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo

OpenStack integration Cinder & Glance Will explain later

Page 33: What you need to know about ceph

Ceph FS

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

Page 34: What you need to know about ceph

Ceph FS

POSIX compliant file system build on top of RADOS

Can mount with Linux native kernel driver (cephfs) or FUSE

Metadata servers (mds) manages metadata of the file system tree

Page 35: What you need to know about ceph

Ceph FS is reliable

MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures

Multiple MDS can run for HA and load balancing

Page 36: What you need to know about ceph

Ceph FS and OSD

MDS

OSDOSDOSD

POSIX Metadata(directory, time, owner, etc)

MDS

Write metadata journal

Data I/O

Metadata held in-memory

Page 37: What you need to know about ceph
Page 38: What you need to know about ceph
Page 39: What you need to know about ceph
Page 40: What you need to know about ceph
Page 41: What you need to know about ceph

DYNAMIC SUBTREE PARTITIONING

Page 42: What you need to know about ceph

Ceph FS is experimental

Page 43: What you need to know about ceph

Other features

Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network

Page 44: What you need to know about ceph

Rolling upgrades

No interruption to the service when upgrading

Stop/Start daemons one by one mon → osd → mds →

radowgw

Page 45: What you need to know about ceph

Erasure coding

Use erasure coding instead of parity for data durability

Suitable for rarely modified or accessed objects

Erasure Coding

Replication

Space overhead(survive 2 fails)

Approx 40% 200%

CPU High Low

Latency High Low

Page 46: What you need to know about ceph

Cache tiering

Cache tierex. SSD

Base tierex. HDD,erasure coded

librados

transparent to clients

read/write

read when miss

fetch when missflush to base tier

Page 47: What you need to know about ceph

Key-value OSD backend

Use LevelDB for OSD backend (instead of xfs)

Better performance esp for small objects

Plans to support RocksDB, NVMKV, etc

Page 48: What you need to know about ceph

Separate backend network

Backend network for replication

Clients

Frontend network for service

OSDs

1. Write

2. Replicate

Page 49: What you need to know about ceph

OpenStack Integration

Page 50: What you need to know about ceph

OpenStack with Ceph

Page 51: What you need to know about ceph

RADOSGW and Keystone

Keystone Server

RADOSGW

RESTful Object Store

Query token

Access with token

Grant/revoke

Page 52: What you need to know about ceph

Glance Integration

RBD

Glance Server

/etc/glance/glance-api.conf

default_store=rbdrbd_store_user=glancerbd_store_pool=images

Store, Download

Need just 3 lines!

Image

Page 53: What you need to know about ceph

Cinder/Nova Integration

RBD

Cinder Server

qemu

VM

librbd

nova-compute

Boot from volume

Management

Volume Image

Copy-on-write clone

Page 54: What you need to know about ceph

Benefits of using with

Unified storage for both images and volumes

Copy-on-write cloning and snapshot support

Native qemu / KVM support for better performance

Page 55: What you need to know about ceph
Page 56: What you need to know about ceph

Wrap-up

Page 57: What you need to know about ceph

Ceph is

Massively scalable storage Unified architecture for

object / block / POSIX FS OpenStack integration is

ready to use & awesome

Page 58: What you need to know about ceph

Ceph and GlusterFSCeph GlusterFS

Distribution Object based File based

File location Deterministic algorithm (CRUSH)

Distributed hash table, stored in xattr

Replication Server side Client side

Primary usage Object / block storage

POSIX-like file system

Challenge POSIX file system needs improvement

Object / block storage needs improvement

Page 59: What you need to know about ceph

Further readings

Ceph Documents

https://ceph.com/docs/master/

Well documented.

Sébastien Han

http://www.sebastien-han.fr/blog/

An awesome blog.

CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data

http://ceph.com/papers/weil-crush-sc06.pdf

CRUSH algorithm paper

Ceph: A Scalable, High-Performance Distributed File System

http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf

Ceph paper

Ceph の覚え書きのインデックスhttp://www.nminoru.jp/~nminoru/unix/ceph/

Well written introduction in Japanese

Page 60: What you need to know about ceph

One more thing

Page 61: What you need to know about ceph

Calamari will be open sourced

“Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced.”

http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf

Page 62: What you need to know about ceph

Calamari screens

Page 63: What you need to know about ceph

Thank you!