DEVIEW 2013

Ceph: Massively Scalable Distributed StoragePatrick McGarryDirector Community, Inktank

Who am I?

● Patrick McGarry

● Dir Community, Inktank

● /. → ALU → P4 → Inktank

● @scuttlemonkey

● [email protected]

mailto:[email protected]

Outline

● What is Ceph?

● Ceph History

● How does Ceph work?

● CRUSH Fundamentals

● Applications

● Testing

● Performance

● Moving forward

● Questions

What is Ceph?

Ceph at a glance

O n c o m m o d i t y h a r d w a r e

C e p h c a n r u n o n a n y i n f r a s t r u c t u r e , m e t a l o r v i r t u a l i z e d t o p r o v i d e a c h e a p a n d p o w e r f u l s t o r a g e c l u s t e r .

O b j e c t , b l o c k , a n d f i l e

L o w o v e r h e a d d o e s n ’ t m e a n j u s t h a r d w a r e , i t m e a n s p e o p l e t o o !

A w e s o m e s a u c e

I n f r a s t r u c t u r e - a w a r e p l a c e m e n t a l g o r i t h m a l l o w s y o u t o d o r e a l l y c o o l s t u f f .

H u g e a n d b e y o n d

D e s i g n e d f o r e x a b y t e , c u r r e n t i m p l e m e n t a t i o n s i n t h e m u l t i - p e t a b y t e . H P C , B i g D a t a , C l o u d , r a w s t o r a g e .

S o f t w a r e A l l - i n - 1 C R U S H S c a l e

Unified storage system

● objects● native

● RESTful

● block● thin provisioning, snapshots, cloning

● file● strong consistency, snapshots

Distributed storage system

● data center scale● 10s to 10,000s of machines

● terabytes to exabytes

● fault tolerant● no single point of failure

● commodity hardware

● self-managing, self-healing

Where did Ceph come from?

UCSC research grant

● “Petascale object storage”● DOE: LANL, LLNL, Sandia

● Scalability

● Reliability

● Performance● Raw IO bandwidth, metadata ops/sec

● HPC file system workloads● Thousands of clients writing to same file, directory

Distributed metadata management

● Innovative design● Subtree-based partitioning for locality, efficiency

● Dynamically adapt to current workload

● Embedded inodes

● Prototype simulator in Java (2004)

● First line of Ceph code● Summer internship at LLNL

● High security national lab environment

● Could write anything, as long as it was OSS

The rest of Ceph

● RADOS – distributed object storage cluster (2005)

● EBOFS – local object storage (2004/2006)

● CRUSH – hashing for the real world (2005)

● Paxos monitors – cluster consensus (2006)

→ emphasis on consistent, reliable storage

→ scale by pushing intelligence to the edges

→ a different but compelling architecture

Industry black hole

● Many large storage vendors● Proprietary solutions that don't scale well

● Few open source alternatives (2006)● Very limited scale, or

● Limited community and architecture (Lustre)

● No enterprise feature sets (snapshots, quotas)

● PhD grads all built interesting systems...● ...and then went to work for Netapp, DDN, EMC, Veritas.

● They want you, not your project

A different path

● Change the world with open source● Do what Linux did to Solaris, Irix, Ultrix, etc.

● What could go wrong?

● License● GPL, BSD...

● LGPL: share changes, okay to link to proprietary code

● Avoid unsavory practices● Dual licensing

● Copyright assignment

How does Ceph work?

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS


LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS


RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD


CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS


RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW


APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Why start with objects?

● more useful than (disk) blocks● names in a single flat namespace

● variable size

● simple API with rich semantics

● more scalable than files● no hard-to-distribute hierarchy

● update semantics do not span objects

● workload is trivially parallel

Ceph object model

● pools● 1s to 100s

● independent namespaces or object collections

● replication level, placement policy

● objects● bazillions

● blob of data (bytes to gigabytes)

● attributes (e.g., “version=12”; bytes to kilobytes)

● key/value bundle (bytes to gigabytes)

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

M

Object Storage Daemons (OSDs)

● 10s to 10000s in a cluster

● one per disk, SSD, or RAID group, or ...

● hardware agnostic

● serve stored objects to clients

● intelligently peer to perform replication and recovery tasks

Monitors

● maintain cluster membership and state

● provide consensus for distributed decision-making

● small, odd number

● these do not served stored objects to clients

Data distribution

● all objects are replicated N times

● objects are automatically placed, balanced, migrated in a dynamic cluster

● must consider physical infrastructure● ceph-osds on hosts in racks in rows in data centers

● three approaches● pick a spot; remember where you put it

● pick a spot; write down where you put it

● calculate where to put it, where to find it

CRUSH fundamentals

CRUSH

● pseudo-random placement algorithm

● fast calculation, no lookup

● repeatable, deterministic

● statistically uniform distribution

● stable mapping

● limited data migration on change

● rule-based configuration

● infrastructure topology aware

● adjustable replication

● allows weighting

[Rules]

[Buckets] [Map]

[Mapping]

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

CLIENTCLIENT

??

CLIENT

??

Applications

RADOS


RADOS


LIBRADOS


LIBRADOS


RBD


RBD


CEPH FS


CEPH FS


RADOSGW


RADOSGW


APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RADOS Gateway

● REST-based object storage proxy

● uses RADOS to store objects

● API supports buckets, accounting

● usage accounting for billing purposes

● compatible with S3, Swift APIs

RADOS Block Device

● storage of disk images in RADOS

● decouple VM from host

● images striped across entire cluster (pool)

● snapshots

● copy-on-write clones

● support in

● Qemu/KVM, Xen

● mainline Linux kernel (2.6.34+)

● OpenStack, CloudStack

● Ganeti, OpenNebula

Metadata Server (MDS)

● manages metadata for POSIX shared file system

● directory hierarchy

● file metadata (size, owner, timestamps)

● stores metadata in RADOS

● does not serve file data to clients

● only required for the shared file system

Multiple protocols, implementations

● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt

● export (NFS), Samba (CIFS)

● ceph-fuse

● libcephfs.so● your app

● Samba (CIFS)

● Ganesha (NFS)

● Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Testing

Package Building

● GitBuilder● Build packages / tarballs

● Very basic tool, works well though

● Push to get → .deb && .rpm for all platforms

● autosigned

● Pbuilder● Used for major releases

● Clean room type of build

● Signed using different key

● As many distros as possible (All Debian, Ubuntu, Fedora 17&18, CentOS, SLES, OpenSUSE, tarball)

Teuthology

● Our own custom test framework

● Allocates machines you define to cluster you define

● Runs tasks against that environment● Set up ceph

● Run workloads

● Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount it → run workload

● Makes it easy to stack things up, map them to hosts, run

● Suite of tests on our github

● Active community development underway!

Performance

4K Relative Performance

All Workloads: QEMU/KVM, BTRFS, or XFSRead Heavy Only Workloads: Kernel RBD, EXT4, or XFS

128K Relative Performance

All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

4M Relative Performance

All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

What does this mean?

● Anecdotal results are great!

● Concrete evidence coming soon

● Performance is complicated● Workload

● Hardware (especially network!)

● Lots and lots of tuneables

● Mark Nelson (nhm on #ceph) is the performance geek

● Lots of reading on ceph.com blog

● We welcome testing and comparison help● Anyone want to help with an EMC / NetApp showdown?

Moving forward

A n o n g o i n g p r o c e s s

W h i l e t h e f i r s t p a s s f o r d i s a s t e r r e c o v e r y i s d o n e , w e w a n t t o g e t t o b u i l t - i n , w o r l d - w i d e r e p l i c a t i o n .

R e c e p t i o n e f f i c i e n c y

C u r r e n t l y u n d e r w a y i n t h e c o m m u n i t y !

H e a d e d t o d y n a m i c

C a n a l r e a d y d o t h i s i n a s t a t i c p o o l -b a s e d s e t u p . L o o k i n g t o g e t t o a u s e - b a s e d m i g r a t i o n .

M a k i n g i t o p e n - e r

B e e n t a l k i n g a b o u t i t f o r e v e r . T h e t i m e i s c o m i n g !

G e o - R e p l i c a t i o n E r a s u r e C o d i n g T i e r i n g G o v e r n a n c e

Future plans

Q u a r t e r l y O n l i n e S u m m i t

O n l i n e s u m m i t p u t s t h e c o r e d e v st o g e t h e r w i t h t h e C e p h c o m m u n i t y .

N o t j u s t f o r N Y C

M o r e p l a n n e d , i n c l u d i n g S a n t a C l a r a a n d L o n d o n . K e e p a n e y e o u t : h t t p : / / i n k t a n k . c o m / c e p h d a ys /

G e e k - o n - d u t y

D u r i n g t h e w e e k t h e r e a r e t i m e s w h e n C e p h e x p e r t s a r e a v a i l a b l e t o h e l p . S t o p b y o f t c . n e t / c e p h

E m a i l m a k e s t h e w o r l d g o

O u r m a i l i n g l i s t s a r e v e r y a c t i v e , c h e c k o u t c e p h . c o m f o r d e t a i l s o n h o w t o j o i n i n !

C D S C e p h D a y I R C L i s t s

Get Involved!

Questions?

E - M A I Lp a t r i c k @ i n k t a n k . c o m

W E B S I T EC e p h . c o m

S O C I A L@ s c u t t l e m o n k e y@ c e p hF a c e b o o k . c o m / c e p h s t o r a g e

Technology

DEVIEW 2013