53
Ceph: Massively Scalable Distributed Storage Patrick McGarry Director Community, Inktank

DEVIEW 2013

Embed Size (px)

DESCRIPTION

Slides from my talk at DEVIEW 2013

Citation preview

Page 1: DEVIEW 2013

Ceph: Massively Scalable Distributed StoragePatrick McGarryDirector Community, Inktank

Page 2: DEVIEW 2013

Who am I?

● Patrick McGarry

● Dir Community, Inktank

● /. → ALU → P4 → Inktank

● @scuttlemonkey

[email protected]

Page 3: DEVIEW 2013

Outline

● What is Ceph?

● Ceph History

● How does Ceph work?

● CRUSH Fundamentals

● Applications

● Testing

● Performance

● Moving forward

● Questions

Page 4: DEVIEW 2013

What is Ceph?

Page 5: DEVIEW 2013

Ceph at a glance

O n c o m m o d i t y h a r d w a r e

C e p h c a n r u n o n a n y i n f r a s t r u c t u r e , m e t a l o r v i r t u a l i z e d t o p r o v i d e a c h e a p a n d p o w e r f u l s t o r a g e c l u s t e r .

O b j e c t , b l o c k , a n d f i l e

L o w o v e r h e a d d o e s n ’ t m e a n j u s t h a r d w a r e , i t m e a n s p e o p l e t o o !

A w e s o m e s a u c e

I n f r a s t r u c t u r e - a w a r e p l a c e m e n t a l g o r i t h m a l l o w s y o u t o d o r e a l l y c o o l s t u f f .

H u g e a n d b e y o n d

D e s i g n e d f o r e x a b y t e , c u r r e n t i m p l e m e n t a t i o n s i n t h e m u l t i - p e t a b y t e . H P C , B i g D a t a , C l o u d , r a w s t o r a g e .

S o f t w a r e A l l - i n - 1 C R U S H S c a l e

Page 6: DEVIEW 2013

Unified storage system

● objects● native

● RESTful

● block● thin provisioning, snapshots, cloning

● file● strong consistency, snapshots

Page 7: DEVIEW 2013

Distributed storage system

● data center scale● 10s to 10,000s of machines

● terabytes to exabytes

● fault tolerant● no single point of failure

● commodity hardware

● self-managing, self-healing

Page 8: DEVIEW 2013

Where did Ceph come from?

Page 9: DEVIEW 2013
Page 10: DEVIEW 2013
Page 11: DEVIEW 2013
Page 12: DEVIEW 2013
Page 13: DEVIEW 2013
Page 14: DEVIEW 2013
Page 15: DEVIEW 2013

UCSC research grant

● “Petascale object storage”● DOE: LANL, LLNL, Sandia

● Scalability

● Reliability

● Performance● Raw IO bandwidth, metadata ops/sec

● HPC file system workloads● Thousands of clients writing to same file, directory

Page 16: DEVIEW 2013

Distributed metadata management

● Innovative design● Subtree-based partitioning for locality, efficiency

● Dynamically adapt to current workload

● Embedded inodes

● Prototype simulator in Java (2004)

● First line of Ceph code● Summer internship at LLNL

● High security national lab environment

● Could write anything, as long as it was OSS

Page 17: DEVIEW 2013

The rest of Ceph

● RADOS – distributed object storage cluster (2005)

● EBOFS – local object storage (2004/2006)

● CRUSH – hashing for the real world (2005)

● Paxos monitors – cluster consensus (2006)

→ emphasis on consistent, reliable storage

→ scale by pushing intelligence to the edges

→ a different but compelling architecture

Page 18: DEVIEW 2013

Industry black hole

● Many large storage vendors● Proprietary solutions that don't scale well

● Few open source alternatives (2006)● Very limited scale, or

● Limited community and architecture (Lustre)

● No enterprise feature sets (snapshots, quotas)

● PhD grads all built interesting systems...● ...and then went to work for Netapp, DDN, EMC, Veritas.

● They want you, not your project

Page 19: DEVIEW 2013

A different path

● Change the world with open source● Do what Linux did to Solaris, Irix, Ultrix, etc.

● What could go wrong?

● License● GPL, BSD...

● LGPL: share changes, okay to link to proprietary code

● Avoid unsavory practices● Dual licensing

● Copyright assignment

Page 20: DEVIEW 2013

How does Ceph work?

Page 21: DEVIEW 2013

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 22: DEVIEW 2013

Why start with objects?

● more useful than (disk) blocks● names in a single flat namespace

● variable size

● simple API with rich semantics

● more scalable than files● no hard-to-distribute hierarchy

● update semantics do not span objects

● workload is trivially parallel

Page 23: DEVIEW 2013

Ceph object model

● pools● 1s to 100s

● independent namespaces or object collections

● replication level, placement policy

● objects● bazillions

● blob of data (bytes to gigabytes)

● attributes (e.g., “version=12”; bytes to kilobytes)

● key/value bundle (bytes to gigabytes)

Page 24: DEVIEW 2013

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 25: DEVIEW 2013

M

Object Storage Daemons (OSDs)

● 10s to 10000s in a cluster

● one per disk, SSD, or RAID group, or ...

● hardware agnostic

● serve stored objects to clients

● intelligently peer to perform replication and recovery tasks

Monitors

● maintain cluster membership and state

● provide consensus for distributed decision-making

● small, odd number

● these do not served stored objects to clients

Page 26: DEVIEW 2013

Data distribution

● all objects are replicated N times

● objects are automatically placed, balanced, migrated in a dynamic cluster

● must consider physical infrastructure● ceph-osds on hosts in racks in rows in data centers

● three approaches● pick a spot; remember where you put it

● pick a spot; write down where you put it

● calculate where to put it, where to find it

Page 27: DEVIEW 2013

CRUSH fundamentals

Page 28: DEVIEW 2013

CRUSH

● pseudo-random placement algorithm

● fast calculation, no lookup

● repeatable, deterministic

● statistically uniform distribution

● stable mapping

● limited data migration on change

● rule-based configuration

● infrastructure topology aware

● adjustable replication

● allows weighting

Page 29: DEVIEW 2013

[Rules]

[Buckets] [Map]

[Mapping]

Page 30: DEVIEW 2013

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 31: DEVIEW 2013

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 32: DEVIEW 2013

CLIENTCLIENT

??

Page 33: DEVIEW 2013
Page 34: DEVIEW 2013
Page 35: DEVIEW 2013

CLIENT

??

Page 36: DEVIEW 2013

Applications

Page 37: DEVIEW 2013

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 38: DEVIEW 2013

RADOS Gateway

● REST-based object storage proxy

● uses RADOS to store objects

● API supports buckets, accounting

● usage accounting for billing purposes

● compatible with S3, Swift APIs

Page 39: DEVIEW 2013

RADOS Block Device

● storage of disk images in RADOS

● decouple VM from host

● images striped across entire cluster (pool)

● snapshots

● copy-on-write clones

● support in

● Qemu/KVM, Xen

● mainline Linux kernel (2.6.34+)

● OpenStack, CloudStack

● Ganeti, OpenNebula

Page 40: DEVIEW 2013

Metadata Server (MDS)

● manages metadata for POSIX shared file system

● directory hierarchy

● file metadata (size, owner, timestamps)

● stores metadata in RADOS

● does not serve file data to clients

● only required for the shared file system

Page 41: DEVIEW 2013

Multiple protocols, implementations

● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt

● export (NFS), Samba (CIFS)

● ceph-fuse

● libcephfs.so● your app

● Samba (CIFS)

● Ganesha (NFS)

● Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 42: DEVIEW 2013

Testing

Page 43: DEVIEW 2013

Package Building

● GitBuilder● Build packages / tarballs

● Very basic tool, works well though

● Push to get → .deb && .rpm for all platforms

● autosigned

● Pbuilder● Used for major releases

● Clean room type of build

● Signed using different key

● As many distros as possible (All Debian, Ubuntu, Fedora 17&18, CentOS, SLES, OpenSUSE, tarball)

Page 44: DEVIEW 2013

Teuthology

● Our own custom test framework

● Allocates machines you define to cluster you define

● Runs tasks against that environment● Set up ceph

● Run workloads

● Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount it → run workload

● Makes it easy to stack things up, map them to hosts, run

● Suite of tests on our github

● Active community development underway!

Page 45: DEVIEW 2013

Performance

Page 46: DEVIEW 2013

4K Relative Performance

All Workloads: QEMU/KVM, BTRFS, or XFSRead Heavy Only Workloads: Kernel RBD, EXT4, or XFS

Page 47: DEVIEW 2013

128K Relative Performance

All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

Page 48: DEVIEW 2013

4M Relative Performance

All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

Page 49: DEVIEW 2013

What does this mean?

● Anecdotal results are great!

● Concrete evidence coming soon

● Performance is complicated● Workload

● Hardware (especially network!)

● Lots and lots of tuneables

● Mark Nelson (nhm on #ceph) is the performance geek

● Lots of reading on ceph.com blog

● We welcome testing and comparison help● Anyone want to help with an EMC / NetApp showdown?

Page 50: DEVIEW 2013

Moving forward

Page 51: DEVIEW 2013

A n o n g o i n g p r o c e s s

W h i l e t h e f i r s t p a s s f o r d i s a s t e r r e c o v e r y i s d o n e , w e w a n t t o g e t t o b u i l t - i n , w o r l d - w i d e r e p l i c a t i o n .

R e c e p t i o n e f f i c i e n c y

C u r r e n t l y u n d e r w a y i n t h e c o m m u n i t y !

H e a d e d t o d y n a m i c

C a n a l r e a d y d o t h i s i n a s t a t i c p o o l -b a s e d s e t u p . L o o k i n g t o g e t t o a u s e - b a s e d m i g r a t i o n .

M a k i n g i t o p e n - e r

B e e n t a l k i n g a b o u t i t f o r e v e r . T h e t i m e i s c o m i n g !

G e o - R e p l i c a t i o n E r a s u r e C o d i n g T i e r i n g G o v e r n a n c e

Future plans

Page 52: DEVIEW 2013

Q u a r t e r l y O n l i n e S u m m i t

O n l i n e s u m m i t p u t s t h e c o r e d e v st o g e t h e r w i t h t h e C e p h c o m m u n i t y .

N o t j u s t f o r N Y C

M o r e p l a n n e d , i n c l u d i n g S a n t a C l a r a a n d L o n d o n . K e e p a n e y e o u t : h t t p : / / i n k t a n k . c o m / c e p h d a ys /

G e e k - o n - d u t y

D u r i n g t h e w e e k t h e r e a r e t i m e s w h e n C e p h e x p e r t s a r e a v a i l a b l e t o h e l p . S t o p b y o f t c . n e t / c e p h

E m a i l m a k e s t h e w o r l d g o

O u r m a i l i n g l i s t s a r e v e r y a c t i v e , c h e c k o u t c e p h . c o m f o r d e t a i l s o n h o w t o j o i n i n !

C D S C e p h D a y I R C L i s t s

Get Involved!

Page 53: DEVIEW 2013

Questions?

E - M A I Lp a t r i c k @ i n k t a n k . c o m

W E B S I T EC e p h . c o m

S O C I A L@ s c u t t l e m o n k e y@ c e p hF a c e b o o k . c o m / c e p h s t o r a g e