Upload
patrick-mcgarry
View
912
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Slides from my talk at DEVIEW 2013
Citation preview
Ceph: Massively Scalable Distributed StoragePatrick McGarryDirector Community, Inktank
Who am I?
● Patrick McGarry
● Dir Community, Inktank
● /. → ALU → P4 → Inktank
● @scuttlemonkey
Outline
● What is Ceph?
● Ceph History
● How does Ceph work?
● CRUSH Fundamentals
● Applications
● Testing
● Performance
● Moving forward
● Questions
What is Ceph?
Ceph at a glance
O n c o m m o d i t y h a r d w a r e
C e p h c a n r u n o n a n y i n f r a s t r u c t u r e , m e t a l o r v i r t u a l i z e d t o p r o v i d e a c h e a p a n d p o w e r f u l s t o r a g e c l u s t e r .
O b j e c t , b l o c k , a n d f i l e
L o w o v e r h e a d d o e s n ’ t m e a n j u s t h a r d w a r e , i t m e a n s p e o p l e t o o !
A w e s o m e s a u c e
I n f r a s t r u c t u r e - a w a r e p l a c e m e n t a l g o r i t h m a l l o w s y o u t o d o r e a l l y c o o l s t u f f .
H u g e a n d b e y o n d
D e s i g n e d f o r e x a b y t e , c u r r e n t i m p l e m e n t a t i o n s i n t h e m u l t i - p e t a b y t e . H P C , B i g D a t a , C l o u d , r a w s t o r a g e .
S o f t w a r e A l l - i n - 1 C R U S H S c a l e
Unified storage system
● objects● native
● RESTful
● block● thin provisioning, snapshots, cloning
● file● strong consistency, snapshots
Distributed storage system
● data center scale● 10s to 10,000s of machines
● terabytes to exabytes
● fault tolerant● no single point of failure
● commodity hardware
● self-managing, self-healing
Where did Ceph come from?
UCSC research grant
● “Petascale object storage”● DOE: LANL, LLNL, Sandia
● Scalability
● Reliability
● Performance● Raw IO bandwidth, metadata ops/sec
● HPC file system workloads● Thousands of clients writing to same file, directory
Distributed metadata management
● Innovative design● Subtree-based partitioning for locality, efficiency
● Dynamically adapt to current workload
● Embedded inodes
● Prototype simulator in Java (2004)
● First line of Ceph code● Summer internship at LLNL
● High security national lab environment
● Could write anything, as long as it was OSS
The rest of Ceph
● RADOS – distributed object storage cluster (2005)
● EBOFS – local object storage (2004/2006)
● CRUSH – hashing for the real world (2005)
● Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
Industry black hole
● Many large storage vendors● Proprietary solutions that don't scale well
● Few open source alternatives (2006)● Very limited scale, or
● Limited community and architecture (Lustre)
● No enterprise feature sets (snapshots, quotas)
● PhD grads all built interesting systems...● ...and then went to work for Netapp, DDN, EMC, Veritas.
● They want you, not your project
A different path
● Change the world with open source● Do what Linux did to Solaris, Irix, Ultrix, etc.
● What could go wrong?
● License● GPL, BSD...
● LGPL: share changes, okay to link to proprietary code
● Avoid unsavory practices● Dual licensing
● Copyright assignment
How does Ceph work?
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
Why start with objects?
● more useful than (disk) blocks● names in a single flat namespace
● variable size
● simple API with rich semantics
● more scalable than files● no hard-to-distribute hierarchy
● update semantics do not span objects
● workload is trivially parallel
Ceph object model
● pools● 1s to 100s
● independent namespaces or object collections
● replication level, placement policy
● objects● bazillions
● blob of data (bytes to gigabytes)
● attributes (e.g., “version=12”; bytes to kilobytes)
● key/value bundle (bytes to gigabytes)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
M
Object Storage Daemons (OSDs)
● 10s to 10000s in a cluster
● one per disk, SSD, or RAID group, or ...
● hardware agnostic
● serve stored objects to clients
● intelligently peer to perform replication and recovery tasks
Monitors
● maintain cluster membership and state
● provide consensus for distributed decision-making
● small, odd number
● these do not served stored objects to clients
Data distribution
● all objects are replicated N times
● objects are automatically placed, balanced, migrated in a dynamic cluster
● must consider physical infrastructure● ceph-osds on hosts in racks in rows in data centers
● three approaches● pick a spot; remember where you put it
● pick a spot; write down where you put it
● calculate where to put it, where to find it
CRUSH fundamentals
CRUSH
● pseudo-random placement algorithm
● fast calculation, no lookup
● repeatable, deterministic
● statistically uniform distribution
● stable mapping
● limited data migration on change
● rule-based configuration
● infrastructure topology aware
● adjustable replication
● allows weighting
[Rules]
[Buckets] [Map]
[Mapping]
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, policy)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
CLIENTCLIENT
??
CLIENT
??
Applications
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RADOS Gateway
● REST-based object storage proxy
● uses RADOS to store objects
● API supports buckets, accounting
● usage accounting for billing purposes
● compatible with S3, Swift APIs
RADOS Block Device
● storage of disk images in RADOS
● decouple VM from host
● images striped across entire cluster (pool)
● snapshots
● copy-on-write clones
● support in
● Qemu/KVM, Xen
● mainline Linux kernel (2.6.34+)
● OpenStack, CloudStack
● Ganeti, OpenNebula
Metadata Server (MDS)
● manages metadata for POSIX shared file system
● directory hierarchy
● file metadata (size, owner, timestamps)
● stores metadata in RADOS
● does not serve file data to clients
● only required for the shared file system
Multiple protocols, implementations
● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt
● export (NFS), Samba (CIFS)
● ceph-fuse
● libcephfs.so● your app
● Samba (CIFS)
● Ganesha (NFS)
● Hadoop (map/reduce)
kernel
libcephfs
ceph fuseceph-fuse
your app
libcephfsSamba
libcephfsGanesha
NFS SMB/CIFS
libcephfsHadoop
Testing
Package Building
● GitBuilder● Build packages / tarballs
● Very basic tool, works well though
● Push to get → .deb && .rpm for all platforms
● autosigned
● Pbuilder● Used for major releases
● Clean room type of build
● Signed using different key
● As many distros as possible (All Debian, Ubuntu, Fedora 17&18, CentOS, SLES, OpenSUSE, tarball)
Teuthology
● Our own custom test framework
● Allocates machines you define to cluster you define
● Runs tasks against that environment● Set up ceph
● Run workloads
● Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount it → run workload
● Makes it easy to stack things up, map them to hosts, run
● Suite of tests on our github
● Active community development underway!
Performance
4K Relative Performance
All Workloads: QEMU/KVM, BTRFS, or XFSRead Heavy Only Workloads: Kernel RBD, EXT4, or XFS
128K Relative Performance
All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
4M Relative Performance
All Workloads: QEMU/KVM, or XFSRead Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
What does this mean?
● Anecdotal results are great!
● Concrete evidence coming soon
● Performance is complicated● Workload
● Hardware (especially network!)
● Lots and lots of tuneables
● Mark Nelson (nhm on #ceph) is the performance geek
● Lots of reading on ceph.com blog
● We welcome testing and comparison help● Anyone want to help with an EMC / NetApp showdown?
Moving forward
A n o n g o i n g p r o c e s s
W h i l e t h e f i r s t p a s s f o r d i s a s t e r r e c o v e r y i s d o n e , w e w a n t t o g e t t o b u i l t - i n , w o r l d - w i d e r e p l i c a t i o n .
R e c e p t i o n e f f i c i e n c y
C u r r e n t l y u n d e r w a y i n t h e c o m m u n i t y !
H e a d e d t o d y n a m i c
C a n a l r e a d y d o t h i s i n a s t a t i c p o o l -b a s e d s e t u p . L o o k i n g t o g e t t o a u s e - b a s e d m i g r a t i o n .
M a k i n g i t o p e n - e r
B e e n t a l k i n g a b o u t i t f o r e v e r . T h e t i m e i s c o m i n g !
G e o - R e p l i c a t i o n E r a s u r e C o d i n g T i e r i n g G o v e r n a n c e
Future plans
Q u a r t e r l y O n l i n e S u m m i t
O n l i n e s u m m i t p u t s t h e c o r e d e v st o g e t h e r w i t h t h e C e p h c o m m u n i t y .
N o t j u s t f o r N Y C
M o r e p l a n n e d , i n c l u d i n g S a n t a C l a r a a n d L o n d o n . K e e p a n e y e o u t : h t t p : / / i n k t a n k . c o m / c e p h d a ys /
G e e k - o n - d u t y
D u r i n g t h e w e e k t h e r e a r e t i m e s w h e n C e p h e x p e r t s a r e a v a i l a b l e t o h e l p . S t o p b y o f t c . n e t / c e p h
E m a i l m a k e s t h e w o r l d g o
O u r m a i l i n g l i s t s a r e v e r y a c t i v e , c h e c k o u t c e p h . c o m f o r d e t a i l s o n h o w t o j o i n i n !
C D S C e p h D a y I R C L i s t s
Get Involved!
Questions?
E - M A I Lp a t r i c k @ i n k t a n k . c o m
W E B S I T EC e p h . c o m
S O C I A L@ s c u t t l e m o n k e y@ c e p hF a c e b o o k . c o m / c e p h s t o r a g e