Upload
haruka-iwao
View
118
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Ceph, an open-source, massively scalable distributed file system. This document explains the architecture of Ceph and integration with OpenStack.
Citation preview
What you need to know about CephGluster Community Day, 20 May 2014
Haruka Iwao
Index
What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up
What is Ceph?
Ceph
The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.
Cephalopod?
Ceph is...
{ }object storage and file system
Open-sourceMassively scalableSoftware-defined
History of Ceph
2003 Project born at UCSC
2006 Open sourced Papers published
2012 Inktank founded “Argonaut” released
In April 2014
Yesterday
Red Hat acquires me
I joined Red Hat as an architect of storage systems
This is just a coincidence.
Red Hat acquires me
Ceph releases
Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)
Ceph architecture
Ceph at a glance
Layers in Ceph
RADOS = /dev/sda Ceph FS = ext4
/dev/sda
ext4
RADOS
Reliable Replicated to avoid data loss
Autonomic Communicate each other to
detect failures Replication done transparently
Distributed Object Store
RADOS (2)
Fundamentals of Ceph Everything is stored in
RADOS Including Ceph FS metadata
Two components: mon, osd CRUSH algorithm
OSD
Object storage daemon One OSD per disk Uses xfs/btrfs as backend
Btrfs is experimental! Write-ahead journal for
integrity and performance 3 to 10000s OSDs in a cluster
OSD (2)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MON
Monitoring daemon Maintain cluster map and
state Small, odd number
Locating objects
RADOS uses an algorithm “CRUSH” to locate objects Location is decided through
pure “calculation”
No central “metadata” server No SPoF Massive scalability
CRUSH1. Assign a placement group pg = Hash(object name) % num pg2. CRUSH(pg, cluster map, rule)
1
2
Cluster map
Hierarchical OSD map Replicating across failure domains Avoiding network congestion
Object locations computed
0100111010100111011 Name: abc, Pool: test
Hash(“abc”) % 256 = 0x23“test” = 3
Placement Group: 3.23
PG to OSD
Placement Group: 3.23
CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9
15
9
Synchronous Replication
Replication is synchronousto maintain strong consistency
When OSD fails
OSD marked “down” 5mins later, marked “out”
Cluster map updated
CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9
CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9
Wrap-up: CRUSH
Object name + cluster map → object locations Deterministic
No metadata at all Calculation done on clients Cluster map reflects
network hierarchy
RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
S3 / Swift compatible gateway to RADOS
RBD
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RBD
RADOS Block Devices
RBD
Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo
OpenStack integration Cinder & Glance Will explain later
Ceph FS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
Ceph FS
POSIX compliant file system build on top of RADOS
Can mount with Linux native kernel driver (cephfs) or FUSE
Metadata servers (mds) manages metadata of the file system tree
Ceph FS is reliable
MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures
Multiple MDS can run for HA and load balancing
Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata held in-memory
DYNAMIC SUBTREE PARTITIONING
Ceph FS is experimental
Other features
Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network
Rolling upgrades
No interruption to the service when upgrading
Stop/Start daemons one by one mon → osd → mds →
radowgw
Erasure coding
Use erasure coding instead of parity for data durability
Suitable for rarely modified or accessed objects
Erasure Coding
Replication
Space overhead(survive 2 fails)
Approx 40% 200%
CPU High Low
Latency High Low
Cache tiering
Cache tierex. SSD
Base tierex. HDD,erasure coded
librados
transparent to clients
read/write
read when miss
fetch when missflush to base tier
Key-value OSD backend
Use LevelDB for OSD backend (instead of xfs)
Better performance esp for small objects
Plans to support RocksDB, NVMKV, etc
Separate backend network
Backend network for replication
Clients
Frontend network for service
OSDs
1. Write
2. Replicate
OpenStack Integration
OpenStack with Ceph
RADOSGW and Keystone
Keystone Server
RADOSGW
RESTful Object Store
Query token
Access with token
Grant/revoke
Glance Integration
RBD
Glance Server
/etc/glance/glance-api.conf
default_store=rbdrbd_store_user=glancerbd_store_pool=images
Store, Download
Need just 3 lines!
Image
Cinder/Nova Integration
RBD
Cinder Server
qemu
VM
librbd
nova-compute
Boot from volume
Management
Volume Image
Copy-on-write clone
Benefits of using with
Unified storage for both images and volumes
Copy-on-write cloning and snapshot support
Native qemu / KVM support for better performance
Wrap-up
Ceph is
Massively scalable storage Unified architecture for
object / block / POSIX FS OpenStack integration is
ready to use & awesome
Ceph and GlusterFSCeph GlusterFS
Distribution Object based File based
File location Deterministic algorithm (CRUSH)
Distributed hash table, stored in xattr
Replication Server side Client side
Primary usage Object / block storage
POSIX-like file system
Challenge POSIX file system needs improvement
Object / block storage needs improvement
Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/blog/
An awesome blog.
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
http://ceph.com/papers/weil-crush-sc06.pdf
CRUSH algorithm paper
Ceph: A Scalable, High-Performance Distributed File System
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Ceph paper
Ceph の覚え書きのインデックスhttp://www.nminoru.jp/~nminoru/unix/ceph/
Well written introduction in Japanese
One more thing
Calamari will be open sourced
“Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced.”
http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf
Calamari screens
Thank you!