Upload
italo-santos
View
363
Download
1
Tags:
Embed Size (px)
Citation preview
A distributed storage system
whoami
● Italo Santos
● @ Locaweb since 2007
● Sysadmin @ Storage Team
Introduction● Single Storage System
● Scalable
● Reliable
● Self-healing
● Fault Tolerant
● NO single point of failure
Architecture
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
Ceph Storage Cluster
OSDs
MMonitors MDS
OSDs
OSDs● One per disk
● Store data
● Replication
● Recovery
● Backfilling
● Rebalancing
● OSDs heartbeat
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
M
Ceph Monitors
Monitors● Cluster map
● Monitors map
● OSDs map
● Placement Group map
● CRUSH map
Metadata Server (MDS)
MDS● Used only by CephFS
● POSIX-compliant shared filesystem
● Manage metadata
○ Directory hierarchy
○ File metadata
● Stores metadata on RADOS
CRUSH
CRUSH● Pseudo-random placement algorithm
○ Fast calculation
○ Deterministic
● Statistically uniform distribution
● Limited data migration on change
● Rule-based configuration
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
CRUSH(pg, cluster state, rule set)
hash(object name) % num pg
CLIENT
??
Placement Groups (PGs)
Placement Groups● Logical collection of objects
● Maps PGs to OSDs dynamically
● Computationally less expensive
○ Reduce number of process
○ Less of per-object metadata
● Dynamically rebalance
Placement Groups
Placement Groups● Increase PGs reduces per-osd load
● ~100 PGs per OSD
(i.e., OSD per object = Number of replicas)
● Defined on pool creation
● PGs with multiple pools
○ Balance PGs per pool with PGs per OSD
Pools
Pools● Replicated
○ Object replicated N times (i.e., default size = 3)
○ Object + 2 protection replicas
● Erasure Coded
○ Stores objects as K+M chunks (i.e., size = K+M)
○ Divided into K data chunks and M coding chunks
Ceph Clients
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RadosGWCeph Object Gateway
RadosGW● Object Storage Interface
● Apache + FastCGI
● S3-compatible
● Swift-compatible
● Common namespace
● Store data on Ceph cluster
RBDRados Block Device
RBD● Block device interface
● Data striped on ceph cluster
● Thin-provisioned
● Snapshot support
● Linux Kernel-based (librbd)
● Cloud native support
CephFSCeph File System
CephFS● POSIX-compliant filesystem
● Shared filesystem
● Directory hierarchy
● File metadata (owner, timestamps, mode, etc.)
● Ceph MDS required
● NOT production ready!
ThanksItalo Santos @ Storage Team