49
Petabyte Scale Out Rocks CEPH as replacement for Openstack's Swift & Co. Udo Seidel

adp.ceph.openstack.talk

Embed Size (px)

DESCRIPTION

Presentation about how CEPH can be used as storage infrastructure for Openstack. Part of the APDs internal presentation initiative.

Citation preview

Page 1: adp.ceph.openstack.talk

Petabyte Scale Out Rocks

CEPH as replacement for Openstack's Swift & Co.

Udo Seidel

Page 2: adp.ceph.openstack.talk

22JAN2014

Agenda

● Background● Architecture of CEPH● CEPH Storage● Openstack● Dream team: CEPH and Openstack● Summary

Page 3: adp.ceph.openstack.talk

22JAN2014

Background

Page 4: adp.ceph.openstack.talk

22JAN2014

CEPH – what?

● So-called parallel distributed cluster file system● Started as part of PhD studies at UCSC● Public announcement in 2006 at 7th OSDI● File system shipped with Linux kernel since

2.6.34● Name derived from pet octopus - Cephalopods

Page 5: adp.ceph.openstack.talk

22JAN2014

Shared file systems – short intro

● Multiple server access the same data● Different approaches

● Network based, e.g. NFS, CIFS● Clustered

– Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2– Distributed parallel, e.g. Lustre .. GlusterFS and CEPHFS

Page 6: adp.ceph.openstack.talk

22JAN2014

Architecture of CEPH

Page 7: adp.ceph.openstack.talk

22JAN2014

CEPH and storage

● Distributed file system => distributed storage● Does not use traditional disks or RAID arrays● Does use so-called OSD’s

– Object based Storage Devices– Intelligent disks

Page 8: adp.ceph.openstack.talk

22JAN2014

Object Based Storage I

● Objects of quite general nature● Files● Partitions

● ID for each storage object● Separation of meta data operation and storing

file data● HA not covered at all● Object based Storage Devices

Page 9: adp.ceph.openstack.talk

22JAN2014

Object Based Storage II

● OSD software implementation● Usual an additional layer between between

computer and storage● Presents object-based file system to the computer● Use a “normal” file system to store data on the

storage● Delivered as part of CEPH

● File systems: LUSTRE, EXOFS

Page 10: adp.ceph.openstack.talk

22JAN2014

CEPH – the full architecture I

● 4 components● Object based Storage Devices

– Any computer– Form a cluster (redundancy and load balancing)

● Meta Data Servers– Any computer– Form a cluster (redundancy and load balancing)

● Cluster Monitors– Any computer

● Clients ;-)

Page 11: adp.ceph.openstack.talk

22JAN2014

CEPH – the full architecture II

Page 12: adp.ceph.openstack.talk

22JAN2014

CEPH and Storage

Page 13: adp.ceph.openstack.talk

22JAN2014

CEPH and OSD

● User land implementation● Any computer can act as OSD● Uses BTRFS as native file system

● Since 2009● Before self-developed EBOFS● Provides functions of OSD-2 standard

– Copy-on-write– snapshots

● No redundancy on disk or even computer level

Page 14: adp.ceph.openstack.talk

22JAN2014

CEPH and OSD – file systems

● BTRFS preferred● Non-default configuration for mkfs

● XFS and EXT4 possible● XATTR (size) is key -> EXT4 less recommended● XFS recommended for Production

Page 15: adp.ceph.openstack.talk

22JAN2014

OSD failure approach

● Any OSD expected to fail● New OSD dynamically added/integrated● Data distributed and replicated● Redistribution of data after change in OSD

landscape

Page 16: adp.ceph.openstack.talk

22JAN2014

Data distribution

● File stripped ● File pieces mapped to object IDs● Assignment of so-called placement group to

object ID● Via hash function● Placement group (PG): logical container of storage

objects

● Calculation of list of OSD’s out of PG ● CRUSH algorithm

Page 17: adp.ceph.openstack.talk

22JAN2014

CRUSH I

● Controlled Replication Under Scalable Hashing● Considers several information

● Cluster setup/design● Actual cluster landscape/map● Placement rules

● Pseudo random -> quasi statistical distribution● Cannot cope with hot spots● Clients, MDS and OSD can calculate object

location

Page 18: adp.ceph.openstack.talk

22JAN2014

CRUSH II

Page 19: adp.ceph.openstack.talk

22JAN2014

Data replication

● N-way replication ● N OSD’s per placement group● OSD’s in different failure domains● First non-failed OSD in PG -> primary

● Read and write to primary only● Writes forwarded by primary to replica OSD’s● Final write commit after all writes on replica OSD

● Replication traffic within OSD network

Page 20: adp.ceph.openstack.talk

22JAN2014

CEPH cluster monitors

● Status information of CEPH components critical● First contact point for new clients● Monitor track changes of cluster landscape

● Update cluster map● Propagate information to OSD’s

Page 21: adp.ceph.openstack.talk

22JAN2014

CEPH cluster map I

● Objects: computers and containers● Container: bucket for computers or containers ● Each object has ID and weight● Maps physical conditions

● rack location● fire cells

Page 22: adp.ceph.openstack.talk

22JAN2014

CEPH cluster map II

● Reflects data rules● Number of copies● Placement of copies

● Updated version sent to OSD’s● OSD’s distribute cluster map within OSD cluster● OSD re-calculates via CRUSH PG membership

– data responsibilities– Order: primary or replica

● New I/O accepted after information synch

Page 23: adp.ceph.openstack.talk

22JAN2014

CEPH - RADOS

● Reliable Autonomic Distributed Object Storage● Direct access to OSD cluster via librados

● Support: C, C++, Java, Python, Ruby, PHP

● Drop/skip of POSIX layer (CEPHFS) on top● Visible to all CEPH cluster members => shared

storage

Page 24: adp.ceph.openstack.talk

22JAN2014

CEPH Block Device

● Aka RADOS block device (RBD)● Upstream since kernel 2.6.37● RADOS storage exposed as block device

● /dev/rdb● qemu/KVM storage driver via librados/librdb

● Alternative to ● shared SAN/iSCSI for HA environments● Storage HA solutions for qemu/KVM

Page 25: adp.ceph.openstack.talk

22JAN2014

The RADOS picture

Page 26: adp.ceph.openstack.talk

22JAN2014

CEPH Object Gateway

● Aka RADOS Gateway (RGW)● RESTful API

● Amazon S3 -> s3 tools work● SWIFT API's

● Proxy HTTP to RADOS● Tested with apache and lighthttpd

Page 27: adp.ceph.openstack.talk

22JAN2014

CEPH Take Aways

● Scalable● Flexible configuration● No SPOF but build on commodity hardware● Accessible via different interfaces

Page 28: adp.ceph.openstack.talk

22JAN2014

Openstack

Page 29: adp.ceph.openstack.talk

22JAN2014

What?

● Infrastructure as a Service (IaaS)● 'Open source' version of AWS● New versions every 6 months

● Current called Havana● Previous called Grizzly

● Managed by Openstack Foundation

Page 30: adp.ceph.openstack.talk

22JAN2014

Openstack architecture

Page 31: adp.ceph.openstack.talk

22JAN2014

Openstack Components

● Keystone – identity● Glance - image● Nova - compute● Cinder - block● Swift - object● Neutron - network● Horizon - dashboard

Page 32: adp.ceph.openstack.talk

22JAN2014

About Glance● There since almost the beginning● Image store

● Server● Disk

● Several formats● raw● qcow2● ...

● Shared/non-shared

Page 33: adp.ceph.openstack.talk

22JAN2014

About Cinder● Kind of new

● Part of Nova before● Separate since Folsom

● Block storage● For compute instances● Managed by Openstack users● Several storage back-ends possible

Page 34: adp.ceph.openstack.talk

22JAN2014

About Swift● Since the beginning● Replace Amazon S3 -> cloud storage

● Scalable ● Redundant

● Object store

Page 35: adp.ceph.openstack.talk

22JAN2014

Openstack Object Store

● Proxy● Object● Container● Account● Auth

Page 36: adp.ceph.openstack.talk

22JAN2014

Dream Team: CEPH and Openstack

Page 37: adp.ceph.openstack.talk

22JAN2014

Why CEPH in the first place?

● Full blown storage solution● Support● Operational model● Cloud'ish

● Separation of duties● One solution for different storage needs

Page 38: adp.ceph.openstack.talk

22JAN2014

Integration● Full integration with RADOS

● Since Folsom

● Two parts● Authentication● Technical access

● Both parties must be aware● Independent for each of the storage

components● Different RADOS pools

Page 39: adp.ceph.openstack.talk

22JAN2014

Authentication● CEPH part

● Key rings● Configuration● For Cinder and Glance

● Openstack part● Keystone

– Only for Swift– Needs RGW

Page 40: adp.ceph.openstack.talk

22JAN2014

Access to RADOS/RBD I● Via API/libraries => no CEPHFS needed● Easy for Glance/Cinder

● CEPH keyring configuration● Update of CEPH.conf● Update of Glance/Cinder API configuration

Page 41: adp.ceph.openstack.talk

22JAN2014

Access to RADOS/RBD II● Swift => more work needed● CEPHFS => again: not needed

● CEPH Object Gateway– Web server– RGW software– Keystone certificates– No support for v2 Swift authentication

● Keystone authentication– Endlist configuration -> RGW

Page 42: adp.ceph.openstack.talk

22JAN2014

Integration the full picture

RADOS

qemu/kvm

CEPH Object Gateway CEPH Block Device

SwiftKeystone CinderGlance Nova

Page 43: adp.ceph.openstack.talk

22JAN2014

Integration pitfalls● CEPH versions not in sync● Authentication● CEPH Object Gateway setup

Page 44: adp.ceph.openstack.talk

22JAN2014

Why CEPH - reviewed● Previous arguments still valid :-)● Modular usage● High integration● Built-in HA● No need for POSIX compatible interface● Works even with other IaaS implementations

Page 45: adp.ceph.openstack.talk

22JAN2014

Summary

Page 46: adp.ceph.openstack.talk

22JAN2014

Take Aways● Sophisticated storage engine● Mature OSS distributed storage solution

● Other use cases● Can be used elsewhere

● Full integration in Openstack

Page 47: adp.ceph.openstack.talk

22JAN2014

References

● http://ceph.com● http://www.openstack.org

Page 48: adp.ceph.openstack.talk

22JAN2014

Thank you!

Page 49: adp.ceph.openstack.talk

22JAN2014

Petabyte Scale Out Rocks

CEPH as replacement for Openstack's Swift & Co.

Udo Seidel