SUSE Enterprise Storage · 2017. 3. 15. · SUSE Storage architectural benefits Exabyte scalability...

Preview:

Citation preview

SUSE Enterprise Storage

Your First Ceph Cluster

Michal JuraSenior Software Engineer

Linux HA/Cloud Developer

mjura@suse.com

2

Agenda

● Ceph overview by engineer eye?● Hardware planning● Deploy Ceph● Deploy SUSE OpenStack Cloud and SUSE Storage

Brief intro to SUSE Storage / Ceph

4

SUSE Storage

● SUSE Storage is based upon Ceph● SUSE Storage 1.0 was already released

● Based upon Ceph Firefly v0.80 release

● Latest release Ceph Giant v0.87 and Ceph Hammer v0.94

5

SUSE Storage architectural benefits

● Exabyte scalability● No bottlenecks or single points of failure

● Industry-leading functionality● Remote replication, erasure coding

● Cache tiering

● Unified block, file and object interface

● Thin provisioning, copy on write

● 100% software based; can use commodity hardware● Automated management

● Self-managing, self-healing

6

Expected use cases

● Scalable cloud storage● Provide block storage for the cloud

● Allowing host migration

● Cheap archival storage● Using erasure encoding (like RAID5/6)

● Scalable object store● This is what Ceph is built upon

7

More exciting things about Ceph

● Tunable for multiple use cases:● for performance

● for price

● for recovery

● Configurable redundancy:● at the disk level

● at the host level

● at the rack level

● at the room level

● ...

8

Let's do it again

● Only two main components:● MON (Cluster Monitor Deamon) for cluster state

● OSD (Object Storage Daemon) for storing data

● Hash-based data distribution (CRUSH)● (Usually) No need to ask where data is

● Simplifies data balancing

● Ceph clients communicate with OSD directly

Hardware planning

10

Questions to answer

• How much net storage, in what tiers?

• How many IOPS?‒ Aggregated

‒ Per VM (average)

‒ Per VM (peak)

• What to optimize for?‒ Cost

‒ Performance

11

Network

• Choose the fastest network you can afford

• Switches should be low latency with fully meshed backplane

• Separate public and cluster network

• Cluster network should typically be twice the public bandwidth‒ Incoming writes are replicated over the cluster network

‒ Re-balancing and re-mirroring take utilize the cluster network

12

Networking (public and private)

• Ethernet (1, 10, 40 GbE)‒ Reasonably inexpensive (except for 40 GbE)

‒ Can easily be bonded for availability

‒ Use jumbo frames

• Infiniband‒ High bandwidth

‒ Low latency

‒ Typically more expensive

‒ No support for RDMA yet in Ceph, need to use IPoIB

13

Storage node

• CPU‒ Number and speed of cores

• Memory

• Storage controller‒ Bandwidth, performance, cache size

• SSDs for OSD journal‒ SSD to HDD ratio

• HDDs‒ Count, capacity, performance

14

Best practices and magic numbers

• Ceph-OSD sizing‒ Disks

‒ 8-10 SAS HDDs per 1 x 10G NIC

‒ ~12 SATA HDDs per 1 x 10G NIC

‒ 1 x SSD for write journal per 4-6 OSD drives

‒ RAM

‒ 1GB of RAM per 1 TB of OSD storage space

‒ CPU

‒ 0,5 CPU core's/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD drives)

• Ceph-MON sizing‒ 1 ceph-mon node per 15-20 OSD node (min 3 per cluster)

15

IOPS calculation more magic numbers

• For typical “unpredictable” workload we usually assume:‒ 70/30 read/write IOPS proportion

‒ ~4-8KB random read pattern

• To estimate Ceph IOPS efficiency we usually tak‒ 4-8K Random Read – 0.88

‒ 4-8K Random Write – 0.64

Based on benchmark data and semi-empirical evidence

16

Performance optimized Ceph

• Ceph-MON‒ CPU 6 cores

‒ RAM 65GB

‒ HDD: 1x1TB SATA

‒ NIC: 1 x 1 Gb/s, 1 x 10 GB/s

• Ceph-OSD‒ CPU 2x6 cores

‒ RAM 64GB

‒ HDD (OS drives): 2xSATA 500GB

‒ HDD (OSD drives): 20xSAS 1,8TB (10k RPM, 2.5 inch)

‒ SSD (write journals): 4 x SSD 128GB

‒ NIC: 1 x 1Gb/s, 4 x 10Gb/s (2 bonds – 1 for ceph-public and 1 for ceph-replication)

• Compute size of Ceph cluster‒ 1000/ (20 * 1,8TB * 0,85) * 3 = 98 servers to serve 1 petabyte net

17

Exected performance level

• 500 VM in Data Center

• Conservative drive rating: 250 IOPs

• Cluster read rating:‒ 250 * 20 * 98 * 0.88 = 431 200 IOPs

‒ Approx 800 IOPs per VM capacity is available

• Cluster writing rating:‒ 250 * 20 * 98 * 0.88 = 313 600 IOPs

‒ Approx 600 IOPs per VM capacity is available

21

Adding more nodes

• Capacity increases

• Total throughput increases

• IOPS increase

• Redundancy increases

• Latency unchanged

• Eventually: network topology limitations

• Temporary impact during re-balancing

22

Adding more disks to a node

• Capacity increases

• Redundancy increases

• Throughput might increase

• IOPS might increase

• Internal node bandwidth is consumed

• Higher CPU and memory load

• Cache contention

• Latency unchanged

23

OSD file system

• btrfs‒ Typically better write

throughput performance

‒ Higher CPU utilization

‒ Feature rich

‒ Compression, checksums, copy on write

‒ The choice for the future!

• XFS‒ Good all around choice

‒ Very mature for data partitions

‒ Typically lower CPU utilization

‒ The choice for today!

24

Impact of caches

• Cache on the client side‒ Typically, biggest impact on performance

‒ Does not help with write performance

• Server OS cache‒ Low impact: reads have already been cached on the client

‒ Still, helps with readahead

• Caching controller, battery backed:‒ Significant impact for writes

25

Impact of SSD journals

• SSD journals accelerate bursts and random write IO

• For sustained writes that overflow the journal, performance degrades to HDD levels

• SSDs help very little with read performance

• SSDs are very costly‒ ... and consume storage slots -> lower density

• A large battery-backed cache on the storage controller is highly recommended if not using SSD journals

26

Hard disk parameters

• Capacity matters‒ Often, highest density is

not most cost effective

• On-disk cache matters less

• Reliability advantage of Enterprise drives typically marginal compared to cost‒ Buy more drives instead

• RPM:‒ Increase IOPS &

throughput

‒ Increases power consumption

‒ 15k drives quite expensive still

27

Impact of redundancy choices

• Replication:‒ n number of exact, full-

size copies

‒ Potentially increased read performance due to striping

‒ Increased cluster network utilization for writes

‒ Rebuilds can leverage multiple sources

‒ Significant capacity impact

• Erasure coding:‒ Data split into k parts

plus m redundancy codes

‒ Better space efficiency

‒ Higher CPU overhead

‒ Significant CPU and cluster network impact, especially during rebuild

‒ Cannot directly be used to with block devices (see next slide)

28

Cache tiering

• Multi-tier storage architecture:‒ Pool acts as a transparent write-back overlay for another

‒ e.g., SSD 3-way replication over HDDs with erasure coding

‒ Can flush either on relative or absolute dirty levels, or age

‒ Additional configuration complexity and requires workload-specific tuning

‒ Also available: read-only mode (no write acceleration)

‒ Some downsides (no snapshots)

• A good way to combine the advantages of replication and erasure coding

29

Federated gateways

30

SUSE Enterprise Storage 2.0

• The new and faster civetweb integration for RADOS Gateway

• iSCSI Ceph Gateway

• Network-based key server for encryption of OSD disks and journals

• This release is based on the stable Ceph 0.94 Hammer release

Deploy SUSE Enterprise Storage

32

Deploying SUSE Enterprise Storage

• Ceph-deploy method

33

Deploying SUSE Enterprise Storage

• Crowbar provisioning and orchestration system

34

About Ceph layout

● Ceph needs 1 or more mon nodes● In production 3 nodes are the minimum

● Ceph needs 3 or more osd nodes● Can be fewer in testing

● Each osd should manage a minimum of 15 Gb

● Smaller is possible

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

35

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

Recommended