SUSE Enterprise Storage · 2017. 3. 15. · SUSE Storage architectural benefits Exabyte scalability...

SUSE Enterprise Storage

Your First Ceph Cluster

Michal JuraSenior Software Engineer

Linux HA/Cloud Developer

mjura@suse.com

Agenda

● Ceph overview by engineer eye?● Hardware planning● Deploy Ceph● Deploy SUSE OpenStack Cloud and SUSE Storage

Brief intro to SUSE Storage / Ceph

SUSE Storage

● SUSE Storage is based upon Ceph● SUSE Storage 1.0 was already released

● Based upon Ceph Firefly v0.80 release

● Latest release Ceph Giant v0.87 and Ceph Hammer v0.94

SUSE Storage architectural benefits

● Exabyte scalability● No bottlenecks or single points of failure

● Industry-leading functionality● Remote replication, erasure coding

● Cache tiering

● Unified block, file and object interface

● Thin provisioning, copy on write

● 100% software based; can use commodity hardware● Automated management

● Self-managing, self-healing

Expected use cases

● Scalable cloud storage● Provide block storage for the cloud

● Allowing host migration

● Cheap archival storage● Using erasure encoding (like RAID5/6)

● Scalable object store● This is what Ceph is built upon

More exciting things about Ceph

● Tunable for multiple use cases:● for performance

● for price

● for recovery

● Configurable redundancy:● at the disk level

● at the host level

● at the rack level

● at the room level

● ...

Let's do it again

● Only two main components:● MON (Cluster Monitor Deamon) for cluster state

● OSD (Object Storage Daemon) for storing data

● Hash-based data distribution (CRUSH)● (Usually) No need to ask where data is

● Simplifies data balancing

● Ceph clients communicate with OSD directly

Hardware planning

Questions to answer

• How much net storage, in what tiers?

• How many IOPS?‒ Aggregated

‒ Per VM (average)

‒ Per VM (peak)

• What to optimize for?‒ Cost

‒ Performance

Network

• Choose the fastest network you can afford

• Switches should be low latency with fully meshed backplane

• Separate public and cluster network

• Cluster network should typically be twice the public bandwidth‒ Incoming writes are replicated over the cluster network

‒ Re-balancing and re-mirroring take utilize the cluster network

Networking (public and private)

• Ethernet (1, 10, 40 GbE)‒ Reasonably inexpensive (except for 40 GbE)

‒ Can easily be bonded for availability

‒ Use jumbo frames

• Infiniband‒ High bandwidth

‒ Low latency

‒ Typically more expensive

‒ No support for RDMA yet in Ceph, need to use IPoIB

Storage node

• CPU‒ Number and speed of cores

• Memory

• Storage controller‒ Bandwidth, performance, cache size

• SSDs for OSD journal‒ SSD to HDD ratio

• HDDs‒ Count, capacity, performance

Best practices and magic numbers

• Ceph-OSD sizing‒ Disks

‒ 8-10 SAS HDDs per 1 x 10G NIC

‒ ~12 SATA HDDs per 1 x 10G NIC

‒ 1 x SSD for write journal per 4-6 OSD drives

‒ RAM

‒ 1GB of RAM per 1 TB of OSD storage space

‒ CPU

‒ 0,5 CPU core's/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD drives)

• Ceph-MON sizing‒ 1 ceph-mon node per 15-20 OSD node (min 3 per cluster)

IOPS calculation more magic numbers

• For typical “unpredictable” workload we usually assume:‒ 70/30 read/write IOPS proportion

‒ ~4-8KB random read pattern

• To estimate Ceph IOPS efficiency we usually tak‒ 4-8K Random Read – 0.88

‒ 4-8K Random Write – 0.64

Based on benchmark data and semi-empirical evidence

Performance optimized Ceph

• Ceph-MON‒ CPU 6 cores

‒ RAM 65GB

‒ HDD: 1x1TB SATA

‒ NIC: 1 x 1 Gb/s, 1 x 10 GB/s

• Ceph-OSD‒ CPU 2x6 cores

‒ RAM 64GB

‒ HDD (OS drives): 2xSATA 500GB

‒ HDD (OSD drives): 20xSAS 1,8TB (10k RPM, 2.5 inch)

‒ SSD (write journals): 4 x SSD 128GB

‒ NIC: 1 x 1Gb/s, 4 x 10Gb/s (2 bonds – 1 for ceph-public and 1 for ceph-replication)

• Compute size of Ceph cluster‒ 1000/ (20 * 1,8TB * 0,85) * 3 = 98 servers to serve 1 petabyte net

Exected performance level

• 500 VM in Data Center

• Conservative drive rating: 250 IOPs

• Cluster read rating:‒ 250 * 20 * 98 * 0.88 = 431 200 IOPs

‒ Approx 800 IOPs per VM capacity is available

• Cluster writing rating:‒ 250 * 20 * 98 * 0.88 = 313 600 IOPs

‒ Approx 600 IOPs per VM capacity is available

Adding more nodes

• Capacity increases

• Total throughput increases

• IOPS increase

• Redundancy increases

• Latency unchanged

• Eventually: network topology limitations

• Temporary impact during re-balancing

Adding more disks to a node

• Capacity increases

• Redundancy increases

• Throughput might increase

• IOPS might increase

• Internal node bandwidth is consumed

• Higher CPU and memory load

• Cache contention

• Latency unchanged

OSD file system

• btrfs‒ Typically better write

throughput performance

‒ Higher CPU utilization

‒ Feature rich

‒ Compression, checksums, copy on write

‒ The choice for the future!

• XFS‒ Good all around choice

‒ Very mature for data partitions

‒ Typically lower CPU utilization

‒ The choice for today!

Impact of caches

• Cache on the client side‒ Typically, biggest impact on performance

‒ Does not help with write performance

• Server OS cache‒ Low impact: reads have already been cached on the client

‒ Still, helps with readahead

• Caching controller, battery backed:‒ Significant impact for writes

Impact of SSD journals

• SSD journals accelerate bursts and random write IO

• For sustained writes that overflow the journal, performance degrades to HDD levels

• SSDs help very little with read performance

• SSDs are very costly‒ ... and consume storage slots -> lower density

• A large battery-backed cache on the storage controller is highly recommended if not using SSD journals

Hard disk parameters

• Capacity matters‒ Often, highest density is

not most cost effective

• On-disk cache matters less

• Reliability advantage of Enterprise drives typically marginal compared to cost‒ Buy more drives instead

• RPM:‒ Increase IOPS &

throughput

‒ Increases power consumption

‒ 15k drives quite expensive still

Impact of redundancy choices

• Replication:‒ n number of exact, full-

size copies

‒ Potentially increased read performance due to striping

‒ Increased cluster network utilization for writes

‒ Rebuilds can leverage multiple sources

‒ Significant capacity impact

• Erasure coding:‒ Data split into k parts

plus m redundancy codes

‒ Better space efficiency

‒ Higher CPU overhead

‒ Significant CPU and cluster network impact, especially during rebuild

‒ Cannot directly be used to with block devices (see next slide)

Cache tiering

• Multi-tier storage architecture:‒ Pool acts as a transparent write-back overlay for another

‒ e.g., SSD 3-way replication over HDDs with erasure coding

‒ Can flush either on relative or absolute dirty levels, or age

‒ Additional configuration complexity and requires workload-specific tuning

‒ Also available: read-only mode (no write acceleration)

‒ Some downsides (no snapshots)

• A good way to combine the advantages of replication and erasure coding

Federated gateways

SUSE Enterprise Storage 2.0

• The new and faster civetweb integration for RADOS Gateway

• iSCSI Ceph Gateway

• Network-based key server for encryption of OSD disks and journals

• This release is based on the stable Ceph 0.94 Hammer release

Deploy SUSE Enterprise Storage

Deploying SUSE Enterprise Storage

• Ceph-deploy method

Deploying SUSE Enterprise Storage

• Crowbar provisioning and orchestration system

About Ceph layout

● Ceph needs 1 or more mon nodes● In production 3 nodes are the minimum

● Ceph needs 3 or more osd nodes● Can be fewer in testing

● Each osd should manage a minimum of 15 Gb

● Smaller is possible

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

SUSE Enterprise Storage · 2017. 3. 15. · SUSE Storage architectural benefits Exabyte scalability...

Documents

SUSE Enterprise Storage – Case Study · SUSE ® Enterprise Storage – Case Study Town of Orchard Park New York Anthony Tortola Senior Sales Engineer - SUSE atortola@suse.com Town

SUSE Enterprise Storage 7

SUSE Enterprise Storage 6. … · SUSE Enterprise Storage Last 12-month Accomplishments Monitor Nodes Management Node Storage Nodes Unified Open Source Software on x86 and Arm Resilient

SUSE Enterprise Storage powered by Ceph - opentechday.nl · File system (xfs) or BlueStore Object Storage Daemon #opentechday #suse. Ceph Storage node ... • Ceph Jewel release •

Building reliable Ceph clusters with SUSE Enterprise Storage

SUSE - Containers Today · 2019. 12. 10. · SUSE Enterprise Storage. Networking. SDN and NFV. Compute. Virtual Machine & Container. Multimodal Operating System. SUSE Linux Enterprise

SUSE Enterprise Storage on HPE Apollo 4200/4500 …...SUSE Enterprise Storage architecture—powered by Ceph A SUSE Enterprise Storage cluster is a software -defined storage (SDS)

Introducing SUSE Enterprise Storage 5 · SUSE Enterprise Storage 5 –Ceph BlueStore. ... Introducing SUSE Enterprise Storage 5. ... • Preview of Ceph’s ability to export a file

Welcome to SUSE Expert Days · Welcome to SUSE Expert Days ... Public Cloud SUSE Cloud Service Provider Program Container Management SUSE CaaS Platform Storage ... Java ERP. 21 Bridging

SUSE: Software Defined Storage

SUSE Enterprise Storage Roadmap - Image Relay

eXabyte 2016 Rulebook

Introducing SUSE Enterprise Storage 5 · SUSE Enterprise Storage 5 is the ideal solution for Compliance, Archive, Backup and Large Data. Customers can simplify and scale the storage

Exabyte 8200 Service

Deployment Guide - SUSE Enterprise Storage 6 · 2020-05-12 · Contents About This Guidex I SUSE ENTERPRISE STORAGE 1 1 SUSE Enterprise Storage 6 and Ceph2 1.1 Ceph Features 2 1.2

Managing and Monitoring SUSE Enterprise Storage · SUSE Enterprise Storage Tim Serong Senior Clustering Engineer tserong@suse.com ... • AngularJS and Python REST api ... Further,

Gearing for Exabyte Storage with Hadoop Distributed Filesystem Edward Bortnikov, Amir Langer, Artyom Sharov

10.breakout session suse enterprise storage

Deployment Guide - SUSE Enterprise Storage 5.5 …...and support of SUSE. SUSE Enterprise Storage 5.5 provides IT organizations with the ability to deploy a distributed storage architecture

SUSE Enterprise Storage Architectural Overview …€¦ · SUSE Enterprise Storage ™ Architectural Overview and Recommendations Product Guide ... to have at least 4 nodes; ... Within