CEPH@DeutscheTelekomA 2+ Years Production LiaisonIevgen Nelen, Gerd Pruessmann - Deutsche Telekom AG, DBU Cloud Services, P&I
07.05.2015 2
SpeakersIevgen Nelen & Gerd Prüßmann
• Cloud Operations Engineer
• Ceph cuttlefish
• Openstack diablo
• @eugene_nelen
• Head of Platform Engineering
• CEPH argonaut
• Openstack cactus
• @2digitsLeft
07.05.2015 4
OverviewBusiness Marketplace
• https://portal.telekomcloud.com/
• SaaS Applications from Software Partners (ISVs) and DT offered to SME customers
• i.e. Saperion, Sage, PadCloud, Teamlike, Fastbill, Imeet, Weclapp, SilverERP, Teamdisk ...
• Complements other cloud offerings from Deutsche Telekom (Enterprise cloud from T-Systems, Cisco Intercloud, Mediencenter etc.)
• IaaS platform based only on Open Source technologies like OpenStack, CEPH and Linux
• Project started in 2012 with OS Essex, CEPH in production since 3/2013 (bobtail)
07.05.2015– Strictly confidential, Confidential, Internal – Author / Presentation title 5
Overviewwhy opensource? Why ceph?
• no vendor lock in!
• easier to change and adapt new technologies / concepts - more independent from vendor priorities
• low cost of ownership and operation, utilizing commodity hardware and Open Source
• no license fees - but professional support
• modular and horizontally scalable platform
• automation and flexibility allow for faster deployment cycles, than in traditional hosting
• control over open source code - faster bug fixing and feature delivery
07.05.2015 7
DETAILSceph basics
• Bobtail > Cuttlefish > Dumpling > Firefly (0.80.9)
• Multiple CEPH clusters
• overall raw capacity 4.8 PB
• One S3 and cluster (~810TB raw capacity - 15 storage nodes - 3 MONs)
• multiple smaller RBD clusters for REF, LIFE and DEV
• S3 storage for cloud native apps (Teamdisk, Teamlike) and for backups (i.e RBD)
• RBD for persistent volumes / data via Openstack Cinder (i.e. DB volumes)
• Supermicro2x Intel Xeon E5-2640 v2 @ 2.00GHz64GB RAM7x SSDs18x HDDs
• Seagate TerascaleST4000NC000 4TB HDDs
• LSI MegaRAID SAS 9271-8i
• 18 OSDs per node: RAID1 with 2 SSD for /, 3 RAID0 with 1 SSD for journals, 18 raid0 with 1 hdd for OSD
• 2x10Gb network adapters
07.05.2015 11
DETAILShardware
• Supermicro1x Intel Xeon E5-2650L @ 1.80GHz64GB RAM36x HDDs
• Seagate Barracuda ST3000DM001 3TB HDDs
• LSI MegaRAIDSAS 9271-8i
• 10 OSDs per node: RAID1 for /, 10 RAID0 with 1 hdd for journals, 10 raid0 with 2 hdd for OSD
• 2x10Gb network adapters
07.05.2015 13
detailsconfiguration & deployment
• Razor
• Puppet
• https://github.com/TelekomCloud/puppet-ceph
• dm-crypt disk encryption
• osd location
• XFS
• 3 replica
• OMD/Check_mk http://omdistro.org/
• ceph-dash https://github.com/TelekomCloud/ceph-dash for dashboard and API
• check_mk plugins (Cluster health, OSDs, S3)
07.05.2015 15
detailsperformance tuning
• Problem - Low IOPS, IOPS drops
• fio
• Enable RAID0 Writeback cache
• Use separate disks for ceph journals (better use SSDs – scale out project)
• Problem - Recovery/Backfilling consumes a lot of cpu, decrease of performance
• osd_recovery_max_active 1 number of active recovery requests per OSD at one time
• osd_max_backfills 1 maximum number of backfills allowed to or from a single OSD
07.05.2015 19
lessons learnedoperational experience
• Chose your hardware well !!
• I,e. RAID and hard disks -> enterprise grade disks (desktop HDs are missing important features like TLER/ERC)
• CPU/RAM planning: calculate 1GHz CPU power and 2GB RAM per single OSD
• pick nodes with low storage capacity density for smaller clusters
• At least 5 nodes for a 3 replica cluster (i.e. for PoC, testing and development purposes)
• Cluster configuration “adjustments”:
• increasing PG num > impact on cluster because of massive data migration
• Rolling software updates / upgrades worked perfectly
• CEPH: has a character – but highly reliable - never lost data
07.05.2015 20
lessons learnedoperational experience
• Failed / ”Slow” disks
• Inconsistent PGs
• Incomplete PGs
• RBD pool configured with min_size=2
• Blocks IO operations to the pool / cluster
• fixed in Hammer (allows PG replication while replica level below min_size pool/OSD)
/var/log/syslog.log
Apr 12 04:59:47 cephosd5 kernel: [12473860.669262] sd 6:2:10:0: [sdk] Unhandled error code
root@cephosd5:/var/log# mount | grep sdk /dev/mapper/cephosd5-journal-sdk on /var/lib/ceph/osd/journal-disk9
root@cephosd5:/var/log# grep journal-disk9 /etc/ceph/ceph.conf osd journal = /var/lib/ceph/osd/journal-disk9/osd.151-journal
/var/log/ceph/ceph-osd.151.log.1.gz
2015-04-12 04:59:47.891284 7f8a10c76700 -1 journal FileJournal::do_ write: pwrite(fd=25, hbp.length=4096) failed :(5) Input/output error
07.05.2015 21
lessons learnedoperational experience
5/7/2015 22
lessons learned incomplete PGs - what happened?
OSD node
OSD
Journal
pg pgOSD
JournalOSD node
OSD
Journal
pg pgOSD
Journal
OSD node
OSD
Journal
pg pgOSD
Journal
pg
07.05.2015 24
OverviewSCALE OUT Project
+40%
Current overall capacity:
~60 storage nodes
5,4 PB Storage Gross
~0,5 PB S3 Storage Net
Planned Capacity for 2015:
~90 storage nodes
7,5 PB Storage Gross
~1,5 PB S3 Storage Net
07.05.2015 25
Future setupscale out project
• 2 physically separated rooms
• Data distributed according the rule
• not more than 2 replicas in - one room not more than 1 replica in one rack
07.05.2015 26
Future setupNew crushmap rules
rule myrule {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type rack
step emit
}
crushtool -i real7 --test --show-statistics --rule 3 --min-x 1 --max-x 1024 --num-rep 3 --show-mappings
CRUSH rule 3 x 1 [12,19,15]CRUSH rule 3 x 2 [14,16,13]CRUSH rule 3 x 3 [3,0,7]…
Listing 1: crushmap rule Listing 2: Simulate 1024 Objects
07.05.2015 27
Future setupdreams
• cache tiering
• make use of shiny new SSDs in a hot zone / cache pool
• SSD pools
• Openstack live migration for VMs (boot from rbd volume)
07.05.2015 29
QUESTION & ANSWERS
• Ievgen Nelen
• @eugene_nelen
• Gerd Prüßmann
• @2digitsLeft