Copyright 2015 Bit-isle Inc. All Rights Reserved
Approaching
Open-Source Hyper-Converged OpenStack using 40Gbit Ethernet Network
Ikuo Kumagai – Bit-isie Inc.
Yuki Kitajima - Altima corp.
Special Thanks Masayoshi Oka - Netone systems
Copyright 2015 Bit-isle Inc. All Rights Reserved
Background
The Services of Bit-isle
‣ IDC service
▪ iDCs
- 5 iDCs in Tokyo Metropolitan area and 1 iDCs in Osaka.
▪ Network connectivity
- Providing Internet or Private Network
▪ Rental service
- Server and network equipment rentals are available for collocation.
▪ Managed
- This offers a fully managed environment across on-premises, collocation in Data Center.
‣ There is also Cloud services
▪ It is not in today’s topic.
Copyright 2015 Bit-isle Inc. All Rights Reserved
Hyper converged infrastructure – our needs
Element of Hyper Converged
‣ Structured as simple as possible
‣ Deploying as rapid as possible
‣ Managing integrated
‣ As much as possible flexible scalability
Our Concept
‣ No Special Appliance
‣ No product
Copyright 2015 Bit-isle Inc. All Rights Reserved
Our Goals① Providing easily
【Goal】
Short Leadtime at less cost. ▪ Supply as fast as possible
▪ Stock as less as possible
▪ Cost as cheap as possible
‣ Scale easily
▪ Easy to deploy physical machine
▪ Easy to deploy logical conponents
【Method】
‣ Physical systems as simple as possible
▪ Simple Using 1U Server(Our service base)
▪ Only 2 Switch systems
▪ Using Ceph Cluster
Copyright 2015 Bit-isle Inc. All Rights Reserved
Network device
‣ 1 × 40G Network for All Service
‣ 1 × 1G Network for IPMI
OpenStack Nodes
‣ 1 Control and NW
‣ 5 Compute and Storage
Deployment Node
‣ Juju /Maas Server
Basic Structure
Compute&
Storage
CTRL/NW
1 node
Deployment
Router
CTRL/NW
Compute/OSD
Compute/OSD
Compute/OSD
MAAS/Juju
OpenStack Segment IPMI Segment
Compute/OSD
Compute/OSD
Copyright 2015 Bit-isle Inc. All Rights Reserved
【Goal】
Performance is much higher
Base is provided open source.
Specific applications option is for profit.
Our Goals② Performance
【Method】 ▪ Basic Server (spec will be linear upgrade.)
▪ Using 40Gbit/56Gbit Switch
▪ PCIe SSD (for Ceph Journal & OSD)
Copyright 2015 Bit-isle Inc. All Rights Reserved
Resource Server
40GB/Ethernet
Hyper Visor
Compute & Storage Server
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
KVM
VM
VM
VM
VM
VM
VM
Ceph Cluster SSD(OSD))
SSD(Journal)
SSD(OSD)
SSD(Journal)
SSD(OSD))
SSD(Journal)
SSD(OSD))
SSD(Journal)
Server fundamentals
Server HP ProLiant DL360 Gen9
CPU E5-2690v3 2.60GHz 1P/12C * 2
HDD SAS 1TB HDD *2 2 RAID1 for OS
PCIeSSD Fusion-io iodrive 320GB * 1 1 for Journal , 1 for OSD
40Gbps NIC Mellanox ConnectX3-Pro
Copyright 2015 Bit-isle Inc. All Rights Reserved
Server & Storage
Server(HP DL360 Gen9)
PCIe SSD
‣ Fusion-io ioDrive Duo 320GB
Copyright 2015 Bit-isle Inc. All Rights Reserved
Network Device
HW selection
‣ Adapter (NIC)
‣ Switch
10 / 40 / 56GbE
RDMA supported
VXLAN offload supported
36 ports x QSFP
48 ports x SFP+ , 12 ports x QSFP
12 ports x QSFP (48 ports x SFP+)
*Breakout Cable
10 / 40 / 56GbE
220ns Low Latency
Best suits for SDS Network
Copyright 2015 Bit-isle Inc. All Rights Reserved
【Goal】
Easy to Customize
‣ Deploy server more easily
Sharing knowledge.
Our Goals③ Knowledge Sharing
【Method】
‣ Using Juju/MAAS (open sourced deployment tool)
Copyright 2015 Bit-isle Inc. All Rights Reserved
Deploy
Using Juju/Maas
‣ Nodes Setup (by Local Charm )
▪ Installing OS
▪ Installing Device Drivers & Network settings
- For 40G NIC & PCIe SSD Driver
‣ Deploy Ceph and OpenStack Components
cs: ceph cs: ceph-osd cs:trusty/ntp cs:trusty/ceph cs:trusty/ceph-osd cs:trusty/rsyslog cs:trusty/rsyslog-forwarder-ha local:trusty/nova-compute cs:trusty/percona-cluster cs:trusty/rabbitmq-server-32
cs:trusty/keystone local:trusty/openstack-dashboard local:trusty/nova-cloud-controller cs:trusty/neutron-api cs:trusty/neutron-gateway cs:trusty/cinder cs:trusty/glance cs:trusty/cinder-ceph cs:trusty/neutron-openvswitch cs:trusty/hacluster
Copyright 2015 Bit-isle Inc. All Rights Reserved
Test Items (Network)
KVM
Compute Node-2
OVS
VXLAN
VM VM VM VM
KVM
Compute Node-1
OVS
VM VM VM VM
VM to VM between physical nodes
1 – 16 VM per physical node
Metering by iperf3 TCP & UDP
Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance TCP Bandwidth
Total & Average Performance(※iperf3 default)
Bandwidth(GBits/sec)
1-1 2-2 4-4 8-8 16-16
Total 2.05 3.18 5.74 7.18 10.53
Average 2.05 1.59 1.43 0.90 0.66
0.00
2.00
4.00
6.00
8.00
10.00
12.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance (UDP Bandwidth by packet size)
Total
Average
Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Perfomance (UDP Laytency by packet size)
Laytency (Jitter)
Lost Packet
Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph Cluster
Test Items (IOPS)
KVM
Compute Node-2
Network
VM VM VM VM
KVM
Compute Node-1
VM VM VM VM
KVM
Compute Node-3
VM VM VM VM
SSD(Journal)
SSD(Journal)
SSD(Journal)
HDD(OSD) HDD(OSD)
HDD(OSD)
FIO
FIO (8k 100jobs )
‣ 1 – 16 vm (1 ,2 or 4 VM per Host, Hosts count : 1 – 4 )
Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Performance of storage(Bandwidth)
Bandwidth(8k MByte/sec) ‣ Total
‣ Average
【FYI】 Fio parameters bs=8k size=10M runtime=60 iodepth=32 numjobs=80 group_reporting
Copyright 2015 Bit-isle Inc. All Rights Reserved
Basic Performance of storage (IOPS)
IOPS(8k MByte/sec) ‣ Total
‣ Average
【FYI】 Fio parameters bs=8k size=10M runtime=60 iodepth=32 numjobs=80 group_reporting
Copyright 2015 Bit-isle Inc. All Rights Reserved
Counter Plan
In order to using the 40Gbit network more effectively
‣ Network performance improvement
▪ Using VXLAN Offload
- Offloading cpu workload of VXLAN
▪ Using DPDK
- Reduce network function cost of Linux kernel
‣ Ceph IO performance improvement
▪ Using Ceph RDMA
- Enable direct memory access over ethernet for storage cluster
Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
OVS + Normal NIC [General Understanding]
‣ VXLAN process handled by OVS.
‣ It means that CPU works for packet process of VXLAN packets.
‣ Normal NIC can NOT take care about,
▪ Checksum, TSO, RSS, etc
Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
What is the VXLAN offload
‣ Offload VXLAN protocol on edge-point (NIC)
‣ VXLAN offload engine enables TCP/IP offload
▪ Enable checksum, TSO, RSS, GRO
‣ Get more throughput, Less latency and less CPU resource
VM generate inner packet OVS generates outer packet
Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload
HW selection
型番 MCX311A-XCCT
MCX312B-XCCT MCX313A-BCCT MCX314A-BCCT
Port Single 10GbE
Dual 10GbE
Single /10/40/56GbE
Dual /10/40/56GbE
Port Type SFP+ SFP+ QSFP QSFP
Cable Cupper, Optical
Host Bus PCIe 3.0 x 8
Features VXLAN/NVGRE offload, RDMA, SR-IOV, etc
OS RHEL, SLES, Microsoft Windows Sever, FreeBSD, Ubuntu, VMWare ESXi
Copyright 2015 Bit-isle Inc. All Rights Reserved
VXLAN offload result(TCP Bandwidth)
Compair VXLAN offload and normal
Bandwidth(GBps) - VXLAN offload
1-1 2-2 4-4 8-8 16-16
Total 14.40 21.70 30.00 31.43 24.63
Average 14.40 10.85 7.50 3.93 1.54
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Bandwidth(Gbps) normal
1-1 2-2 4-4 8-8 16-16
Total 2.05 3.18 5.74 7.18 10.53
Average 2.05 1.59 1.43 0.90 0.66
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Copyright 2015 Bit-isle Inc. All Rights Reserved
Virtualization bottole-neck
Bandwidth(GBps) - VXLAN offload
1-1 2-2 4-4 8-8 16-16
Total 14.40 21.70 30.00 31.43 24.63
Average 14.40 10.85 7.50 3.93 1.54
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1-1 2-2 4-4 8-8 16-16
Total
Average
Why
Copyright 2015 Bit-isle Inc. All Rights Reserved
How to use CPU
Allocate CPU Core to DPDK explicit.
Data Plane
Data Plane
DPDK DPDK Kernel
Linux
Processor 0 Processor 1
C1 C2 C3 C4 C1 C2 C3 C4
Data Plane Control Plane
Linux Linux Linux Linux Linux
Copyright 2015 Bit-isle Inc. All Rights Reserved
Network Bottleneck by Linux kernel Stack
Network Bottleneck by Linux kernel Stack
‣ Data Plane Development Kit
Application
General Process DPDK
Linux Kernel
System Call
Packet Copy
Network Device
Device Driver
Hyper-visor
Kernel
Device Driver
Application
DPDK liblary Kernel
Copyright 2015 Bit-isle Inc. All Rights Reserved
DPDK
1 to 1 performance
N to N perfomane
To be verified next month
Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
What is the RDMA? ‣ Remote DMA
‣ Zero-Copy Technology
‣ Protocol
▪ iSER, RoCE, iWARP
Application
General Process RDMA Process
Socket API RDMA verbs API
Application
Socket
TCP
Network Device
Device Driver
Kernel
Device Driver
Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
RDMA network suits for the flash storage
RDMA Advantage for Ceph
‣ Reduce CPU workload of Hypervisors for IO transaction
‣ Much faster IO for east-west traffic and fail-over(fail-back)
‣ Gets higher throughput and IOPS
Total: 45usec Total: 25.7usec
RoCE
Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA
Ceph supports RDMA
‣ v0.94 Hammer released
https://ceph.com/releases/v0-94-hammer-released/
http://tracker.ceph.com/projects/ceph/wiki/Accelio_RDMA_Messenger
Fuction :XioMessenger Library :Accelio
Copyright 2015 Bit-isle Inc. All Rights Reserved
Ceph RDMA RESULT
For reference purpose only
‣ 3 nodes Ceph Cluster & Fio access direct rbd
‣ Bandwidth
IOPS
Copyright 2015 Bit-isle Inc. All Rights Reserved
Summary
VXLAN offload is one of the effective solution
The other solutions require continual verification
To be continued.
Copyright 2015 Bit-isle Inc. All Rights Reserved
Next Plan
More Performance
‣ Network workload offload
‣ Increase Memory (DIMM NAND flush)
‣ NVMe SSD and DIMM Storage
Scale Flexibility
‣ Scale Internet Gateway (SDN or NFV)
‣ Multi region scaling
Recommended