Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Analyzing Performance in the Cloudsolving an elastic problem with a scientific approach
Nicholas Wakou (Dell EMC), Alex Krzos (Red Hat)Thursday, October 27, 2016Barcelona Openstack Summit 2016
Presenters
Alex Krzos is a Senior Performance Engineer at Red Hat working on [email protected]
https://www.openstack.org/summit/barcelona-2016/summit-schedule/events/16204/analyzing-performance-in-the-cloud-solving-an-elastic-problem-with-a-scientific-approach
Nicholas Wakou is a Principal Performance Engineer with the Dell EMC Open Source [email protected]
Agenda
➢ CLOUD DEFINITION & CHARACTERISTICS
➢ PERFORMANCE MEASURING TOOLS
➢ SPEC CLOUD IaaS 2016 BENCHMARK
➢ PERFORMANCE MONITORING TOOLS
➢ PERFORMANCE CHARACTERIZATION
➢ TUNING TIPS
CLOUD DEFINITION & CHARACTERISTICS
DEFINING A CLOUD
NIST SPECIAL PUBLICATION 800-145
Cloud computing is a model for enabling ubiquitous,convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidlyprovisioned and released with minimal management effort or service provider interaction.
http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf
CLOUD CHARACTERISTICS
PERFORMANCE MEASURING TOOLS
RALLYOpenStack Benchmarking Tool
➢ as-an-App and as-a-Service➢ Verification➢ Benchmarking➢ Profiling➢ Reports➢ SLAs for Benchmarks➢ Many plugins
Source: What is Rally?, https://rally.readthedocs.io/en/latest/
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
OpenSource Living Benchmarking Framework containing a set of Benchmarks used to compare cloud offerings/environments
➢ 10+ Cloud Providers/Environments➢ 34+ Benchmarks➢ Large Community Involvement➢ Capture Cloud Elasticity with Benchmark
Results➢ Use Cloud/Environment CLI Tooling➢ Publish Results to BigQuery for Comparison
PERFKIT EXPLORER
Source: https://github.com/GoogleCloudPlatform/PerfKitExplorer
Dashboarding and Performance Analysis Tool for PerfKitBenchmarker Results
➢ Multiple Chart Options➢ Uses BigQuery as backend data-store➢ Hosted in Google App Engine
CLOUDBENCH
➢ Framework that automates cloud-scale evaluation and benchmarking➢ Benchmark Harness
▪ Requests the Cloud Manager to create an instance(s)▪ Submit configuration plan and steps to the Cloud Manager on how the
test will be performed▪ At the end of the test, collect and log applicable performance data and
logs▪ Destroy Instances no longer needed for the test.
HARNESS AND WORKLOAD CONTROL
Benchmark Harness Cloud SUT
Group of boxes represents an application instance
Benchmark Harness. It comprises of CloudBench (CBTOOL) and baseline/elasticity drivers, and report generators.
For white-box clouds the benchmark harness is outside the SUT. For black-box clouds, it can be in the same location or campus.
BROWBEAT
Orchestration tool for existing OpenStack Workloads➢ Combines Workloads, Metrics, and Results into single tool➢ Runs Performance Workloads:
➢ Rally - Control Plane➢ Rally Plugins & Rally+pBench Plugins - Control+Data Plane➢ Shaker - Network Data Plane➢ PerfKitBenchmarker - Data Plane + Cloud Elasticity
➢ Provides Performance Infrastructure Installation and Configuration for➢ Carbon/Graphite/Grafana➢ Collectd➢ ELK➢ FluentD
➢ Provides dashboards for Visualizing and Comparing Results and System Performance Metrics
BROWBEAT - RESULTS
BROWBEAT - Metrics
SPEC CLOUD IAAS 2016 BENCHMARK
SPEC CLOUD IAAS 2016 BENCHMARK
➢ Measures performance of Infrastructure-as-a-Service (IaaS) Clouds➢ Measures both control and data plane
▪ Control: management operations, e.g., Instance provisioning time ▪ Data: virtualization, network performance, runtime performance
➢ Uses workloads that➢ resemble “real” customer applications➢ benchmarks the cloud, not the application
➢ Produces metrics (“elasticity”, “scalability”, “provisioning time”) which allow comparison
SPEC Cloud IaaS Benchmarking : Dell Leads the Way http://en.community.dell.com/techcenter/cloud/b/dell-cloud- blog/archive/2016/06/24/spec-cloud-iaas-benchmarking-dell-leads-the-way
Scalability and Elasticity AnalogyClimbing a mountain
18
c c
Scal
abilit
y –
conq
uerin
g an
infin
itely
high
mou
ntai
n
{
{
{
{
{
Elasticity – time for each step
IDEALScalability• Mountain: Keep on climbing• Cloud: keep on adding load without errorsElasticity• Mountain: Each step takes identical time• Cloud: performance within limits as load increases
{
{
{
WHAT IS MEASURED?
➢ Measures the number of AIs that can be loaded onto a Cluster before SLA violations occur
➢ Measures the scalability and elasticity of the Cloud under Test (CuT)
➢ Not a measure of Instance density➢ SPEC Cloud workloads can individually be used to stress the CuT:
▪ KMeans – CPU/Memory ▪ YCSB - IO
SPEC CLOUD BENCHMARK PHASESBaseline Phase
▪ Determine the results for a single application instance of a workload
▪ AI = stream of 5 runs
KMeans baseline AI
YCSB baseline AI
Elasticity Phase
Determine cloud elasticity andscalability results when multipleworkloads are run
BENCHMARK STOPPING CONDITIONS
➢ 20% AIs fail to provision➢ 10% AIs have errors in any run➢ Max number of AIs set by Cloud Provider ➢ 50% AIs have QoS violations
▪ KMeans completion time ≤ 3.33x Baseline phase ▪ YCSB Throughput ≥ Baselinethroughput / 3▪ YCSB Read Response Time ≤ 20 x BaselineReadResponse Time ▪ YCSB Insert Response Time ≤ 20 x BaselineInsertResponse Time
HIGH LEVEL REPORT SUMMARY
PUBLISHED RESULTS WEBSITE
https://www.spec.org/cloud_iaas2016/results/cloudiaas2016.html
PERFORMANCE MONITORING TOOLS
CEILOMETER
Source: http://docs.openstack.org/developer/ceilometer/architecture.html
Another familiar OpenStack project➢ https://wiki.openstack.org/wiki/Telemetry
➢ Goal is to efficiently collect, normalize and transform data produced by OpenStack services
➢ Interacts directly with the OpenStack services through defined interfaces
➢ Applications can leverage Ceilometer to gather OpenStack performance data
COLLECTD/GRAPHITE/GRAFANA
➢ Collectd➢ Daemon to collect System Performance
Statistics ➢ Plugins for CPU, Memory, Disk,
Network, Process, …➢ Graphite/Carbon
➢ Carbon receives metrics, and flushes them to whisper database files
➢ Graphite is webapp frontend to Carbon➢ Grafana
➢ Visualize metrics from multiple backends.
GANGLIA Ganglia is a scalable, distributed monitoring system for high-performance computing systems such as Server Nodes, Clusters and Grids.- Relatively easy to
setup- Tracks a lot
hardware-centric metrics
- Low operational burden
PERFORMANCE CHARACTERIZATION
PROVISIONING TIME: SPEC CLOUD
➢ The time needed to bring up a new instance, or add more resources (like CPU or storage) to an existing instance ➢ Instance: Time FROM request to create a new instance TO time when the instance responds to a netcat
probe on port 22.➢ Application instance: Time FROM request to create a new instance TO time when the AI reports readiness to
accept client requests.
➢ Provisioning Time Characterization using Baseline phase➢ Increase number of VMs (vary YCSB seeds KMeans and/or Hadoop slaves) and note impact on provisioning
time. ➢ vary instance configuration (flavor)
PCI-E LimitsFor PCI-E Gen-3 capable slots. http://www.tested.com/tech/457440-theoretical-vs- actual-bandwidth-pci-express-and-thunderbolt/
SAS LimitAn LSI whitepaper, Switched SAS: Sharable, Scalable SAS Infrastructurehttp://www.abacus.cz/prilohy/_5025/5025548/SAS_Switch_White%20Paper_US-EN_092210.pdf
IO LIMITS
NETWORK/IO CHARACTERIZATION
➢ Understand network utilization under load ➢ Management networks➢ Data networks (Neutron tenant)
➢ Monitor with Ganglia, collectd, Linux tools (vmstat, iostat etc)
➢ SPEC Cloud YCSB Baseline tests – Throughput (ops/s)➢ Vary number of Seeds➢ Increase number of YCSB records and
operations ➢ Increase number of YCSB threads
➢ CloudBench fio➢ CloudBench Netperf
CPU CHARACTERIZATION
➢ Use SPEC Cloud Baseline tests for CPU Characterization➢ Vary number of hadoopslaves➢ Increase sample size, number of dimensions, number of clusters
➢ Understand CPU utilization under load
➢ Monitor with Ganglia, collectd, graphana
➢ Linux tools (top, vmstat), SPEC Cloud, Kmeans
Note:✓ CPU user time✓ CPU system time✓ CPU iowait time✓ CPU irq time
SCALABILITY/ELASTICITY
➢Understand Scalability/Elasticity of the CuT ➢ SPEC Cloud Elasticity phase➢ Vary number of AIs➢ Monitor with FDR html report
TUNING TIPS
HARDWARE/OS TUNING
➢ Latest BIOS and Firmware revs ➢ Appropriate BIOS settings ➢ RAID/JBOD➢ Disk controller➢ NIC driver- Interrupt coalescing and affinitization ➢ NIC bonding➢ NIC jumbo frames➢ OS configuration settings
CLOUD TUNING
▪ HW/OS Tuning▪ Cloud Configs/Settings ▪ Workload tuning
INSTANCE CONFIGURATION
Performance is impacted by▪ Instance type (flavor) ▪ Number of Instances
OVER-SUBSCRIPTION
Beware of over-subscription !!!
LOCAL STORAGE
Use of local storage instead of shared storage (like Ceph) could improveperformance by over 50% ... depending on Ceph replication.
Source: OpenStack: Install and configure a storage node - OpenStackkilo.http://docs.OpenStack.org/kilo/install-guide/install/yum/content/cinder-install-storage-node.html (2015)
NUMA NODES
Pinning instance CPU to physical CPUs (NUMA nodes) on local storage further improves performance.
Source: Red Hat: Cpu pinning and numa topology awareness in OpenStackcompute.http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-OpenStack-compute/ (2015)
DISK PINNING
Disk Pinning shows a 15% performance improvement
Source: OpenStack: OpenStack cinder multibackend. https://wiki.OpenStack.org/wiki/Cinder-multi-backend (2015)
WORKER COUNT CONFIGURATIONAllow Services to use available resources with higher concurrency
Examples:Keystone Process CountNeutron WorkersGlance WorkersGnocchi API Workers
UNEVEN CONTROLLER USAGE
One controller had more cores available than the other two and ended up with all the jobs. This scenario was identified easily because the correct dashboarding was in place.
HEAT MEMORY USAGE
About 1GB of memory used by Heat for every 10 compute nodes deployed. Size your controller memory appropriately.
DEPLOYMENT TIMINGSOSPD 9 Overcloud Deployment
CONCLUSION
CONCLUSION
➢ Define what you are trying to measure ▪ Define a cloud▪ Define what metrics are important
➢ Use the correct tools ▪ Rally▪ PerfKitBenchmarker▪ Cloudbench▪ SPEC Cloud IaaS 2016 Benchmark ▪ Ceilometer▪ Collectd/Graphite/Grafana ▪ Ganglia▪ Browbeat
➢ Gather and analyze data▪ Apply tuning tips based on the data
PARTICIPATE!
ADDITIONAL INFORMATION
➢ Guidelines and Considerations for Performance and Scaling your Red Hat Enterprise Linux OpenStack Platform 6 Cloud
▪ https://access.redhat.com/articles/1507893➢ Guidelines and Considerations for Performance and Scaling your Red
Hat Enterprise Linux OpenStack Platform 7 Cloud▪ https://access.redhat.com/articles/2165131
➢ Red Hat OpenStack Blog▪ http://redhatstackblog.redhat.com/
➢ Red Hat Developer Blog▪ http://developerblog.redhat.com/
➢ Red Hat Enterprise Linux Blog▪ http://rhelblog.redhat.com/
Rally
Source: https://github.com/OpenStack/rally/blob/master/doc/source/images/Rally-Actions.png
Rally
Rally is a familiar OpenStack project
▪ https://github.com/OpenStack/rally
▪ An automated benchmark tool for OpenStack
Benchmarking
▪ Multiple use cases
• Development and QA• DevOps• CI/CD
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
BROWBEAT
➢ Scale and Performance automation ➢ Ansible playbooks for automation➢ Provides automation wrapper around existing tooling
➢ Rally - Control plane tests➢ Shaker - Data plane network tests ➢ Perfkit - Data plane tests
➢ Leverages existing upstream test frameworks rather than replacing them➢ Performance Monitoring
➢ Collectd/Graphite/Grafana➢ Results Capture/Storage/Analytics
➢ ELK stack➢ Allows for results comparison
BROWBEAT
BROWBEAT - RESULTS
COLLECTD/GRAPHITE/GRAFANAExample Grafana dashboards
UNEVEN CONTROLLER USAGE
About 1GB of memory used by Heat for every 10 compute nodes deployed. Size your controller memory appropriately.
HEAT MEMORY USAGE
About 1GB of memory used by Heat for every 10 compute nodes deployed. Size your controller memory appropriately.
DEPLOYMENT TIMINGS
Saw many instance reschedules with default scheduler. Deployment time dropped dramatically by setting up assignments via ironic.
DEFINING A CLOUD
Cloud = Private
Cloud = OpenStackCloud = Rain
Cloud = CumulusCloud = Public
Cloud = FunnelCloud = OpenShift
Cloud = Community Cloud = Cirrus
Ten different people will probably give you ten different answers
DEFINING A CLOUD
http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf
NIST SPECIAL PUBLICATION 800-145
Private cloud
The cloud infrastructure is provisioned for exclusive use by a single organizationcomprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.
Public cloud
The cloud infrastructure is provisioned for open use by the general public. It may beowned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider.
Hybrid cloud
The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability(e.g., cloud bursting for load balancing between clouds).
DEFINING A CLOUD
SPEC OSG Cloud Subcommittee Glossary
Blackbox Cloud
A cloud-provider provides a general specification of the SUT, usually in terms of how the cloud consumer may be billed. The exact hardware detailscorresponding to these compute units may not be known. This will typically be the case if the entity benchmarking the cloud is different from a cloudprovider.
Whitebox Cloud
The SUT’s exact engineering specifications including all hardware and software are known and under the control of the tester. This will typically be the casefor private clouds.
Source: https://www.spec.org/cloud_iaas2016/docs/faq/html/glossary.html
DEFINING A CLOUD
➢ The focus of this presentation will be predominantly on white box private cloud environments
➢ Primary example is OpenStack
➢ Many of the tools and methodologies are usable in the other cloud environments as well
CLOUD CHARACTERISTICS
CLOUD CHARACTERISTICS
SPEC RESEARCH GROUP - CLOUD WORKING GROUPhttps://research.spec.org/working-groups/rg-cloud-working- group.html
READY FOR RAIN? A VIEW FROM SPEC RESEARCH ON THE FUTURE OF CLOUD METRICS
https://research.spec.org/fileadmin/user_upload/documents/ rg_cloud/endorsed_publications/SPEC-RG-2016-01_CloudMetrics.pdf
THE DEGREE TO WHICH A SYSTEM IS ABLE TO ADAPT TO WORKLOAD CHANGES BY PROVISIONING AND DE-PROVISIONING RESOURCES IN AN AUTONOMIC MANNER, SUCH THAT AT EACH POINT IN TIME THE AVAILABLE RESOURCES MATCH THE CURRENT DEMAND AS CLOSELY AS POSSIBLE
Source: READY FOR RAIN? A VIEW FROM SPEC RESEARCH ON THE FUTURE OF CLOUD METRICS, SPEC RG CloudWorking Group
ELASTICITY
ELASTICITY
THE DEGREE TO WHICH A SYSTEM IS ABLE TO ADAPT TO WORKLOAD CHANGES BY PROVISIONING AND DE-PROVISIONING RESOURCES IN AN AUTONOMIC MANNER, SUCH THAT AT EACH POINT IN TIME THE AVAILABLE RESOURCES MATCH THE CURRENT DEMAND AS CLOSELY AS POSSIBLE
Source: READY FOR RAIN? A VIEW FROM SPEC RESEARCH ON THE FUTURE OF CLOUD METRICS, SPEC RG CloudWorking Group
ELASTICITY
ELASTICITY
Source: http://www.today.com/news/remember-stretch-armstrong-how-buy-your-favorite-retro-toys-your-1D8037 7927
HOW FAR WILL HE STRETCH? WILL HE BREAK WHEN STRETCHED?
AS YOU STRETCH HIM DOES IT GET HARDER TO STRETCH HIM MORE?
WHEN I LET GO DOES HE RETURN TO HIS ORIGINAL SHAPE?
HOW LONG DOES HE TAKE TO RETURN TO HIS NORMAL SHAPE?
RESULTS COMPARED# Submissions 2 Submission ID Dell_12g Dell_13g CommentCloud Type Private / White
boxPrivate / White box
Hardware Platform 12g, 7xR720 Compute nodes
13g, 9xR630 Compute nodes
Job Date 03/05/2016 06/08/2016 Metrics Scalability@AIs 10.3@10 29.5@20 Higher is betterScalability per AI 1.03 1.45 Higher is betterElasticity 63.0% 71.9% Higher is betterInst. Prov. Time (s)
163 135 Lower is better
AI Prov. Success 100% 86.96% Higher is betterAI Run Success 100% 100% Higher is betterTotal Instances 65 131 Higher is betterBaseline Phase YCSB Throughput 13,082.6 17,742.0 Higher is betterKMeans Job time (s)
115.7 109.7 Lower is better
Elasticity Phase YCSB Throughput 9,480.9 14,890.8 Higher is betterKMeans Job time (s)
211.5 186.2 Lower is better
PERFKIT BENCHMARKER
Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks
BENCHMARK HARNESS
SURPORTED WORKLOADS
BROWBEAT
BROWBEAT
REPEATABLE AUTOMATED TESTING
PROVISIONING TIME: RALLY
79
Automated VM provisioningNova Success rate
Source: measuring the Cloud Using Rally & CloudBench, Douglas Shakshober, Red Hat Inc.
PCI-E LimitsFor PCI-E Gen-3 capable slots. (http://www.tested.com/tech/457440-theoretical-vs- actual-bandwidth-pci-express-and-thunderbolt/)
▪ Gen-3 is defined at 8 Gbps; this gives a bandwidth of 8.0 Gb/s (Scrambling +128b/130b encoding instead of 8b/10b encoding) per lane, so for example a PCI-E Gen-3 x8 link delivers an aggregate bandwidth of 8 GB/s
SAS LimitAn LSI whitepaper, Switched SAS: Sharable, Scalable SAS Infrastructure(http://www.abacus.cz/prilohy/_5025/5025548/SAS_Switch_White%20Paper_US-EN_092210.pdf) shows how to calculate the SAS limit of an 8 lane controller port with a SAS bandwidth of 6Gbps:
▪ Vary number of Seeds 6Gb/s x 8 lanes = 48Gb/s per x8 port▪ 48Gb/s (8b/10b encoding) = 4.8GB/sec per port (per node)▪ 4.8GB/s per port x 88.33% (arbitration delays and additional framing) = 4320MB/s per port
IO LIMITS
CEILOMETER: High-level Architecture
Source: http://docs.OpenStack.org/developer/ceilometer/architecture.html
CEILOMETER
Another familiar OpenStack project
➢ https://wiki.openstack.org/wiki/Telemetry
➢ Goal is to efficiently collect, normalize and transform data produced by OpenStack services
➢ Interacts directly with the OpenStack services through defined interfaces
➢ Many tools utilize Ceilometer to gather OpenStack performance data
COLLECTD/GRAPHITE/GRAFANA
➢ Collectd➢ Daemon to collect System Performance Statistics ➢ CPU, Memory, Disk, Network, Process, MariaDB, Load, Logged errors
and more➢ Graphite/Carbon
➢ Carbon receives metrics, and flushes them to whisper database files➢ Graphite is webapp frontend to carbon
➢ Grafana➢ Visualize metrics from multiple backends.
SPEC CLOUD WORKLOADS
YCSB
Framework used by a common set of workloads for evaluatingperformance of different key-value and cloud serving stores.
KMeans
-Hadoop-based CPU intensive workload
-Chose Intel HiBench implementation