Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sajag Chaturvedi, IT Infra Architect
Iwan ‘e1’ Rahabok, Product Manager
MGT1440 BU
MGT1440BU
VMworld 2018 Content: Not for publication or distribution
Disclaimer
2©2018 VMware, Inc.
This presentation may contain product features orfunctionality that are currently under development.
This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.
VMworld 2018 Content: Not for publication or distribution
3©2018 VMware, Inc.
• One of the largest retail store in Thailand with >12M customers/week and >2,000 stores in 73 provinces • Online service with > 20,000 groceries, fresh food and non-food products online for home delivery• Available on Lazada, Southeast Asia’s leading e-commerce portal.
• >5,000 full-time employees• >500 IT (outsourced model) supporting the constant request of internal customers in TESCO Lotus
Our People
• Provide innovative products and services to create good shopping experience such as TESCO Lotus Mobile application with e-coupons
• >11 million Clubcard members on our Thank You programmes.
Our Mission
VMware Engagement • In 2016: migrate from existing TESCO Private Cloud (on XenServer) to address the concerns of TESCO
and to transform TESCO to be the first TESCO Cloud Service (TCS) in Asia. • In 2017: Adopt Operationalize Your World
Our Business
Ek-Chai Distribution System Co. Ltd. “TESCO Lotus”
VMworld 2018 Content: Not for publication or distribution
4©2018 VMware, Inc.
6-month PerformanceTaken from Tier 2 cluster.
VMworld 2018 Content: Not for publication or distribution
5©2018 VMware, Inc.
Benefits over 6 months
Help Desk Tickets Reduced by 40% (mostly on VM Owner complaint on performance)
IaaS Uptime 100%! Yes!(achievement unlocked)
Compute Saving 14 blades + 1 chassis
No of Datastores 204 42
Storage Saving 110 TB
$ Saving US$ 657K
VMworld 2018 Content: Not for publication or distribution
6©2018 VMware, Inc.
CommodityX86
UtilityCloud
ServicesVirtualization
Impact on your work Impact on you
Business Architect
Service Architect
Systems Architect
StandardizedBespoke
Relevance to Business
The Rise and Fall of Infrastructure Architect
VMworld 2018 Content: Not for publication or distribution
7©2018 VMware, Inc.
Your World. Not operationalized.
7
Your Customers Your Boss
You
“Need more HW! Your infra is slow”
“Your infra is not full. Add more VM!”
VMworld 2018 Content: Not for publication or distribution
8©2018 VMware, Inc.
A day in a life of Operations Team…
8
A VM Owner complains to IaaS
Team that her VM is slow.
Her application architect has verified that:• Windows CPU and RAM utilization is good.• Disk latency is good. No network drop packets.
DRoll up your
sleeve! You are born for this!
ACheck ESXi
utilization. If it’s low, tell her to
doubt no more.
CCall your VMware
TAM & MCS. That’s why you pay them
right?
B Buy her a nice lunch + flower.
Ask her to forget about it
VMworld 2018 Content: Not for publication or distribution
9©2018 VMware, Inc.
The answer you want
9
EHelp Desk can
answer within 60 seconds
• Is the problem caused by IaaS not serving the VM well?• If yes…
• Which part of the Infra: CPU, RAM, Disk, Network? • How bad is the problem?• Alert already triggered, and IT is already working
on it.• If not, how to prove it convincingly?
VMworld 2018 Content: Not for publication or distribution
10
List of VMs with respective Tier,
with info on how IaaS is serving them
Performance
UtilizationVMworld 2018 Content: Not for publication or distribution
11©2018 VMware, Inc.
Introducing…
11
VMworld 2018 Content: Not for publication or distribution
12©2018 VMware, Inc.
Where to Measure Performance
12
ESXi + physical infrastructure Infrastructure Utilization
Infrastructure Capacity
VM VM
Guest OS
VM Utilization VM Capacity
VM
VM Contention Performance SLA
VMworld 2018 Content: Not for publication or distribution
13©2018 VMware, Inc.
Tracking how well a VM is served
13
Provisioned
0 vCPU or 0 GB
4 vCPU or 16 GB
Contention is when the VM is not getting all it wants.
What the VM wants
What the VM uses
VMworld 2018 Content: Not for publication or distribution
14©2018 VMware, Inc.
Performance SLA: Example
14
Service TierFor Each VM
CPU RAM Network Storage
1 (highest) <1% CPU Contention <0.1% RAM Contention 0 drop packet 10 ms latency
2 <5% CPU Contention <3% RAM Contention 0 drop packet 20 ms latency
3 (lowest) <10% CPU Contention <6% RAM Contention 0 drop packet 40 ms latency
CPU Contention = CPU ReadyVM Network: Based on TX only.VMworld 2018 Content: Not for publication or distribution
Prove that your IaaS is serving all the VMs well.
15
The Ask from Your CIO
Every 5 minutes.Without fail
CPU. RAM. Network. Disk
Every single VM.No one left out.
You have 1000 VMs. How to prove?VMworld 2018 Content: Not for publication or distribution
16©2018 VMware, Inc.
1,000 VMCPURAM
NetworkStorage
4 8,76612 data points in 1 hour288 data points in 1 day
8,766 data points in 1 month
35,064,000X X =
How to show?!
Proof that your IaaS is performing
16
35K per VM
per month
VMworld 2018 Content: Not for publication or distribution
17©2018 VMware, Inc.
Be Proactive, Not Reactive
17
If you do nothave
performance threshold,
you cannot be proactive
The time your
customers know it’s a
problem if you do nothing
The time you know
it’s a problem
Proactive Window
VMworld 2018 Content: Not for publication or distribution
18©2018 VMware, Inc.
Performance SLA: Your defense line
18
35000x with no complaintIf no complaint in the past 35000 measurements, why complain now?
No agreement Unset Expectation Variable Expectation Relationship DrivenVMworld 2018 Content: Not for publication or distribution
19©2018 VMware, Inc. 19
Performance Monitoring: Are we serving all the VMs well?
VMworld 2018 Content: Not for publication or distribution
20©2018 VMware, Inc. 20
On the big screens. Transparency
All Critical Apps/Systems
Drill down to actual counter.
Performance Monitoring: Do Tier 1 Apps need more Infrastructure?
VMworld 2018 Content: Not for publication or distribution
21©2018 VMware, Inc.
3 Levels of Monitoring
• How much sales do we make today?• How many customers buy our products this week?• On average, how long did the XYZ transaction take this hour?• How many customers login yesterday? • On average, how long do they stay?
• How long does SQL Query ABCD take in the past 7 days? • What’s SQL Server free memory value 1 hour ago?• What’s the overall application uptime?• Are my apps configured for performance?
• What’s Windows CPU Run Queue?• What’s the peak VM CPU Contention in the past 24 hours?• What’s the total IO hitting vSAN from 9 – 6 pm yesterday?• What’s the buffer in physical switch right now?
VM or Container
Virtual Infra
Physical Infra
Sample MetricsSub Layers
Business Results
Business Transaction
Business
Application
Infrastructure
Layers
Individual Node
“The System”
VMworld 2018 Content: Not for publication or distribution
22©2018 VMware, Inc.
KPI for Multi-tier Application
Load balancer
Load balancer
Web Server
Web Server
Web Server
Web Server
DB Server(Active)
DB Server(Passive)
App KPI = Worst (Tier Performance)
App Server
App Server
Load balancer
Load balancer
App Server
App Server
App Tier KPI = Average (VM Performance)
VMworld 2018 Content: Not for publication or distribution
23©2018 VMware, Inc.
KPI of a single VMHow do we define the performance of a single VM, from infrastructure viewpoint?
Component Metric Green Yellow Orange Red
CPU Guest OS CPU Context Switch 0 – 1K < 10K < 100K > 100K
CPU VM CPU Usage (%) 0 – 70% > 70% > 80% > 90%
CPU VM CPU Co-Stop (%) 0 – 2.5% >2.5% > 5% > 7.5%
CPU VM CPU Ready (%) 0 – 2.5% > 2.5% > 5% > 7.5%
RAM VM RAM Contention (%) 0 – 1% > 1 > 3 > 5
RAM Guest OS RAM Free (GB) > 512 MB > 256 MB > 128 MB ≤ 128 MB
RAM Guest OS RAM Page-in Rate as a % of Used 0 – 1% > 1 > 3 > 5
Disk VM Disk Latency (ms) 0 – 10 ms > 10 > 20 > 30
Network VM Network TX Dropped Packet 0% > 0% > 1% > 2%
Proactive, not alert based
Red does not mean emergency. It means you need to take a look within a few days.
VMworld 2018 Content: Not for publication or distribution
Logic
75 - 10050 - 7525 - 500 - 25
Metric Green Yellow Orange Red
VM Disk Latency (ms) 0 – 10 ms > 10 ms > 20 ms > 30 ms
Guest OS RAM Free (GB) > 512 MB > 256 MB > 128 MB ≤ 128 MB
If CPU Usage > 90 Then RED Elseif CPU Usage > 80 Then ORANGE Then Elseif CPU Usage > 70 Then YELLOW Else GREENIf CPU Usage > 90 Then 12.5 Elseif CPU Usage > 80 Then 37.5 Then Elseif CPU Usage > 70 Then 62.5 Else 87.5
“Flip” & translate into 0 – 100 range.
Take Mid-Point to represent the range. 87.562.537.512.5
VMworld 2018 Content: Not for publication or distribution
VM KPI: Implementation• Super metric used to create VM KPI metric.•
• Zooming into the code for visibility.
• Limitation: – Suitable only for Tier 1 VM. – For Tier 2, clone this metric and adjust the threshold to map your Class of Service
Avg ([ Max([0,${this, metric=guest|contextSwapRate_latest} as CPU_Context]) > 100000 ? 12.5 : (Max([0,CPU_Context]) > 10000 ? 37.5 : (Max([0,CPU_Context]) > 1000 ? 62.5 : 87.5 ${this, metric=cpu|usage_average} as CPU_Usage > 90 ? 12.5 : (CPU_Usage > 80 ? 37.5 : (CPU_Usage > 70 ? 62.5 : 87.5 ${this, metric=cpu|readyPct} as CPU_Ready > 7.5 ? 12.5 : (CPU_Ready > 5 ? 37.5 : (CPU_Ready > 2.5 ? 62.5 : 87.5 ${this, metric=cpu|costopPct} as CPU_CoStop > 7.5 ? 12.5 : (CPU_CoStop > 5 ? 37.5 : (CPU_CoStop > 2.5 ? 62.5 : 87.5 ${this, metric=mem|host_contentionPct} as RAM_Contention > 5 ? 12.5 : (RAM_Contention > 3 ? 37.5 : (RAM_Contention > 1 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_9c6d1a2e-ae93-49d9-957b-d93643f5e3e4} as RAM_Free < 128 ? 12.5 : (RAM_Free < 256 ? 37.5 : (RAM_Free < 512 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_01b91453-1722-4d93-b2b1-066be0596dd0} as RAM_PageIn > 5 ? 12.5 : (RAM_PageIn > 3 ? 37.5 : (RAM_PageIn > 1 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_e5980700-3d2c-428e-996f-78346304b723} as Net_DroppedTX > 2 ? 12.5 : (Net_DroppedTX > 1 ? 37.5 : (Net_DroppedTX > 0 ? 62.5 : 87.5 ${this, metric=virtualDisk:Aggregate of all instances|totalLatency} as Disk_Latency > 30 ? 12.5 : (Disk_Latency > 20 ? 37.5 : (Disk_Latency > 10 ? 62.5 : 87.5])
Avg ([${this, metric=cpu|usage_average} as CPU_Usage > 90 ? 12.5 : ( CPU_Usage > 80 ? 37.5 : ( C${this, metric=mem|host_contentionPct} as RAM_Contention > 5 ? 12.5 : ( RAM_Contention > 3 ? 37.5 : ( R])
VMworld 2018 Content: Not for publication or distribution
26©2018 VMware, Inc. 26
Tier 1 Apps: From Apps to VM Metric
VMworld 2018 Content: Not for publication or distribution
27©2018 VMware, Inc.
Do you serve everyone well?
Cluster Performance = 100%It’s serving all its VM well. 100% of VMs served in all 4 elements of IaaS (CPU, RAM, Disk, network) as per SLA
How to standardize since clusters aren’t equal?• Different number of VMs• Different SLA
CPU, RAM, Disk, network refers to Latency or Contention, not Utilization.
Cluster Performance = 0% It fails to serve any its VM well. 0% of VMs served.
VMworld 2018 Content: Not for publication or distribution
28©2018 VMware, Inc.
PerformanceQuantifying Cluster Performance
28
( SLA Type Failure per VM )ΣNo of running VMs in the cluster
/ 4 SLA typesCluster SLA Failure (%) = 100%x
Cluster Performance (%) 100% Cluster SLA Failure (%)= -
VMworld 2018 Content: Not for publication or distribution
29©2018 VMware, Inc.
Implementation: super metric
29
1 set for each Tier: CPU, RAM, Disk. Network is expected to be 0.
Assign 1 set to every VM. Each VM belongs to 1 tier only.
Workaround as we need to consolidate into 1 supermetric
VMworld 2018 Content: Not for publication or distribution
Implementation: super metric
30
Sum ([${this, metric=net|droppedTx_summation} > 0 ? 1 : 0,${this, metric=mem|host_contentionPct} > ${this, metric=Super Metric|sm_4a3bd0c0-c897-4baf-a60e-4bea139e537b} ? 1 : 0, ${this, metric=cpu|capacity_contentionPct} > ${this, metric=Super Metric|sm_20ff3c62-0185-47a8-9bdc-a96f3081a2a8} ? 1 : 0, ${this, metric=virtualDisk:Aggregate of all instances|totalLatency}> ${this, metric=Super Metric|sm_cc38300b-d116-4040-bdd2-76b8ba3cd360} ? 1 : 0])
VMworld 2018 Content: Not for publication or distribution
31©2018 VMware, Inc.
Performance Monitoring: Compute performance
VMworld 2018 Content: Not for publication or distribution
Key Takeaways
If you don’t have SLA, you are a System Builder, not a Service
Provider
Capacity is defined by
Availability & PerformanceAvailability SLA
needs to be accompanied by
Performance SLA
32
VMworld 2018 Content: Not for publication or distribution
Using Older Version of vRealize? You are Missing out!
Get free help to deploy or upgrade to the latest version today!
33
Quicksilver: For a limited time, VMware cloud management BU is offering help AT NO ADDITIONAL COST to bring qualified customers up to date with your vRealize deployment.
If you are behind on your version (or never fully deployed), we can bring you to the latest of vRealize Automation, vRealize Operations, and Lifecycle Manager (LCM).
Email to [email protected] to qualify and for next stepsVMworld 2018 Content: Not for publication or distribution
34Confidential │ ©2018 VMware, Inc.
. Instant analytics.
Full stack correlation. Intelligent alerting.
Cloud-native with unrivalled scale and
performance.100K containers. No aggregation or lost
answers.1-second resolution,18 month retention.
• Faster business, faster software development -are your APM tools providing enough coverage and problem resolution
• Enough granularity, retention, correlations?
• Would a simple, cost-effective way to monitor apps & middleware (SAP, MongoDB, etc) and infrastructure help
• Are you using Hyperic or EPOps?
• Do you have new plans/tools to monitor your modern application stacks?
• Do you have the tools to monitor custom applications, containers, PKS/PAS, AWS/Azure/GCP, Serverless?
vROps
integrated w/
The Future of
Cloud Monitoring and Analytics
VMworld 2018 Content: Not for publication or distribution
35Confidential │ ©2018 VMware, Inc.
Customers Love Wavefront!!!! Zero Churn!! Zero Shelfware!!
Jing ZhaoDevOps Engineer
“Wavefront’s powerful query language allows us to easily visualize and debug our time series data. Its alerting is flexible enough to notify us through email, Slack, and PagerDuty based on severity. Well-tuned alerts helps us lower MTTD . We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.”
John BeattyCofounder and CEO
“Before Wavefront, we would have no clue about what is going on. Now we can fix software bugs and validate that the changes had exactly the effects we intended.”
Pierre-Alexandre MasseEngineering Director
“Wavefront gives us very quick insights and the best query language we could find to explore and understand our data. Since we rely on data and metrics to make our decisions, Wavefront is an essential and indispensable part of our day-to-day operations.”
Bob MugliaCEO
“Wavefront has made a huge impact in helping us identify and reduce operational performance issues. Its intuitive, but advanced representation of our data delivers time and actionable insights, help us meet our customers SLA. Our operations team starts their day checking Wavefront, and uses it all day long. It’s an awesome product.”
VMworld 2018 Content: Not for publication or distribution
36Confidential │ ©2018 VMware, Inc.
Customers Love Wavefront!!!! Zero Churn!! Zero Shelfware!!
Jing ZhaoDevOps Engineer
“Wavefront’s powerful query language allows us to easily visualize and debug our time series data. Its alerting is flexible enough to notify us through email, Slack, and PagerDuty based on severity. Well-tuned alerts helps us lower MTTD . We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.”
John BeattyCofounder and CEO
“Before Wavefront, we would have no clue about what is going on. Now we can fix software bugs and validate that the changes had exactly the effects we intended.”
Pierre-Alexandre MasseEngineering Director
“Wavefront gives us very quick insights and the best query language we could find to explore and understand our data. Since we rely on data and metrics to make our decisions, Wavefront is an essential and indispensable part of our day-to-day operations.”
Bob MugliaCEO
“Wavefront has made a huge impact in helping us identify and reduce operational performance issues. Its intuitive, but advanced representation of our data delivers time and actionable insights, help us meet our customers SLA. Our operations team starts their day checking Wavefront, and uses it all day long. It’s an awesome product.”
VMworld 2018 Content: Not for publication or distribution
37Confidential │ ©2018 VMware, Inc.
SaaS Leaders Use WavefrontMonitoring Cloud-native Applications and Infrastructure
Free Trial:wavefront.com/sign-up
Contact: [email protected] 2018 Content: Not for publication or distribution
PLEASE FILL OUTYOUR SURVEY.Take a survey and enter a drawingfor a VMware company store gift card.
#vmworld #MGT1440BU
VMworld 2018 Content: Not for publication or distribution
THANK YOU!
#vmworld #MGT1440BU
VMworld 2018 Content: Not for publication or distribution