for publication Content: Not VMworld 2018...Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new features/functionality/ VMworld 2018

Sajag Chaturvedi, IT Infra Architect

Iwan ‘e1’ Rahabok, Product Manager

MGT1440 BU

MGT1440BU

VMworld 2018 Content: Not for publication or distribution

Disclaimer

2©2018 VMware, Inc.

This presentation may contain product features orfunctionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.



• One of the largest retail store in Thailand with >12M customers/week and >2,000 stores in 73 provinces • Online service with > 20,000 groceries, fresh food and non-food products online for home delivery• Available on Lazada, Southeast Asia’s leading e-commerce portal.

• >5,000 full-time employees• >500 IT (outsourced model) supporting the constant request of internal customers in TESCO Lotus

Our People

• Provide innovative products and services to create good shopping experience such as TESCO Lotus Mobile application with e-coupons

• >11 million Clubcard members on our Thank You programmes.

Our Mission

VMware Engagement • In 2016: migrate from existing TESCO Private Cloud (on XenServer) to address the concerns of TESCO

and to transform TESCO to be the first TESCO Cloud Service (TCS) in Asia. • In 2017: Adopt Operationalize Your World

Our Business

Ek-Chai Distribution System Co. Ltd. “TESCO Lotus”



6-month PerformanceTaken from Tier 2 cluster.



Benefits over 6 months

Help Desk Tickets Reduced by 40% (mostly on VM Owner complaint on performance)

IaaS Uptime 100%! Yes!(achievement unlocked)

Compute Saving 14 blades + 1 chassis

No of Datastores 204 42

Storage Saving 110 TB

$ Saving US$ 657K


https://www.vmware.com/brand.html


CommodityX86

UtilityCloud

ServicesVirtualization

Impact on your work Impact on you

Business Architect

Service Architect

Systems Architect

StandardizedBespoke

Relevance to Business

The Rise and Fall of Infrastructure Architect



Your World. Not operationalized.

7

Your Customers Your Boss

You

“Need more HW! Your infra is slow”

“Your infra is not full. Add more VM!”



A day in a life of Operations Team…

8

A VM Owner complains to IaaS

Team that her VM is slow.

Her application architect has verified that:• Windows CPU and RAM utilization is good.• Disk latency is good. No network drop packets.

DRoll up your

sleeve! You are born for this!

ACheck ESXi

utilization. If it’s low, tell her to

doubt no more.

CCall your VMware

TAM & MCS. That’s why you pay them

right?

B Buy her a nice lunch + flower.

Ask her to forget about it



The answer you want

9

EHelp Desk can

answer within 60 seconds

• Is the problem caused by IaaS not serving the VM well?• If yes…

• Which part of the Infra: CPU, RAM, Disk, Network? • How bad is the problem?• Alert already triggered, and IT is already working

on it.• If not, how to prove it convincingly?


10

List of VMs with respective Tier,

with info on how IaaS is serving them

Performance

UtilizationVMworld 2018 Content: Not for publication or distribution


Introducing…

11



Where to Measure Performance

12

ESXi + physical infrastructure Infrastructure Utilization

Infrastructure Capacity

VM VM

Guest OS

VM Utilization VM Capacity

VM

VM Contention Performance SLA



Tracking how well a VM is served

13

Provisioned

0 vCPU or 0 GB

4 vCPU or 16 GB

Contention is when the VM is not getting all it wants.

What the VM wants

What the VM uses



Performance SLA: Example

14

Service TierFor Each VM

CPU RAM Network Storage

1 (highest) <1% CPU Contention <0.1% RAM Contention 0 drop packet 10 ms latency

2 <5% CPU Contention <3% RAM Contention 0 drop packet 20 ms latency

3 (lowest) <10% CPU Contention <6% RAM Contention 0 drop packet 40 ms latency

CPU Contention = CPU ReadyVM Network: Based on TX only.VMworld 2018 Content: Not for publication or distribution

Prove that your IaaS is serving all the VMs well.

15

The Ask from Your CIO

Every 5 minutes.Without fail

CPU. RAM. Network. Disk

Every single VM.No one left out.

You have 1000 VMs. How to prove?VMworld 2018 Content: Not for publication or distribution


1,000 VMCPURAM

NetworkStorage

4 8,76612 data points in 1 hour288 data points in 1 day

8,766 data points in 1 month

35,064,000X X =

How to show?!

Proof that your IaaS is performing

16

35K per VM

per month



Be Proactive, Not Reactive

17

If you do nothave

performance threshold,

you cannot be proactive

The time your

customers know it’s a

problem if you do nothing

The time you know

it’s a problem

Proactive Window



Performance SLA: Your defense line

18

35000x with no complaintIf no complaint in the past 35000 measurements, why complain now?

No agreement Unset Expectation Variable Expectation Relationship DrivenVMworld 2018 Content: Not for publication or distribution

19©2018 VMware, Inc. 19

Performance Monitoring: Are we serving all the VMs well?


20©2018 VMware, Inc. 20

On the big screens. Transparency

All Critical Apps/Systems

Drill down to actual counter.

Performance Monitoring: Do Tier 1 Apps need more Infrastructure?



3 Levels of Monitoring

• How much sales do we make today?• How many customers buy our products this week?• On average, how long did the XYZ transaction take this hour?• How many customers login yesterday? • On average, how long do they stay?

• How long does SQL Query ABCD take in the past 7 days? • What’s SQL Server free memory value 1 hour ago?• What’s the overall application uptime?• Are my apps configured for performance?

• What’s Windows CPU Run Queue?• What’s the peak VM CPU Contention in the past 24 hours?• What’s the total IO hitting vSAN from 9 – 6 pm yesterday?• What’s the buffer in physical switch right now?

VM or Container

Virtual Infra

Physical Infra

Sample MetricsSub Layers

Business Results

Business Transaction

Business

Application

Infrastructure

Layers

Individual Node

“The System”



KPI for Multi-tier Application

Load balancer

Load balancer

Web Server

Web Server

Web Server

Web Server

DB Server(Active)

DB Server(Passive)

App KPI = Worst (Tier Performance)

App Server

App Server

Load balancer

Load balancer

App Server

App Server

App Tier KPI = Average (VM Performance)



KPI of a single VMHow do we define the performance of a single VM, from infrastructure viewpoint?

Component Metric Green Yellow Orange Red

CPU Guest OS CPU Context Switch 0 – 1K < 10K < 100K > 100K

CPU VM CPU Usage (%) 0 – 70% > 70% > 80% > 90%

CPU VM CPU Co-Stop (%) 0 – 2.5% >2.5% > 5% > 7.5%

CPU VM CPU Ready (%) 0 – 2.5% > 2.5% > 5% > 7.5%

RAM VM RAM Contention (%) 0 – 1% > 1 > 3 > 5

RAM Guest OS RAM Free (GB) > 512 MB > 256 MB > 128 MB ≤ 128 MB

RAM Guest OS RAM Page-in Rate as a % of Used 0 – 1% > 1 > 3 > 5

Disk VM Disk Latency (ms) 0 – 10 ms > 10 > 20 > 30

Network VM Network TX Dropped Packet 0% > 0% > 1% > 2%

Proactive, not alert based

Red does not mean emergency. It means you need to take a look within a few days.


Logic

75 - 10050 - 7525 - 500 - 25

Metric Green Yellow Orange Red

VM Disk Latency (ms) 0 – 10 ms > 10 ms > 20 ms > 30 ms

Guest OS RAM Free (GB) > 512 MB > 256 MB > 128 MB ≤ 128 MB

If CPU Usage > 90 Then RED Elseif CPU Usage > 80 Then ORANGE Then Elseif CPU Usage > 70 Then YELLOW Else GREENIf CPU Usage > 90 Then 12.5 Elseif CPU Usage > 80 Then 37.5 Then Elseif CPU Usage > 70 Then 62.5 Else 87.5

“Flip” & translate into 0 – 100 range.

Take Mid-Point to represent the range. 87.562.537.512.5


VM KPI: Implementation• Super metric used to create VM KPI metric.•

• Zooming into the code for visibility.

• Limitation: – Suitable only for Tier 1 VM. – For Tier 2, clone this metric and adjust the threshold to map your Class of Service

Avg ([ Max([0,${this, metric=guest|contextSwapRate_latest} as CPU_Context]) > 100000 ? 12.5 : (Max([0,CPU_Context]) > 10000 ? 37.5 : (Max([0,CPU_Context]) > 1000 ? 62.5 : 87.5 ${this, metric=cpu|usage_average} as CPU_Usage > 90 ? 12.5 : (CPU_Usage > 80 ? 37.5 : (CPU_Usage > 70 ? 62.5 : 87.5 ${this, metric=cpu|readyPct} as CPU_Ready > 7.5 ? 12.5 : (CPU_Ready > 5 ? 37.5 : (CPU_Ready > 2.5 ? 62.5 : 87.5 ${this, metric=cpu|costopPct} as CPU_CoStop > 7.5 ? 12.5 : (CPU_CoStop > 5 ? 37.5 : (CPU_CoStop > 2.5 ? 62.5 : 87.5 ${this, metric=mem|host_contentionPct} as RAM_Contention > 5 ? 12.5 : (RAM_Contention > 3 ? 37.5 : (RAM_Contention > 1 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_9c6d1a2e-ae93-49d9-957b-d93643f5e3e4} as RAM_Free < 128 ? 12.5 : (RAM_Free < 256 ? 37.5 : (RAM_Free < 512 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_01b91453-1722-4d93-b2b1-066be0596dd0} as RAM_PageIn > 5 ? 12.5 : (RAM_PageIn > 3 ? 37.5 : (RAM_PageIn > 1 ? 62.5 : 87.5 ${this, metric=Super Metric|sm_e5980700-3d2c-428e-996f-78346304b723} as Net_DroppedTX > 2 ? 12.5 : (Net_DroppedTX > 1 ? 37.5 : (Net_DroppedTX > 0 ? 62.5 : 87.5 ${this, metric=virtualDisk:Aggregate of all instances|totalLatency} as Disk_Latency > 30 ? 12.5 : (Disk_Latency > 20 ? 37.5 : (Disk_Latency > 10 ? 62.5 : 87.5])

Avg ([${this, metric=cpu|usage_average} as CPU_Usage > 90 ? 12.5 : ( CPU_Usage > 80 ? 37.5 : ( C${this, metric=mem|host_contentionPct} as RAM_Contention > 5 ? 12.5 : ( RAM_Contention > 3 ? 37.5 : ( R])


26©2018 VMware, Inc. 26

Tier 1 Apps: From Apps to VM Metric



Do you serve everyone well?

Cluster Performance = 100%It’s serving all its VM well. 100% of VMs served in all 4 elements of IaaS (CPU, RAM, Disk, network) as per SLA

How to standardize since clusters aren’t equal?• Different number of VMs• Different SLA

CPU, RAM, Disk, network refers to Latency or Contention, not Utilization.

Cluster Performance = 0% It fails to serve any its VM well. 0% of VMs served.



PerformanceQuantifying Cluster Performance

28

( SLA Type Failure per VM )ΣNo of running VMs in the cluster

/ 4 SLA typesCluster SLA Failure (%) = 100%x

Cluster Performance (%) 100% Cluster SLA Failure (%)= -



Implementation: super metric

29

1 set for each Tier: CPU, RAM, Disk. Network is expected to be 0.

Assign 1 set to every VM. Each VM belongs to 1 tier only.

Workaround as we need to consolidate into 1 supermetric


Implementation: super metric

30

Sum ([${this, metric=net|droppedTx_summation} > 0 ? 1 : 0,${this, metric=mem|host_contentionPct} > ${this, metric=Super Metric|sm_4a3bd0c0-c897-4baf-a60e-4bea139e537b} ? 1 : 0, ${this, metric=cpu|capacity_contentionPct} > ${this, metric=Super Metric|sm_20ff3c62-0185-47a8-9bdc-a96f3081a2a8} ? 1 : 0, ${this, metric=virtualDisk:Aggregate of all instances|totalLatency}> ${this, metric=Super Metric|sm_cc38300b-d116-4040-bdd2-76b8ba3cd360} ? 1 : 0])



Performance Monitoring: Compute performance


Key Takeaways

If you don’t have SLA, you are a System Builder, not a Service

Provider

Capacity is defined by

Availability & PerformanceAvailability SLA

needs to be accompanied by

Performance SLA

32


Using Older Version of vRealize? You are Missing out!

Get free help to deploy or upgrade to the latest version today!

33

Quicksilver: For a limited time, VMware cloud management BU is offering help AT NO ADDITIONAL COST to bring qualified customers up to date with your vRealize deployment.

If you are behind on your version (or never fully deployed), we can bring you to the latest of vRealize Automation, vRealize Operations, and Lifecycle Manager (LCM).

Email to [email protected] to qualify and for next stepsVMworld 2018 Content: Not for publication or distribution

mailto:[email protected]

34Confidential │ ©2018 VMware, Inc.

. Instant analytics.

Full stack correlation. Intelligent alerting.

Cloud-native with unrivalled scale and

performance.100K containers. No aggregation or lost

answers.1-second resolution,18 month retention.

• Faster business, faster software development -are your APM tools providing enough coverage and problem resolution

• Enough granularity, retention, correlations?

• Would a simple, cost-effective way to monitor apps & middleware (SAP, MongoDB, etc) and infrastructure help

• Are you using Hyperic or EPOps?

• Do you have new plans/tools to monitor your modern application stacks?

• Do you have the tools to monitor custom applications, containers, PKS/PAS, AWS/Azure/GCP, Serverless?

vROps

integrated w/

The Future of

Cloud Monitoring and Analytics



Customers Love Wavefront!!!! Zero Churn!! Zero Shelfware!!

Jing ZhaoDevOps Engineer

“Wavefront’s powerful query language allows us to easily visualize and debug our time series data. Its alerting is flexible enough to notify us through email, Slack, and PagerDuty based on severity. Well-tuned alerts helps us lower MTTD . We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.”

John BeattyCofounder and CEO

“Before Wavefront, we would have no clue about what is going on. Now we can fix software bugs and validate that the changes had exactly the effects we intended.”

Pierre-Alexandre MasseEngineering Director

“Wavefront gives us very quick insights and the best query language we could find to explore and understand our data. Since we rely on data and metrics to make our decisions, Wavefront is an essential and indispensable part of our day-to-day operations.”

Bob MugliaCEO

“Wavefront has made a huge impact in helping us identify and reduce operational performance issues. Its intuitive, but advanced representation of our data delivers time and actionable insights, help us meet our customers SLA. Our operations team starts their day checking Wavefront, and uses it all day long. It’s an awesome product.”



Customers Love Wavefront!!!! Zero Churn!! Zero Shelfware!!

Jing ZhaoDevOps Engineer

“Wavefront’s powerful query language allows us to easily visualize and debug our time series data. Its alerting is flexible enough to notify us through email, Slack, and PagerDuty based on severity. Well-tuned alerts helps us lower MTTD . We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.”

John BeattyCofounder and CEO

“Before Wavefront, we would have no clue about what is going on. Now we can fix software bugs and validate that the changes had exactly the effects we intended.”

Pierre-Alexandre MasseEngineering Director

“Wavefront gives us very quick insights and the best query language we could find to explore and understand our data. Since we rely on data and metrics to make our decisions, Wavefront is an essential and indispensable part of our day-to-day operations.”

Bob MugliaCEO

“Wavefront has made a huge impact in helping us identify and reduce operational performance issues. Its intuitive, but advanced representation of our data delivers time and actionable insights, help us meet our customers SLA. Our operations team starts their day checking Wavefront, and uses it all day long. It’s an awesome product.”



SaaS Leaders Use WavefrontMonitoring Cloud-native Applications and Infrastructure

Free Trial:wavefront.com/sign-up

Contact: [email protected] 2018 Content: Not for publication or distribution

PLEASE FILL OUTYOUR SURVEY.Take a survey and enter a drawingfor a VMware company store gift card.

#vmworld #MGT1440BU


THANK YOU!

#vmworld #MGT1440BU


Documents

for publication Content: Not VMworld 2018...Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new features/functionality/ VMworld 2018