59
Confidential Workload Forecasting and Reporting Damian Ward Nonstop Solutions Architect / BITUG Vice Chairman

Workload Forecasting and Reporting

Embed Size (px)

Citation preview

Page 1: Workload Forecasting and Reporting

Confidential

Workload Forecasting and Reporting

Damian Ward

Nonstop Solutions Architect / BITUG Vice Chairman

Page 2: Workload Forecasting and Reporting

2 Unclassified

• About Me

• About VocaLink

• Part 1 – Some Theory

• Part 2 – Forecasts & Models

− Part 2a – Transaction Volume Forecast

− Part 2b – Improved Transaction Volume Forecast

− Part 2c – Workload Models

− Part 2d – Combining Forecast & Workload models

• Part 3 – Case Study

• Summary

• Questions..? Please feel free to ask as we go through the presentation.

Introduction What am I going to talk about today

Page 3: Workload Forecasting and Reporting

3 Unclassified

Introduction About your presenter

• Damian Ward

• 20 years HP NonStop and Payments experience

• Career spanning:

− Operations, Application Programming, System Management, Programme

Management, Technical Specialist, Solutions Architect, Enterprise Architect,

Infrastructure Architect

• Specialities:

− HP NonStop systems and architecture, Enterprise Architecture, Encryption,

Availability Management, ATM Systems, Payments Processing, Capacity Planning,

System modelling, Fraud, Mobile and Internet technologies, Programming,

Emerging Technologies and Robotics

• BITUG Vice Chairman 2011

• BITUG Chairman 2012

Page 4: Workload Forecasting and Reporting

4 Unclassified

Introduction VocaLink History

Page 5: Workload Forecasting and Reporting

5 Unclassified

Introduction VocaLink History

Page 6: Workload Forecasting and Reporting

6 Unclassified

Direct connection to in house processing system

Introduction Card processing landscape

FIS Connex Advantage

Switch with resillient

telecommunication

connections to each

customer

Indirect ATM acquirer and card issuer

connection (via VocaLinkCSB)

ATM and POS international acquiring

and issuing connections via gateway connections to international schemes

Connections to Mobile Operators

Direct connection

to Post Office systems

Connections to

overseas schemes and

banks

Indirect ATM connection (via third

party processor)

via TNS CSB

Page 7: Workload Forecasting and Reporting

7 Unclassified

Introduction Transaction Processing Peeks

Page 8: Workload Forecasting and Reporting

8 Unclassified

PART 1 – SOME THEORY

Page 9: Workload Forecasting and Reporting

9 Unclassified

Some Theory Peak TPS vs Throughput?

• Third slide indicates 482tps peak

• Bell curve, arrival rate, measurements, averaging

periods could all account for this.

• HOWEVER – I am using fictional

transaction summary data based

on real world observations.

• All transaction summary data

used in this presentation is made

up to for the sole purpose of

illustrating the models within this presentation

Page 10: Workload Forecasting and Reporting

10 Unclassified

Some Theory Maximum recommended CPU utilisation?

• System response time increases exponentially with utilisation

• Switch time measurements reflect this

• 80% maximum metric used by VocaLink

• Remember normal switch time in order or 0.1 second

• < 1 second is probably acceptable (ATM’s timeout at 30 seconds).

Page 11: Workload Forecasting and Reporting

11 Unclassified

The Theory Average CPU utilisation vs Actual CPU utilisation

• When performing “what-if?” type analysis CPU utilisation is generally

considered uniform

• Application Support teams need to ensure a good balance

Page 12: Workload Forecasting and Reporting

12 Unclassified

The Theory Priority based OS will save us

• Some would argue that the NonStop OS priority based scheduling makes

this work redundant?

• DP2 a particular issue here.

• Our application is a collection of high priority processes

• Gets busier as a whole

• Single CPU can become saturated with high priority processes

• Negative impact on of rest of application.

• Application function becomes unstable

• CPU imbalance means some CPU’s get saturated before others

• Cross switch transaction time goes up.

• Remember normal switch time in order or 0.1 second

• < 1 second is probably acceptable (ATM’s timeout at 30 seconds).

Page 13: Workload Forecasting and Reporting

13 Unclassified

PART 2 – FORECASTS &

MODELS

Page 14: Workload Forecasting and Reporting

14 Unclassified

Forecasts and models The information can / should we use

• Actual data from running system NS MEASURE

• Business unit volume forecasts Monthly volumes by service

• SLA volume commitments Where appropriate (ie FPS)

• Application vendor data Not available / reliable

• Hardware vendor information For what if scenarios

• Other models Profile data, ratio’s

• Availability policy Scheme and processing model dependent

• Capacity policy 80% CPU threshold Cross switch time driven

Page 15: Workload Forecasting and Reporting

15 Unclassified

• Peak second for every hour

• Rolling 24 month planning horizon.

Forecasts and models The end result

Page 16: Workload Forecasting and Reporting

17 Unclassified

Transaction volume forecasting Daily volume (txnsyyyy.xlsx) spreadsheet

• Transaction summary data dating back to 1998

• Actual daily volumes

• Forecast daily volumes

• Tracks actual vs forecast

• Traditionally used to predict volumes prior business taking this role

• This model can only look backwards

• Used to derive annual to peak month and month to peak day transaction

ratios

• Used to derive monthly daily transaction volume distribution

• Model tuned annually

Page 17: Workload Forecasting and Reporting

18 Unclassified

Transaction volume forecasting Friday analysis (fridays.xlsx) spreadsheet

• Transaction summary data dating back to 1998

• Analysis of Friday daily volumes

• Actual peak day, hour, minute, second data

• Used to derive peak day to hour, peak hour to minute and peak minute to

second ratios

• Model tuned annually

Page 18: Workload Forecasting and Reporting

19 Unclassified

Transaction volume forecasting Derived transaction ratios

• Peak period transaction ratio’s

• Derived from:

• txnsyyyy.xlsx

• fridays.xlsx

• Tuned annually

Page 19: Workload Forecasting and Reporting

20 Unclassified

Transaction volume forecasting Business Unit volume forecast

• Business unit provide future

volumes

• Business unit responsible for

these, they have sight of new

business and industry trends so

we don’t need to.

• Forms part of contract between

IT and the business.

• Removes volume prediction

responsibility from IT.

• Based on calendar month.

Page 20: Workload Forecasting and Reporting

21 Unclassified

Transaction volume forecasting Business Unit volume inserted in txnsyyyy.xlsx

• Business forecast volumes plug into transaction (txnsyyyy.xlsx) model

unchanged

• Month to peak Friday ratio used to predict peak Friday volume

Page 21: Workload Forecasting and Reporting

22 Unclassified

Transaction volume forecasting Daily transactions worksheet takes values from business forecast

• Peak Friday volume plugged into daily

volume prediction worksheet.

Page 22: Workload Forecasting and Reporting

23 Unclassified

Transaction volume forecasting Remaining Fridays populated using ratios

• Remaining Friday volumes predicted

using Friday ratios

Page 23: Workload Forecasting and Reporting

24 Unclassified

Transaction volume forecasting Remaining weekdays populated using ratios

• Remaining daily volumes predicted

using week day ratios

Page 24: Workload Forecasting and Reporting

25 Unclassified

Transaction volume forecasting Ratios used to calculate hour, minute, second volumes

• Peak Hour, minute and

second calculated using

ratios.

Page 25: Workload Forecasting and Reporting

26 Unclassified

Transaction volume forecasting Ratio’s recap..

• A brief example showing the ratios at work

Page 26: Workload Forecasting and Reporting

27 Unclassified

PART 2B – IMPROVED

TRANSACTION VOLUME

FORECAST

Page 27: Workload Forecasting and Reporting

28 Unclassified

Improved transaction volume forecasting Peak day transaction distribution profile

Page 28: Workload Forecasting and Reporting

29 Unclassified

Improved transaction volume forecasting Profile used to generate hourly volumes

• Daily volumes now distributed

according to daily profile.

• Derives max tpm per hour

Page 29: Workload Forecasting and Reporting

30 Unclassified

• Peak second per hour derived from peak

minute per hour.

• The 2 models validate each other.

Improved transaction volume forecasting Ratios drill down to peak second per hour

Page 30: Workload Forecasting and Reporting

31 Unclassified

PART 2C – WORKLOAD

MODELS

Page 31: Workload Forecasting and Reporting

32 Unclassified

• Gives the business the ability to predict future machine utilisation.

• Allows adequate time to prepare for known volume growth,

− ie following new business take on.

− New product launch

• Allows the business to perform what if analysis.

• Allows for application benchmarking and comparison pre / post changes.

Workload Models Why create a workload model

Page 32: Workload Forecasting and Reporting

33 Unclassified

Forecasts and models Raw NonStop Measure Report

1 * ?dictionary perfdict

2 * ?assign process to process

3 * open process;

4 * list by volume noprint, by subvol noprint, by filename noprint

5 * by volume nohead as a8

6 * by subvol nohead as a8

7 * by filename nohead as a8

8 * count (subvol over filename) nohead AS "M<ZZ9>"

9 * sum (cpu-busy-time over filename) nohead AS "M<ZZZZZZZZZ9>"

10 * sum (messages-sent over filename) nohead AS "M<ZZZZZZZ9>"

11 * sum (messages-received over filename) nohead AS "M<ZZZZZZZ9>"

12 * sum (recv-qtime over filename) nohead AS "M<ZZZZZZZZZZZ9>"

13 * ;

$AOS10 ZYQ00000 Z00006BX 2 12214910 30987 0 0

$AOS11 AT67POBJ N50Q 15 12237880 44086 7968 2538348

$AOS11 AT67POBJ SETLQ 5 0 0 0 0

$AOS11 AT67POBJ TIDELQ 3 306420 770 230 16163

$AOS11 AT67POBJ TRITON1Q 1 119767 314 99 1606

$AOS11 AT67POBJ TRITONQ 7 2155984 5812 2004 2472860

$AOS11 BA67POBJ EXTRQ 3 0 0 0 0

$AOS11 BA67POBJ HISO1Q 10 194172 280 184 16003

$AOS11 BA67POBJ HISO5Q 1 113841 85 107 29178

$AOS11 BA67POBJ INSHISO 2 14847 0 30 4881

$AOS11 BA67POBJ RIP 1 0 0 0 0

$AOS11 BA67POBJ T24HISO 1 204139 414 336 68292

$AOS11 SW67POBJ LINKQ 3 412824 755 441 46470

Page 33: Workload Forecasting and Reporting

34 Unclassified

Forecasts and models Measure report imported into Excel

• Imported Measure data can be quite large.

• Summarised by object subvol and or object name

Page 34: Workload Forecasting and Reporting

35 Unclassified

Forecasts and models Measure report imported into Excel

• Measure data summary

• Measure data used to benchmark system.

• Collected each Friday.

• Collected during V&P

testing

• CPU cost per transaction

established.

• Default non core application

“noise” established.

• Safe tps ascertained and

used to feed into other models.

Page 35: Workload Forecasting and Reporting

36 Unclassified

PART 2D – COMBINING

FORCAST & WORKLOAD

MODELS

Page 36: Workload Forecasting and Reporting

37 Unclassified

Combined forecast and workload Excel conditional formatting used to good effect

• Max tps of 376 used with

Excel “conditional

formatting”

• Danger times are obvious.

Page 37: Workload Forecasting and Reporting

38 Unclassified

Combined forecast and workload (n-1) Seeing into the future

• Model can be rolled

forward for as far as the

business can predict.

• Typically 24 months.

Page 38: Workload Forecasting and Reporting

39 Unclassified

Combined forecast and workload (n-1) What about failure scenario’s

• Simple maths can be used

to ascertain n-1 system

capacity.

• .

Page 39: Workload Forecasting and Reporting

40 Unclassified

Combined forecast and workload (n-1) What about failure scenario’s

• Max (n-1) tps of 345 used

with Excel “conditional

formatting”

• Danger times are obvious.

Page 40: Workload Forecasting and Reporting

41 Unclassified

• Impact of process relocations modelled in Excel

• Resultant n-1 impact shown.

Combined forecast and workload (n-1) CPU down capacity, by CPU

Page 41: Workload Forecasting and Reporting

42 Confidential

PART 3 – EXAMPLE USE CASE

Page 42: Workload Forecasting and Reporting

43 Unclassified

• Assumptions

• High level capacity with first CPU @ 90% Utilisation (n) = 357 tps

• High level capacity @ 90 CPU Utilisation (n-1) between 317 and 352tps

• Average capacity of 333 tps (n-1) used in following illustrations

• CPU fail to fix time 6 hours.

Example use case S Series capacity evaluation

Page 43: Workload Forecasting and Reporting

44 Unclassified

(n) and (n-1) illustration, capacity vs workload February 2012 – April 2012 (max tps + RAG for each hour)

(n) (n-1)

Page 44: Workload Forecasting and Reporting

45 Unclassified

(n) and (n-1) illustration, capacity vs workload May 2012 – July 2012 (max tps + RAG for each hour)

(n) (n-1)

Page 45: Workload Forecasting and Reporting

46 Unclassified

Probability of failure? How to quantify the risk

• Can depend upon your sizing philosophy

• Size for < 80% with 1 CPU down..? Or 95% with all CPU’s up..?

• Impact of incident at quiet time not same as at busy time.

• Deviation from provisioning policy

− (ie >80% (n-1) utilisation forecast in next 12 months)

• System is 12 months from retirement

• Thought exercise performed… presented to management, attempted to

quantify risk.

• When communicating risk.. I recommend you don’t use the phrase

“imagine you’re in a casino..!

when talking to management...

Page 46: Workload Forecasting and Reporting

47 Unclassified

• S Series Upgrade Options

Option 1 – Stay as Is

Option 2 – 2 x CPU upgrade

Option 3 – Add 2 x CPU

Option 4 – Migrate to NB50000

Probability of failure? Options considered

Page 47: Workload Forecasting and Reporting

48 Unclassified

• Upgrade option comparison.

Probability of failure? CPU down capacity by failed CPU

Page 48: Workload Forecasting and Reporting

49 Unclassified

Probability of failure? Lots of Maths (special thanks to Ian Murphy, VocaLink)

Page 49: Workload Forecasting and Reporting

50 Unclassified

Probability of failure? Number of danger CPU’s in each hour including fix time

Page 50: Workload Forecasting and Reporting

51 Unclassified

Probability of failure? Probability calculations

Page 51: Workload Forecasting and Reporting

52 Unclassified

Probability of failure? Number of danger CPU’s in each hour

Page 52: Workload Forecasting and Reporting

53 Unclassified

Probability of failure? Number of danger CPU’s in each hour including fix time

Page 53: Workload Forecasting and Reporting

54 Unclassified

Probability of failure? Probability of service impacting failure (option 1)

Page 54: Workload Forecasting and Reporting

55 Unclassified

Probability of failure? Probability of service impacting failure (option 2)

Page 55: Workload Forecasting and Reporting

56 Unclassified

Probability of failure? Probability of service impacting failure (option 3)

Page 56: Workload Forecasting and Reporting

57 Unclassified

Summary

• Transaction volume forecasting can be as simple as some rations, or more

complex with profiles.

• Workload and capacity can be modelled with Measure data

• Combine Volume and Workload to great effect

• Don't forget the failure scenarios

• Cheapest way to additional capacity is good n and n-1 CPU balance

• Use workload models in what if scenarios

• Probability of failure can be calculated but mostly academic

• Most of us are in the zero tolerance business, the service cannot fail.

• Especially true once risk identified.

• Many Thanks, Questions..?

Page 57: Workload Forecasting and Reporting

58 Unclassified

Page 58: Workload Forecasting and Reporting

59 Unclassified

Summary

• Thank you for your attention.

• Questions..?

[email protected]

Page 59: Workload Forecasting and Reporting