Confidential
Workload Forecasting and Reporting
Damian Ward
Nonstop Solutions Architect / BITUG Vice Chairman
2 Unclassified
• About Me
• About VocaLink
• Part 1 – Some Theory
• Part 2 – Forecasts & Models
− Part 2a – Transaction Volume Forecast
− Part 2b – Improved Transaction Volume Forecast
− Part 2c – Workload Models
− Part 2d – Combining Forecast & Workload models
• Part 3 – Case Study
• Summary
• Questions..? Please feel free to ask as we go through the presentation.
Introduction What am I going to talk about today
3 Unclassified
Introduction About your presenter
• Damian Ward
• 20 years HP NonStop and Payments experience
• Career spanning:
− Operations, Application Programming, System Management, Programme
Management, Technical Specialist, Solutions Architect, Enterprise Architect,
Infrastructure Architect
• Specialities:
− HP NonStop systems and architecture, Enterprise Architecture, Encryption,
Availability Management, ATM Systems, Payments Processing, Capacity Planning,
System modelling, Fraud, Mobile and Internet technologies, Programming,
Emerging Technologies and Robotics
• BITUG Vice Chairman 2011
• BITUG Chairman 2012
4 Unclassified
Introduction VocaLink History
5 Unclassified
Introduction VocaLink History
6 Unclassified
Direct connection to in house processing system
Introduction Card processing landscape
FIS Connex Advantage
Switch with resillient
telecommunication
connections to each
customer
Indirect ATM acquirer and card issuer
connection (via VocaLinkCSB)
ATM and POS international acquiring
and issuing connections via gateway connections to international schemes
Connections to Mobile Operators
Direct connection
to Post Office systems
Connections to
overseas schemes and
banks
Indirect ATM connection (via third
party processor)
via TNS CSB
7 Unclassified
Introduction Transaction Processing Peeks
8 Unclassified
PART 1 – SOME THEORY
9 Unclassified
Some Theory Peak TPS vs Throughput?
• Third slide indicates 482tps peak
• Bell curve, arrival rate, measurements, averaging
periods could all account for this.
• HOWEVER – I am using fictional
transaction summary data based
on real world observations.
• All transaction summary data
used in this presentation is made
up to for the sole purpose of
illustrating the models within this presentation
10 Unclassified
Some Theory Maximum recommended CPU utilisation?
• System response time increases exponentially with utilisation
• Switch time measurements reflect this
• 80% maximum metric used by VocaLink
• Remember normal switch time in order or 0.1 second
• < 1 second is probably acceptable (ATM’s timeout at 30 seconds).
11 Unclassified
The Theory Average CPU utilisation vs Actual CPU utilisation
• When performing “what-if?” type analysis CPU utilisation is generally
considered uniform
• Application Support teams need to ensure a good balance
12 Unclassified
The Theory Priority based OS will save us
• Some would argue that the NonStop OS priority based scheduling makes
this work redundant?
• DP2 a particular issue here.
• Our application is a collection of high priority processes
• Gets busier as a whole
• Single CPU can become saturated with high priority processes
• Negative impact on of rest of application.
• Application function becomes unstable
• CPU imbalance means some CPU’s get saturated before others
• Cross switch transaction time goes up.
• Remember normal switch time in order or 0.1 second
• < 1 second is probably acceptable (ATM’s timeout at 30 seconds).
13 Unclassified
PART 2 – FORECASTS &
MODELS
14 Unclassified
Forecasts and models The information can / should we use
• Actual data from running system NS MEASURE
• Business unit volume forecasts Monthly volumes by service
• SLA volume commitments Where appropriate (ie FPS)
• Application vendor data Not available / reliable
• Hardware vendor information For what if scenarios
• Other models Profile data, ratio’s
• Availability policy Scheme and processing model dependent
• Capacity policy 80% CPU threshold Cross switch time driven
15 Unclassified
• Peak second for every hour
• Rolling 24 month planning horizon.
Forecasts and models The end result
17 Unclassified
Transaction volume forecasting Daily volume (txnsyyyy.xlsx) spreadsheet
• Transaction summary data dating back to 1998
• Actual daily volumes
• Forecast daily volumes
• Tracks actual vs forecast
• Traditionally used to predict volumes prior business taking this role
• This model can only look backwards
• Used to derive annual to peak month and month to peak day transaction
ratios
• Used to derive monthly daily transaction volume distribution
• Model tuned annually
18 Unclassified
Transaction volume forecasting Friday analysis (fridays.xlsx) spreadsheet
• Transaction summary data dating back to 1998
• Analysis of Friday daily volumes
• Actual peak day, hour, minute, second data
• Used to derive peak day to hour, peak hour to minute and peak minute to
second ratios
• Model tuned annually
19 Unclassified
Transaction volume forecasting Derived transaction ratios
• Peak period transaction ratio’s
• Derived from:
• txnsyyyy.xlsx
• fridays.xlsx
• Tuned annually
20 Unclassified
Transaction volume forecasting Business Unit volume forecast
• Business unit provide future
volumes
• Business unit responsible for
these, they have sight of new
business and industry trends so
we don’t need to.
• Forms part of contract between
IT and the business.
• Removes volume prediction
responsibility from IT.
• Based on calendar month.
21 Unclassified
Transaction volume forecasting Business Unit volume inserted in txnsyyyy.xlsx
• Business forecast volumes plug into transaction (txnsyyyy.xlsx) model
unchanged
• Month to peak Friday ratio used to predict peak Friday volume
22 Unclassified
Transaction volume forecasting Daily transactions worksheet takes values from business forecast
• Peak Friday volume plugged into daily
volume prediction worksheet.
23 Unclassified
Transaction volume forecasting Remaining Fridays populated using ratios
• Remaining Friday volumes predicted
using Friday ratios
24 Unclassified
Transaction volume forecasting Remaining weekdays populated using ratios
• Remaining daily volumes predicted
using week day ratios
25 Unclassified
Transaction volume forecasting Ratios used to calculate hour, minute, second volumes
• Peak Hour, minute and
second calculated using
ratios.
26 Unclassified
Transaction volume forecasting Ratio’s recap..
• A brief example showing the ratios at work
27 Unclassified
PART 2B – IMPROVED
TRANSACTION VOLUME
FORECAST
28 Unclassified
Improved transaction volume forecasting Peak day transaction distribution profile
29 Unclassified
Improved transaction volume forecasting Profile used to generate hourly volumes
• Daily volumes now distributed
according to daily profile.
• Derives max tpm per hour
30 Unclassified
• Peak second per hour derived from peak
minute per hour.
• The 2 models validate each other.
Improved transaction volume forecasting Ratios drill down to peak second per hour
31 Unclassified
PART 2C – WORKLOAD
MODELS
32 Unclassified
• Gives the business the ability to predict future machine utilisation.
• Allows adequate time to prepare for known volume growth,
− ie following new business take on.
− New product launch
• Allows the business to perform what if analysis.
• Allows for application benchmarking and comparison pre / post changes.
Workload Models Why create a workload model
33 Unclassified
Forecasts and models Raw NonStop Measure Report
1 * ?dictionary perfdict
2 * ?assign process to process
3 * open process;
4 * list by volume noprint, by subvol noprint, by filename noprint
5 * by volume nohead as a8
6 * by subvol nohead as a8
7 * by filename nohead as a8
8 * count (subvol over filename) nohead AS "M<ZZ9>"
9 * sum (cpu-busy-time over filename) nohead AS "M<ZZZZZZZZZ9>"
10 * sum (messages-sent over filename) nohead AS "M<ZZZZZZZ9>"
11 * sum (messages-received over filename) nohead AS "M<ZZZZZZZ9>"
12 * sum (recv-qtime over filename) nohead AS "M<ZZZZZZZZZZZ9>"
13 * ;
$AOS10 ZYQ00000 Z00006BX 2 12214910 30987 0 0
$AOS11 AT67POBJ N50Q 15 12237880 44086 7968 2538348
$AOS11 AT67POBJ SETLQ 5 0 0 0 0
$AOS11 AT67POBJ TIDELQ 3 306420 770 230 16163
$AOS11 AT67POBJ TRITON1Q 1 119767 314 99 1606
$AOS11 AT67POBJ TRITONQ 7 2155984 5812 2004 2472860
$AOS11 BA67POBJ EXTRQ 3 0 0 0 0
$AOS11 BA67POBJ HISO1Q 10 194172 280 184 16003
$AOS11 BA67POBJ HISO5Q 1 113841 85 107 29178
$AOS11 BA67POBJ INSHISO 2 14847 0 30 4881
$AOS11 BA67POBJ RIP 1 0 0 0 0
$AOS11 BA67POBJ T24HISO 1 204139 414 336 68292
$AOS11 SW67POBJ LINKQ 3 412824 755 441 46470
34 Unclassified
Forecasts and models Measure report imported into Excel
• Imported Measure data can be quite large.
• Summarised by object subvol and or object name
35 Unclassified
Forecasts and models Measure report imported into Excel
• Measure data summary
• Measure data used to benchmark system.
• Collected each Friday.
• Collected during V&P
testing
• CPU cost per transaction
established.
• Default non core application
“noise” established.
• Safe tps ascertained and
used to feed into other models.
36 Unclassified
PART 2D – COMBINING
FORCAST & WORKLOAD
MODELS
37 Unclassified
Combined forecast and workload Excel conditional formatting used to good effect
• Max tps of 376 used with
Excel “conditional
formatting”
• Danger times are obvious.
38 Unclassified
Combined forecast and workload (n-1) Seeing into the future
• Model can be rolled
forward for as far as the
business can predict.
• Typically 24 months.
39 Unclassified
Combined forecast and workload (n-1) What about failure scenario’s
• Simple maths can be used
to ascertain n-1 system
capacity.
• .
40 Unclassified
Combined forecast and workload (n-1) What about failure scenario’s
• Max (n-1) tps of 345 used
with Excel “conditional
formatting”
• Danger times are obvious.
41 Unclassified
• Impact of process relocations modelled in Excel
• Resultant n-1 impact shown.
Combined forecast and workload (n-1) CPU down capacity, by CPU
42 Confidential
PART 3 – EXAMPLE USE CASE
43 Unclassified
• Assumptions
• High level capacity with first CPU @ 90% Utilisation (n) = 357 tps
• High level capacity @ 90 CPU Utilisation (n-1) between 317 and 352tps
• Average capacity of 333 tps (n-1) used in following illustrations
• CPU fail to fix time 6 hours.
Example use case S Series capacity evaluation
44 Unclassified
(n) and (n-1) illustration, capacity vs workload February 2012 – April 2012 (max tps + RAG for each hour)
(n) (n-1)
45 Unclassified
(n) and (n-1) illustration, capacity vs workload May 2012 – July 2012 (max tps + RAG for each hour)
(n) (n-1)
46 Unclassified
Probability of failure? How to quantify the risk
• Can depend upon your sizing philosophy
• Size for < 80% with 1 CPU down..? Or 95% with all CPU’s up..?
• Impact of incident at quiet time not same as at busy time.
• Deviation from provisioning policy
− (ie >80% (n-1) utilisation forecast in next 12 months)
• System is 12 months from retirement
• Thought exercise performed… presented to management, attempted to
quantify risk.
• When communicating risk.. I recommend you don’t use the phrase
“imagine you’re in a casino..!
when talking to management...
47 Unclassified
• S Series Upgrade Options
Option 1 – Stay as Is
Option 2 – 2 x CPU upgrade
Option 3 – Add 2 x CPU
Option 4 – Migrate to NB50000
Probability of failure? Options considered
48 Unclassified
• Upgrade option comparison.
Probability of failure? CPU down capacity by failed CPU
49 Unclassified
Probability of failure? Lots of Maths (special thanks to Ian Murphy, VocaLink)
50 Unclassified
Probability of failure? Number of danger CPU’s in each hour including fix time
51 Unclassified
Probability of failure? Probability calculations
52 Unclassified
Probability of failure? Number of danger CPU’s in each hour
53 Unclassified
Probability of failure? Number of danger CPU’s in each hour including fix time
54 Unclassified
Probability of failure? Probability of service impacting failure (option 1)
55 Unclassified
Probability of failure? Probability of service impacting failure (option 2)
56 Unclassified
Probability of failure? Probability of service impacting failure (option 3)
57 Unclassified
Summary
• Transaction volume forecasting can be as simple as some rations, or more
complex with profiles.
• Workload and capacity can be modelled with Measure data
• Combine Volume and Workload to great effect
• Don't forget the failure scenarios
• Cheapest way to additional capacity is good n and n-1 CPU balance
• Use workload models in what if scenarios
• Probability of failure can be calculated but mostly academic
• Most of us are in the zero tolerance business, the service cannot fail.
• Especially true once risk identified.
• Many Thanks, Questions..?
58 Unclassified