Workloads 02 Tutorial

Workload Modeling
and its Effect on
Performance Evaluation

Dror Feitelson

Hebrew University

Thanks toparticipants and progam committee; thanks to Monien; abuse hospitality talk about agenda

Performance Evaluation
In system designSelection of algorithmsSetting parameter valuesIn procurement decisionsValue for moneyMeet usage goalsFor capacity planing
Important and basic activity

The Good Old Days
The skies were blueThe simulation results were conclusiveOur scheme was better than theirs
Feitelson & Jette, JSSPP 1997

Focus on system design. Widely different designs lead to conclusive results.

But in their papers,

Their scheme was better than ours!

But literature is full of contradictory results.

How could they be so wrong?

Leads to question of what is the cause for contradictions.

Performance evaluation depends on:
The systems design
(What we teach in algorithms and data structures)
Its implementation
(What we teach in programming courses)
The workload to which it is subjectedThe metric used in the evaluationInteractions between these factors
Next: our focus is the workloads.

Performance evaluation depends on:
The systems design
(What we teach in algorithms and data structures)
Its implementation
(What we teach in programming courses)
The workload to which it is subjectedThe metric used in the evaluationInteractions between these factors

Outline for Today
Three examples of how workloads affect performance evaluationWorkload modelingGetting dataFitting, correlations, stationarityHeavy tails, self similarityResearch agenda
In the context of parallel job scheduling

Job scheduling, not task scheduling

Example #1

Gang Scheduling and

Job Size Distribution

Gang What?!?

Time slicing parallel jobs with coordinated context switching

Ousterhout

matrix

Ousterhout, ICDCS 1982

Gang What?!?

Time slicing parallel jobs with coordinated context switching

Ousterhout

matrix

Optimization:

Alternative

scheduling

Ousterhout, ICDCS 1982

Packing Jobs

Use a buddy system for allocating processors

Feitelson & Rudolph, Computer 1990

Packing Jobs


Start with full system in one block

Packing Jobs


To allocate repeatedly partition in two to get desired size

Packing Jobs


Packing Jobs


Or use existing partition

The Question:
The buddy system leads to internal fragmentationBut it also improves the chances of alternative scheduling, because processors are allocated in predefined groups
Which effect dominates the other?

The Answer (part 1):

Feitelson & Rudolph, JPDC 1996

Answer as function of workload, but not full answer because workload unknown. Dashed lines: provable bounds.


Note logarithmic Y axis

Many small jobsMany sequential jobsMany power of two jobsPractically no jobs use full machine
Conclusion: buddy system should work well

Verification

Feitelson, JSSPP 1996

Using Feitelson workload

Example #2

Parallel Job Scheduling

and Job Scaling

Variable Partitioning
Each job gets a dedicated partition for the duration of its executionResembles 2D bin packingPacking large jobs first should lead to better performanceBut what about correlation of size and runtime?
First-fit decreasing is optimal

Scaling Models
Constant workParallelism for speedup: Amdahls LawLarge first SJFConstant timeSize and runtime are uncorrelatedMemory boundLarge first LJFFull-size jobs lead to blockout
Worley, SIAM JSSC 1990

Question is which model applies within the context of a single machine

Scan Algorithm
Keep jobs in separate queues according to size (sizes are powers of 2)Serve the queues Round Robin, scheduling all jobs from each queue (they pack perfectly)Assuming constant work model, large jobs only block the machine for a short timeBut the memory bound model would lead to excessive queueing of small jobs
Krueger et al., IEEE TPDS 1994

Important point: schedule order determined by size

The Data

The Data

Data: SDSC Paragon, 1995/6

The Data


Partitions with equal numbers of jobs; many more small jobs.

The Data


Similar range, different shape; 80th percentile moves from

Conclusion
Parallelism used for better results, not for faster resultsConstant work model is unrealisticMemory bound model is reasonableScan algorithm will probably not perform well in practice

Example #3

Backfilling and

User Runtime Estimation

Backfilling
Variable partitioning can suffer from external fragmentationBackfilling optimization: move jobs forward to fill in holes in the scheduleRequires knowledge of expected job runtimes

Variants
EASY backfilling
Make reservation for first queued job
Conservative backfilling
Make reservation for all queued jobs

User Runtime Estimates
Lower estimates improve chance of backfilling and better response timeToo low estimates run the risk of having the job killedSo estimates should be accurate, right?

They Arent

Mualem & Feitelson, IEEE TPDS 2001

Short=failed; killed typically exceeded runtime estimate, ~15%

Surprising Consequences
Inaccurate estimates actually lead to improved performancePerformance evaluation results may depend on the accuracy of runtime estimatesExample: EASY vs. conservativeUsing different workloadsAnd different metrics
Will focus on second bullet

EASY vs. Conservative

Using CTC SP2 workload


Using Jann workload model

Note: jann model of CTC


Using Feitelson workload model

Conflicting Results Explained
Jann uses accurate runtime estimatesThis leads to a tighter scheduleEASY is not affected too muchConservative manages less backfilling of long jobs, because respects more reservations
Relative measure: more by EASY = less by conservative

Conservative is bad for the long jobs
Good for short ones that are respected

Conservative

EASY

Conflicting Results Explained
Response time sensitive to long jobs, which favor EASYSlowdown sensitive to short jobs, which favor conservativeAll this does not happen at CTC, because estimates are so loose that backfill can occur even under conservative

Verification

Run CTC workload with accurate estimates

But What About My Model?

Simply does not have such small long jobs

Workload Data Sources

No Data
Innovative unprecedented systemsWirelessHand-heldUse an educated guessSelf similarityHeavy tailsZipf distribution

Serendipitous Data
Data may be collected for various reasonsAccounting logsAudit logsDebugging logsJust-so logsCan lead to wealth of information

NASA Ames iPSC/860 log

42050 jobs from Oct-Dec 1993

user job nodes runtime date time

user4 cmd8 32 70 11/10/93 10:13:17

user4 cmd8 32 70 11/10/93 10:19:30

user42 nqs450 32 3300 11/10/93 10:22:07

user41 cmd342 4 54 11/10/93 10:22:37

sysadmin pwd 1 6 11/10/93 10:22:42

user4 cmd8 32 60 11/10/93 10:25:42

sysadmin pwd 1 3 11/10/93 10:30:43

user41 cmd342 4 126 11/10/93 10:31:32

Feitelson & Nitzberg, JSSPP 1995

Distribution of Job Sizes

Distribution of Resource Use

Degree of Multiprogramming

System Utilization

Job Arrivals

Arriving Job Sizes

Distribution of Interarrival Times

Distribution of Runtimes

User Activity

Repeated Execution

Application Moldability

Of jobs run more than once

Distribution of Run Lengths

Predictability in Repeated Runs

For jobs run more than 5 times

Recurring Findings
Many small and serial jobsMany power-of-two jobsWeak correlation of job size and durationJob runtimes are bounded but have CV>1Inaccurate user runtime estimatesNon-stationary arrivals (daily/weekly cycle)Power-law user activity, run lengths

Instrumentation
Passive: snoop without interferingActive: modify the systemCollecting the data interferes with system behaviorSaving or downloading the data causes additional interferencePartial solution: model the interference

Data Sanitation
Strange things happenLeaving them in is safe and faithful to the real dataBut it risks situations in which a non-representative situation dominates the evaluation results

Arrivals to SDSC SP2

Arrivals to LANL CM-5

Arrivals to CTC SP2

Arrivals to SDSC Paragon

What are they doing at 3:30 AM?

3:30 AM
Nearly every day, a set of 16 jobs are run by the same userMost probably the same set, as they typically have a similar pattern of runtimesMost probably these are administrative jobs that are executed automatically

Arrivals to CTC SP2

Arrivals to SDSC SP2

Arrivals to LANL CM-5

Arrivals to SDSC Paragon

Are These Outliers?
These large activity outbreaks are easily distinguished from normal activityThey last for several days to a few weeksThey appear at intervals of several months to more than a yearThey are each caused by a single user!Therefore easy to remove

Two Aspects
In workload modeling, should you include this in the model?In a general model, probably notConduct separate evaluation for special conditions (e.g. DOS attack)In evaluations using raw workload data, there is a danger of bias due to unknown special circumstances

Automation
The idea:Cluster daily data in based on various workload attributesRemove days that appear alone in a clusterRepeatThe problem:Strange behavior often spans multiple days
Cirne &Berman, Wkshp Workload Charact. 2001

Workload Modeling

Statistical Modeling
Identify attributes of the workloadCreate empirical distribution of each attributeFit empirical distribution to create modelSynthetic workload is created by sampling from the model distributions

Fitting by Moments
Calculate model parameters to fit moments of empirical dataProblem: does not fit the shape of the distribution

Jann et al, JSSPP 1997

Fitting by Moments
Calculate model parameters to fit moments of empirical dataProblem: does not fit the shape of the distributionProblem: very sensitive to extreme data values

Effect of Extreme Runtime Values

Downey & Feitelson, PER 1999
Change when top records omittedomitmeanCV0.01%-2.1%-29%0.02%-3.0%-35%0.04%-3.7%-39%0.08%-4.6%-39%0.16%-5.7%-42%0.31%-7.1%-42%

Alternative: Fit to Shape
Maximum likelihood: what distribution parameters were most likely to lead to the given observationsNeeds initial guess of functional formPhase type distributionsConstruct the desired shapeGoodness of fitKolmogorov-Smirnov: difference in CDFsAnderson-Darling: added emphasis on tailMay need to sample observations

Correlations
Correlation can be measured by the correlation coefficientIt can be modeled by a joint distribution functionBoth may not be very useful

Correlation Coefficient

Gives low results for correlation of runtime and size in parallel systems
systemCCCTC SP2-0.029KTH SP20.011SDSC SP20.145LANL CM-50.211SDSCParagon0.305

Distributions

A restricted version of a joint distribution

Modeling Correlation
Divide range of one attribute into sub-rangesCreate a separate model of other attribute for each sub-rangeModels can be independent, or model parameter can depend on sub-range

Stationarity
Problem of daily/weekly activity cycleNot important if unit of activity is very small (network packet)Very meaningful if unit of work is long (parallel job)

How to Modify the Load
Multiply interarrivals or runtimes by a factorChanges the effective length of the dayMultiply machine size by a factorModifies packing propertiesAdd users

Stationarity
Problem of daily/weekly activity cycleNot important if unit of activity is very small (network packet)Very meaningful if unit of work is long (parallel job)Problem of new/old systemImmature workloadLeftover workload

Heavy Tails

Tail Types

When a distribution has mean m, what is the distribution of samples that are larger than x?
Light: expected to be smaller than x+m Memoryless: expected to be x+m Heavy: expected to be larger than x+m

Formal Definition

Tail decays according to a power law

Test: log-log complementary distribution

Consequences
Large deviations from the mean are realisticMass disparitysmall fraction of samples responsible for large part of total massMost samples together account for negligible part of mass
Crovella, JSSPP 2001

Unix File Sizes Survey, 1993

Unix File Sizes LLCD

Consequences
Large deviations from the mean are realisticMass disparitysmall fraction of samples responsible for large part of total massMost samples together account for negligible part of massInfinite momentsFor mean is undefinedFor variance is undefined
Crovella, JSSPP 2001

Pareto Distribution

With parameter the density is proportional to

The expectation is then

i.e. it grows with the number of samples

Pareto Samples

Effect of Samples from Tail
In simulation:A single sample may dominate resultsExample: response times of processesIn analysis:Average long-term behavior may never happen in practice

Real Life
Data samples are necessarily boundedThe question is how to generalize to the model distributionArbitrary truncationLognormal or phase-type distributionsSomething in between

Solution 1: Truncation
Postulate an upper bound on the distributionQuestion: where to put the upper boundProbably OK for qualitative analysisMay be problematic for quantitative simulations

Solution 2: Model the Sample
Approximate the empirical distribution using a mixture of exponentials (e.g. phase-type distributions)In particular, exponential decay beyond highest sampleIn some cases, a lognormal distribution provides a good fitGood for mathematical analysis

Solution 3: Dynamic
Place an upper bound on the distributionLocation of bound depends on total number of samples requiredExample:
Note: does not change during simulation

Self Similarity

The Phenomenon
The whole has the same structure as certain partsExample: fractals

The Phenomenon
The whole has the same structure as certain partsExample: fractalsIn workloads: burstiness at many different time scales
Note: relates to a time series

Job Arrivals to SDSC Paragon

Process Arrivals to SDSC Paragon

Long-Range Correlation
A burst of activity implies that values in the time series are correlatedA burst covering a large time frame implies correlation over a long rangeThis is contrary to assumptions about the independence of samples

Aggregation
Replace each subsequence of m consecutive values by their meanIf self-similar, the new series will have statistical properties that are similar to the original (i.e. bursty)If independent, will tend to average out

Poisson Arrivals

Tests
Essentially based on the burstiness-retaining nature of aggregationRescaled range (R/s) metric: the range (sum) of n samples as a function of n

R/s Metric

Tests
Essentially based on the burstiness-retaining nature of aggregationRescaled range (R/s) metric: the range (sum) of n samples as a function of nVariance-time metric: the variance of an aggregated time series as a function of the aggregation level

Variance Time Metric

Modeling Self Similarity
Generate workload by an on-off processDuring on period, generate work at steady paceDuring off period to nothingOn and off period lengths are heavy tailedMultiplex many such sourcesLeads to long-range correlation

Research Areas

Effect of Users
Workload is generated by usersHuman users do not behave like a random sampling processFeedback based on system performanceRepetitive working patterns

Feedback
User population is finiteUsers back off when performance is inadequate
Negative feedback

Better system stability
Need to explicitly model this behavior

Locality of Sampling
Users display different levels of activity at different timesAt any given time, only a small subset of users is active

Active Users

Users display different levels of activity at different timesAt any given time, only a small subset of users is activeThese users repeatedly do the same thingWorkload observed by system is not a random sample from long-term distribution

SDSC Paragon Data

Growing Variability

SDSC Paragon Data


The questions:
How does this effect the results of performance evaluation?Can this be exploited by the system, e.g. by a scheduler?

Hierarchical Workload Models
Model of user populationModify load by adding/deleting usersModel of a single users activityBuilt-in self similarity using heavy-tailed on/off timesModel of application behavior and internal structureCapture interaction with system attributes

A Small Problem
We dont have data for these modelsEspecially for user behavior such as feedbackNeed interaction with cognitive scientistsAnd for distribution of application types and their parametersNeed detailed instrumentation

Final Words

We like to think that we design systems based on solid foundations

But beware:

the foundations might be unbased assumptions!

We should have more science in computer science:
Collect data rather than make assumptions Run experiments under different conditions Make measurements and observations Make predictions and verify them Share data and programs to promote good
practices and ensure comparability

Computer Systems are Complex

Science = experimental scince, like physics, chemistry, biology

Advice from the Experts

Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house

-- Henri Poincar

Advice from the Experts

Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house

-- Henri Poincar

Everything should be made as simple as possible, but not simpler

-- Albert Einstein

Acknowledgements
Students: Ahuva Mualem, David Talby,
Uri Lublin
Larry Rudolph / MITData in Parallel Workloads ArchiveJoefon Jann / IBMAllen Downey / WelselleyCTC SP2 log / Steven HotovySDSC Paragon log / Reagan MooreSDSC SP2 log / Victor HazelwoodLANL CM-5 log / Curt CanadaNASA iPSC/860 log / Bill Nitzberg
x

a

x

F

log

)

(

log

-

=

(

)

(

)

(

)

(

)

-

-

-

-

2

2

y

y

x

x

y

y

x

x

i

i

i

i

(

)

(

)

2

0

Pr

Documents

Workloads 02 Tutorial