Towards Adaptive Caching for Parallel and Distributed Simulation

Maria Hybinette, UGA 1

Towards Adaptive Caching for

Parallel and Distributed Simulation

Abhishek Chugh & Maria Hybinette

Computer Science Department

The University of Georgia

WSC-2004

Airspace

Atlanta Munich

Simulation Model Assumptions

Collection of Logical Processes (LPs) Assume LPs do not share state variables Communicate by exchanging time stamped

messages

Problem & Goal

Problem:Inefficiency in PDES: Redundant computations

Observation:Computations repeat: » Long run of simulations» Cyclic Systems» Communication network simulations

Goal:Increase efficiency by reusing computations

LPLPLPLPMsgMsgMsg

Approach

Cache computations and re-use when they repeat instead of re-compute.

Msg Msg Msg

Msg MsgMsg LP

Msg LP

Approach: Adaptive Caching

Generic caching mechanism independent of simulation engine and application

Caveat: Different factors that impact the effectiveness of caching

» Proposal: An adaptive approach

Msg LP

Factors Affecting Caching Effectiveness

Cache size Cost of looking up into the cache and

updating cache Execution time of the computation Probability of a hit: Hit rate

Effective Caching Cost

E(Costuse_cache) =

hit_rate * Costlookup_hit

+ (1 - hit_rate) * (Costlookup_miss + Costcomputation+ Costinsert)

Caching is Not Always a Good Idea

E(Costuse_cache) =

hit_rate * Costlookup_hit

+ (1 - hit_rate) * (Costlookup_miss + Costcomputation+ Costinsert)

Hit rate low, or Very fast computation Only when Costuse_cache < Costcomputation is caching

worthwhile

How Much Speedup is Possible?

Neglecting cache warm up and fixed costs

Expected Speedup = Costcomputation / Costuse_cache

Upper bound (hit_rate = 1)

= Costcomputation / Costlookup

In our experiments Costcomputation / Costlookup = ~3.5

Related Work

Function Caching: Replace application level function calls with cache queries:

» Introduced by: Bellman (1957); Michie (1968)» Incremental computations:

– Pugh & Teitelbaum (1989), Liu & Teitelbaum (1995)» Sequential discrete event simulation:

– Staged Simulation: Walsh & Sirer (2003) function caching + currying (break up computations), re-ordering and pre-computations),

Decision Tool Techniques for PADS: Multiple runs of similar simulations

» Simulation Cloning: Hybinette & Fujimoto (1998); Chen & Turner, et al (2002); Straburger (2000)

» Updateable Simulations (Ferenci et al 2002) Related Optimization Techniques

» Lazy Re-Evaluation: West (1988)

Overview of Adaptive Caching

Execution time:

1. Warm-up execution phase, for each function:a) Monitor: hit rate, query time, function run time

b) Determine utility of using cache

2. Main execution phase, for each function:a) Use cache (or not) depending on results from 1

b) Randomly sample: hit rate, query time, function run time» Revise decision if conditions change

What’s New

Decision to use cache is made dynamically » in response to unpredictable local conditions for each LP at

execution time

Relieves user of having to know whether something is worth caching

» adaptive method will automatically identify caching opportunities, reject poor caching choices

Easy to use caching API » independent of application or simulation kernel

» cache middleware

Distributed cache» Each LP maintains own independent cache

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init()

cacheInitialize(int argc, char** argv);

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init()

cacheInitialize(int argc, char** argv);

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init(){cacheInitialize(int argc, char** argv);

Proc(state, msg, MyPE){retval = cacheCheckStart( currentstate, event );if( retval == NULL )

{/* original LP code. compute new state and events to be scheduled */

/* allow cache to save results */cacheCheckEnd( newstate, newevents ) ;}

else{newstate = retval.state;newevents = retval.events;}

schedule( newevents );

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init(){cacheInitialize(int argc, char** argv);

Proc(state, msg, MyPE){retval = cacheCheckStart( currentstate, event );if( retval == NULL )

{/* original LP code. compute new state and events to be scheduled */

/* allow cache to save results */cacheCheckEnd( newstate, newevents ) ;}

else{newstate = retval.state;newevents = retval.events;}

schedule( newevents );

Implementation

Caching Middleware

Simulation Application

Cache Middleware

Simulation Kernel

Caching Middleware (Hit)

Cache Middleware

Simulation Kernel

Check cache state/message Cache Hit

Caching Middleware (Miss)

Cache Middleware

Simulation Kernel

Check cache state/message

Miss or cache lookup expensive

Miss: Cache new state & message

Cache Miss

Cache Implementation

Hash table and separate chaining Input: Current State & Message Output: State and output message(s) Hash function (djb2 by Dan Bernstein, Perl)

Memory Management

Distributed cache; one for each LP Pre-allocate memory pool for cache in each

LP during initialization phase Upper limit parameterized

Experiments

3 Sets of Experiments with P-Hold» Proof of concept (no adaptive caching) hit-rate» Evaluation of impact of cache size and simulation

running time on speedup (no caching/caching)» Evaluation of adaptive caching with regard to the cost of

event computation 16 processor SGI Origin 2000

» 4 processors

“Curried” out time stamps

0 20000 40000 60000 80000 100000 120000 140000 160000 180000

Progress (Simulated Time)

90 KB (10%)

25000 KB (25%)

10000 KB (100%)

Hit Rate versus Progress

As expected hit ratio increases as cache size increases Maximum hit rate for large cache Hit rates sets an upper bound for speedup

Speedup vs Cache Size

0 2000 4000 6000 8000 10000

Size of Cache (KB)

5 msec3 msec

Speedup improves as size of the cache increases Beyond size 9,000KB speedup declines and levels off Better performance for simulations with computations

that have higher latency

Speedup vs Costcomputation

Non-adaptive caching suffers a speedup of 0.82 for low latency computations and improves to 1 when the computational latency approaches 1.5 msec

0 0.5 1 1.5 2 2.5 3

Computational Latency (msec)

Speedup (

Non-Adaptive

Speedup vs Costcomputation

Adaptive Caching, tracks the cost of consulting the cast in comparison of running the actual computation

Adaptive caching is 1 for small computational latencies (selects performing computation instead of consulting cache)

0 0.5 1 1.5 2 2.5 3

Computational Latency (msec)

Speedup (

Non-Adaptive

Adaptive

Summary & Future Work

Summary: Middleware implementation that require no major

structural revision of application code Best case speedup approaches 3.5 worst case speedup

of 1 (speedup is limited to a hit rate of 70%) Random generated information (such as time stamps or

other) caching may become ineffective unless taking pre-cautions

Future Work: Function caching instead of LP caching Look at series of functions to jump forward Adaptive replacement strategies

Closing

“A sword wielded poorly will kill it’s owner”

-- Ancient Proverb

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init()

Proc(state, msg, MyPE)

val1 =

fancy_function(msg->param1,

state->key_part);

val2 =

fancier_function(msg->param3);

state->key_part = val1 + val2;

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init()

val1 =

state->key_part);

val2 =

Pseudo-Code Example

// ORIGINAL LP CODE

LP_init()

val1 =

state->key_part);

val2 =

// LP CODE WITH CACHING

LP_init()

cache_init(FF1, SIZE1, 2,

fancy_function);

cache_init(FF2, SIZE2, 1,

fancier_function);

val1 =

cache_query(FF1, msg->param1,

state->key_part);

val2 =

cache_query(FF2, msg->param3);

State->key_part = val1 + val2;

Approach

Towards Adaptive Caching for Parallel and Distributed Simulation

Documents

PACMan: Coordinated Memory Caching for Parallel Jobs · 2017-09-12 · PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan1, Ali Ghodsi 1,4, Andrew Wang 1,

Parallel Adaptive Wang Landau - GDR November 2011

The Holodeck: A Parallel Ray-caching Rendering System - CiteSeer

Multimedia Proxy Caching Mechanism for Quality Adaptive Streaming Applications in The Internet

Adaptive Parallel Applications in Distributed Computing Environment

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching

DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Adaptive Sequential Posterior Simulators for Massively Parallel

RACE: A Robust Adaptive Caching Strategy for Buffer Cacheweb.eece.maine.edu/~zhu/papers/wmpi-Zhu.pdf · RACE: A Robust Adaptive Caching Strategy for Buffer Cache Yifeng Zhu Hong Jiang

DiagSplit: parallel, crack-free, adaptive tessellation for

Probabilistic Adaptive Load Balancing For Parallel Queries

Infrastructure for Parallel Adaptive Unstructured Mesh Simulations

Replication algorithms in a remote caching architecture - Parallel

The Holodeck: A Parallel Ray-caching Rendering System · through parallel processing and view ray sample caching. ... hologram with an unobstructed region of free movement, which

Planar: Parallel Lightweight Architecture-Aware Adaptive

Adaptive grid algorithm in application to parallel

Algorithms for massively parallel generic hp-adaptive

Adaptive parallel job scheduling with resource admissible

Adaptive Histogram Equalization, a Parallel Implementation

PACMan: Coordinated Memory Caching for Parallel …PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan 1, Ali Ghodsi , Andrew Wang , Dhruba Borthakur2, Srikanth