Modeling shared cache and bus in multi-core platforms for timing analysis

Modeling shared cache and bus in multi-core platforms for timing analysis

Sudipta ChattopadhyayAbhik RoychoudhuryTulika Mitra

Timing analysis (basics) Hard real time systems need to meet certain deadline

System level or schedulability analysis Single task analysis (Worst Case Execution Time analysis)

WCET : An upper bound on the execution time for all possible inputs Usually obtained by static analysis Worst Case Execution Time (WCET) of a program for a given

hardware platform

Usage of WCET Schedulability analysis of hard real time systems Worst case oriented optimization

WCET and BCET

ActualBCET

ActualWCET

Execution Time

ObservedWCET

Estimated BCET

Actual

Observed

Over-estimation

WCET = Worst-case Execution TimeBCET = Best-case Execution Time

ObservedBCET Estimated

WCETActualWCET

Timing analysis for multi-cores Modeling shared cache and shared bus

Most common form of resource sharing in multi-cores Difficulties

Conflicts in shared cache arising from other cores Contention in shared bus introduced by other cores Interaction between shared cache and shared bus

Commercial multi-core

Shared off-chip Bus

Core 0

L1….

Core N

L1

Shared L2

Core 0

L1….

Core N

L1

Shared L2

Off-chip Memory

Crossbar Crossbar

Processor 0 Processor 1

Intel Core-2 Duo

Presence of both shared cache and

shared bus

Modeled architecture

Shared cache is accessed through a shared bus

….

Shared Bus

Shared L2

Core 0

L1….

Core N

L1

Shared Bus

Shared L3

L2 L2

Architecture AArchitecture B

L1 L1

Core 0 Core N

Assumptions Perfect data cache, currently we model only shared instruction

cache

Shared bus is TDMA (Time Division Multiple Access) and TDMA slots are assigned in a round-robin fashion TDMA is chosen for predictability

Separated instruction and data bus Bus traffic arising from data memory accesses are ignored

No self modifying code Cache coherence need not be modeled

Non-preemptive scheduling

Overview of the frameworkL1 cache analysis

L2 cacheanalysis

Cache accessclassification

L1 cache analysis

L2 cache analysis

L2 conflict analysisInitial interference


Bus awareanalysis

WCRT computation

Interference changes ?

Yes

Estimated WCRT

No

Iterative fix-point analysis

Termination of our analysis is

guaranteed

Framework componentsL1 cache analysis

L2 cacheanalysis

L1 cache analysis

L2 cache analysis



Bus awareanalysis

WCRT computation


Yes

Estimated WCRT

No


L1 cache analysis (Ferdinand et. al. RTS’97) Abstract cache set

{a}

{a}{b,c}

{c}

{a}{c}

{a}{b,c}

{a}{c}

{a}

{a}{b,c}

{c}

{a}{b,c}

{a}

{b}

{b}

{c}Evicted blocks

low

high

age

Must JoinIntersection, maximum age

Finds All hit (AH) cache blocks

May JoinUnion, minimum age

Finds All Miss (AM) cache blocks Persistence JoinUnion, maximum age

Finds Persistence (PS) or never evicted cache blocks


L1 cache analysis


Bus awareanalysis

WCRT computation


Yes

Estimated WCRT


L2 cache analysis


L2 cache analysis1

Per core L2 cache analysis (Puaut et. al. RTSS 2008)

Memory reference

L1 cache

L2 cache

All miss Persistence or NC

Never accessed (N)

Always accessed (A)

All hit

ACSout = ACSin ACSout = U(ACSin)

Unknown (U)

∏ Join

ACSin

ACSout = ACSinACSout = U(ACSin)


L2 cacheanalysis

L1 cache analysis

L2 cache analysis

Initial interference


Bus awareanalysis

WCRT computation


Yes

Estimated WCRT

No


L2 conflict analysis

Shared cache conflict analysis Our past work (RTSS 2009) Exploit task lifetime to refine shared cache analysis Task interference graph

There exists an edge between two task nodes if they have overlapping lifetimes

Analyze each cache set C individually

Task interference graph

Timeline

T3

T2

T1

T1

T2

T3

Task interference graph

Cache conflict analysis

T1

T2

T3

Task interference graphm1

Associativity = 4

T1

T2

T3

m2

m3

T1

T2

T3

m2

m3

m1

shiftAfter conflict analysis

m1: AH

m2: AH

m3: AH

m1: AH->AH

m2: AH->AH

m3: AH->AH

All memory blocks remain all hits

Cache set C

M(C) = 1

M(C) = 2

M(C) = 1

Cache conflict analysis

T1

T2

T3

Task interference graphm1

Associativity = 4

T1

T2

T3

m2

m3

T1

T3

After conflict analysis

m0, m1: AH

m2: AH

m3: AH

m1: AH->AH

m2: AH->NC

m3: AH->AH

m2 may be replaced from the cache due to conflicts from other cores

Cache set C

M(C) = 1

M(C) = 3

M(C) = 1

m0

m0 m1

m3

T2


L2 cacheanalysis

L1 cache analysis

L2 cache analysis



WCRT computation


Yes

Estimated WCRT

No



Bus awareanalysis

Example : variable bus delay

Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,

Code Executing on Core0

Right BranchCommon PathLeft Branch

C1 = 20

C2 = 10

C5 = 10

M1 = 10

C1 = 20

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

t = 100

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10

t = 150

L2 miss

L2 hit

First iteration (No bus delay)

Example : variable bus delay




C1 = 20

C2 = 10

C5 = 10

M1 = 10

C1 = 20

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10L2 miss

L2 hit

Second iteration (M1 suffers 20 cycles bus delay)

Bus delayM1 = 10

C1 = 20

t = 100

t = 150

Conclusion:WCET of different iterations

could be different

Possible solutions

Source of problem Each iteration of a loop may start at different offset relative to its

bus slot

Possible solutions Virtually unroll all loop iterations – too expensive Do not model the bus or take maximum possible bus delay –

imprecise result

Our solution Assume each loop iteration starts at the same offset relative to its

bus slot and add necessary alignment cost

Key observation

Core 0 slot Core 1 slot Core 0 slot Core 1 slot

Timeline

Bus schedule

Δ T starts at core 0 Δ T starts at

core 0

Round robin schedule follow repeated patterns

Core 0 slot Core 1 slot

Δ T starts at core 0

T must follow the same Execution pattern if the offset ( Δ) is same

Bus schedule

Revisiting the example



C1 = 20

C2 = 10

C5 = 10

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10L2 miss

L2 hit

Align

M1 = 10

C1 = 20

t = 100

t = 150

M1 = 10

C1 = 20

Alignment cost = 20 cycles (all iterations follow the same execution pattern with this alignment)

WCET of one iteration <= 100 cycles

(No need to virtually unroll the loop)


Partial Unrolling

C1=10C2=10L2 Hit

C1=10M2=10C2=10

C1=10M2=10C2=10C1=10M2=10C2=10C1=10M2=10C2=10

C1=10 C1=10

t=0

t=100

Core0 Bus slot

No unrolling Partial unrolling

Iter1

Iter2

Iter3

Iter1

Iter2 Iter4 Core0 Bus slot


Alignment cost higher if the loop is very small compared to the length of bus slot

Partially unroll such loopstill one bus slot is filled

Extension to full programs

WCETof

inner loop

WCETof

outer loop


L2 cacheanalysis

L1 cache analysis

L2 cache analysis




Yes

Estimated WCRT

No



Bus awareWCET/BCETcomputation

WCRT computation

WCRT

t3t2

t4

t1(1)

(2) (2)

(1)

Assigned core

Task graph

Peers

Task lifetime : [EarliestReady, LatestFinish]

EarliestReady(t1) = 0EarliestReady(t4) >= EarliestFinish(t2)EarliestReady(t4) >= EarliestFinish(t3)

EarliestFinish = EarliestReady + BCET

LatestReady(t4) >= LatestFinish (t2)LatestReady(t4) >= LatestFinish (t3)

t2 has peers LatestFinish (t2) = LatestReady(t2)

+ WCET(t2) + WCET(t3)

t4 has no peers LatestFinish (t4) = LatestReady(t4)

+ WCET(t4)

Computed WCRT = LatestFinish(t4)

Earliest timecomputation

Latest timecomputation

An example

T2.1= 10

T3.2=10

T3.1=20

M2.2=20

T2.2=20

M3.2 =20T3.2 =10T4.2=10

T4.1 =20

M4.2=10T4.2 =10

Core 0 Core 1Bus

Core 0 Core 1

Wait

Wait

T1.1=90

T2.1= 10T2 lifetime

T3 lifetime

Bus schedule based on M2.2, M3.2 L2 missWCRT: 170 cyclesT2 and T3 have Disjoint lifetimeM2.2 and M3.2 cannot conflict: Both L2 Hit

L2 Hit: 10 cyclesL2 Miss: 20 cyclesBus slot: 50 cyclesM2.2 and M3.2 conflict in L2: Both L2 MissM4.2 is L2 Hit

T1.1= 90

T2.2=20

T3.1= 20

T4.1= 20

Core0 slot

Core1 slot

Core0 slot

Core1 slot

Example contd.

Bus schedule based on M2.2, M3.2 L2 HitSecond bus wait for Core 1 eliminatedWCRT: 130 cycles

T1.1=90

T3.1=20

T2.2 =20

T3.2=10

T2.1= 10

T4.1=20

T4.2=10M2.2=10

M3.2=10

M4.2=10

Core 0 Core 1Bus

Wait

Core0

slot

Core1 slot

Core0 slot

Core1

slot

T3.1=20

M2.2=20

T2.2=20

M3.2 =20T3.2 =10T4.1 =20

M4.2=10T4.2 =10

Core 0 Core 1Bus

Wait

Wait

T1.1=90

T2.1= 10T2 lifetime

T3 lifetime

Core0 slot

Core1 slot

Core0 slot

Core1 slot

Experimental evaluation

Tasks are compiled into Simplescalar PISA compliant binaries

CMP_SIM is used for simulation, CMP_SIM is extended with shared bus modeling and for PISA compliant binaries

Two setup Independent tasks running in different cores Task dependency specified through a task graph

Overestimation ratio (2-core)

One core runs statemate another core runs the program under evaluation

L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Average Overestimation = 40%

Overestimation ratio (4-core)

Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores

L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Average Overestimation = 40%

Sensitivity with bus slot length (2-core)Average overestimation ratio for program Statemate

Sensitivity with bus slot length (4-core)Average overestimation ratio for program Statemate

Debie is an online space debris monitoring program manufactured by Space Systems Finland Ltd.

Extracted task graph (Debie-test)

WCRT analysis of task graph

main-tc(1)

main-hm(1)

main-tm(1)

main-hit(1)

main-aq(1)

main-su(1)

tc-test(3)

hm-test(4)

tm-test(1)

hit-test(2)

aq-test(4)

su-test(2)

Assigned core number

Experimental evaluation of Debie-test

L1 cache : 2-way, 2 KBL2 cache : 4-way, 8 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Overestimation ratio ~ 20%

This difference clearly shows that for real life application bus modeling is essential

Extension to different multi-core architecture (e.g. Intel Core2 Duo)

Shared off-chip Bus

Core 0

L1….

Core N

L1

Shared L2

Core 0

L1….

Core N

L1

Shared L2

Off-chip Memory

Crossbar Crossbar

Processor 0 Processor 1

Only L2 cache misses appearin shared bus

Overall framework still remains the same, only shared bus waiting time is computed for L2 cache misses

Documents

Modeling shared cache and bus in multi-core platforms for timing analysis