37
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra

Modeling shared cache and bus in multi-core platforms for timing analysis

  • Upload
    burian

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Modeling shared cache and bus in multi-core platforms for timing analysis. Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra. Timing analysis (basics). Hard real time systems need to meet certain deadline System level or schedulability analysis - PowerPoint PPT Presentation

Citation preview

Page 1: Modeling shared cache and bus in multi-core platforms for timing analysis

Modeling shared cache and bus in multi-core platforms for timing analysis

Sudipta ChattopadhyayAbhik RoychoudhuryTulika Mitra

Page 2: Modeling shared cache and bus in multi-core platforms for timing analysis

Timing analysis (basics) Hard real time systems need to meet certain deadline

System level or schedulability analysis Single task analysis (Worst Case Execution Time analysis)

WCET : An upper bound on the execution time for all possible inputs Usually obtained by static analysis Worst Case Execution Time (WCET) of a program for a given

hardware platform

Usage of WCET Schedulability analysis of hard real time systems Worst case oriented optimization

Page 3: Modeling shared cache and bus in multi-core platforms for timing analysis

WCET and BCET

ActualBCET

ActualWCET

Execution Time

ObservedWCET

Estimated BCET

Actual

Observed

Over-estimation

WCET = Worst-case Execution TimeBCET = Best-case Execution Time

ObservedBCET Estimated

WCETActualWCET

Page 4: Modeling shared cache and bus in multi-core platforms for timing analysis

Timing analysis for multi-cores Modeling shared cache and shared bus

Most common form of resource sharing in multi-cores Difficulties

Conflicts in shared cache arising from other cores Contention in shared bus introduced by other cores Interaction between shared cache and shared bus

Page 5: Modeling shared cache and bus in multi-core platforms for timing analysis

Commercial multi-core

Shared off-chip Bus

Core 0

L1….

Core N

L1

Shared L2

Core 0

L1….

Core N

L1

Shared L2

Off-chip Memory

Crossbar Crossbar

Processor 0 Processor 1

Intel Core-2 Duo

Presence of both shared cache and

shared bus

Page 6: Modeling shared cache and bus in multi-core platforms for timing analysis

Modeled architecture

Shared cache is accessed through a shared bus

….

Shared Bus

Shared L2

Core 0

L1….

Core N

L1

Shared Bus

Shared L3

L2 L2

Architecture AArchitecture B

L1 L1

Core 0 Core N

Page 7: Modeling shared cache and bus in multi-core platforms for timing analysis

Assumptions Perfect data cache, currently we model only shared instruction

cache

Shared bus is TDMA (Time Division Multiple Access) and TDMA slots are assigned in a round-robin fashion TDMA is chosen for predictability

Separated instruction and data bus Bus traffic arising from data memory accesses are ignored

No self modifying code Cache coherence need not be modeled

Non-preemptive scheduling

Page 8: Modeling shared cache and bus in multi-core platforms for timing analysis

Overview of the frameworkL1 cache analysis

L2 cacheanalysis

Cache accessclassification

L1 cache analysis

L2 cache analysis

L2 conflict analysisInitial interference

Cache accessclassification

Bus awareanalysis

WCRT computation

Interference changes ?

Yes

Estimated WCRT

No

Iterative fix-point analysis

Termination of our analysis is

guaranteed

Page 9: Modeling shared cache and bus in multi-core platforms for timing analysis

Framework componentsL1 cache analysis

L2 cacheanalysis

L1 cache analysis

L2 cache analysis

L2 conflict analysisInitial interference

Cache accessclassification

Bus awareanalysis

WCRT computation

Interference changes ?

Yes

Estimated WCRT

No

Cache accessclassification

Page 10: Modeling shared cache and bus in multi-core platforms for timing analysis

L1 cache analysis (Ferdinand et. al. RTS’97) Abstract cache set

{a}

{a}{b,c}

{c}

{a}{c}

{a}{b,c}

{a}{c}

{a}

{a}{b,c}

{c}

{a}{b,c}

{a}

{b}

{b}

{c}Evicted blocks

low

high

age

Must JoinIntersection, maximum age

Finds All hit (AH) cache blocks

May JoinUnion, minimum age

Finds All Miss (AM) cache blocks Persistence JoinUnion, maximum age

Finds Persistence (PS) or never evicted cache blocks

Page 11: Modeling shared cache and bus in multi-core platforms for timing analysis

Framework componentsL1 cache analysis

L1 cache analysis

L2 conflict analysisInitial interference

Bus awareanalysis

WCRT computation

Interference changes ?

Yes

Estimated WCRT

Cache accessclassification

L2 cache analysis

Cache accessclassification

L2 cache analysis1

Page 12: Modeling shared cache and bus in multi-core platforms for timing analysis

Per core L2 cache analysis (Puaut et. al. RTSS 2008)

Memory reference

L1 cache

L2 cache

All miss Persistence or NC

Never accessed (N)

Always accessed (A)

All hit

ACSout = ACSin ACSout = U(ACSin)

Unknown (U)

∏ Join

ACSin

ACSout = ACSinACSout = U(ACSin)

Page 13: Modeling shared cache and bus in multi-core platforms for timing analysis

Framework componentsL1 cache analysis

L2 cacheanalysis

L1 cache analysis

L2 cache analysis

Initial interference

Cache accessclassification

Bus awareanalysis

WCRT computation

Interference changes ?

Yes

Estimated WCRT

No

Cache accessclassification

L2 conflict analysis

Page 14: Modeling shared cache and bus in multi-core platforms for timing analysis

Shared cache conflict analysis Our past work (RTSS 2009) Exploit task lifetime to refine shared cache analysis Task interference graph

There exists an edge between two task nodes if they have overlapping lifetimes

Analyze each cache set C individually

Page 15: Modeling shared cache and bus in multi-core platforms for timing analysis

Task interference graph

Timeline

T3

T2

T1

T1

T2

T3

Task interference graph

Page 16: Modeling shared cache and bus in multi-core platforms for timing analysis

Cache conflict analysis

T1

T2

T3

Task interference graphm1

Associativity = 4

T1

T2

T3

m2

m3

T1

T2

T3

m2

m3

m1

shiftAfter conflict analysis

m1: AH

m2: AH

m3: AH

m1: AH->AH

m2: AH->AH

m3: AH->AH

All memory blocks remain all hits

Cache set C

M(C) = 1

M(C) = 2

M(C) = 1

Page 17: Modeling shared cache and bus in multi-core platforms for timing analysis

Cache conflict analysis

T1

T2

T3

Task interference graphm1

Associativity = 4

T1

T2

T3

m2

m3

T1

T3

After conflict analysis

m0, m1: AH

m2: AH

m3: AH

m1: AH->AH

m2: AH->NC

m3: AH->AH

m2 may be replaced from the cache due to conflicts from other cores

Cache set C

M(C) = 1

M(C) = 3

M(C) = 1

m0

m0 m1

m3

T2

Page 18: Modeling shared cache and bus in multi-core platforms for timing analysis

Framework componentsL1 cache analysis

L2 cacheanalysis

L1 cache analysis

L2 cache analysis

Initial interference

Cache accessclassification

WCRT computation

Interference changes ?

Yes

Estimated WCRT

No

Cache accessclassification

L2 conflict analysis

Bus awareanalysis

Page 19: Modeling shared cache and bus in multi-core platforms for timing analysis

Example : variable bus delay

Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,

Code Executing on Core0

Right BranchCommon PathLeft Branch

C1 = 20

C2 = 10

C5 = 10

M1 = 10

C1 = 20

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

t = 100

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10

t = 150

L2 miss

L2 hit

First iteration (No bus delay)

Page 20: Modeling shared cache and bus in multi-core platforms for timing analysis

Example : variable bus delay

Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,

Code Executing on Core0

Right BranchCommon PathLeft Branch

C1 = 20

C2 = 10

C5 = 10

M1 = 10

C1 = 20

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10L2 miss

L2 hit

Second iteration (M1 suffers 20 cycles bus delay)

Bus delayM1 = 10

C1 = 20

t = 100

t = 150

Conclusion:WCET of different iterations

could be different

Page 21: Modeling shared cache and bus in multi-core platforms for timing analysis

Possible solutions

Source of problem Each iteration of a loop may start at different offset relative to its

bus slot

Possible solutions Virtually unroll all loop iterations – too expensive Do not model the bus or take maximum possible bus delay –

imprecise result

Our solution Assume each loop iteration starts at the same offset relative to its

bus slot and add necessary alignment cost

Page 22: Modeling shared cache and bus in multi-core platforms for timing analysis

Key observation

Core 0 slot Core 1 slot Core 0 slot Core 1 slot

Timeline

Bus schedule

Δ T starts at core 0 Δ T starts at

core 0

Round robin schedule follow repeated patterns

Core 0 slot Core 1 slot

Δ T starts at core 0

T must follow the same Execution pattern if the offset ( Δ) is same

Bus schedule

Page 23: Modeling shared cache and bus in multi-core platforms for timing analysis

Revisiting the example

Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,

Code Executing on Core0

C1 = 20

C2 = 10

C5 = 10

M2 = 20

C2 = 10C3 = 20

C4 = 30

C3 = 20

C4 = 20

t = 0

t = 50

Core 0 bus slot

Core 1 bus slot

Core 0 bus slot

C5 = 10L2 miss

L2 hit

Align

M1 = 10

C1 = 20

t = 100

t = 150

M1 = 10

C1 = 20

Alignment cost = 20 cycles (all iterations follow the same execution pattern with this alignment)

WCET of one iteration <= 100 cycles

(No need to virtually unroll the loop)

Right BranchCommon PathLeft Branch

Page 24: Modeling shared cache and bus in multi-core platforms for timing analysis

Partial Unrolling

C1=10C2=10L2 Hit

C1=10M2=10C2=10

C1=10M2=10C2=10C1=10M2=10C2=10C1=10M2=10C2=10

C1=10 C1=10

t=0

t=100

Core0 Bus slot

No unrolling Partial unrolling

Iter1

Iter2

Iter3

Iter1

Iter2 Iter4 Core0 Bus slot

Code Executing on Core0

Alignment cost higher if the loop is very small compared to the length of bus slot

Partially unroll such loopstill one bus slot is filled

Page 25: Modeling shared cache and bus in multi-core platforms for timing analysis

Extension to full programs

WCETof

inner loop

WCETof

outer loop

Page 26: Modeling shared cache and bus in multi-core platforms for timing analysis

Framework componentsL1 cache analysis

L2 cacheanalysis

L1 cache analysis

L2 cache analysis

Initial interference

Cache accessclassification

Interference changes ?

Yes

Estimated WCRT

No

Cache accessclassification

L2 conflict analysis

Bus awareWCET/BCETcomputation

WCRT computation

Page 27: Modeling shared cache and bus in multi-core platforms for timing analysis

WCRT

t3t2

t4

t1(1)

(2) (2)

(1)

Assigned core

Task graph

Peers

Task lifetime : [EarliestReady, LatestFinish]

EarliestReady(t1) = 0EarliestReady(t4) >= EarliestFinish(t2)EarliestReady(t4) >= EarliestFinish(t3)

EarliestFinish = EarliestReady + BCET

LatestReady(t4) >= LatestFinish (t2)LatestReady(t4) >= LatestFinish (t3)

t2 has peers LatestFinish (t2) = LatestReady(t2)

+ WCET(t2) + WCET(t3)

t4 has no peers LatestFinish (t4) = LatestReady(t4)

+ WCET(t4)

Computed WCRT = LatestFinish(t4)

Earliest timecomputation

Latest timecomputation

Page 28: Modeling shared cache and bus in multi-core platforms for timing analysis

An example

T2.1= 10

T3.2=10

T3.1=20

M2.2=20

T2.2=20

M3.2 =20T3.2 =10T4.2=10

T4.1 =20

M4.2=10T4.2 =10

Core 0 Core 1Bus

Core 0 Core 1

Wait

Wait

T1.1=90

T2.1= 10T2 lifetime

T3 lifetime

Bus schedule based on M2.2, M3.2 L2 missWCRT: 170 cyclesT2 and T3 have Disjoint lifetimeM2.2 and M3.2 cannot conflict: Both L2 Hit

L2 Hit: 10 cyclesL2 Miss: 20 cyclesBus slot: 50 cyclesM2.2 and M3.2 conflict in L2: Both L2 MissM4.2 is L2 Hit

T1.1= 90

T2.2=20

T3.1= 20

T4.1= 20

Core0 slot

Core1 slot

Core0 slot

Core1 slot

Page 29: Modeling shared cache and bus in multi-core platforms for timing analysis

Example contd.

Bus schedule based on M2.2, M3.2 L2 HitSecond bus wait for Core 1 eliminatedWCRT: 130 cycles

T1.1=90

T3.1=20

T2.2 =20

T3.2=10

T2.1= 10

T4.1=20

T4.2=10M2.2=10

M3.2=10

M4.2=10

Core 0 Core 1Bus

Wait

Core0

slot

Core1 slot

Core0 slot

Core1

slot

T3.1=20

M2.2=20

T2.2=20

M3.2 =20T3.2 =10T4.1 =20

M4.2=10T4.2 =10

Core 0 Core 1Bus

Wait

Wait

T1.1=90

T2.1= 10T2 lifetime

T3 lifetime

Core0 slot

Core1 slot

Core0 slot

Core1 slot

Page 30: Modeling shared cache and bus in multi-core platforms for timing analysis

Experimental evaluation

Tasks are compiled into Simplescalar PISA compliant binaries

CMP_SIM is used for simulation, CMP_SIM is extended with shared bus modeling and for PISA compliant binaries

Two setup Independent tasks running in different cores Task dependency specified through a task graph

Page 31: Modeling shared cache and bus in multi-core platforms for timing analysis

Overestimation ratio (2-core)

One core runs statemate another core runs the program under evaluation

L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Average Overestimation = 40%

Page 32: Modeling shared cache and bus in multi-core platforms for timing analysis

Overestimation ratio (4-core)

Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores

L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Average Overestimation = 40%

Page 33: Modeling shared cache and bus in multi-core platforms for timing analysis

Sensitivity with bus slot length (2-core)Average overestimation ratio for program Statemate

Page 34: Modeling shared cache and bus in multi-core platforms for timing analysis

Sensitivity with bus slot length (4-core)Average overestimation ratio for program Statemate

Page 35: Modeling shared cache and bus in multi-core platforms for timing analysis

Debie is an online space debris monitoring program manufactured by Space Systems Finland Ltd.

Extracted task graph (Debie-test)

WCRT analysis of task graph

main-tc(1)

main-hm(1)

main-tm(1)

main-hit(1)

main-aq(1)

main-su(1)

tc-test(3)

hm-test(4)

tm-test(1)

hit-test(2)

aq-test(4)

su-test(2)

Assigned core number

Page 36: Modeling shared cache and bus in multi-core platforms for timing analysis

Experimental evaluation of Debie-test

L1 cache : 2-way, 2 KBL2 cache : 4-way, 8 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles

Overestimation ratio ~ 20%

This difference clearly shows that for real life application bus modeling is essential

Page 37: Modeling shared cache and bus in multi-core platforms for timing analysis

Extension to different multi-core architecture (e.g. Intel Core2 Duo)

Shared off-chip Bus

Core 0

L1….

Core N

L1

Shared L2

Core 0

L1….

Core N

L1

Shared L2

Off-chip Memory

Crossbar Crossbar

Processor 0 Processor 1

Only L2 cache misses appearin shared bus

Overall framework still remains the same, only shared bus waiting time is computed for L2 cache misses