Upload
burian
View
65
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Modeling shared cache and bus in multi-core platforms for timing analysis. Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra. Timing analysis (basics). Hard real time systems need to meet certain deadline System level or schedulability analysis - PowerPoint PPT Presentation
Citation preview
Modeling shared cache and bus in multi-core platforms for timing analysis
Sudipta ChattopadhyayAbhik RoychoudhuryTulika Mitra
Timing analysis (basics) Hard real time systems need to meet certain deadline
System level or schedulability analysis Single task analysis (Worst Case Execution Time analysis)
WCET : An upper bound on the execution time for all possible inputs Usually obtained by static analysis Worst Case Execution Time (WCET) of a program for a given
hardware platform
Usage of WCET Schedulability analysis of hard real time systems Worst case oriented optimization
WCET and BCET
ActualBCET
ActualWCET
Execution Time
ObservedWCET
Estimated BCET
Actual
Observed
Over-estimation
WCET = Worst-case Execution TimeBCET = Best-case Execution Time
ObservedBCET Estimated
WCETActualWCET
Timing analysis for multi-cores Modeling shared cache and shared bus
Most common form of resource sharing in multi-cores Difficulties
Conflicts in shared cache arising from other cores Contention in shared bus introduced by other cores Interaction between shared cache and shared bus
Commercial multi-core
Shared off-chip Bus
Core 0
L1….
Core N
L1
Shared L2
Core 0
L1….
Core N
L1
Shared L2
Off-chip Memory
Crossbar Crossbar
Processor 0 Processor 1
Intel Core-2 Duo
Presence of both shared cache and
shared bus
Modeled architecture
Shared cache is accessed through a shared bus
….
Shared Bus
Shared L2
Core 0
L1….
Core N
L1
Shared Bus
Shared L3
L2 L2
Architecture AArchitecture B
L1 L1
Core 0 Core N
Assumptions Perfect data cache, currently we model only shared instruction
cache
Shared bus is TDMA (Time Division Multiple Access) and TDMA slots are assigned in a round-robin fashion TDMA is chosen for predictability
Separated instruction and data bus Bus traffic arising from data memory accesses are ignored
No self modifying code Cache coherence need not be modeled
Non-preemptive scheduling
Overview of the frameworkL1 cache analysis
L2 cacheanalysis
Cache accessclassification
L1 cache analysis
L2 cache analysis
L2 conflict analysisInitial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Iterative fix-point analysis
Termination of our analysis is
guaranteed
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
L2 conflict analysisInitial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L1 cache analysis (Ferdinand et. al. RTS’97) Abstract cache set
{a}
{a}{b,c}
{c}
{a}{c}
{a}{b,c}
{a}{c}
{a}
{a}{b,c}
{c}
{a}{b,c}
{a}
{b}
{b}
{c}Evicted blocks
low
high
age
Must JoinIntersection, maximum age
Finds All hit (AH) cache blocks
May JoinUnion, minimum age
Finds All Miss (AM) cache blocks Persistence JoinUnion, maximum age
Finds Persistence (PS) or never evicted cache blocks
Framework componentsL1 cache analysis
L1 cache analysis
L2 conflict analysisInitial interference
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
Cache accessclassification
L2 cache analysis
Cache accessclassification
L2 cache analysis1
Per core L2 cache analysis (Puaut et. al. RTSS 2008)
Memory reference
L1 cache
L2 cache
All miss Persistence or NC
Never accessed (N)
Always accessed (A)
All hit
ACSout = ACSin ACSout = U(ACSin)
Unknown (U)
∏ Join
ACSin
ACSout = ACSinACSout = U(ACSin)
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
Shared cache conflict analysis Our past work (RTSS 2009) Exploit task lifetime to refine shared cache analysis Task interference graph
There exists an edge between two task nodes if they have overlapping lifetimes
Analyze each cache set C individually
Task interference graph
Timeline
T3
T2
T1
T1
T2
T3
Task interference graph
Cache conflict analysis
T1
T2
T3
Task interference graphm1
Associativity = 4
T1
T2
T3
m2
m3
T1
T2
T3
m2
m3
m1
shiftAfter conflict analysis
m1: AH
m2: AH
m3: AH
m1: AH->AH
m2: AH->AH
m3: AH->AH
All memory blocks remain all hits
Cache set C
M(C) = 1
M(C) = 2
M(C) = 1
Cache conflict analysis
T1
T2
T3
Task interference graphm1
Associativity = 4
T1
T2
T3
m2
m3
T1
T3
After conflict analysis
m0, m1: AH
m2: AH
m3: AH
m1: AH->AH
m2: AH->NC
m3: AH->AH
m2 may be replaced from the cache due to conflicts from other cores
Cache set C
M(C) = 1
M(C) = 3
M(C) = 1
m0
m0 m1
m3
T2
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
Bus awareanalysis
Example : variable bus delay
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
Right BranchCommon PathLeft Branch
C1 = 20
C2 = 10
C5 = 10
M1 = 10
C1 = 20
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
t = 100
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10
t = 150
L2 miss
L2 hit
First iteration (No bus delay)
Example : variable bus delay
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
Right BranchCommon PathLeft Branch
C1 = 20
C2 = 10
C5 = 10
M1 = 10
C1 = 20
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10L2 miss
L2 hit
Second iteration (M1 suffers 20 cycles bus delay)
Bus delayM1 = 10
C1 = 20
t = 100
t = 150
Conclusion:WCET of different iterations
could be different
Possible solutions
Source of problem Each iteration of a loop may start at different offset relative to its
bus slot
Possible solutions Virtually unroll all loop iterations – too expensive Do not model the bus or take maximum possible bus delay –
imprecise result
Our solution Assume each loop iteration starts at the same offset relative to its
bus slot and add necessary alignment cost
Key observation
Core 0 slot Core 1 slot Core 0 slot Core 1 slot
Timeline
Bus schedule
Δ T starts at core 0 Δ T starts at
core 0
Round robin schedule follow repeated patterns
Core 0 slot Core 1 slot
Δ T starts at core 0
T must follow the same Execution pattern if the offset ( Δ) is same
Bus schedule
Revisiting the example
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
C1 = 20
C2 = 10
C5 = 10
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10L2 miss
L2 hit
Align
M1 = 10
C1 = 20
t = 100
t = 150
M1 = 10
C1 = 20
Alignment cost = 20 cycles (all iterations follow the same execution pattern with this alignment)
WCET of one iteration <= 100 cycles
(No need to virtually unroll the loop)
Right BranchCommon PathLeft Branch
Partial Unrolling
C1=10C2=10L2 Hit
C1=10M2=10C2=10
C1=10M2=10C2=10C1=10M2=10C2=10C1=10M2=10C2=10
C1=10 C1=10
t=0
t=100
Core0 Bus slot
No unrolling Partial unrolling
Iter1
Iter2
Iter3
Iter1
Iter2 Iter4 Core0 Bus slot
Code Executing on Core0
Alignment cost higher if the loop is very small compared to the length of bus slot
Partially unroll such loopstill one bus slot is filled
Extension to full programs
WCETof
inner loop
WCETof
outer loop
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
Bus awareWCET/BCETcomputation
WCRT computation
WCRT
t3t2
t4
t1(1)
(2) (2)
(1)
Assigned core
Task graph
Peers
Task lifetime : [EarliestReady, LatestFinish]
EarliestReady(t1) = 0EarliestReady(t4) >= EarliestFinish(t2)EarliestReady(t4) >= EarliestFinish(t3)
EarliestFinish = EarliestReady + BCET
LatestReady(t4) >= LatestFinish (t2)LatestReady(t4) >= LatestFinish (t3)
t2 has peers LatestFinish (t2) = LatestReady(t2)
+ WCET(t2) + WCET(t3)
t4 has no peers LatestFinish (t4) = LatestReady(t4)
+ WCET(t4)
Computed WCRT = LatestFinish(t4)
Earliest timecomputation
Latest timecomputation
An example
T2.1= 10
T3.2=10
T3.1=20
M2.2=20
T2.2=20
M3.2 =20T3.2 =10T4.2=10
T4.1 =20
M4.2=10T4.2 =10
Core 0 Core 1Bus
Core 0 Core 1
Wait
Wait
T1.1=90
T2.1= 10T2 lifetime
T3 lifetime
Bus schedule based on M2.2, M3.2 L2 missWCRT: 170 cyclesT2 and T3 have Disjoint lifetimeM2.2 and M3.2 cannot conflict: Both L2 Hit
L2 Hit: 10 cyclesL2 Miss: 20 cyclesBus slot: 50 cyclesM2.2 and M3.2 conflict in L2: Both L2 MissM4.2 is L2 Hit
T1.1= 90
T2.2=20
T3.1= 20
T4.1= 20
Core0 slot
Core1 slot
Core0 slot
Core1 slot
Example contd.
Bus schedule based on M2.2, M3.2 L2 HitSecond bus wait for Core 1 eliminatedWCRT: 130 cycles
T1.1=90
T3.1=20
T2.2 =20
T3.2=10
T2.1= 10
T4.1=20
T4.2=10M2.2=10
M3.2=10
M4.2=10
Core 0 Core 1Bus
Wait
Core0
slot
Core1 slot
Core0 slot
Core1
slot
T3.1=20
M2.2=20
T2.2=20
M3.2 =20T3.2 =10T4.1 =20
M4.2=10T4.2 =10
Core 0 Core 1Bus
Wait
Wait
T1.1=90
T2.1= 10T2 lifetime
T3 lifetime
Core0 slot
Core1 slot
Core0 slot
Core1 slot
Experimental evaluation
Tasks are compiled into Simplescalar PISA compliant binaries
CMP_SIM is used for simulation, CMP_SIM is extended with shared bus modeling and for PISA compliant binaries
Two setup Independent tasks running in different cores Task dependency specified through a task graph
Overestimation ratio (2-core)
One core runs statemate another core runs the program under evaluation
L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Average Overestimation = 40%
Overestimation ratio (4-core)
Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores
L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Average Overestimation = 40%
Sensitivity with bus slot length (2-core)Average overestimation ratio for program Statemate
Sensitivity with bus slot length (4-core)Average overestimation ratio for program Statemate
Debie is an online space debris monitoring program manufactured by Space Systems Finland Ltd.
Extracted task graph (Debie-test)
WCRT analysis of task graph
main-tc(1)
main-hm(1)
main-tm(1)
main-hit(1)
main-aq(1)
main-su(1)
tc-test(3)
hm-test(4)
tm-test(1)
hit-test(2)
aq-test(4)
su-test(2)
Assigned core number
Experimental evaluation of Debie-test
L1 cache : 2-way, 2 KBL2 cache : 4-way, 8 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Overestimation ratio ~ 20%
This difference clearly shows that for real life application bus modeling is essential
Extension to different multi-core architecture (e.g. Intel Core2 Duo)
Shared off-chip Bus
Core 0
L1….
Core N
L1
Shared L2
Core 0
L1….
Core N
L1
Shared L2
Off-chip Memory
Crossbar Crossbar
Processor 0 Processor 1
Only L2 cache misses appearin shared bus
Overall framework still remains the same, only shared bus waiting time is computed for L2 cache misses