26
Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications, School of EECS, Peking University

Shared Cache Aware Task Mapping for WCRT Minimization · 2013. 4. 24. · t4 t5 . Previous Task ... –State-of-the-art works on shared cache usually fix task mapping before shared

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

  • Shared Cache Aware Task Mapping for WCRT Minimization

    Huping Ding & Tulika Mitra School of Computing, National University of Singapore

    Yun Liang Center for Energy-efficient Computing and Applications, School of EECS, Peking University

  • Multi-core Processor in Embedded Systems

    Embedded systems adopt multi-core processors due to performance, thermal and power constraints

    Embedded systems have real-time constraints, e.g., time deadline of tasks

    Shared Resources, like shared cache, make the timing difficult to estimate, e.g. Worst-Case Response Time (WCRT) analysis

    Core 1 Core 2 Core n ….

    Shared Resources

  • Worst-Case Response Time

    Tasks can be mapped to any core in a multi-core systems

    The maximal latest finish time among all these tasks is the Worst-Case Response Time (WCRT) of the application

    An important metric for schedulability analysis

    Our purpose is to minimize WCRT in multi-core systems with shared cache via task mapping

    A multitasking application

    t1

    t2 t3

    t4 t5

  • Previous Task Mapping Approaches

    They usually focus on balancing the workload

    They are agnostic to the shared cache behavior – State-of-the-art works on shared cache usually fix task

    mapping before shared L2 cache analysis

  • Motivating Example

    adpcm

    ndes minver

    crc compress

    Task graph Configuration: 2-core, private L1 cache: 256 bytes, shared L2 cache: 2KB w/o L2 cache modeling: Assume cache miss (hit) for each L2 access in the worst (best) case

    0

    2

    4

    6

    8

    10

    12

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    WC

    RT

    (mill

    ion

    cyc

    les)

    Task mappings

    w/o L2 cache modeling with L2 cache modeling

    It is imperative to design a shared cache aware task mapping solution!

  • Our Architecture

    A private L1 cache for each core

    All cores share a L2 cache

    Tasks execute in a non-preemptive fashion

    Shared L2 cache

    CPU

    L1 cache

    Core1 CPU

    L1 cache

    Core 2 CPU

    L1 cache

    Core n

    Main memory

  • Private Cache Systems

    Usually only the intra-task analysis is required to estimate the WCET (Worst-Case Execution Time)

    No inter-core cache conflicts

    Cache

    Off-chip

    Main Memory

    CPU

  • Shared Cache Multi-core Systems

    Inter-core cache conflicts

    – Lead to extra L2 cache misses

    This makes the timing analysis complex

    m1 m2

    young old

    Access m2

    Hit

    LRU replacement policy

    Shared L2 cache

    Core 2 Core 1

    Task T

    m1 m2

    young old

    Access m2

    Miss

    LRU replacement policy

    Shared L2 cache

    Core 2 Core 1

    Task T

    m1’ m1

    Access m1’

    Task T’

  • Task Interference

    Task lifetime : [Earliest Ready, Latest Finish]

    Interfering tasks : Two tasks mapped to different cores with lifetime overlapped

    Two tasks with dependence between them can never interfere with each other

    Time

    t1

    t2 t3

    t4

    Interfering tasks: t2 and t3, t2 and t4

    Core 1 Core 2 t1

    t2

    t3

    t4

  • Current Efforts on Shared Cache

    They focus on modeling cache and inter-core cache conflicts (Yan and Zhang RTAS 2008, Hardy and Puaut RTSS 2008 and Liang et al. RTS 2012)

    Task mapping is fixed (known) before analysis, and agnostic to the shared cache conflicts

  • Impact of Task Mapping

    Task interference – Two tasks can interfere with each other only when they are

    mapped to different cores

    Workload balance among cores – Workload balance influences the total execution time of a

    multitasking application, thus impacts its WCRT

    Task interference

    Inter-core cache conflict

    Execution time of task

    WCRT

  • Our Aims

    We target the task mapping problem in multi-core systems with shared cache, in order to optimize the WCRT

    We not only consider workload balance, but also minimize inter-core cache conflicts

  • Challenges

    Exhaustive enumeration of all mappings is impossible as # of cores and # of tasks increase

    Significant complexity due to inter-dependency between task mapping and task execution time

    Task mapping

    Task execution time

    Task interference

    Cache conflict

    Workload balancing

  • Our Approach

    An approach based on ILP (integer linear programming) formulation

    Intra-task cache

    analysis

    Task mapping modeling

    WCET computation

    modeling

    Shared cache

    modeling

    WCRT computation

    modeling

    ILP problem Objective: minimize WCRT

    ILP solver Result: a task mapping

    Accurate WCRT analysis framework

    Final WCRT

  • Intra-task Analysis

    Must analysis – Captures the memory blocks that are guaranteed in the

    cache

    May analysis – Captures the memory blocks that are never in the cache

    Memory block access classification – Always hit

    – Always miss

    – Non-classified

  • Task Mapping Modeling

    Suppose there are N tasks and M cores in the multi-core system. 𝑴𝑨𝑷𝒊𝑘 is a 0-1 decision variable to indicate if task i is mapped to core k. Then for any task i (0

  • Shared Cache Modeling

    a

    b

    m

    c

    d

    a

    b

    m

    Cache state before considering cache conflicts

    d

    e

    a

    b

    Possible cache states after considering cache conflicts

    (𝑎𝑔𝑒𝑚+ 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑗𝑠

    𝑗 ∈ 𝑖𝑛𝑡𝑓(𝑖)

    ) ≥ 𝐴

    {m, a, b, c, d, e} are mapped to the same set s in shared L2 cache. m, a, b and c are from task i, while d and e are tasks u and v respectively

    Hit

    Miss

    young

    old

    LRU replacement policy

  • WCET and WCRT Computation

    WCET – Intra-task analysis and shared cache modeling

    WCRT – WCRT is the summation of WCET value on the critical path

  • Approximations in ILP

    Interfering tasks – Two tasks i and j have no dependence and are mapped to

    different cores

    WCET = Original WCET + Cache conflict penalty – Intra-task cache analysis Original WCET

    – Shared cache modeling Cache conflict penalty

    Why approximations ? – Make the problem easy to solve

    Estimate accurate WCRT with the mapping released by ILP formulation by a WCRT analysis framework

    𝑖𝑛𝑡𝑒𝑟𝑓𝑒𝑟𝑒𝑖𝑗 = 𝐷𝐶𝑖𝑗

  • Iterative Analysis Framework

    Yun Liang, Huping Ding, Tulika Mitra, Abhik Roychoudhury, Yan Li, Vivy Suhendra Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores. In Real-Time System Journal 2012

    L1 cache analysis

    L1 cache analysis

    Filter Filter

    L2 cache analysis

    L2 cache analysis

    L2 cache Conflict analysis

    WCRT analysis

    Interference changes?

    yes no

    Estimated WCRT

    Core 1 Core 2

    Initial task interference

    Modified task interference

  • Experimental Evaluation

    Real-world benchmark: DEBIE benchmark

    Synthetic benchmarks: MRTC benchmark suite

    Comparison among different approaches

    Enumerate all Task mappings

    Evaluate WCRT with L2 cache modeling

    Select the best mapping

    ILP formulation

    Solve ILP problem

    Iterative analysis framework

    Enumerate all Task mappings

    Evaluate WCRT w/o L2 cache modeling

    Task mapping released

    Select the best mapping

    Optimal mapping Our mapping Mapping w/o L2 cache modeling

  • DEBIE Benchmark Results

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    L1:1KB L2:16KB

    WC

    RT

    (mill

    ion

    cyc

    les)

    Configuration: 4-core

    Optimal

    Our approach

    w/o L2 cache modeling

  • Synthetic Benchmarks Results

    0.000.200.400.600.801.001.201.401.601.802.00

    Taskgraph 1

    Taskgraph 2

    Taskgraph 3

    Taskgraph 4

    Taskgraph 5

    Taskgraph 6

    Taskgraph 7

    Taskgraph 8

    Taskgraph 9

    Re

    lati

    ve W

    CR

    T

    Configurations: 4-core, L1:512B, L2:4KB

    Optimal Our approach w/o L2 cache modeling

  • Runtime for Synthetic Benchmarks

    Task graph # of tasks Our approach (seconds)

    Optimal (seconds)

    1 6 6.59 3.98

    2 7 33.37 1.42

    3 8 8.77 4.93

    4 9 4.22 15.05

    5 10 17.20 67.54

    6 11 9.27 269.29

    7 12 396.33 1089.35

    8 13 480.62 3,883.00

    9 14 539.31 22,794.00

  • Conclusions

    Tasking mapping has a great impact on WCRT

    An ILP formulation approach that is aware of the shared cache conflict

    Our approach not only considers the workload balance but also minimizes inter-core cache conflict

    Significant reduction in WCRT compared to traditional approach agnostic to shared cache conflicts

  • Q & A