16
COMP25212 CPU Multi Threading • Learning Outcomes: to be able to: Describe the motivation for multithread support in CPU hardware To distinguish the benefits and implementations of coarse grain, fine grain and simultaneous multithreading To explain when multithreading is inappropriate To be able to describe a multithreading implementations To be able to estimate performance of these implementations To be able to state important assumptions of this performance model

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits

Embed Size (px)

Citation preview

  • COMP25212 CPU Multi ThreadingLearning Outcomes: to be able to:Describe the motivation for multithread support in CPU hardwareTo distinguish the benefits and implementations of coarse grain, fine grain and simultaneous multithreadingTo explain when multithreading is inappropriateTo be able to describe a multithreading implementationsTo be able to estimate performance of these implementationsTo be able to state important assumptions of this performance model

  • Revision: IncreasingCPU PerformanceData CacheFetch LogicFetch LogicDecode LogicFetch LogicExec LogicFetch LogicMem LogicWrite LogicInst CacheHow can throughput be increased? Clocka

    c

    b

    d

    f

    e

  • Increasing CPU PerformanceBy increasing clock frequencyBy increasing Instructions per ClockMinimizing memory access impact data cacheMaximising Inst issue rate branch predictionMaximising Inst issue rate superscalarMaximising pipeline utilisation avoid instruction dependencies out of order execution(What does lengthening pipeline do?)

  • Increasing Program ParellelismKeep issuing instructions after branch?Keep processing instructions after cache miss?Process instructions in parallel?Write register while previous write pending?

    Where can we find additional independent instructions?In a different program!

  • Revision Process StatesTerminatedRunning on a CPUBlocked waiting for eventReady waiting for a CPUNewDispatch (scheduler)Needs to wait(e.g. I/O)I/O occursPre-empted(e.g. timer)

  • Revision Process Control BlockProcess IDProcess StatePCStack PointerGeneral RegistersMemory Management Info

    Open File List, with positionsNetwork ConnectionsCPU time usedParent Process ID

  • Revision: CPU SwitchProcess P0Process P1Operating SystemSave state into PCB0Load state fromPCB1Save state into PCB0Load state fromPCB1

  • What does CPU load on dispatch?Process IDProcess StatePCStack PointerGeneral RegistersMemory Management Info

    Open File List, with positionsNetwork ConnectionsCPU time usedParent Process ID

  • What does CPU need to store on deschedule?Process IDProcess StatePCStack PointerGeneral RegistersMemory Management Info

    Open File List, with positionsNetwork ConnectionsCPU time usedParent Process ID

  • CPU Support for Multithreading

  • How Should OS View Extra Hardware Thread?A variety of solutions

    Simplest is probably to declare extra CPU

    Need multiprocessor-aware OS

  • CPU Support for MultithreadingDesign Issue:when to switch threads

  • Coarse-Grain MultithreadingSwitch Thread on expensive operation:E.g. I-cache missE.g. D-cache miss

    Some are easier than others!

  • Switch Threads on Icache miss

    1234567Inst aIFIDEXMEMWBInst bIFIDEXMEMWBInst cIF MISSIDEXMEMWBInst dIFIDEXMEMInst eIFIDEXInst fIFID

    Inst XInst YInst Z

    ----

  • Performance of Coarse GrainAssume (conservatively) 1GHz clock (1nS clock tick!), 20nS memory ( = 20 clocks)1 i-cache miss per 100 instructions1 instruction per clock otherwiseThen, time to execute 100 instructions without multithreading100 + 20 clock cyclesInst per Clock = 100 / 120 = 0.83.With multithreading: time to exec 100 instructions:100 [+ 1]Inst per Clock = 100 / 101 = 0.99..

  • Switch Threads on Dcache missPerformance:similar calculation (STATE ASSUMPTIONS!)

    Where to restart after memory cycle? I suggest instruction a why?Abort these

    1234567Inst aIFIDEXM-MissWBInst bIFIDEXMEMWBInst cIFIDEXMEMWBInst dIFIDEXMEMInst eIFIDEXInst fIFID

    MISSMISSMISS

    ---

    ---

    ---

    Inst XInst Y

    By increasing clock frequency

    By increasing Instructions per Clock

    Minimizing memory access impact data cache

    Maximising Inst issue rate branch prediction

    Maximising inst issue rate superscalar

    Maximising pipeline utilisation avoid instruction dependencies out of order execution

    **