View
79
Download
1
Category
Tags:
Preview:
DESCRIPTION
Hyper-Threading , Chip multiprocessors and both. Zoran Jovanovic. To Be Tackled in Multithreading. Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages. Threading Algorithms. Time-slicing - PowerPoint PPT Presentation
Citation preview
Hyper-Threading, Chip multiprocessors and both
Zoran Jovanovic
2
To Be Tackled in Multithreading
Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages
3
Threading Algorithms Time-slicing
A processor switches between threads in fixed time intervals.
High expenses, especially if one of the processes is in the wait state. Fine grain
Switch-on-eventTask switching in case of long pausesWaiting for data coming from a relatively slow
source, CPU resources are given to other processes. Coarse grain
4
Threading Algorithms (cont.) Multiprocessing
Distribute the load over many processorsAdds extra cost
Simultaneous multi-threadingMultiple threads execute on a single
processor without switching.Basis of Intel’s Hyper-Threading technology.
5
Hyper-Threading Concept At each point of time only a part of
processor resources is used for execution of the program code of a thread.
Unused resources can also be loaded, for example, with parallel execution of another thread/application.
Extremely useful in desktop and server applications where many threads are used.
Quick Recall: Many Resources IDLE!
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.
Slide source: John Kubiatowicz
7
8
(a) A superscalar processor with no multithreading
(b) A superscalar processor with coarse-grain multithreading
(c) A superscalar processor with fine-grain multithreading
(d) A superscalar processor with simultaneous multithreading (SMT)
(a)(a) (b)(b) (c)(c) (d)(d)
9
Simultaneous Multithreading (SMT)Example: new Pentium with “Hyperthreading”
Key Idea: Exploit ILP across multiple threads! i.e., convert thread-level parallelism into more ILPexploit following features of modern processors:
multiple functional units modern processors typically have more functional units
available than a single thread can utilize
register renaming and dynamic scheduling multiple instructions from independent threads can co-exist
and co-execute!
10
Hyper-Threading Architecture First used in Intel Xeon MP processor Makes a single physical processor appear as
multiple logical processors. Each logical processor has a copy of
architecture state. Logical processors share a single set of physical
execution resources
11
Hyper-Threading Architecture Operating systems and user programs can
schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors.
From an architecture perspective we have to worry about the logical processors using shared resources.Caches, execution units, branch predictors,
control logic, and buses.
Power 5 dataflow ...
Why only two threads? With 4, one of the shared resources (physical registers,
cache, memory bandwidth) would be prone to bottleneck Cost:
The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support
13
Advantages
Extra architecture only adds about 5% to the total die area.
No performance loss if only one thread is active. Increased performance with multiple threads
Better resource utilization.
14
Disadvantages To take advantage of hyper-threading
performance, serial execution can not be used.Threads are non-deterministic and involve
extra designThreads have increased overhead
Shared resource conflicts
Multicore
Multiprocessors on a single chip
15
CS267 Lecture 6 16
Basic Shared Memory Architecture
Processors all connected to a large shared memory Where are caches?
• Now take a closer look at structure, costs, limits, programming
P1
interconnect
memory
P2 Pn
Slide source: John Kubiatowicz
What About Caching???
Want High performance for shared memory: Use Caches! Each processor has its own cache (or multiple caches) Place data from memory into cache Writeback cache: don’t send all writes over bus to memory
Caches Reduce average latency Automatic replication closer to processor More important to multiprocessor than uniprocessor: latencies longer
Normal uniprocessor mechanisms to access data Loads and Stores form very low-overhead communication primitive
Problem: Cache Coherence!
I/O devicesMem
P1
$ $
Pn
Bus
Example Cache Coherence Problem
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?
4
u = ?
u :51
u :5
2
u :5
3
u = 7
Things to note: Processors could see different values for u after event 3 With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when How to fix with a bus: Coherence Protocol
Use bus to broadcast writes or invalidations Simple protocols rely on presence of broadcast medium
Bus not scalable beyond about 64 processors (max) Capacity, bandwidth limitations
Slide source: John Kubiatowicz
CS267 Lecture 6
Limits of Bus-Based Shared Memory
I/O MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% data hit rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor 140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
8 processors will saturate bus
5.2 GB/s
140 MB/s
20
21
Cache Organizations for Multi-cores
• L1 caches are always private to a core
• L2 caches can be private or shared
• Advantages of a shared L2 cache: efficient dynamic allocation of space to each core data shared by multiple cores is not replicated every block has a fixed “home” – hence, easy to find the latest copy
• Advantages of a private L2 cache: quick access to private L2 – good for small working sets private bus to private L2 less contention
A Reminder: SMT (Simultaneous Multi Threading)
SMT vs. CMP
A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97
• For Same area (a billion tr. DRAM area)
Superscalar and SMT: Very Complex• Wide• Advanced Branch prediction• Register Renaming• OOO Instruction Issue• Non-Blocking data caches
Superscalar (SS) SMT
CMP
SS and SMT vs. CMP
CPU Cores: Three main hardware design problems (of SS and SMT):•Area increases quadratically with core complexity
• Number of Registers O(Instruction window size)• Register ports - O(Issue width)CMP solves this problem (~ linear Area to Issue width)
•Longer Cycle Times • Long Wires, many MUXes and crossbars• Large buffers, queues and register filesClustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties)CMP allows small cycle time (with little effort)
Small and fastRelies on software to schedule- Poor ILP
•Complex Design and Verification
SS and SMT vs. CMPMemory:•12 issue SS or SMT require multiport data cache (4-6 ports)
• 2 X 128 Kbyte (2 cycle latency)CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport)Shared memory: write through caches
SMT CMP
Performance comparison
• Compress: (Integer apps) Low ILP and no TLP• Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand)
+ SMT utilizes core resources better+ But CMP has 16 issue slots instead of 12
• Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache• Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)
CMP Motivation
How to utilize available silicon? Speculation (aggressive superscalar) Simultaneous Multithreading (SMT, Hyperthreading) Several processors on a single chip
What is a CMP (Chip MultiProcessor)? Several processors (several masters) Both shared and distributed memory architectures Both homogenous and heterogeneous processor types
Why? Wire Delays Diminishing of Uniprocessors Very long design and verification times for modern processors
A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97
• TLP and PLP become widespread in future applications• Various Multimedia applications• Compilers and OS Favours CMP
CMP: • Better performance with simple hardware• Higher clock rates, better memory bandwidth• Shorter pipelines
SMT: has better utilizations but CMP has more resources (no wide-issue logic)
Although CMP bad for no TLP and ILP (compress), SMT and SS not much better
A Reminder: SMT (Simultaneous Multi Threading)
SMT CMP
• Pool of execution units (Wide machine)• Several Logical processors
• Copy of State for each• Mul. Threads are running concurrently• Better utilization and Latency Tolerance
• Simple Cores• Moderate amount of parallelism• Threads are running concurrently on different cores
30
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 3 Thread 2 Thread 4
Recommended