Upload
anselma-altherr
View
111
Download
0
Embed Size (px)
Citation preview
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Part 10
Thread and TaskLevel Parallelism
Computer Architecture
Slide Sets
WS 2010/2011
Prof. Dr. Uwe BrinkschulteProf. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 2 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Basic concepts
Thread:
Threads are lightweight processes. They consist of several instructions. The threads share a common (virtual) address space. Threads can communicate via this common address space.
Task:
Tasks are heavyweight processes. Each task has its own address space.
Tasks can only communicate via inter task communication channels like
shared memory, pipes, message queues or sockets. A task can contain
several threads
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 3 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Basic concepts
Instruction level parallelism is limited. To further exploit parallel processing, thread or task level parallelism can be used.
Two major architectures are known:
• Multithreaded processors exploit thread level parallelism
• Chip multiprocessors (multi core processors, many core processors) exploit task level parallelism
Both concepts are also used in combination
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 4 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Basic concepts
In a multi-threaded processor instructions of several threads of the
program are candidates for concurrent issuing.
This can be done in a classical scalar pipeline to hide the latencies of
memory access.
Here, instructions from several threads can be processed in the different
pipeline stages.
In can be as well combined with a superscalar pipeline to increase the
level of possible parallelism from the intra thread level to the inter thread
level.
This is called SMT (Simultaneous Multithreading).
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 5 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Basic concepts
Chip multiprocessors combine multiple processor cores on a single chip.
Therefore these processors are also called multi core processors.
Today's multicore processors integrate 2 - 8 cores on a chip.
By increasing the number of cores in the future (e.g. > 100), the term many core processors is used.
These cores can execute several tasks in parallel.
Cores can be homogeneous or heterogeneous.
Having multithreaded cores, multithreading and chip multiprocessing can be combined.
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 6 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multithreaded Architectures
Multithreaded processor:
Supports the execution of multiple threads by hardware
It can store the context information of several threads in separate register sets and execute instructions of different threads at the same time in the processor pipeline
Different stages of the processor pipeline can contain instructions from different threads
This exploits thread level parallelism on basis of parallelism in time (pipelining)
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 7 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multithreaded Architectures
Goal:
Reduction of latencies caused by memory accesses or dependencies
Such latencies can be bridged by switching to another thread
During the latency, instructions from other threads are feed into the pipeline
=> the processor ultilzation is raised, the throughput of a load consisting of multiple threads increases (while the throughput of a single thread remains the same)
• Explicit multithreaded processors: each thread is a real thread of the application program
• Implicit multithreaded processors: speculative parallel threads are created dynamically out of a sequential program
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 8 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Basic multithreading techniques
(b) Cycle-by-cycle-
Interleaving-Technik
(fine-grain multithreading):
Context is switched each
clock cycle
(c) Block-Interleaving-Technik (coarse-grain multithreading):
Instructions of a thread are executed until an event causes a latency.
Then context is switched. (a )
Tim
e (p
roce
ssor
cyc
les)
(c )
Con
text
sw
itch
(b )
Con
text
sw
itche
s
(1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(2 )
(2 )
(2 )
(2 )
(2 ) (2 )
(3 )(4 )(3 )
(3 )
(4 )
(4 )
(a) single threaded prozessor
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 9 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Comparing multithreading to superscaler and VLIW
(a )
Tim
e (p
roce
ssor
cyc
les)
(c )
Con
text
sw
itche
s
(1 )
(1 )
(1 )
(1 )
(1 )
(2 ) (4 )(3 )
(1 )
(1 ) (1 ) (1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 ) (1 ) (1 )
(1 )
(1 )
(1 )
(1 )
(1 )
(1 ) (1 ) (1 )
(b )
(1 )
(1 )
(1 )
(1 )
(1 ) (1 ) (1 )
N N
N N N
(1 )
(2 )
(1 )(1 )
(1 )
(1 )
(2 ) (2 ) (2 )
N N
N N
N
(1 ) (2 ) (4 )(3 )
(d )
Con
text
sw
itch
es(2 ) (2 ) (2 )
(3 )(4 ) (4 )
(1 )
(2 )
(1 )(1 )
(1 )
(1 )(2 ) (2 ) (2 )
(3 )(4 ) (4 )
N N N
N N
N
(2 ) (2 ) (2 )
a: four times superscalar processor b: four times VLIW processor
c: four times superscaler processor d: four times VLIW processor with cycle by cycle interleaving with cycle by cycle interleaving
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 10 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Classification of block interleaving techniques
B lock In te rleav ing
statisch dynam isch
Exp lic it-sw itch Im plic it-sw itch(sw itch -on-load ,sw itch -on-s to re,sw itch -on-b ranch , ...)
Sw itch-on-cache-m iss
Sw itch-on-use
C onditional-sw itch
Sw itch-on-signa l
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 11 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Simultaneous multithreading (SMT)
In s tru c tio n F e tch
. . .
In s tru c tio n D eco d e
an d R en am e
. . .
Inst
ruct
ion
Win
dow
I s su e
Res
erva
tion
St
atio
ns
E x ecu tio n
Res
erva
tion
St
atio
nsE x ecu tio n
. . .
R etire an d
W rite B ack1 2 3 4
1
4
A simultaneous multithreaded processor is able to issue instructions of
multiple threads to multiple execution units in a single clock cycle.
This exploits thread level and instruction level parallelism in time and
space
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 12 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Comparing SMT to chip multiprocessing
Simultaneous multithreading (a) and chip multiprocessing (b)
(a )
Tim
e (p
roce
ssor
cyc
les)
(b )
(1 ) (2 )
(4 ) (4 )
(4 )
(1 ) (2 ) (4 )(3 )
(1 )
(1 )
(1 )(1 )
(1 )
(4 ) (4 ) (4 )
(4 )
(2 ) (4 )
(4 ) (4 ) (1 )
(1 )
(1 )
(2 ) (2 ) (4 )
(2 ) (3 )
(1 ) (2 )
(4 )
(4 )
(2 )
(1 )
(2 )
(1 ) (1 ) (2 )
(1 )
(1 ) (1 )
(1 ) (1 ) (1 )
(2 ) (2 )
(3 )
(4 ) (4 ) (4 )
(2 ) (2 )
(2 )
(1 ) (2 ) (4 )(3 )
(1 )
(1 ) (2 )
(1 ) (1 )
(2 ) (2 )(1 )
(4 )
(4 )
(3 )
(3 )
(3 )(2 )
(4 )(3 )
(3 ) (4 ) (4 )
(4 )
(3 ) (4 )
(1 )(1 )
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 13 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Other applications of multithreading
Resulting from the ability of fast context switching more application
fields for multithreading arise
• Reduction of energy consumption
Mispredictions in superscaler processors cost energy.
Multithreaded processors can execute instructions from other
threads instead
• Event handling
Helper threads handle special events (e.g. carbage collection)
• Real-time processing
Allows efficient real-time scheduling polocies like LLF or GP
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 14 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Chip multiprocessing architectures
A Chip-Multiprocessor (CMP) combines several processors on a
single chip
Instead of chip-multiprocessor, today this is also called Multi-Core-
Processor, where a core denotes a single processor on the multi-
core processor chip
Each core can have the complexity of today‘s microprocessors and
holds ist own primary cache for instructions and data
Usually, the cores are organized as memory coupled multi
processors with a shared address space
Furthermore, a secondary cache is contained on the chip
For future multi-core processors containing a large number of cores
(>100), the term Many-Core-Processor is used
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 15 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Possible multi-core-configurations (1)
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
SecondaryCache
SecondaryCache
SecondaryCache
SecondaryCache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
PrimaryCache
Secondary Cache
Global Memory
PrimaryCache
PrimaryCache
PrimaryCache
shared-main memory shared-secondary cache
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 16 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Pro-cessor
Pro-cessor
Pro-cessor
Pro-cessor
Secondary Cache
Global Memory
Primary Cache
shared-primary cache
Possible multi-core-configurations (2)
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 17 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Chip-Multiprocessor / Multi-Core
Simulations show the shared secondary cache architecture superior
to shared primary cache and shared main memory
Therefore, mostly a large shared secondary cache is implemented on
the processor chip
Cache coherency protocols known from symmetric multi-processor
architectures (e.g. MESI protocol) guarantee a correct access to the
shared memory cells from inside and outside the processor chip
Today, chip multiprocessing is often combined with simultaneous
multithreading
There, each core is a SMT core giving the advantages of both
approaches
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 18 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
An early single chip multiprocessorproposal: Hydra
CPU 0
Centralized Bus Arbitration Mechanisms
Cache SRAM Array DRAM Main Memory I/O Device
A S
ingle Chip
PrimaryI-cache
PrimaryD-cache
CPU0 Mem.Controller
Rambus Mem.Interface
Off-chip L3Interface
I/O BusInterface
DMA
CPU 1
PrimaryI-cache
PrimaryD-cache
CPU1 Mem.Controller
CPU 2
PrimaryI-cache
PrimaryD-cache
CPU2 Mem.Controller
CPU 3
PrimaryI-cache
PrimaryD-cache
CPU 3 Mem.Controller
On-chip Secondary Cache
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 19 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core examples
IBM Power5
Symmetric multi-core processor with two 64-bit 2 times SMT
processors having 64 kBytes instruction cache and 32 kBytes data
cache
Both cores share a 1.41. MByte on-chip secondary cache
Controller for third level cache as well
on chip
Four Power5 chips and four L3 cache
chips are combined in a
multi-chip module
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 20 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core examples
IBM Power6
Similar to Power5, but superscaler
in-order-execution
Level 1 cache size raised to
64 kBytes for instructions and
data on each core
65 nm process
5 GHz clock frequency
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 21 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core examples
IBM Power7
released in 2010
4, 6 or 8 cores
Turbo mode deactivates 4 out of 8 cores, but gives access to all
memory controllers for the remaining 4 cores => improves single
core performance
Each core supports 4 times SMT
45 nm process
4 GHz clock frequency
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 22 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core examples
Intel Core 2 Duo (Wolfdale)
2 processor cores of Intel Core 2 architecture
32 kBytes data and instruction cache for each core
6 MBytes L2 cache
45 nm process
3 Ghz clock frequency
L2 CacheShared byboth cores
Core 1
Core 2
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 23 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Microarchitecture of Intel Core 2 family(a single core)
Source: c’t 16/2006
Multi-Core examples
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 24 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core examples
Intel Core 2 Quad (Yorkfield)
2 Wolfdale dices in a multi-chip module
=> 4 processor cores of Intel Core 2 architecture
32 kBytes data and instruction cache for each core
6 MBytes L2 cache for each dice
45 nm process
3 Ghz clock frequency
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 25 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
While homogeneous multi-core processors are commonly used for
general purpose computing, heterogeneous multi-core processors are
seen as a future trend for embedded systems
A first member of this technology is the IBM Cell processor containing a
Power processor (Power Processor Element, PPE) and 8 dependend
processors (Synergistic Processing Elements, SPE)
PPE: based on Power architecture, two times SMT, controls the 8 SPEs
SPE: contains a RISC processor with 128 bit SIMD (multimedia)
instructions, a memory flow controller and a bus controller
Originally designed for Sony Playstation 3, the cell processor is now used
in various application domains
Heterogeneous multi-cores
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 26 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Cell Processor Die
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 27 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Multi-Core discussion: performance
Due to multithreading in PC and server operating systems, two to four cores significantly increase the processor throughput
Exploiting eight or more cores requires parallel application programs
Hence, software development is challenged to deliver the necessary number of parallel threads by either parallelizing compilers or parallel applications
Experiences from multiprocessors show a moderate number of parallel threads resulting in high performance improvement, but this does not scale to a higher amount of parallelism
Beginning with 4 to 8 threads, the performance improvement is dramatically reduced
Using 8 cores, except for very computing intensive applications some cores will be temporarily idle
Furthermore, memory bandwidth can become a bottleneck
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 28 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
While current multi-core processors use cache coupled interconnection, future processors might rely on grid structures (network on chip) to improve performance
Adaptive and reconfigurable MPSoC (Multi-Processor Systens-on-.Chip) will gain importance for embedded systems and general purpose computing
Reconfigurable cache memories might allow variable connections to different cores
Available input/output bandwidth is still an open problem for throughput oriented programs
Multi-Core discussion: hardware
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 29 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
For data access, transactional memory might be is a model for future multi-core processors
• Similar to database systems, memory access is organized as atransaction being executed completely or not at all
• Hardware support for checkpointing and rollback is necessary
• As an advantage, concurrent access is simplified (no locks)
Furthermore, fault tolerance and dependability techniques will become more important as the error probability will increase with decreasing transistor dimensions
On chip power management will keep the importance it has already today
Multi-Core discussion: hardware
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 30 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Currently, operating system concepts known from memory coupled multiprocessor systems are used. Here, the operating system scheduler assigns independent processes to the available processors
Different to these concepts, the closer core connection of multi-core processors leads to a different „computation versus synchronization“ ratio allowing to use more fine grain parallelism
Parallel computing will become the future standard programming model
Most of the currently existing software is sequential, thus can run only on one core
Programming languages and tools to exploit the fine grain parallelism of multi-core processors need to be developed
Furthermore, software engineering techniques are needed to allow the development of safe parallel programs
Multi-Core discussion: software
Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 31 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
The application development for multi-core processors will become one of the main future market places for computer scientists
Today‘s applications have to be proceeded with the goal to exploit parallelism, gain performance and increase comfort
New applications currently not realizable due to a lack of processor performance will arise
These are hard to predict
Possible applications must have the need for high computational performance reachable by parallelism
Such applications might come from speech recognition, image recognition, data mining, learning technologies or hardware synthesis
Multi-Core discussion: software