Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and

Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Part 10

Thread and TaskLevel Parallelism

Computer Architecture

Slide Sets

WS 2010/2011

Prof. Dr. Uwe BrinkschulteProf. Dr. Klaus Waldschmidt


Basic concepts

Thread:

Threads are lightweight processes. They consist of several instructions. The threads share a common (virtual) address space. Threads can communicate via this common address space.

Task:

Tasks are heavyweight processes. Each task has its own address space.

Tasks can only communicate via inter task communication channels like

shared memory, pipes, message queues or sockets. A task can contain

several threads


Basic concepts

Instruction level parallelism is limited. To further exploit parallel processing, thread or task level parallelism can be used.

Two major architectures are known:

• Multithreaded processors exploit thread level parallelism

• Chip multiprocessors (multi core processors, many core processors) exploit task level parallelism

Both concepts are also used in combination


Basic concepts

In a multi-threaded processor instructions of several threads of the

program are candidates for concurrent issuing.

This can be done in a classical scalar pipeline to hide the latencies of

memory access.

Here, instructions from several threads can be processed in the different

pipeline stages.

In can be as well combined with a superscalar pipeline to increase the

level of possible parallelism from the intra thread level to the inter thread

level.

This is called SMT (Simultaneous Multithreading).


Basic concepts

Chip multiprocessors combine multiple processor cores on a single chip.

Therefore these processors are also called multi core processors.

Today's multicore processors integrate 2 - 8 cores on a chip.

By increasing the number of cores in the future (e.g. > 100), the term many core processors is used.

These cores can execute several tasks in parallel.

Cores can be homogeneous or heterogeneous.

Having multithreaded cores, multithreading and chip multiprocessing can be combined.


Multithreaded Architectures

Multithreaded processor:

Supports the execution of multiple threads by hardware

It can store the context information of several threads in separate register sets and execute instructions of different threads at the same time in the processor pipeline

Different stages of the processor pipeline can contain instructions from different threads

This exploits thread level parallelism on basis of parallelism in time (pipelining)


Multithreaded Architectures

Goal:

Reduction of latencies caused by memory accesses or dependencies

Such latencies can be bridged by switching to another thread

During the latency, instructions from other threads are feed into the pipeline

=> the processor ultilzation is raised, the throughput of a load consisting of multiple threads increases (while the throughput of a single thread remains the same)

• Explicit multithreaded processors: each thread is a real thread of the application program

• Implicit multithreaded processors: speculative parallel threads are created dynamically out of a sequential program


Basic multithreading techniques

(b) Cycle-by-cycle-

Interleaving-Technik

(fine-grain multithreading):

Context is switched each

clock cycle

(c) Block-Interleaving-Technik (coarse-grain multithreading):

Instructions of a thread are executed until an event causes a latency.

Then context is switched. (a )

Tim

e (p

roce

ssor

cyc

les)

(c )

Con

text

sw

itch

(b )

Con

text

sw

itche

s

(1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(2 )

(2 )

(2 )

(2 )

(2 ) (2 )

(3 )(4 )(3 )

(3 )

(4 )

(4 )

(a) single threaded prozessor


Comparing multithreading to superscaler and VLIW

(a )

Tim

e (p

roce

ssor

cyc

les)

(c )

Con

text

sw

itche

s

(1 )

(1 )

(1 )

(1 )

(1 )

(2 ) (4 )(3 )

(1 )

(1 ) (1 ) (1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 ) (1 ) (1 )

(1 )

(1 )

(1 )

(1 )

(1 )

(1 ) (1 ) (1 )

(b )

(1 )

(1 )

(1 )

(1 )

(1 ) (1 ) (1 )

N N

N N N

(1 )

(2 )

(1 )(1 )

(1 )

(1 )

(2 ) (2 ) (2 )

N N

N N

N

(1 ) (2 ) (4 )(3 )

(d )

Con

text

sw

itch

es(2 ) (2 ) (2 )

(3 )(4 ) (4 )

(1 )

(2 )

(1 )(1 )

(1 )

(1 )(2 ) (2 ) (2 )

(3 )(4 ) (4 )

N N N

N N

N

(2 ) (2 ) (2 )

a: four times superscalar processor b: four times VLIW processor

c: four times superscaler processor d: four times VLIW processor with cycle by cycle interleaving with cycle by cycle interleaving


Classification of block interleaving techniques

B lock In te rleav ing

statisch dynam isch

Exp lic it-sw itch Im plic it-sw itch(sw itch -on-load ,sw itch -on-s to re,sw itch -on-b ranch , ...)

Sw itch-on-cache-m iss

Sw itch-on-use

C onditional-sw itch

Sw itch-on-signa l


Simultaneous multithreading (SMT)

In s tru c tio n F e tch

. . .

In s tru c tio n D eco d e

an d R en am e

. . .

Inst

ruct

ion

Win

dow

I s su e

Res

erva

tion

St

atio

ns

E x ecu tio n

Res

erva

tion

St

atio

nsE x ecu tio n

. . .

R etire an d

W rite B ack1 2 3 4

1

4

A simultaneous multithreaded processor is able to issue instructions of

multiple threads to multiple execution units in a single clock cycle.

This exploits thread level and instruction level parallelism in time and

space


Comparing SMT to chip multiprocessing

Simultaneous multithreading (a) and chip multiprocessing (b)

(a )

Tim

e (p

roce

ssor

cyc

les)

(b )

(1 ) (2 )

(4 ) (4 )

(4 )

(1 ) (2 ) (4 )(3 )

(1 )

(1 )

(1 )(1 )

(1 )

(4 ) (4 ) (4 )

(4 )

(2 ) (4 )

(4 ) (4 ) (1 )

(1 )

(1 )

(2 ) (2 ) (4 )

(2 ) (3 )

(1 ) (2 )

(4 )

(4 )

(2 )

(1 )

(2 )

(1 ) (1 ) (2 )

(1 )

(1 ) (1 )

(1 ) (1 ) (1 )

(2 ) (2 )

(3 )

(4 ) (4 ) (4 )

(2 ) (2 )

(2 )

(1 ) (2 ) (4 )(3 )

(1 )

(1 ) (2 )

(1 ) (1 )

(2 ) (2 )(1 )

(4 )

(4 )

(3 )

(3 )

(3 )(2 )

(4 )(3 )

(3 ) (4 ) (4 )

(4 )

(3 ) (4 )

(1 )(1 )


Other applications of multithreading

Resulting from the ability of fast context switching more application

fields for multithreading arise

• Reduction of energy consumption

Mispredictions in superscaler processors cost energy.

Multithreaded processors can execute instructions from other

threads instead

• Event handling

Helper threads handle special events (e.g. carbage collection)

• Real-time processing

Allows efficient real-time scheduling polocies like LLF or GP


Chip multiprocessing architectures

A Chip-Multiprocessor (CMP) combines several processors on a

single chip

Instead of chip-multiprocessor, today this is also called Multi-Core-

Processor, where a core denotes a single processor on the multi-

core processor chip

Each core can have the complexity of today‘s microprocessors and

holds ist own primary cache for instructions and data

Usually, the cores are organized as memory coupled multi

processors with a shared address space

Furthermore, a secondary cache is contained on the chip

For future multi-core processors containing a large number of cores

(>100), the term Many-Core-Processor is used


Possible multi-core-configurations (1)

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

SecondaryCache

SecondaryCache

SecondaryCache

SecondaryCache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

PrimaryCache

Secondary Cache

Global Memory

PrimaryCache

PrimaryCache

PrimaryCache

shared-main memory shared-secondary cache


Pro-cessor

Pro-cessor

Pro-cessor

Pro-cessor

Secondary Cache

Global Memory

Primary Cache

shared-primary cache

Possible multi-core-configurations (2)


Chip-Multiprocessor / Multi-Core

Simulations show the shared secondary cache architecture superior

to shared primary cache and shared main memory

Therefore, mostly a large shared secondary cache is implemented on

the processor chip

Cache coherency protocols known from symmetric multi-processor

architectures (e.g. MESI protocol) guarantee a correct access to the

shared memory cells from inside and outside the processor chip

Today, chip multiprocessing is often combined with simultaneous

multithreading

There, each core is a SMT core giving the advantages of both

approaches


An early single chip multiprocessorproposal: Hydra

CPU 0

Centralized Bus Arbitration Mechanisms

Cache SRAM Array DRAM Main Memory I/O Device

A S

ingle Chip

PrimaryI-cache

PrimaryD-cache

CPU0 Mem.Controller

Rambus Mem.Interface

Off-chip L3Interface

I/O BusInterface

DMA

CPU 1

PrimaryI-cache

PrimaryD-cache

CPU1 Mem.Controller

CPU 2

PrimaryI-cache

PrimaryD-cache

CPU2 Mem.Controller

CPU 3

PrimaryI-cache

PrimaryD-cache

CPU 3 Mem.Controller

On-chip Secondary Cache


Multi-Core examples

IBM Power5

Symmetric multi-core processor with two 64-bit 2 times SMT

processors having 64 kBytes instruction cache and 32 kBytes data

cache

Both cores share a 1.41. MByte on-chip secondary cache

Controller for third level cache as well

on chip

Four Power5 chips and four L3 cache

chips are combined in a

multi-chip module


Multi-Core examples

IBM Power6

Similar to Power5, but superscaler

in-order-execution

Level 1 cache size raised to

64 kBytes for instructions and

data on each core

65 nm process

5 GHz clock frequency


Multi-Core examples

IBM Power7

released in 2010

4, 6 or 8 cores

Turbo mode deactivates 4 out of 8 cores, but gives access to all

memory controllers for the remaining 4 cores => improves single

core performance

Each core supports 4 times SMT

45 nm process

4 GHz clock frequency


Multi-Core examples

Intel Core 2 Duo (Wolfdale)

2 processor cores of Intel Core 2 architecture

32 kBytes data and instruction cache for each core

6 MBytes L2 cache

45 nm process

3 Ghz clock frequency

L2 CacheShared byboth cores

Core 1

Core 2


Microarchitecture of Intel Core 2 family(a single core)

Source: c’t 16/2006

Multi-Core examples


Multi-Core examples

Intel Core 2 Quad (Yorkfield)

2 Wolfdale dices in a multi-chip module

=> 4 processor cores of Intel Core 2 architecture

32 kBytes data and instruction cache for each core

6 MBytes L2 cache for each dice

45 nm process

3 Ghz clock frequency


While homogeneous multi-core processors are commonly used for

general purpose computing, heterogeneous multi-core processors are

seen as a future trend for embedded systems

A first member of this technology is the IBM Cell processor containing a

Power processor (Power Processor Element, PPE) and 8 dependend

processors (Synergistic Processing Elements, SPE)

PPE: based on Power architecture, two times SMT, controls the 8 SPEs

SPE: contains a RISC processor with 128 bit SIMD (multimedia)

instructions, a memory flow controller and a bus controller

Originally designed for Sony Playstation 3, the cell processor is now used

in various application domains

Heterogeneous multi-cores


Cell Processor Die


Multi-Core discussion: performance

Due to multithreading in PC and server operating systems, two to four cores significantly increase the processor throughput

Exploiting eight or more cores requires parallel application programs

Hence, software development is challenged to deliver the necessary number of parallel threads by either parallelizing compilers or parallel applications

Experiences from multiprocessors show a moderate number of parallel threads resulting in high performance improvement, but this does not scale to a higher amount of parallelism

Beginning with 4 to 8 threads, the performance improvement is dramatically reduced

Using 8 cores, except for very computing intensive applications some cores will be temporarily idle

Furthermore, memory bandwidth can become a bottleneck


While current multi-core processors use cache coupled interconnection, future processors might rely on grid structures (network on chip) to improve performance

Adaptive and reconfigurable MPSoC (Multi-Processor Systens-on-.Chip) will gain importance for embedded systems and general purpose computing

Reconfigurable cache memories might allow variable connections to different cores

Available input/output bandwidth is still an open problem for throughput oriented programs

Multi-Core discussion: hardware


For data access, transactional memory might be is a model for future multi-core processors

• Similar to database systems, memory access is organized as atransaction being executed completely or not at all

• Hardware support for checkpointing and rollback is necessary

• As an advantage, concurrent access is simplified (no locks)

Furthermore, fault tolerance and dependability techniques will become more important as the error probability will increase with decreasing transistor dimensions

On chip power management will keep the importance it has already today

Multi-Core discussion: hardware


Currently, operating system concepts known from memory coupled multiprocessor systems are used. Here, the operating system scheduler assigns independent processes to the available processors

Different to these concepts, the closer core connection of multi-core processors leads to a different „computation versus synchronization“ ratio allowing to use more fine grain parallelism

Parallel computing will become the future standard programming model

Most of the currently existing software is sequential, thus can run only on one core

Programming languages and tools to exploit the fine grain parallelism of multi-core processors need to be developed

Furthermore, software engineering techniques are needed to allow the development of safe parallel programs

Multi-Core discussion: software


The application development for multi-core processors will become one of the main future market places for computer scientists

Today‘s applications have to be proceeded with the goal to exploit parallelism, gain performance and increase comfort

New applications currently not realizable due to a lack of processor performance will arise

These are hard to predict

Possible applications must have the need for high computational performance reachable by parallelism

Such applications might come from speech recognition, image recognition, data mining, learning technologies or hardware synthesis

Multi-Core discussion: software

Documents

Hier wird Wissen Wirklichkeit Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 10 Thread and