21
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput Seminar on : Advance Computer Architecture

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Embed Size (px)

Citation preview

Page 1: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Seminar on :

Advance Computer Architecture

Page 2: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Outline

Multithreading

Multithreading approaches

How Resources are Shared?

Effectiveness of Fine MT on The Sun T1

EffectivenessofSTMonSuperscalarprocessor

References

Page 3: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Multithreading

Multithreading is a primary technique for exposing more parallelism to the hardware.

In a strict sense, multithreading uses thread-level parallelism, but it’srole in both improving pipeline utilization and in GPUs motivates us tointroduce the concept here. Although increasing performance by usingILP.

Allows multiple threads to share the functional units of a singleprocessor in an overlapping fashion. In contrast, a more generalmethod to exploit thread-level parallelism(TLP) is with a multiprocessorthat has multiple independent threads operating at once and inparallel.

Does not duplicate the entire processor as a multiprocessor does.Instead, multi-threading shares most of the processor core among aset of threads, duplicating only private state, such as the registers andprogram counter.

Page 4: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Fine-grained multithreading Switches between threads on each clock, causing the execution of

instructions from multiple threads to be interleaved. Thisinterleaving is often done in a round-robin fashion, skipping anythreads that are stalled at that time.

Advantage of this approach is that it can hide throughput losses(latency) that arise from both short and long stalls.

The primary disadvantage of this approach is that it slows downthe execution of an individual thread .

Processors use this approach :-

The Sun Niagara .

NVidia GPUs .

Multithreading approaches

Page 5: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Coarse-grained multithreading Was invented as an alternative to Fine-grained multithreading.

Coarse-grained multithreading switches thread only on costlystalls, such as level two or three .

It need to have thread-switching be essentially free because changrelieves.

Less likely to slow down the execution of any one thread, sinceinstructions from other threads will only be issued when a threadencounters a costly stall.

Coarse-grained multithreading suffers from a major drawback,which limited the ability to overcome throughput losses, especiallyfrom shorter stalls.

No major processors use this technique.

Multithreading approaches

Page 6: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Simultaneous multithreading (SMT) The most common implementation of multithreading and it is a

variation on fine-grained multithreading.

It arises naturally when fine-grained multithreading isimplemented on to of a multiple-issue, dynamically scheduledprocessor.

Exploits thread-level parallelism at the same time it exploits ILP,SMT uses TLP to hide long-latency events in a processor.

The key insight in SMT is that register renaming and dynamicscheduling allow multiple instruction from independent threads tobe executed without regard to the dependences among them.

The resolution of the dependences can be handled by the dynamicscheduling capability.

Intel core i7 and IBM power7 use SMT.

Multithreading approaches

Page 7: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

How Resources are Shared?

Following figure show the differences in processor’s ability toexploit the resources of a superscalar for the followingconfiguration :

A superscalar with no multithreading support

A superscalar with coarse-grained multithreading

A superscalar with fine-grained multithreading

A superscalar with simultaneous multithreading

In the superscalar without multithreading support, the use of issueslots is limited by a lack of ILP, including ILP to hide memory latency.Because of the length of L2 and L3 cache misses, much of theprocessor can be left idle.

Page 8: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

How Resources are Shared?

Figure 1 How four different approaches use the functional unit execution slots of superscalar processor.

The horizontal dimension represent the instruction execution capability in each clock.

The vertical dimension represent a sequence of clock cycles.

An empty (white) box indicates that the corresponding execution slot is unused.

The shades gray and black corresponding to four different threads in the multithreadingprocessors.

Page 9: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

How Resources are Shared?

In the coarse-grained multithreaded superscalar, By switchingto another thread that’s cause partially hidden. This switchingreduces the number of completely idle clock cycles. Threadswitching only occurs when there is a stall. Because there arelikely to be some fully idle cycles remaining.

Fine-grained multithreading can only issue instructions from asingle thread in a cycle – can not find max work every cycle,but cache misses can be tolerated.

Simultaneous multithreading can issue instructions from anythread every cycle has the highest probability of finding workfor every issue slot .

Page 10: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Sun T1 Processor Overview

The T1 is a Fine MT, multicore microprocessor introduce by sun in2005.

Totally focused on exploiting thread-level parallelism (TLP), ratherthan (ILP).

Returned to a simple Pipeline strategy and focused on exploiting(TLP), using multiple cores and multithreading to producethroughput.

8 processor cores, each supporting 4 threads, each core consist6-stage single-issue Pipeline ( a standard five stage RISCPipeline, with one stage added for thread switching .

The Sun T1 processor had the best performance on integerapplications with extensive (TLP) and demanding memoryperformance, such as SPECJBB and transaction processingworkloads.

Effectiveness of Fine MT on the Sun T1

Page 11: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Figure 2 A summary of T1 processor

Effectiveness of Fine MT on the Sun T1

Page 12: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

T1 Multithreading Unicore Performance

To examine the performance of the T1 we use three server-oriented :

TPC-C

SPECJBB

SPECWeb99

Since multiple threads increase the memory demand from asingle processor they could overload the memory system, leadingto reductions in the potential gain from multithreading.

Next figures show the effectiveness of fine MT on the Sun T1

Effectiveness of Fine MT on the Sun T1

Page 13: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Figure 3 The relative change in the miss rates and miss latencies when executing with one thread per core versus four threads per core on the TPC-C benchmark.

Effectiveness of Fine MT on the Sun T1

Page 14: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Figure 4 Breakdown of the status on an average thread.

Remember that not ready does not imply that the core with that thread is stalled; it is only when all four threads are not ready that core will stall.

Thread can be not ready due to cache misses, Pipeline delays.

Effectiveness of Fine MT on the Sun T1

Page 15: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Figure 5 The breakdown of causes for a thread being not ready

Thread can be not ready due to cache misses, Pipeline delays.

Figure above show the frequency of various causes effect on Thread.

Effectiveness of Fine MT on the Sun T1

Page 16: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Figure 6 The per-thread CPI, the per-core CPI, the effective eight-core CPI, and

the effective IPC (inverse of CPI) for the eight-core T1 processor.

Effectiveness of Fine MT on the Sun T1

Page 17: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Simulation research results are unrealistic.

In practice, the existing implementations give the result is that thegain from SMT is also more modest.

The intel core i7 support SMT with two threads. The following figuresshow the performance ratio and the energy efficiency ratio.

To examine the performance of the T1 we use three server-oriented :

TPC-C

SPECJBB

SPECWeb99

Since multiple threads increase the memory demand from asingle processor they could overload the memory system, leadingto reductions in the potential gain from multithreading.

Next figures show the effectiveness of fine MT on the Sun T1

Effectiveness of STM on Superscalar processors

Page 18: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Simulation research results are unrealistic.

In practice, the existing implementations give the result is that thegain from SMT is also more modest.

The intel core i7 support SMT with two threads. The following figuresshow the performance ratio and the energy efficiency ratio.

Figure 7 The speedup from using multithreading on one core on an i7 processoraverages 1.28 for the Java benchmarks and 1.31 for the PARSEC .

Effectiveness of STM on Superscalar processors

Page 19: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

In the PARSEC benchmarks, SMT reduces energy by 7%, these resultsclearly show that SMT in aggressive speculative processor withextensive support for SMT can improve performance in an energyefficient fashion, which the more aggressive ILP approaches have failedto do .

Indeed, Esmaeilzadeh et al. [2011] show that the energyimprovements from SMT are even larger on the Intel i5 (a processorsimilar to the i7, but with smaller caches and a lower clock rate) andthe Intel Atom (an 80×86 processor designed for the netbook market)

Effectiveness of STM on Superscalar processors

Page 20: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

David Patterson, John L. Hennessy, “ComputerArchitecture:A Quantitative Approach” Morgan Kaufmannis an imprint of Elsevier 225 Wyman Street, Waltham,MA 02451, USA© 2012 Elsevier, Inc. All rights reserved,pp.223-232

References

Page 21: Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor Throughput

Thank You…