19
High Performance Computin g 1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was derived from material in the text (HPC Chap. 1-2).

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *

  • Upload
    lilia

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *. Jeremy R. Johnson. *This lecture was derived from material in the text (HPC Chap. 1-2). Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 1

High Performance Computing (CS 680)

Lecture 2a: Overview of High Performance Processors*

Jeremy R. Johnson

*This lecture was derived from material in the text (HPC Chap. 1-2).

Page 2: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 2

Introduction• Objective: To review recent developments in the design of high

performance microprocessors. To indicate how these features effect program performance.

• An example program will be used to illustrate benchmarking techniques and the effect of compiler optimizations and code organization on performance. We will indicate how changes in software can improve performance by better utilizing the underlying hardware. Our goal for the course is to understand this behavior.

• Topics– pipelining

– instruction level parallelism, superscalar and out of order execution

– Memory Hierarchy: cache, virtual memory

Page 3: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 3

RISC vs. CISC •CISC: instruction set made up of powerful instructions close to primitives in a high-level language such as C or FORTRAN

•RISC: low level instructions are emphasized. RISC is a label most commonly used for a set of instruction set architecture characteristics chosen to ease the use of aggressive implementation techniques found in high-performance processors (John Mashey)

•Prevalence began in mid-1980s (earlier example CDC 6600) when more transistors and better compilers became available.

•Trade complex instructions for faster clock rate and more room for extra registers, cache and advanced performance techniques.

Page 4: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 4

Characterizing RISC

• Instruction pipelining

• Pipelining floating point execution

• Uniform instruction length

• Delayed branching

• Load/Store architecture

• Simple addressing modes

Page 5: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 5

Pipelining

• Instruction pipelining– Instruction Fetch– Instruction Decode– Operand Fetch– Execute– Writeback

IF ID F E W

IF ID F E W

IF ID F E W

Page 6: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 6

Branches and Hazards

• If a branch is executed the pipeline may need to be flushed since the wrong instructions may have been started.

IF ID F E W

IF ID F E W

IF ID F E W

IF ID F E W

IF ID F E

guess

guess

guess

sure

Page 7: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 7

Advanced Techniques

• Superscalar Processors– issue more than one instruction per cycle– can’t have dependencies or hardware conflict– for example can execute an add simultaneously with a mult

• Superpipeling– more stages in the pipeline

• Out of order and speculative execution– maintain semantics but allow instructions to be computed in different

order – may need to guess which instruction to execute– depends on difference between computation and execution

Page 8: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 8

Post-RISC Pipeline

IF ID

IRB

E

RR

R

Instruction Reorder Buffer

Rename Registers

Branch Prediction

Page 9: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 9

Memory Hierarchy

• SRAM vs. DRAM– small fast memory vs. large slow memory– principle of locality

• Registers• Cache (level 1)• Cache (level 2)• Main memory• Disk

Page 10: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 10

Memory Access Speed on DEC 21164 Alpha

• Clock Speed 500 MHz (= 2 ns clock rate)

• Registers (2 ns)

• L1 On-Chip (4 ns)

• L2 On-Chip (5 ns)

• L3 Off-Chip (30 ns)

• Memory (220 ns)

Page 11: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 11

Common Framework for Memory Hierarchies

• Question 1: Where can a block be placed?– One place (direct mapped), a few places (set associative), or any place

(fully associative)

• Question 2: How is a block found?– There are four methods: indexing, limited search, full search, or table

lookup

• Question 3: Which block should be replaced on a cache miss?

– Typically least recently used or random block

• Question 4: What happens on writes?– Write-through or write-back

Page 12: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 12

Mapping to Cache

• Cache - the level of the memory hierarchy between the CPU and main memory.

• Direct-Mapped Cache - memory mapped to one location in cache

(Block address) mod (Number of block in cache)• Number of blocks is typically a power of two cache

location obtained from low-order bits of address.

...

0123

28293031

0123

Page 13: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 13

Locating an Data in the Cache

• Compare cache index (mapping) to Tag (high-order bits) to see if element is currently in cache

• Valid bit used to indicate whether data in cache is valid

• A hit occurs when the data is in cache, otherwise it is a miss

• The extra time required when a cache miss occurs is called the miss penalty

Address (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

Page 14: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 14

Example

• 32-word memory• 8-word cache

Address Binary Cache block Hit or miss22 10110 11026 11010 01022 10110 11026 11010 01016 10000 000

3 00011 01116 10000 00018 10010 010

Index Valid Tag Data000001010011100101110

Page 15: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 15

Cache Organization

• Since cache is smaller than memory more than one address must map to same line in cache

• Direct-Mapped Cache– address mod cache size (only one location when memory address gets

mapped to)

• Fully Associative Cache– address can be mapped anywhere in cache– need tag and associative search to find if element in cache

• Set-Associative Cache– compromise between two extremes– element can map to several locations

Page 16: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 16

Model for Cache Misses

• Compulsory misses– These are cache misses caused by the first access to a block that has

never been in the cache

• Capacity misses (cold-start)– These are cache misses caused when the cache cannot contain all the

blocks needed during execution of a program.

• Conflict misses (collision)– These are cache misses that occur in a set associative or direct

mapped cache when multiple blocks compete for the same set. These misses are eliminated with a fully associative cache.

Page 17: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 17

Measuring Cache Performance

• CPU time = (CPU execution clock cycles +

Memory stall clock cycles) Clock-cycle time• Memory stall clock cycles =

Read-stall cycles + Write-stall cycles• Read-stall cycles = Reads/program Read miss rate Read miss penalty• Write-stall cycles = (Writes/program Write miss rate Write miss penalty) + Write buffer stalls

(assumes write-through cache)• Write buffer stalls should be negligible and write and read miss

penalties equal (cost to fetch block from memory)• Memory stall clock cycles = Mem access/program miss rate

miss penalty

Page 18: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 18

Virtual Memory

• Decouple physical addresses (memory locations) from addresses used by a program. Programmer sees a large memory with the same virtual addresses independent of where the program is actually placed in memory.

– Virtual to physical mapping performed via a page table– Since page tables can be in virtual memory, there could be several table

lookups for a single memory reference. – TLB (translation lookaside buffer) is a cache to store commonly used

virtual to physical maps.

• Page Fault– when page is not in memory it must be brought in (from disk)– very slow (usually occurs with OS intervention)

Page 19: High Performance Computing  (CS 680) Lecture 2a:  Overview of High Performance Processors *

High Performance Computing 19

Improving Memory Performance

• Larger and wider caches

• Cache bypass

• Interleaved and pipelined memory systems

• Prefetching

• Post-RISC effects on memory

• New memory trends