15-740/18-740 Computer Architectureece740/f11/lib/exe/fetch.php?media=wi… · 15-740/18-740 Computer Architecture Lecture 3: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie

15-740/18-740 Computer Architecture

Lecture 3: SIMD, MIMD, and ISA Principles

Prof. Onur Mutlu Carnegie Mellon University

Fall 2011, 9/14/2011

Review of Last Lecture n  Vector processors

q  Advantages and disadvantages q  Amortization of control overhead over multiple data elements q  Vector vs. scalar code execution time

n  SIMD vs. MIMD n  Concept of on-chip networks

2

Today n  Wrap up intro to on-chip networks n  ISA level tradeoffs

3

Review: SIMD vs. MIMD n  SIMD:

q  Concurrency arises from performing the same operations on different pieces of data

n  MIMD:

q  Concurrency arises from performing different operations on different pieces of data n  Control/thread parallelism: execute different threads of control in

parallel à multithreading, multiprocessing

q  Idea: Use multiple processors to solve a problem

4

Review: Flynn’s Taxonomy of Computers

n  Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966

n  SISD: Single instruction operates on single data element n  SIMD: Single instruction operates on multiple data elements

q  Array processor q  Vector processor

n  MISD: Multiple instructions operate on single data element q  Closest form: systolic array processor, streaming processor

n  MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q  Multiprocessor q  Multithreaded processor

5

Review: On-Chip Network Based Multi-Core Systems

n  A scalable multi-core is a distributed system on a chip

6

From East

From West

From North

From South

From PE

VC 0 VC Identifier

VC 1 VC 2

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

To East

To PE

To West To North To South

Input Port with Buffers

Control Logic

Crossbar

R Router

PE Processing Element (Cores, L2 Banks, Memory Controllers, Accelerators, etc)

Routing Unit ( RC )

VC Allocator ( VA )

Switch Allocator (SA)

Review: Idea of On-Chip Networks n  Problem: Connecting many cores with a single bus is not

scalable q  Single point of connection limits communication bandwidth

n  What if multiple core pairs want to communicate with each other at the same time?

q  Electrical loading on the single bus limits bus frequency

n  Idea: Use a network to connect cores q  Connect neighboring cores via short links q  Communicate between cores by routing packets over the

network

n  Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC 2001.

7

Review: Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth,

electrical loading à reduced frequency) - High contention

8

MemoryMemoryMemoryMemory

Proc

cache

Proc

cache

Proc

cache

Proc

cache

Review: Crossbar n  Every node connected to every other n  Good for small number of nodes + Least contention in the network: high bandwidth - Expensive - Not scalable due to quadratic cost Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II

9

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Review: Mesh n  O(N) cost n  Average latency: O(sqrt(N)) n  Easy to layout on-chip: regular and equal-length links n  Path diversity: many ways to get from one node to another

n  Used in Tilera 100-core n  And many on-chip network prototypes

10

Review: Torus n  Mesh is not symmetric on edges: performance very

sensitive to placement of task on edge vs. middle n  Torus avoids this problem + Higher path diversity than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths

11

Review: Torus, continued n  Weave nodes to make inter-node latencies ~constant

12

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

Review: Example NoC: 100-core Tilera Processor

q  Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007.

13

The Need for QoS in the On-Chip Network n  One can create malicious applications that continuously access

the same resource à deny service to less aggressive applications

14

n  Need to provide packet scheduling mechanisms that ensure applications’ service requirements (bandwidth/latency) are satisfied

n  Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.

The Need for QoS in the On-Chip Network

15

On Chip Networks: Some Questions n  Is mesh/torus the best topology? n  How do you design the router?

q  Energy efficient, low latency q  What is the routing algorithm? Is it adaptive or deterministic?

n  How does the router prioritize between different threads’/applications’ packets? q  How does the OS/application communicate the importance of

applications to the routers? q  How does the router provide bandwidth/latency guarantees to

applications that need them? n  Where do you place different resources? (e.g., memory controllers) n  How do you maintain cache coherence? n  How does the OS scheduler place tasks? n  How is data placed in distributed caches?

16

What is Computer Architecture?

n  The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals.

n  We will soon distinguish between the terms architecture, microarchitecture, and implementation.

17

Why Study Computer Architecture?

18

Moore’s Law

19

Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, 1965.

Why Study Computer Architecture? n  Make computers faster, cheaper, smaller, more reliable

q  By exploiting advances and changes in underlying technology/circuits

n  Enable new applications q  Life-like 3D visualization 20 years ago? q  Virtual reality? q  Personal genomics?

n  Adapt the computing stack to technology trends q  Innovation in software is built into trends and changes in computer

architecture n  > 50% performance improvement per year

n  Understand why computers work the way they do

20

An Example: Multi-Core Systems

21

CORE 1

L2 CA

CH

E 0

SHA

RED

L3 CA

CH

E

DR

AM

INTER

FAC

E

CORE 0

CORE 2 CORE 3 L2 C

AC

HE 1

L2 CA

CH

E 2

L2 CA

CH

E 3

DR

AM

BA

NK

S

Multi-Core Chip

*Die photo credit: AMD Barcelona

DRAM MEMORY CONTROLLER

Unexpected Slowdowns in Multi-Core

22

Memory Performance Hog Low priority

High priority

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

23

Why the Disparity in Slowdowns?

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

matlab gcc

DRAM Bank 3

DRAM Bank Operation

24

Row Buffer

(Row 0, Column 0)

Row

dec

oder

Column mux

Row address 0

Column address 0

Data

Row 0 Empty

(Row 0, Column 1)

Column address 1

(Row 0, Column 85)

Column address 85

(Row 1, Column 0)

HIT HIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

Access Address:

25

DRAM Controllers

n  A row-conflict memory access takes significantly longer than a row-hit access

n  Current controllers take advantage of the row buffer

n  Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*

(1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first

n  This scheduling policy aims to maximize DRAM throughput

*Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

26

The Problem n  Multiple threads share the DRAM controller n  DRAM controllers designed to maximize DRAM throughput

n  DRAM scheduling policies are thread-unfair q  Row-hit first: unfairly prioritizes threads with high row buffer locality

n  Threads that keep on accessing the same row

q  Oldest-first: unfairly prioritizes memory-intensive threads

n  DRAM controller vulnerable to denial of service attacks q  Can write programs to exploit unfairness

Fundamental Concepts

27

What is Computer Architecture?

n  The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals.

n  Traditional definition: “The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation.” Gene Amdahl, IBM Journal of R&D, April 1964

28

Levels of Transformation

29

Microarchitecture

ISA

Program

Algorithm

Problem

Circuits

Electrons Microarchitecture

ISA

Programs

Algorithm

Problem

Circuits/Technology

Electrons

Runtime System (VM, OS, MM)

User

Levels of Transformation

n  ISA q  Agreed upon interface between software

and hardware n  SW/compiler assumes, HW promises

q  What the software writer needs to know to write system/user programs

n  Microarchitecture q  Specific implementation of an ISA q  Not visible to the software

n  Microprocessor q  ISA, uarch, circuits q  “Architecture” = ISA + microarchitecture

30

Microarchitecture

ISA

Program

Algorithm

Problem

Circuits

Electrons

ISA vs. Microarchitecture n  What is part of ISA vs. Uarch?

q  Gas pedal: interface for “acceleration” q  Internals of the engine: implements “acceleration” q  Add instruction vs. Adder implementation

n  Implementation (uarch) can be various as long as it satisfies the specification (ISA) q  Bit serial, ripple carry, carry lookahead adders q  x86 ISA has many implementations: 286, 386, 486, Pentium,

Pentium Pro, …

n  Uarch usually changes faster than ISA q  Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs q  Why?

31

ISA n  Instructions

q  Opcodes, Addressing Modes, Data Types q  Instruction Types and Formats q  Registers, Condition Codes

n  Memory q  Address space, Addressability, Alignment q  Virtual memory management

n  Call, Interrupt/Exception Handling n  Access Control, Priority/Privilege n  I/O n  Task Management n  Power and Thermal Management n  Multi-threading support, Multiprocessor support

32

Microarchitecture n  Implementation of the ISA under specific design constraints

and goals n  Anything done in hardware without exposure to software

q  Pipelining q  In-order versus out-of-order instruction execution q  Memory access scheduling policy q  Speculative execution q  Superscalar processing (multiple instruction issue?) q  Clock gating q  Caching? Levels, size, associativity, replacement policy q  Prefetching? q  Voltage/frequency scaling? q  Error correction?

33

Design Point n  A set of design considerations and their importance

q  leads to tradeoffs in both ISA and uarch

n  Considerations q  Cost q  Performance q  Maximum power consumption q  Energy consumption (battery life) q  Availability q  Reliability and Correctness (or is it?) q  Time to Market

n  Design point determined by the “Problem” space (application space)

34

Microarchitecture

ISA

Program

Algorithm

Problem

Circuits

Electrons

Tradeoffs: Soul of Computer Architecture

n  ISA-level tradeoffs

n  Uarch-level tradeoffs

n  System and Task-level tradeoffs q  How to divide the labor between hardware and software

35

ISA-level Tradeoffs: Semantic Gap n  Where to place the ISA? Semantic gap

q  Closer to high-level language (HLL) or closer to hardware control signals? à Complex vs. simple instructions

q  RISC vs. CISC vs. HLL machines n  FFT, QUICKSORT, POLY, FP instructions? n  VAX INDEX instruction (array access with bounds checking)

q  Tradeoffs: n  Simple compiler, complex hardware vs.

complex compiler, simple hardware q  Caveat: Translation (indirection) can change the tradeoff!

n  Burden of backward compatibility n  Performance?

q  Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization?

q  Instruction size, code size

36

X86: Small Semantic Gap: String Operations

37

REP MOVS DEST SRC

How many instructions does this take in Alpha?

Small Semantic Gap Examples in VAX n  FIND FIRST

q  Find the first set bit in a bit field q  Helps OS resource allocation operations

n  SAVE CONTEXT, LOAD CONTEXT q  Special context switching instructions

n  INSQUEUE, REMQUEUE q  Operations on doubly linked list

n  INDEX q  Array access with bounds checking

n  STRING Operations q  Compare strings, find substrings, …

n  Cyclic Redundancy Check Instruction n  EDITPC

q  Implements editing functions to display fixed format output

n  Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78. 38

Small versus Large Semantic Gap n  CISC vs. RISC

q  Complex instruction set computer à complex instructions n  Initially motivated by “not good enough” code generation

q  Reduced instruction set computer à simple instructions n  John Cocke, mid 1970s, IBM 801

q  Goal: enable better compiler control and optimization

n  RISC motivated by q  Memory stalls (no work done in a complex instruction when

there is a memory stall?) n  When is this correct?

q  Simplifying the hardware à lower cost, higher frequency q  Enabling the compiler to optimize the code better

n  Find fine-grained parallelism to reduce stalls 39

Small versus Large Semantic Gap n  John Cocke’s RISC (large semantic gap) concept:

q  Compiler generates control signals: open microcode

n  Advantages of Small Semantic Gap (Complex instructions) + Denser encoding à smaller code size à saves off-chip bandwidth,

better cache hit rate (better packing of instructions) + Simpler compiler

n  Disadvantages

- Larger chunks of work à compiler has less opportunity to optimize - More complex hardware à translation to control signals and

optimization needs to be done by hardware n  Read Colwell et al., “Instruction Sets and Beyond: Computers,

Complexity, and Controversy,” IEEE Computer 1985.

40

ISA-level Tradeoffs: Instruction Length n  Fixed length: Length of all instructions the same

+ Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?)

n  Variable length: Length of instructions different (determined by opcode and sub-opcode)

+ Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How?

-- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently

n  Tradeoffs q  Code size (memory space, bandwidth, latency) vs. hardware complexity q  ISA extensibility and expressiveness q  Performance? Smaller code vs. imperfect decode

41

ISA-level Tradeoffs: Uniform Decode n  Uniform decode: Same bits in each instruction correspond

to the same meaning q  Opcode is always in the same location q  Ditto operand specifiers, immediate values, … q  Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the

instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space

n  Non-uniform decode q  E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively

generate branch target) 42

x86 vs. Alpha Instruction Formats n  x86:

n  Alpha:

43

ISA-level Tradeoffs: Number of Registers n  Affects:

q  Number of bits used for encoding register address q  Number of values kept in fast storage (register file) q  (uarch) Size, access time, power consumption of register file

n  Large number of registers: + Enables better register allocation (and optimizations) by

compiler à fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check

logic

44

ISA-level Tradeoffs: Addressing Modes n  Addressing mode specifies how to obtain an operand of an

instruction q  Register q  Immediate q  Memory (displacement, register indirect, indexed, absolute,

memory indirect, autoincrement, autodecrement, …)

n  More modes: + help better support programming constructs (arrays, pointer-

based accesses) -- make it harder for the architect to design -- too many choices for the compiler?

n  Many ways to do the same thing complicates compiler design n  Read Wulf, “Compilers and Computer Architecture”

45

x86 vs. Alpha Instruction Formats n  x86:

n  Alpha:

46

47

x86

register

absolute

register indirect

register + displacement

x86

48

indexed (base + index)

scaled (base + index*4)

Other ISA-level Tradeoffs n  Load/store vs. Memory/Memory n  Condition codes vs. condition registers vs. compare&test n  Hardware interlocks vs. software-guaranteed interlocking n  VLIW vs. single instruction vs. SIMD n  0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands)

n  Precise vs. imprecise exceptions n  Virtual memory vs. not n  Aligned vs. unaligned access n  Supported data types n  Software vs. hardware managed page fault handling n  Granularity of atomicity n  Cache coherence (hardware vs. software) n  …

49

Programmer vs. (Micro)architect n  Many ISA features designed to aid programmers n  But, complicate the hardware designer’s job

n  Virtual memory q  vs. overlay programming q  Should the programmer be concerned about the size of code

blocks?

n  Unaligned memory access q  Compile/programmer needs to align data

n  Transactional memory? n  VLIW vs. SIMD? Superscalar execution vs. SIMD?

50

Transactional Memory

51

enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; Q->tail = node; release(lock);

}

begin-transaction

… enqueue (Q, v); //no locks … end-transaction

THREAD 1 THREAD 2

enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; Q->tail = node; release(lock);

}

enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; release(lock); Q->tail = node;

}

enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; release(lock); Q->tail = node;

} begin-transaction

… enqueue (Q, v); //no locks … end-transaction

Transactional Memory n  A transaction is executed atomically: ALL or NONE

n  If there is a data conflict between two transactions, only one of them completes; the other is rolled back q  Both write to the same location q  One reads from the location another writes

52

ISA-level Tradeoff: Supporting TM n  Still under research n  Pros:

q  Could make programming with threads easier q  Could improve parallel program performance vs. locks. Why?

n  Cons: q  What if it does not pan out? q  All future microarchitectures might have to support the new

instructions (for backward compatibility reasons) q  Complexity?

n  How does the architect decide whether or not to support TM in the ISA? (How to evaluate the whole stack)

53

ISA-level Tradeoffs: Instruction Pointer n  Do we need an instruction pointer in the ISA?

q  Yes: Control-driven, sequential execution n  An instruction is executed when the IP points to it n  IP automatically changes sequentially (except control flow

instructions)

q  No: Data-driven, parallel execution n  An instruction is executed when all its operand values are

available (data flow)

n  Tradeoffs: MANY high-level ones q  Ease of programming (for average programmers)? q  Ease of compilation? q  Performance: Extraction of parallelism? q  Hardware complexity?

54

The Von-Neumann Model

55

CONTROL UNIT

IP Inst Register

PROCESSING UNIT

ALU TEMP

MEMORY

Mem Addr Reg

Mem Data Reg

INPUT OUTPUT

The Von-Neumann Model n  Stored program computer (instructions in memory) n  One instruction at a time n  Sequential execution n  Unified memory

q  The interpretation of a stored value depends on the control signals

n  All major ISAs today use this model n  Underneath (at uarch level), the execution model is very

different q  Multiple instructions at a time q  Out-of-order execution q  Separate instruction and data caches

56

Fundamentals of Uarch Performance Tradeoffs

57

Instruction Supply

Data Path (Functional

Units)

Data Supply

- Zero-cycle latency (no cache miss) - No branch mispredicts -  No fetch breaks

-  Perfect data flow (reg/memory dependencies) -  Zero-cycle interconnect (operand communication) -  Enough functional units

-  Zero latency compute?

-  Zero-cycle latency -  Infinite capacity -  Zero cost

We will examine all these throughout the course (especially data supply)

How to Evaluate Performance Tradeoffs

58

Algorithm Program ISA Compiler

ISA Microarchitecture

Microarchitecture Logic design Circuit implementation Technology

# cycles instruction

time cycle X X # instructions

program

time program

=

= Execution time

Improving Performance n  Reducing instructions/program

n  Reducing cycles/instruction (CPI)

n  Reducing time/cycle (clock period)

59

Improving Performance (Reducing Exec Time)

n  Reducing instructions/program q  More efficient algorithms and programs q  Better ISA?

n  Reducing cycles/instruction (CPI) q  Better microarchitecture design

n  Execute multiple instructions at the same time n  Reduce latency of instructions (1-cycle vs. 100-cycle memory

access)

n  Reducing time/cycle (clock period) q  Technology scaling q  Pipelining

60

Improving Performance: Semantic Gap n  Reducing instructions/program

q  Complex instructions: small code size (+) q  Simple instructions: large code size (--)

n  Reducing cycles/instruction (CPI) q  Complex instructions: (can) take more cycles to execute (--)

n  REP MOVS n  How about ADD with condition code setting?

q  Simple instructions: (can) take fewer cycles to execute (+)

n  Reducing time/cycle (clock period) q  Does instruction complexity affect this?

n  It depends

61

Documents

15-740/18-740 Computer Architectureece740/f11/lib/exe/fetch.php?media=wi… · 15-740/18-740 Computer Architecture Lecture 3: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie