52
CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/ ~kubitron/cs252 http://www-inst.eecs.berkeley.edu/

CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

Embed Size (px)

Citation preview

Page 1: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

CS252Graduate Computer Architecture

Lecture 11

Vector Processing

John KubiatowiczElectrical Engineering and Computer Sciences

University of California, Berkeley

http://www.eecs.berkeley.edu/~kubitron/cs252

http://www-inst.eecs.berkeley.edu/~cs252

Page 2: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 2

Review: Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units

Page 3: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 3

Review: Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Page 4: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 4

Design Challenges in SMT• Since SMT makes sense only with fine-grained

implementation, impact of fine-grained scheduling on single thread performance?

– A preferred thread approach sacrifices neither throughput nor single-thread performance?

– Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput, when preferred thread stalls

• Larger register file needed to hold multiple contexts

• Clock cycle time, especially in:– Instruction issue - more candidate instructions need to be

considered– Instruction completion - choosing which instructions to commit

may be challenging

• Ensuring that cache and TLB conflicts generated by SMT do not degrade performance

Page 5: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 5

Power 4

Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.

Page 6: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 6

Power 4Power 4

Power 5Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Page 7: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 7

Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Page 8: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 8

Power 5 thread performance ...

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

Page 9: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 9

Changes in Power 5 to support SMT• Increased associativity of L1 instruction cache

and the instruction address translation buffers • Added per thread load and store queues • Increased size of the L2 (1.92 vs. 1.44 MB) and L3

caches• Added separate instruction prefetch and

buffering per thread• Increased the number of virtual registers from

152 to 240• Increased the size of several issue queues• The Power5 core is about 24% larger than the

Power4 core because of the addition of SMT support

Page 10: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 10

Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for

SPECint_rate benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT

– SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark

• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

• Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

• Power 5 running 2 copies of each app speedup between 0.89 and 1.41

– Most gained some

– Fl.Pt. apps had most cache conflicts and least gains

Page 11: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 11

Processor Micro architecture Fetch / Issue /

Execute

FU Clock Rate (GHz)

Transis-tors

Die size

Power

Intel Pentium

4 Extreme

Speculative dynamically

scheduled; deeply pipelined; SMT

3/3/4 7 int. 1 FP

3.8 125 M 122 mm2

115 W

AMD Athlon 64

FX-57

Speculative dynamically scheduled

3/3/4 6 int. 3 FP

2.8 114 M 115 mm2

104 W

IBM Power5 (1 CPU only)

Speculative dynamically

scheduled; SMT; 2 CPU cores/chip

8/4/8 6 int. 2 FP

1.9 200 M 300 mm2 (est.)

80W (est.)

Intel Itanium 2

Statically scheduled VLIW-style

6/5/11 9 int. 2 FP

1.6 592 M 423 mm2

130 W

Head to Head ILP competition

Page 12: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 12

Performance on SPECint2000

0

500

1000

1500

2000

2500

3000

3500

gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf

SP

EC

Rat

io

Itanium 2 Pentium 4 AMD Athlon 64 Power 5

Page 13: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 13

Performance on SPECfp2000

0

2000

4000

6000

8000

10000

12000

14000

w upw ise sw im mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi

SP

EC

Ra

tio

Itanium 2 Pentium 4 AMD Athlon 64 Power 5

Page 14: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 14

Normalized Performance: Efficiency

0

5

10

15

20

25

30

35

SPECInt / MTransistors

SPECFP / MTransistors

SPECInt /mm^2

SPECFP /mm^2

SPECInt /Watt

SPECFP /Watt

I tanium 2 Pentium 4 AMD Athlon 64 POWER 5

Rank

Itanium2

PentIum4

Athlon

Power5

Int/Trans 4 2 1 3

FP/Trans 4 2 1 3

Int/area 4 2 1 3

FP/area 4 2 1 3

Int/Watt 4 3 1 2

FP/Watt 2 4 3 1

Page 15: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 15

No Silver Bullet for ILP • No obvious over all leader in performance

• The AMD Athlon leads on SPECInt performance followed by the Pentium 4, Itanium 2, and Power5

• Itanium 2 and Power5, which perform similarly on SPECFP, clearly dominate the Athlon and Pentium 4 on SPECFP

• Itanium 2 is the most inefficient processor both for Fl. Pt. and integer code for all but one efficiency measure (SPECFP/Watt)

• Athlon and Pentium 4 both make good use of transistors and area in terms of efficiency,

• IBM Power5 is the most effective user of energy on SPECFP and essentially tied on SPECINT

Page 16: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 16

Limits to ILP• Doubling issue rates above today’s 3-6

instructions per clock, say to 6 to 12 instructions, probably requires a processor to

– issue 3 or 4 data memory accesses per cycle,

– resolve 2 or 3 branches per cycle,

– rename and access more than 20 registers per cycle, and

– fetch 12 to 24 instructions per cycle.

• The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate

– E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!

Page 17: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 17

Limits to ILP• Most techniques for increasing performance increase

power consumption • The key question is whether a technique is energy

efficient: does it increase power consumption faster than it increases performance?

• Multiple issue processors techniques all are energy inefficient:1. Issuing multiple instructions incurs some overhead

in logic that grows faster than the issue rate grows2. Growing gap between peak issue rates and sustained

performance• Number of transistors switching = f(peak issue rate),

and performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance

Page 18: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 18

Administrivia

• Exam: Wednesday 3/14Location: TBATIME: 5:30 - 8:30

• This info is on the Lecture page (has been)

• Meet at LaVal’s afterwards for Pizza and Beverages

• CS252 Project proposal due by Monday 3/5– Need two people/project (although can justify three for right

project)

– Complete Research project in 8 weeks

» Typically investigate hypothesis by building an artifact and measuring it against a “base case”

» Generate conference-length paper/give oral presentation

» Often, can lead to an actual publication.

Page 19: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 19

Supercomputers

Definition of a supercomputer:

• Fastest machine in world at given task

• A device to turn a compute-bound problem into an I/O bound problem

• Any machine costing $30M+

• Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

Page 20: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 20

Supercomputer Applications

Typical application areas• Military research (nuclear weapons, cryptography)• Scientific research• Weather forecasting• Oil exploration• Industrial design (car crash simulation)

All involve huge computations on large data sets

In 70s-80s, Supercomputer Vector Machine

Page 21: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 21

Vector Supercomputers

Epitomized by Cray-1, 1976:

Scalar Unit + Vector Extensions• Load/Store Architecture

• Vector Registers

• Vector Instructions

• Hardwired Control

• Highly Pipelined Functional Units

• Interleaved Memory System

• No Data Caches

• No Virtual Memory

Page 22: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 22

Cray-1 (1976)

Page 23: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 23

Cray-1 (1976)

Single PortMemory

16 banks of 64-bit words

+ 8-bit SECDED

80MW/sec data load/store

320MW/sec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V. Mask

V. Length64 Element Vector Registers

Page 24: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 24

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and

Store Instructions

LV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

Vector Programming Model

Page 25: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 25

Vector Code Example

# Scalar Code

LI R4, 64

loop:

L.D F0, 0(R1)

L.D F2, 0(R2)

ADD.D F4, F2, F0

S.D F4, 0(R3)

DADDIU R1, 8

DADDIU R2, 8

DADDIU R3, 8

DSUBIU R4, 1

BNEZ R4, loop

# Vector Code

LI VLR, 64

LV V1, R1

LV V2, R2

ADDV.D V3, V1, V2

SV V3, R3

# C code

for (i=0; i<64; i++)

C[i] = A[i] + B[i];

Page 26: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 26

Vector Instruction Set Advantages

• Compact– one short instruction encodes N operations

• Expressive, tells hardware that these N operations:– are independent

– use the same functional unit

– access disjoint registers

– access registers in the same pattern as previous instructions

– access a contiguous block of memory (unit-stride load/store)

– access memory in a known pattern (strided load/store)

• Scalable– can run same object code on more parallel pipelines or lanes

Page 27: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 27

V1

V2

V3

V3 <- v1 * v2

Six stage multiply pipeline

• Use deep pipeline (=> fast clock) to execute element operations

• Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

Vector Arithmetic Execution

Page 28: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 28

0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base StrideVector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency• Bank busy time: Cycles between accesses to same bank

Vector memory Subsystem

Page 29: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 29

Vector Instruction ExecutionADDV C,A,B

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]

A[16] B[16]

A[20] B[20]

A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]

A[17] B[17]

A[21] B[21]

A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]

A[18] B[18]

A[22] B[22]

A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]

A[19] B[19]

A[23] B[23]

A[27] B[27]

Execution using four pipelined

functional units

Page 30: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 30

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0, 4, 8, …

Elements 1, 5, 9, …

Elements 2, 6, 10, …

Elements 3, 7, 11, …

Page 31: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 31

T0 Vector Microprocessor (1995)

LaneVector register elements striped

over lanes

[0][8]

[16][24]

[1][9]

[17][25]

[2][10][18][26]

[3][11][19][27]

[4][12][20][28]

[5][13][21][29]

[6][14][22][30]

[7][15][23][31]

Page 32: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 32

Vector Memory-Memory versus Vector Register Machines

• Vector memory-memory instructions hold all vector operands in main memory

• The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-memory machines

• Cray-1 (’76) was first vector register machine

for (i=0; i<N; i++)

{

C[i] = A[i] + B[i];

D[i] = A[i] - B[i];

}

Example Source Code ADDV C, A, B

SUBV D, A, B

Vector Memory-Memory Code

LV V1, A

LV V2, B

ADDV V3, V1, V2

SV V3, C

SUBV V4, V1, V2

SV V4, D

Vector Register Code

Page 33: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 33

Vector Memory-Memory vs. Vector Register Machines

• Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?

– All operands must be read in and out of memory

• VMMAs make if difficult to overlap execution of multiple vector operations, why?

– Must check dependencies on memory addresses

• VMMAs incur greater startup latency– Scalar code was faster on CDC Star-100 for vectors < 100 elements

– For Cray-1, vector/scalar breakeven point was around 2 elements

Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector machines since Cray-1 have had vector register architectures

(we ignore vector memory-memory from now on)

Page 34: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 34

Automatic Code Vectorizationfor (i=0; i < N; i++) C[i] = A[i] + B[i];

load

load

add

store

load

load

add

store

Iter. 1

Iter. 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing

requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter. 1

Iter. 2

Vectorized Code

Tim

e

Page 35: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 35

Vector StripminingProblem: Vector registers have finite lengthSolution: Break loops into pieces that fit into vector

registers, “Stripmining” ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainderloop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?

for (i=0; i<N; i++) C[i] = A[i]+B[i];

+

+

+

A B C

64 elements

Remainder

Page 36: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 36

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

– example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Instruction issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

Page 37: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 37

Vector Chaining• Vector version of register

bypassing– introduced with Cray-1

Memory

V1

Load Unit

Mult.

V2

V3

Chain

Add

V4

V5

Chain

LV v1

MULV v3,v1,v2

ADDV v5, v3, v4

Page 38: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 38

Vector Chaining Advantage

• With chaining, can start dependent instruction as soon as first result appears

Load

Mul

Add

Load

Mul

AddTime

• Without chaining, must wait for last element of result to be written before starting dependent instruction

Page 39: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 39

Vector StartupTwo components of vector startup penalty

– functional unit latency (time through pipeline)

– dead time or recovery time (time before another vector instruction can start down pipeline)

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

R X X X W

Functional Unit Latency

Dead Time

First Vector Instruction

Second Vector Instruction

Dead Time

Page 40: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 40

Dead Time and Short Vectors

Cray C90, Two lanes

4 cycle dead time

Maximum efficiency 94% with 128 element vectors

4 cycles dead time T0, Eight lanes

No dead time

100% efficiency with 8 element vectors

No dead time

64 cycles active

Page 41: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 41

Vector Scatter/Gather

Want to vectorize loops with indirect accesses:for (i=0; i<N; i++)

A[i] = B[i] + C[D[i]]

Indexed load instruction (Gather)LV vD, rD # Load indices in D vector

LVI vC, rC, vD # Load indirect from rC base

LV vB, rB # Load B vector

ADDV.D vA, vB, vC # Do add

SV vA, rA # Store result

Page 42: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 42

Vector Scatter/Gather

Scatter example:

for (i=0; i<N; i++)

A[B[i]]++;

Is following a correct translation?LV vB, rB # Load indices in B vector

LVI vA, rA, vB # Gather initial A values

ADDV vA, vA, 1 # Increment

SVI vA, rA, vB # Scatter incremented values

Page 43: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 43

Vector Conditional ExecutionProblem: Want to vectorize loops with conditional

code:for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];

Solution: Add vector mask (or flag) registers– vector version of predicate registers, 1 bit per element

…and maskable vector instructions– vector operation becomes NOP at elements where mask bit is clear

Code example:CVM # Turn on all elements

LV vA, rA # Load entire A vector

SGTVS.D vA, F0 # Set bits in mask register where A>0

LV vA, rB # Load B vector into A under mask

SV vA, rA # Store A back to memory under mask

Page 44: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 44

Masked Vector Instructions

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

Density-Time Implementation– scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Enable

A[7] B[7]M[7]=1

Simple Implementation– execute all N operations, turn off

result writeback according to mask

Page 45: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 45

Compress/Expand Operations• Compress packs non-masked elements from one

vector register contiguously at start of destination vector register

– population count of mask vector gives packed vector length

• Expand performs inverse operation

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

A[3]

A[4]

A[5]

A[6]

A[7]

A[0]

A[1]

A[2]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

B[3]

A[4]

A[5]

B[6]

A[7]

B[0]

A[1]

B[2]

Expand

A[7]

A[1]

A[4]

A[5]

Compress

A[7]

A[1]

A[4]

A[5]

Used for density-time conditionals and also for general selection operations

Page 46: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 46

Vector Reductions

Problem: Loop-carried dependence on reduction variablessum = 0;

for (i=0; i<N; i++)

sum += A[i]; # Loop-carried dependence on sum

Solution: Re-associate operations if possible, use binary tree to perform reduction# Rearrange as:

sum[0:VL-1] = 0 # Vector of VL partial sums

for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks

sum[0:VL-1] += A[i:i+VL-1]; # Vector sum

# Now have VL partial sums in one vector register

do {

VL = VL/2; # Halve vector length

sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials

} while (VL>1)

Page 47: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 47

Novel Matrix Multiply Solution• Consider the following:

/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j++) {

sum = 0; for (t=1; t<k; t++)

sum += a[i][t] * b[t][j]; c[i][j] = sum; }}

• Do you need to do a bunch of reductions? NO!– Calculate multiple independent sums within one vector register– You can vectorize the j loop to perform 32 dot-products at the same

time (Assume Max Vector Length is 32)

• Show it in C source code, but can imagine the assembly vector instructions from it

Page 48: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 48

Optimized Vector Example/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j+=32) {/* Step j 32 at a time. */

sum[0:31] = 0; /* Init vector reg to zeros. */ for (t=1; t<k; t++) { a_scalar = a[i][t]; /* Get scalar */ b_vector[0:31] = b[t][j:j+31]; /* Get vector */

/* Do a vector-scalar multiply. */prod[0:31] = b_vector[0:31]*a_scalar;

/* Vector-vector add into results. */ sum[0:31] += prod[0:31];

}/* Unit-stride store of vector of results. */

c[i][j:j+31] = sum[0:31];}

}

Page 49: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 49

Multimedia Extensions

• Very short vectors added to existing ISAs for micros

• Usually 64-bit registers split into 2x32b or 4x16b or 8x8b

• Newer designs have 128-bit registers (Altivec, SSE2)

• Limited instruction set:– no vector length control

– no strided load/store or scatter/gather

– unit-stride loads must be aligned to 64/128-bit boundary

• Limited vector register length:– requires superscalar dispatch to keep multiply/add/load units busy

– loop unrolling to hide latencies increases register pressure

• Trend towards fuller vector support in microprocessors

Page 50: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 50

“Vector” for Multimedia?

• Intel MMX: 57 additional 80x86 instructions (1st since 386)

– similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC

• 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits– reuse 8 FP registers (FP and MMX cannot mix)

• short vector: load, add, store 8 8-bit operands

• Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...

– use in drivers or added to library routines; no compiler

+

Page 51: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 51

MMX Instructions

• Move 32b, 64b

• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b– opt. signed/unsigned saturate (set to max) if overflow

• Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b

• Multiply, Multiply-Add in parallel: 4 16b

• Compare = , > in parallel: 8 8b, 4 16b, 2 32b– sets field to 0s (false) or 1s (true); removes branches

• Pack/Unpack– Convert 32b<–> 16b, 16b <–> 8b

– Pack saturates (set to max) if number is too large

Page 52: CS252 Graduate Computer Architecture Lecture 11 Vector Processing John Kubiatowicz Electrical Engineering and Computer Sciences University of California,

2/28/2007 cs252-S07, Lecture 11 52

Vector Summary

• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware,

more energy efficient, and better real-time model than Out-of-order machines

• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations

• Fundamental design issue is memory bandwidth– With virtual address translation and caching

• Will multimedia popularity revive vector architectures?