3/3/11 CS$61C:$GreatIdeas$in$Computer$ Architecture ...cs61c/sp11/lectures/14LecSp11DLPIIx6.pdf · 3/3/11 4 Geng$to$Know$Profs$ • Ride$with$sons$in$MS$Charity$Bike$$ Ride$every$September$since$2002$

3/3/11

1

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

SIMD II Instructors: Randy H. Katz

David A. PaFerson hFp://inst.eecs.Berkeley.edu/~cs61c/sp11

1 Spring 2011 -‐-‐ Lecture #14 3/3/11 3/3/11 Spring 2011 -‐-‐ Lecture #14 2

New-‐School Machine Structures (It’s a bit more complicated!)

•  Parallel Requests Assigned to computer e.g., Search “Katz”

•  Parallel Threads Assigned to core e.g., Lookup, Ads

•  Parallel Instruc\ons >1 instruc\on @ one \me e.g., 5 pipelined instruc\ons

•  Parallel Data >1 data item @ one \me e.g., Add of 4 pairs of words

•  Hardware descrip\ons All gates @ one \me

3/3/11 Spring 2011 -‐-‐ Lecture #14 3

Smart Phone

Warehouse Scale

Computer

So'ware Hardware

Harness Parallelism & Achieve High Performance

Logic Gates

Core Core …

Memory (Cache)

Input/Output

Computer

Main Memory

Core

Instruc\on Unit(s) Func\onal Unit(s)

A3+B3 A2+B2 A1+B1 A0+B0

Today’s Lecture

Review

•  Flynn Taxonomy of Parallel Architectures –  SIMD: Single Instruc>on Mul>ple Data –  MIMD: Mul>ple Instruc>on Mul>ple Data –  SISD: Single Instruc\on Single Data (unused) –  MISD: Mul\ple Instruc\on Single Data

•  Intel SSE SIMD Instruc\ons –  One instruc\on fetch that operates on mul\ple operands

simultaneously –  128/64 bit XMM registers

•  SSE Instruc\ons in C –  Embed the SSE machine instruc\ons directly into C programs

through use of intrinsics –  Achieve efficiency beyond that of op\mizing compiler

3/3/11 4 Spring 2011 -‐-‐ Lecture #14

Agenda

•  Amdahl’s Law •  Administrivia

•  SIMD and Loop Unrolling

•  Technology Break •  Memory Performance for Caches

•  Review of 1st Half of 61C

3/3/11 5 Spring 2011 -‐-‐ Lecture #14

Big Idea: Amdahl’s (Heartbreaking) Law •  Speedup due to enhancement E is

Speedup w/ E = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Exec \me w/o E

Exec \me w/ E

•  Suppose that enhancement E accelerates a frac\on F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected

Execu\on Time w/ E =

Speedup w/ E = 3/3/11 6 Fall 2010 -‐-‐ Lecture #17

Execu\on Time w/o E × [ (1-‐F) + F/S]

1 / [ (1-‐F) + F/S ]

3/3/11

2

Big Idea: Amdahl’s Law

3/3/11 7 Fall 2010 -‐-‐ Lecture #17

Speedup =

Example: the execu\on \me of half of the program can be accelerated by a factor of 2. What is the program speed-‐up overall?


3/3/11 8 Fall 2010 -‐-‐ Lecture #17

Speedup = 1

Example: the execu\on \me of half of the program can be accelerated by a factor of 2. What is the program speed-‐up overall?

(1 -‐ F) + F S Non-‐speed-‐up part Speed-‐up part

1

0.5 + 0.5 2

1

0.5 + 0.25 = = 1.33


3/3/11 Fall 2010 -‐-‐ Lecture #17 9

If the por\on of the program that can be parallelized is small, then the speedup is limited

The non-‐parallel por\on limits the performance

Example #1: Amdahl’s Law

•  Consider an enhancement which runs 20 \mes faster but which is only usable 25% of the \me. Speedup w/ E =

•  What if its usable only 15% of the \me? Speedup w/ E =

•  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computa\on can be scalar!

•  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E =

3/3/11 Fall 2010 -‐-‐ Lecture #17 10

Speedup w/ E =


•  Consider an enhancement which runs 20 \mes faster but which is only usable 25% of the \me Speedup w/ E = 1/(.75 + .25/20) = 1.31

•  What if its usable only 15% of the \me? Speedup w/ E = 1/(.85 + .15/20) = 1.17

•  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computa\on can be scalar!

•  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99

3/3/11 Fall 2010 -‐-‐ Lecture #17 11

Speedup w/ E = 1 / [ (1-‐F) + F/S ] Parallel Speed-‐up Example

•  10 “scalar” opera\ons (non-‐parallelizable) •  100 parallelizable opera\ons •  110 opera\ons – 100/110 = .909 Parallelizable, 10/110 = 0.91 Scalar

3/3/11 Fall 2010 -‐-‐ Lecture #17 12

Z0 + Z1 + … + Z10 X1,1 X1,10

X10,1 X10,10

Y1,1 Y1,10

Y10,1 Y10,10

+

Non-‐parallel part Parallel part

Par\\on 10 ways and perform on 10 parallel processing units

3/3/11

3


•  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E =

•  What if there are 100 processors ? Speedup w/ E =

•  What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? Speedup w/ E =

•  What if there are 100 processors ? Speedup w/ E =

Speedup w/ E = 1 / [ (1-‐F) + F/S]

3/3/11 13 Fall 2010 -‐-‐ Lecture #17


•  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors

Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5

•  What if there are 100 processors ? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0

•  What if the matrices are 33 by 33(or 1019 adds in total) on 10 processors? (increase parallel data by 10x)

Speedup w/ E = 1/(.009 + .991/10) = 1/0.108 = 9.2

•  What if there are 100 processors ? Speedup w/ E = 1/(.009 + .991/100) = 1/0.019 = 52.6

Speedup w/ E = 1 / [ (1-‐F) + F/S ]

3/3/11 14 Fall 2010 -‐-‐ Lecture #17

Strong and Weak Scaling

•  To get good speedup on a mul\processor while keeping the problem size fixed is harder than gexng good speedup by increasing the size of the problem. –  Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem (e.g., 10x10 Matrix on 10 processors to 100)

– Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem propor\onally to the increase in the number of processors

–  (e.g., 10x10 Matrix on 10 processors =>33x33 Matrix on 100) •  Load balancing is another important factor: every processor doing same amount of work –  Just 1 unit with twice the load of others cuts speedup almost in half

3/3/11 Fall 2010 -‐-‐ Lecture #17 15

Peer Review •  Suppose a program spends 80% of its \me in a square root rou\ne. How much must you speedup square root to make the program run 5 \mes faster?

3/3/11 Spring 2011 -‐-‐ Lecture #14 16

A red) 4 B orange) 5

C green) 10

20

E pink) None of the above

Speedup w/ E = 1 / [ (1-‐F) + F/S ]

Administrivia

•  Lab #7 posted •  No Homework, no project this week! •  TA Review: Su, Mar 6, 2-‐5 PM, 2050 VLSB •  Midterm Exam: Tu, Mar 8, 6-‐9 PM, 145/155 Dwinelle –  Split: A-‐Lew in 145, Li-‐Z in 155 –  Small number of special considera\on cases, due to class conflicts, etc.—contact Dave or Randy

•  No discussion during exam week; no lecture that day •  Sent (anonymous) 61C midway survey before Midterm: Please fill out! (Only 1/3 so far; have your voice heard!)

•  hFps://www.surveymonkey.com/s/QS3ZLW7

3/3/11 Spring 2011 -‐-‐ Lecture #14 17

61C in the News

HewleF-‐Packard researchers have proposed a fundamental rethinking of the modern computer for the coming era of nanoelectronics — a marriage of memory and compu\ng power that could dras\cally limit the energy used by computers. Today the microprocessor is in the center of the compu\ng universe, and informa\on is moved, at heavy energy cost, first to be used in computa\on and then stored. The new approach would be to marry processing to memory to cut down transporta\on of data and reduce energy use. The semiconductor industry has long warned about a set of impending boFlenecks described as “the wall,”

a point in \me where more than five decades of progress in con\nuously shrinking the size of transistors used in computa\on will end. … systems will be based on memory chips he calls “nanostores” as dis\nct from today’s microprocessors. They will be hybrids, three-‐dimensional systems in which lower-‐level circuits will be based on a nanoelectronic technology called the memristor, which HewleF-‐Packard is developing to store data. The nanostore chips will have a mul\story design, and compu\ng circuits made with conven\onal silicon will sit directly on top of the memory to process the data, with minimal energy costs.

3/3/11 Spring 2011 -‐-‐ Lecture #14 18

“Remapping Computer Circuitry to Avert Impending Bo:lenecks,” John Markoff, NY Times, Feb 28, 2011

3/3/11

4

Gexng to Know Profs •  Ride with sons in MS Charity Bike

Ride every September since 2002 •  “Waves to Wine” •  150 miles over 2 days from SF to Sonoma

•  Team: “Berkeley An\-‐MS Crew” •  If want to join team, let me know •  Always a Top 10 fundraising team despite small size

•  I was top fundraiser 2006, 2007, 2008, 2009, 2010 due to compu\ng –  Can offer fund raising advice: order of sending, when to send during week, who to send to, …

Agenda



•  Technology Break •  Memory Performance for Caches

•  Review of 1st Half of 61C

3/3/11 20 Spring 2011 -‐-‐ Lecture #14

Data Level Parallelism and SIMD

•  SIMD wants adjacent values in memory that can be operated in parallel

•  Usually specified in programs as loops

for(i=1000; i>0; i=i-‐1)

x[i] = x[i] + s;

•  How can reveal more data level parallelism than available in a single itera\on of a loop?

•  Unroll loop and adjust itera\on rate 3/3/11 Spring 2011 -‐-‐ Lecture #14 21

Looping in MIPS

Assump\ons:

-‐  R1 is ini\ally the address of the element in the array with the highest address

-‐  F2 contains the scalar value s

-‐  8(R2) is the address of the last element to operate on. CODE:

Loop:1. l.d F0, 0(R1) ; F0=array element

2. add.d F4,F0,F2 ; add s to F0

3. s.d F4,0(R1) ; store result

4. addui R1,R1,#-‐8 ; decrement pointer 8 byte 5. bne R1,R2,Loop ;repeat loop if R1 != R2

Loop Unrolled Loop: l.d F0,0(R1)

add.d F4,F0,F2

s.d F4,0(R1)

l.d F6,-‐8(R1)

add.d F8,F6,F2

s.d F8,-‐8(R1)

l.d F10,-‐16(R1)

add.d F12,F10,F2

s.d F12,-‐16(R1)

l.d F14,-‐24(R1)

add.d F16,F14,F2

s.d F16,-‐24(R1)

addui R1,R1,#-‐32

bne R1,R2,Loop

NOTE: 1.  Different Registers eliminate stalls 2.  Only 1 Loop Overhead every 4 itera\ons 3.  This unrolling works if loop_limit(mod 4) = 0

Loop Unrolled Scheduled Loop:l.d F0,0(R1)

l.d F6,-‐8(R1) l.d F10,-‐16(R1) l.d F14,-‐24(R1) add.d F4,F0,F2

add.d F8,F6,F2 add.d F12,F10,F2 add.d F16,F14,F2 s.d F4,0(R1) s.d F8,-‐8(R1) s.d F12,-‐16(R1) s.d F16,-‐24(R1) addui R1,R1,#-‐32 bne R1,R2,Loop

4 Loads side-‐by-‐side: Could replace with 4 wide SIMD Load

4 Adds side-‐by-‐side: Could replace with 4 wide SIMD Add

4 Stores side-‐by-‐side: Could replace with 4 wide SIMD Store

3/3/11

5

Loop Unrolling in C

•  Instead of compiler doing loop unrolling, could do it yourself in C

for(i=1000; i>0; i=i-‐1) x[i] = x[i] + s; •  Could be rewriFen for(i=1000; i>0; i=i-‐4) { x[i] = x[i] + s; x[i-‐1] = x[i-‐1] + s; x[i-‐2] = x[i-‐2] + s; x[i-‐3] = x[i-‐3] + s; }

3/3/11 Spring 2011 -‐-‐ Lecture #14 25

What is downside of doing it in C?

Generalizing Loop Unrolling

•  A loop of n iteraQons •  k copies of the body of the loop Then we will run the loop with 1 copy of the body n(mod k) \mes and

with k copies of the body floor(n/k) \mes

•  (Will revisit loop unrolling again when get to pipelining later in semester)

Agenda



•  Memory Performance for Caches

•  Technology Break •  Review of 1st Half of 61C

3/3/11 27 Spring 2011 -‐-‐ Lecture #14

Reading Miss Penalty: Memory Systems that Support Caches

3/3/11 Spring 2011 -‐-‐ Lecture #11 28

•  The off-‐chip interconnect and memory architecture affects overall system performance in drama\c ways

CPU

Cache

DRAM Memory

bus

One word wide organiza\on (one word wide bus and one word wide memory)

Assume •  1 memory bus clock cycle to send address

•  15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle \me), 5 memory bus clock cycles for 2nd, 3rd, 4th words (subsequent column access \me)—note effect of latency!

•  1 memory bus clock cycle to return a word of data

Memory-‐Bus to Cache bandwidth •  Number of bytes accessed from memory and

transferred to cache/CPU per memory bus clock cycle

32-‐bit data &

32-‐bit addr per cycle

on-‐chip

(DDR) SDRAM Opera\on

N ro

ws

N cols

DRAM

Column Address

M-‐bit Output

M bit planes N x M SRAM

Row Address

•  A�er a row is read into the SRAM register •  Input CAS as the star\ng “burst” address along with a burst length

•  Transfers a burst of data (ideally a cache block) from a series of sequen\al addresses within that row -  Memory bus clock controls transfer of successive words in the burst

+1

Row Address

CAS

RAS

Col Address

1st M-‐bit Access 2nd M-‐bit 3rd M-‐bit 4th M-‐bit

Cycle Time

Row Add

3/3/11 29 Spring 2011 -‐-‐ Lecture #11

One Word Wide Bus, One Word Blocks

3/3/11 Spring 2011 -‐-‐ Lecture #11 30

•  If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory memory bus clock cycle to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data total clock cycles miss penalty

•  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock

cycle

CPU

Cache

DRAM Memory

bus

on-‐chip

3/3/11

6

One Word Wide Bus, One Word Blocks

3/3/11 Spring 2011 -‐-‐ Lecture #11 31

•  If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory memory bus clock cycle to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data total clock cycles miss penalty

•  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock

cycle

CPU

Cache

DRAM Memory

bus

on-‐chip

1

15

1

17

4/17 = 0.235

One Word Wide Bus, Four Word Blocks

3/3/11 Spring 2011 -‐-‐ Lecture #11 32

•  What if the block size is four words and each word is in a different DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty

•  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock

CPU

Cache

DRAM Memory

bus

on-‐chip


3/3/11 Spring 2011 -‐-‐ Lecture #11 33

•  What if block size is four words and each word is in a different DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty


CPU

Cache

DRAM Memory

bus

on-‐chip

15 cycles

15 cycles

15 cycles

15 cycles

1

4 x 15 = 60

1

62

(4 x 4)/62 = 0.258


3/3/11 Spring 2011 -‐-‐ Lecture #11 34

•  What if block size is four words and all words are in the same DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty


CPU

Cache

DRAM Memory

bus

on-‐chip


3/3/11 Spring 2011 -‐-‐ Lecture #11 35

•  What if the block size is four words and all words are in the same DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty


CPU

Cache

DRAM Memory

bus

on-‐chip

15 cycles

5 cycles

5 cycles

5 cycles

1

15 + 3*5 = 30

1

32

(4 x 4)/32 = 0.5

Interleaved Memory, One Word Wide Bus

3/3/11 Spring 2011 -‐-‐ Lecture #11 36

•  For a block size of four words cycle to send 1st address

cycles to read DRAM banks

cycles to return last data word

total clock cycles miss penalty

CPU

Cache

DRAM Memory bank 1

bus

on-‐chip

DRAM Memory bank 0

DRAM Memory bank 2

DRAM Memory bank 3 •  Number of bytes transferred per

clock cycle (bandwidth) for a single miss is

bytes per clock

3/3/11

7

Interleaved Memory, One Word Wide Bus

3/3/11 Spring 2011 -‐-‐ Lecture #11 37

•  For a block size of four words cycle to send 1st address

cycles to read DRAM banks

cycles to return last data word

total clock cycles miss penalty

CPU

Cache

bus

on-‐chip

•  Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

15 cycles

15 cycles

15 cycles

15 cycles

(4 x 4)/20 = 0.8

1

15

4*1 = 4

20

DRAM Memory bank 1

DRAM Memory bank 0

DRAM Memory bank 2

DRAM Memory bank 3

DRAM Memory System Observa\ons

•  Its important to match the cache characteris\cs –  Caches access one block at a \me (usually more than one word)

1) With the DRAM characteris\cs – Use DRAMs that support fast mul\ple word accesses, preferably ones that match the block size of the cache

2) With the memory-‐bus characteris\cs – Make sure the memory-‐bus can support the DRAM access rates and paFerns

– With the goal of increasing the Memory-‐Bus to Cache bandwidth

3/3/11 Spring 2011 -‐-‐ Lecture #11 38

Agenda



•  Memory Performance for Caches

•  Technology Break •  Review of 1st Half of 61C

3/3/11 39 Spring 2011 -‐-‐ Lecture #14

New-‐School Machine Structures (It’s a bit more complicated!)

•  Parallel Requests Assigned to computer e.g., Search “Katz”

•  Parallel Threads Assigned to core e.g., Lookup, Ads

•  Parallel Instruc\ons >1 instruc\on @ one \me e.g., 5 pipelined instruc\ons

•  Parallel Data >1 data item @ one \me e.g., Add of 4 pairs of words

•  Hardware descrip\ons All gates func\oning in

parallel at same \me 3/3/11 Spring 2011 -‐-‐ Lecture #1 40

Smart Phone

Warehouse Scale

Computer

So'ware Hardware

Harness Parallelism & Achieve High Performance

Logic Gates

Core Core …

Memory (Cache)

Input/Output

Computer

Main Memory

Core

Instruc\on Unit(s) Func\onal Unit(s)

A3+B3 A2+B2 A1+B1 A0+B0

Project 2

Project 1

Project 3

Project 4

6 Great Ideas in Computer Architecture

1.  Layers of Representa\on/Interpreta\on 2.  Moore’s Law

3.  Principle of Locality/Memory Hierarchy

4.  Parallelism

5.  Performance Measurement & Improvement

6.  Dependability via Redundancy

3/3/11 Spring 2011 -‐-‐ Lecture #1 41

Great Idea #1: Levels of Representa\on/Interpreta\on

lw $t0, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2)

High Level Language Program (e.g., C)

Assembly Language Program (e.g., MIPS)

Machine Language Program (MIPS)

Hardware Architecture DescripQon (e.g., block diagrams)

Compiler

Assembler

Machine Interpreta7on

temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !

Logic Circuit DescripQon (Circuit SchemaQc Diagrams)

Architecture Implementa7on

Anything can be represented as a number,

i.e., data or instruc\ons

3/3/11 42 Spring 2011 -‐-‐ Lecture #1

First half 61C

3/3/11

8

3/3/11 Spring 2011 -‐-‐ Lecture #1 43

Predicts: 2X Transistors / chip every 2 years

Gordon Moore Intel Cofounder B.S. Cal 1950!

# of transistors on an

integrated circuit (IC)

Year

#2: Moore’s Law Great Idea #3: Principle of Locality/ Memory Hierarchy

3/3/11 Spring 2011 -‐-‐ Lecture #1 44

First half 61C

Great Idea #4: Parallelism

•  Data Level Parallelism in 1st half 61C – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays)

– Lots of data on many disks that can be operated on in parallel (e.g., searching for documents)

•  1st project: DLP across 10s of servers and disks using MapReduce

•  Next week’s lab, 3rd project: DLP in memory

3/3/11 Spring 2011 -‐-‐ Lecture #1 45 3/3/11 Spring 2011 -‐-‐ Lecture #1 46

3/3/11 Fall 2010 -‐-‐ Lecture #40 47

Summary

•  Amdhal’s Cruel Law: Law of Diminishing Returns •  Loop Unrolling to Expose Parallelism •  Op\mize Miss Penalty via Memory system •  As the field changes, cs61c has to change too! •  S\ll about the so�ware-‐hardware interface – Programming for performance via measurement! – Understanding the memory hierarchy and its impact on applica\on performance

– Unlocking the capabili\es of the architecture for performance: SIMD

3/3/11 Fall 2010 -‐-‐ Lecture #40 48

Documents

3/3/11 CS$61C:$GreatIdeas$in$Computer$ Architecture ...cs61c/sp11/lectures/14LecSp11DLPIIx6.pdf · 3/3/11 4 Geng$to$Know$Profs$ • Ride$with$sons$in$MS$Charity$Bike$$ Ride$every$September$since$2002$