View
5
Download
0
Category
Preview:
Citation preview
3/3/11
1
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
SIMD II Instructors: Randy H. Katz
David A. PaFerson hFp://inst.eecs.Berkeley.edu/~cs61c/sp11
1 Spring 2011 -‐-‐ Lecture #14 3/3/11 3/3/11 Spring 2011 -‐-‐ Lecture #14 2
New-‐School Machine Structures (It’s a bit more complicated!)
• Parallel Requests Assigned to computer e.g., Search “Katz”
• Parallel Threads Assigned to core e.g., Lookup, Ads
• Parallel Instruc\ons >1 instruc\on @ one \me e.g., 5 pipelined instruc\ons
• Parallel Data >1 data item @ one \me e.g., Add of 4 pairs of words
• Hardware descrip\ons All gates @ one \me
3/3/11 Spring 2011 -‐-‐ Lecture #14 3
Smart Phone
Warehouse Scale
Computer
So'ware Hardware
Harness Parallelism & Achieve High Performance
Logic Gates
Core Core …
Memory (Cache)
Input/Output
Computer
Main Memory
Core
Instruc\on Unit(s) Func\onal Unit(s)
A3+B3 A2+B2 A1+B1 A0+B0
Today’s Lecture
Review
• Flynn Taxonomy of Parallel Architectures – SIMD: Single Instruc>on Mul>ple Data – MIMD: Mul>ple Instruc>on Mul>ple Data – SISD: Single Instruc\on Single Data (unused) – MISD: Mul\ple Instruc\on Single Data
• Intel SSE SIMD Instruc\ons – One instruc\on fetch that operates on mul\ple operands
simultaneously – 128/64 bit XMM registers
• SSE Instruc\ons in C – Embed the SSE machine instruc\ons directly into C programs
through use of intrinsics – Achieve efficiency beyond that of op\mizing compiler
3/3/11 4 Spring 2011 -‐-‐ Lecture #14
Agenda
• Amdahl’s Law • Administrivia
• SIMD and Loop Unrolling
• Technology Break • Memory Performance for Caches
• Review of 1st Half of 61C
3/3/11 5 Spring 2011 -‐-‐ Lecture #14
Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is
Speedup w/ E = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Exec \me w/o E
Exec \me w/ E
• Suppose that enhancement E accelerates a frac\on F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected
Execu\on Time w/ E =
Speedup w/ E = 3/3/11 6 Fall 2010 -‐-‐ Lecture #17
Execu\on Time w/o E × [ (1-‐F) + F/S]
1 / [ (1-‐F) + F/S ]
3/3/11
2
Big Idea: Amdahl’s Law
3/3/11 7 Fall 2010 -‐-‐ Lecture #17
Speedup =
Example: the execu\on \me of half of the program can be accelerated by a factor of 2. What is the program speed-‐up overall?
Big Idea: Amdahl’s Law
3/3/11 8 Fall 2010 -‐-‐ Lecture #17
Speedup = 1
Example: the execu\on \me of half of the program can be accelerated by a factor of 2. What is the program speed-‐up overall?
(1 -‐ F) + F S Non-‐speed-‐up part Speed-‐up part
1
0.5 + 0.5 2
1
0.5 + 0.25 = = 1.33
Big Idea: Amdahl’s Law
3/3/11 Fall 2010 -‐-‐ Lecture #17 9
If the por\on of the program that can be parallelized is small, then the speedup is limited
The non-‐parallel por\on limits the performance
Example #1: Amdahl’s Law
• Consider an enhancement which runs 20 \mes faster but which is only usable 25% of the \me. Speedup w/ E =
• What if its usable only 15% of the \me? Speedup w/ E =
• Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computa\on can be scalar!
• To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E =
3/3/11 Fall 2010 -‐-‐ Lecture #17 10
Speedup w/ E =
Example #1: Amdahl’s Law
• Consider an enhancement which runs 20 \mes faster but which is only usable 25% of the \me Speedup w/ E = 1/(.75 + .25/20) = 1.31
• What if its usable only 15% of the \me? Speedup w/ E = 1/(.85 + .15/20) = 1.17
• Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computa\on can be scalar!
• To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99
3/3/11 Fall 2010 -‐-‐ Lecture #17 11
Speedup w/ E = 1 / [ (1-‐F) + F/S ] Parallel Speed-‐up Example
• 10 “scalar” opera\ons (non-‐parallelizable) • 100 parallelizable opera\ons • 110 opera\ons – 100/110 = .909 Parallelizable, 10/110 = 0.91 Scalar
3/3/11 Fall 2010 -‐-‐ Lecture #17 12
Z0 + Z1 + … + Z10 X1,1 X1,10
X10,1 X10,10
Y1,1 Y1,10
Y10,1 Y10,10
+
Non-‐parallel part Parallel part
Par\\on 10 ways and perform on 10 parallel processing units
3/3/11
3
Example #2: Amdahl’s Law
• Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E =
• What if there are 100 processors ? Speedup w/ E =
• What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? Speedup w/ E =
• What if there are 100 processors ? Speedup w/ E =
Speedup w/ E = 1 / [ (1-‐F) + F/S]
3/3/11 13 Fall 2010 -‐-‐ Lecture #17
Example #2: Amdahl’s Law
• Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors
Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5
• What if there are 100 processors ? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0
• What if the matrices are 33 by 33(or 1019 adds in total) on 10 processors? (increase parallel data by 10x)
Speedup w/ E = 1/(.009 + .991/10) = 1/0.108 = 9.2
• What if there are 100 processors ? Speedup w/ E = 1/(.009 + .991/100) = 1/0.019 = 52.6
Speedup w/ E = 1 / [ (1-‐F) + F/S ]
3/3/11 14 Fall 2010 -‐-‐ Lecture #17
Strong and Weak Scaling
• To get good speedup on a mul\processor while keeping the problem size fixed is harder than gexng good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem (e.g., 10x10 Matrix on 10 processors to 100)
– Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem propor\onally to the increase in the number of processors
– (e.g., 10x10 Matrix on 10 processors =>33x33 Matrix on 100) • Load balancing is another important factor: every processor doing same amount of work – Just 1 unit with twice the load of others cuts speedup almost in half
3/3/11 Fall 2010 -‐-‐ Lecture #17 15
Peer Review • Suppose a program spends 80% of its \me in a square root rou\ne. How much must you speedup square root to make the program run 5 \mes faster?
3/3/11 Spring 2011 -‐-‐ Lecture #14 16
A red) 4 B orange) 5
C green) 10
20
E pink) None of the above
Speedup w/ E = 1 / [ (1-‐F) + F/S ]
Administrivia
• Lab #7 posted • No Homework, no project this week! • TA Review: Su, Mar 6, 2-‐5 PM, 2050 VLSB • Midterm Exam: Tu, Mar 8, 6-‐9 PM, 145/155 Dwinelle – Split: A-‐Lew in 145, Li-‐Z in 155 – Small number of special considera\on cases, due to class conflicts, etc.—contact Dave or Randy
• No discussion during exam week; no lecture that day • Sent (anonymous) 61C midway survey before Midterm: Please fill out! (Only 1/3 so far; have your voice heard!)
• hFps://www.surveymonkey.com/s/QS3ZLW7
3/3/11 Spring 2011 -‐-‐ Lecture #14 17
61C in the News
HewleF-‐Packard researchers have proposed a fundamental rethinking of the modern computer for the coming era of nanoelectronics — a marriage of memory and compu\ng power that could dras\cally limit the energy used by computers. Today the microprocessor is in the center of the compu\ng universe, and informa\on is moved, at heavy energy cost, first to be used in computa\on and then stored. The new approach would be to marry processing to memory to cut down transporta\on of data and reduce energy use. The semiconductor industry has long warned about a set of impending boFlenecks described as “the wall,”
a point in \me where more than five decades of progress in con\nuously shrinking the size of transistors used in computa\on will end. … systems will be based on memory chips he calls “nanostores” as dis\nct from today’s microprocessors. They will be hybrids, three-‐dimensional systems in which lower-‐level circuits will be based on a nanoelectronic technology called the memristor, which HewleF-‐Packard is developing to store data. The nanostore chips will have a mul\story design, and compu\ng circuits made with conven\onal silicon will sit directly on top of the memory to process the data, with minimal energy costs.
3/3/11 Spring 2011 -‐-‐ Lecture #14 18
“Remapping Computer Circuitry to Avert Impending Bo:lenecks,” John Markoff, NY Times, Feb 28, 2011
3/3/11
4
Gexng to Know Profs • Ride with sons in MS Charity Bike
Ride every September since 2002 • “Waves to Wine” • 150 miles over 2 days from SF to Sonoma
• Team: “Berkeley An\-‐MS Crew” • If want to join team, let me know • Always a Top 10 fundraising team despite small size
• I was top fundraiser 2006, 2007, 2008, 2009, 2010 due to compu\ng – Can offer fund raising advice: order of sending, when to send during week, who to send to, …
Agenda
• Amdahl’s Law • Administrivia
• SIMD and Loop Unrolling
• Technology Break • Memory Performance for Caches
• Review of 1st Half of 61C
3/3/11 20 Spring 2011 -‐-‐ Lecture #14
Data Level Parallelism and SIMD
• SIMD wants adjacent values in memory that can be operated in parallel
• Usually specified in programs as loops
for(i=1000; i>0; i=i-‐1)
x[i] = x[i] + s;
• How can reveal more data level parallelism than available in a single itera\on of a loop?
• Unroll loop and adjust itera\on rate 3/3/11 Spring 2011 -‐-‐ Lecture #14 21
Looping in MIPS
Assump\ons:
-‐ R1 is ini\ally the address of the element in the array with the highest address
-‐ F2 contains the scalar value s
-‐ 8(R2) is the address of the last element to operate on. CODE:
Loop:1. l.d F0, 0(R1) ; F0=array element
2. add.d F4,F0,F2 ; add s to F0
3. s.d F4,0(R1) ; store result
4. addui R1,R1,#-‐8 ; decrement pointer 8 byte 5. bne R1,R2,Loop ;repeat loop if R1 != R2
Loop Unrolled Loop: l.d F0,0(R1)
add.d F4,F0,F2
s.d F4,0(R1)
l.d F6,-‐8(R1)
add.d F8,F6,F2
s.d F8,-‐8(R1)
l.d F10,-‐16(R1)
add.d F12,F10,F2
s.d F12,-‐16(R1)
l.d F14,-‐24(R1)
add.d F16,F14,F2
s.d F16,-‐24(R1)
addui R1,R1,#-‐32
bne R1,R2,Loop
NOTE: 1. Different Registers eliminate stalls 2. Only 1 Loop Overhead every 4 itera\ons 3. This unrolling works if loop_limit(mod 4) = 0
Loop Unrolled Scheduled Loop:l.d F0,0(R1)
l.d F6,-‐8(R1) l.d F10,-‐16(R1) l.d F14,-‐24(R1) add.d F4,F0,F2
add.d F8,F6,F2 add.d F12,F10,F2 add.d F16,F14,F2 s.d F4,0(R1) s.d F8,-‐8(R1) s.d F12,-‐16(R1) s.d F16,-‐24(R1) addui R1,R1,#-‐32 bne R1,R2,Loop
4 Loads side-‐by-‐side: Could replace with 4 wide SIMD Load
4 Adds side-‐by-‐side: Could replace with 4 wide SIMD Add
4 Stores side-‐by-‐side: Could replace with 4 wide SIMD Store
3/3/11
5
Loop Unrolling in C
• Instead of compiler doing loop unrolling, could do it yourself in C
for(i=1000; i>0; i=i-‐1) x[i] = x[i] + s; • Could be rewriFen for(i=1000; i>0; i=i-‐4) { x[i] = x[i] + s; x[i-‐1] = x[i-‐1] + s; x[i-‐2] = x[i-‐2] + s; x[i-‐3] = x[i-‐3] + s; }
3/3/11 Spring 2011 -‐-‐ Lecture #14 25
What is downside of doing it in C?
Generalizing Loop Unrolling
• A loop of n iteraQons • k copies of the body of the loop Then we will run the loop with 1 copy of the body n(mod k) \mes and
with k copies of the body floor(n/k) \mes
• (Will revisit loop unrolling again when get to pipelining later in semester)
Agenda
• Amdahl’s Law • Administrivia
• SIMD and Loop Unrolling
• Memory Performance for Caches
• Technology Break • Review of 1st Half of 61C
3/3/11 27 Spring 2011 -‐-‐ Lecture #14
Reading Miss Penalty: Memory Systems that Support Caches
3/3/11 Spring 2011 -‐-‐ Lecture #11 28
• The off-‐chip interconnect and memory architecture affects overall system performance in drama\c ways
CPU
Cache
DRAM Memory
bus
One word wide organiza\on (one word wide bus and one word wide memory)
Assume • 1 memory bus clock cycle to send address
• 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle \me), 5 memory bus clock cycles for 2nd, 3rd, 4th words (subsequent column access \me)—note effect of latency!
• 1 memory bus clock cycle to return a word of data
Memory-‐Bus to Cache bandwidth • Number of bytes accessed from memory and
transferred to cache/CPU per memory bus clock cycle
32-‐bit data &
32-‐bit addr per cycle
on-‐chip
(DDR) SDRAM Opera\on
N ro
ws
N cols
DRAM
Column Address
M-‐bit Output
M bit planes N x M SRAM
Row Address
• A�er a row is read into the SRAM register • Input CAS as the star\ng “burst” address along with a burst length
• Transfers a burst of data (ideally a cache block) from a series of sequen\al addresses within that row - Memory bus clock controls transfer of successive words in the burst
+1
Row Address
CAS
RAS
Col Address
1st M-‐bit Access 2nd M-‐bit 3rd M-‐bit 4th M-‐bit
Cycle Time
Row Add
3/3/11 29 Spring 2011 -‐-‐ Lecture #11
One Word Wide Bus, One Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 30
• If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory memory bus clock cycle to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock
cycle
CPU
Cache
DRAM Memory
bus
on-‐chip
3/3/11
6
One Word Wide Bus, One Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 31
• If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory memory bus clock cycle to send address memory bus clock cycles to read DRAM memory bus clock cycle to return data total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per memory bus clock
cycle
CPU
Cache
DRAM Memory
bus
on-‐chip
1
15
1
17
4/17 = 0.235
One Word Wide Bus, Four Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 32
• What if the block size is four words and each word is in a different DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock
CPU
Cache
DRAM Memory
bus
on-‐chip
One Word Wide Bus, Four Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 33
• What if block size is four words and each word is in a different DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock
CPU
Cache
DRAM Memory
bus
on-‐chip
15 cycles
15 cycles
15 cycles
15 cycles
1
4 x 15 = 60
1
62
(4 x 4)/62 = 0.258
One Word Wide Bus, Four Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 34
• What if block size is four words and all words are in the same DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock
CPU
Cache
DRAM Memory
bus
on-‐chip
One Word Wide Bus, Four Word Blocks
3/3/11 Spring 2011 -‐-‐ Lecture #11 35
• What if the block size is four words and all words are in the same DRAM row? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock
CPU
Cache
DRAM Memory
bus
on-‐chip
15 cycles
5 cycles
5 cycles
5 cycles
1
15 + 3*5 = 30
1
32
(4 x 4)/32 = 0.5
Interleaved Memory, One Word Wide Bus
3/3/11 Spring 2011 -‐-‐ Lecture #11 36
• For a block size of four words cycle to send 1st address
cycles to read DRAM banks
cycles to return last data word
total clock cycles miss penalty
CPU
Cache
DRAM Memory bank 1
bus
on-‐chip
DRAM Memory bank 0
DRAM Memory bank 2
DRAM Memory bank 3 • Number of bytes transferred per
clock cycle (bandwidth) for a single miss is
bytes per clock
3/3/11
7
Interleaved Memory, One Word Wide Bus
3/3/11 Spring 2011 -‐-‐ Lecture #11 37
• For a block size of four words cycle to send 1st address
cycles to read DRAM banks
cycles to return last data word
total clock cycles miss penalty
CPU
Cache
bus
on-‐chip
• Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
15 cycles
15 cycles
15 cycles
15 cycles
(4 x 4)/20 = 0.8
1
15
4*1 = 4
20
DRAM Memory bank 1
DRAM Memory bank 0
DRAM Memory bank 2
DRAM Memory bank 3
DRAM Memory System Observa\ons
• Its important to match the cache characteris\cs – Caches access one block at a \me (usually more than one word)
1) With the DRAM characteris\cs – Use DRAMs that support fast mul\ple word accesses, preferably ones that match the block size of the cache
2) With the memory-‐bus characteris\cs – Make sure the memory-‐bus can support the DRAM access rates and paFerns
– With the goal of increasing the Memory-‐Bus to Cache bandwidth
3/3/11 Spring 2011 -‐-‐ Lecture #11 38
Agenda
• Amdahl’s Law • Administrivia
• SIMD and Loop Unrolling
• Memory Performance for Caches
• Technology Break • Review of 1st Half of 61C
3/3/11 39 Spring 2011 -‐-‐ Lecture #14
New-‐School Machine Structures (It’s a bit more complicated!)
• Parallel Requests Assigned to computer e.g., Search “Katz”
• Parallel Threads Assigned to core e.g., Lookup, Ads
• Parallel Instruc\ons >1 instruc\on @ one \me e.g., 5 pipelined instruc\ons
• Parallel Data >1 data item @ one \me e.g., Add of 4 pairs of words
• Hardware descrip\ons All gates func\oning in
parallel at same \me 3/3/11 Spring 2011 -‐-‐ Lecture #1 40
Smart Phone
Warehouse Scale
Computer
So'ware Hardware
Harness Parallelism & Achieve High Performance
Logic Gates
Core Core …
Memory (Cache)
Input/Output
Computer
Main Memory
Core
Instruc\on Unit(s) Func\onal Unit(s)
A3+B3 A2+B2 A1+B1 A0+B0
Project 2
Project 1
Project 3
Project 4
6 Great Ideas in Computer Architecture
1. Layers of Representa\on/Interpreta\on 2. Moore’s Law
3. Principle of Locality/Memory Hierarchy
4. Parallelism
5. Performance Measurement & Improvement
6. Dependability via Redundancy
3/3/11 Spring 2011 -‐-‐ Lecture #1 41
Great Idea #1: Levels of Representa\on/Interpreta\on
lw $t0, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2)
High Level Language Program (e.g., C)
Assembly Language Program (e.g., MIPS)
Machine Language Program (MIPS)
Hardware Architecture DescripQon (e.g., block diagrams)
Compiler
Assembler
Machine Interpreta7on
temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;
0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !
Logic Circuit DescripQon (Circuit SchemaQc Diagrams)
Architecture Implementa7on
Anything can be represented as a number,
i.e., data or instruc\ons
3/3/11 42 Spring 2011 -‐-‐ Lecture #1
First half 61C
3/3/11
8
3/3/11 Spring 2011 -‐-‐ Lecture #1 43
Predicts: 2X Transistors / chip every 2 years
Gordon Moore Intel Cofounder B.S. Cal 1950!
# of transistors on an
integrated circuit (IC)
Year
#2: Moore’s Law Great Idea #3: Principle of Locality/ Memory Hierarchy
3/3/11 Spring 2011 -‐-‐ Lecture #1 44
First half 61C
Great Idea #4: Parallelism
• Data Level Parallelism in 1st half 61C – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays)
– Lots of data on many disks that can be operated on in parallel (e.g., searching for documents)
• 1st project: DLP across 10s of servers and disks using MapReduce
• Next week’s lab, 3rd project: DLP in memory
3/3/11 Spring 2011 -‐-‐ Lecture #1 45 3/3/11 Spring 2011 -‐-‐ Lecture #1 46
3/3/11 Fall 2010 -‐-‐ Lecture #40 47
Summary
• Amdhal’s Cruel Law: Law of Diminishing Returns • Loop Unrolling to Expose Parallelism • Op\mize Miss Penalty via Memory system • As the field changes, cs61c has to change too! • S\ll about the so�ware-‐hardware interface – Programming for performance via measurement! – Understanding the memory hierarchy and its impact on applica\on performance
– Unlocking the capabili\es of the architecture for performance: SIMD
3/3/11 Fall 2010 -‐-‐ Lecture #40 48
Recommended