Upload
lenci
View
60
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m] [email protected] www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012. CSCI-4320/6360: Parallel Programming & Computing West Hall, Tues./Fri. 12-1:20 p.m. Introduction, Syllabus & Prelims. - PowerPoint PPT Presentation
Citation preview
PPC 2012 - Intro, Syllabus & Prelims 1
CSCI-4320/6360: Parallel Programming & Computing
West Hall, Tues./Fri. 12-1:20 p.m.Introduction, Syllabus &
Prelims Prof. Chris Carothers
Computer Science DepartmentMRC 309a
Office Hrs: Tuesdays, 1:30 – 3:30 p.m][email protected]
www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012
PPC 2012 - Intro, Syllabus & Prelims 2
Let’s Look at the Syllabus…
• See the syllabus on the course webpage.
PPC 2012 - Intro, Syllabus & Prelims 3
To Make A Fast Parallel Computer You Need a Faster Serial Computer…well sorta…
• Review of…– Instructions…– Instruction processing..
• Put it together…why the heck do we care about or need a parallel computer?– i.e., they are really cool pieces of technology,
but can they really do anything useful beside compute Pi to a few billion more digits…
PPC 2012 - Intro, Syllabus & Prelims 4
Processor Instruction Sets• In general, a computer needs a few
different kinds of instructions:– mathematical and logical operations– data movement (access memory)– jumping to new places in memory
• if the right conditions hold.– I/O (sometimes treated as data movement)
• All these instructions involve using registers to store data as close as possible to the CPU– E.g. $t0, $s0 in MIPs on %eax, %ebx in x86
PPC 2012 - Intro, Syllabus & Prelims 5
a=(b+c)-(d+e);
add $t0, $s1, $s2 # t0 = b+cadd $t1, $s3, $s4 # t1 = d+esub $s0, $t0, $t1 # a = $t0–$t1
$s0 $s1 $s2 $s3 $s4
PPC 2012 - Intro, Syllabus & Prelims 6
lw destreg, const(addrreg)
“Load Word”
Name of register to put value in
A number
Name of register to get base address from
address = (contents of addrreg) + const
PPC 2012 - Intro, Syllabus & Prelims 7
Array Example: a=b+c[8];
lw $t0,8($s2) # $t0 = c[8]add $s0, $s1, $t0 # $s0=$s1+$t0
(yeah, this is not quite right …)
$s0 $s1 $s2
PPC 2012 - Intro, Syllabus & Prelims 8
lw destreg, const(addrreg)
“Load Word”
Name of register to put value in
A number
Name of register to get base address from
address = (contents of addrreg) + const
PPC 2012 - Intro, Syllabus & Prelims 9
sw srcreg, const(addrreg)
“Store Word”
Name of register to get value from
A number
Name of register to get base address from
address = (contents of addrreg) + const
PPC 2012 - Intro, Syllabus & Prelims 10
How are instructions processed?• In the simple case…
– Fetch instruction from memory– Decode it (read op code, and use registers
based on what instruction the op code says– Execute the instruction– Write back any results to register or memory
• Complex case…– Pipeline – overlap instruction processing…– Superscalar – multi-instruction issue per
clock cycle..
PPC 2012 - Intro, Syllabus & Prelims 11
Simple (relative term) CPU Multicyle Datapath & Control
PPC 2012 - Intro, Syllabus & Prelims 12
Simple (yeah right!) Instruction Processing FSM!
PPC 2012 - Intro, Syllabus & Prelims 13
Pipeline Processing w/ Laundry
• While the first load is drying, put the second load in the washing machine.
• When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.
• NOTE: unrealistic scenario for CS students, as most only own 1 load of clothes…
PPC 2012 - Intro, Syllabus & Prelims 14
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
PPC 2012 - Intro, Syllabus & Prelims 15
Pipelined DP w/ signals
PPC 2012 - Intro, Syllabus & Prelims 16
Pipelined Instruction.. But wait, we’ve got dependencies!
PPC 2012 - Intro, Syllabus & Prelims 17
Pipeline w/ Forwarding Values
PPC 2012 - Intro, Syllabus & Prelims 18
Where Forwarding Fails…must stall
PPC 2012 - Intro, Syllabus & Prelims 19
How Stalls Are Inserted
PPC 2012 - Intro, Syllabus & Prelims 20
What about those crazy branches?
Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed
PPC 2012 - Intro, Syllabus & Prelims 21
Dynamic Branch Prediction• From the phase “There is no such thing as a
typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.
• Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not.
• Implementation: branch prediction buffer or branch history table.– Index based on lower part of branch address– Single bit indicates if branch at address was last taken
or not. (1 or 0)– But single bit predictors tends to lack sufficient history…
PPC 2012 - Intro, Syllabus & Prelims 22
Solution: 2-bit Branch Predictor
Must be wrong twice before changing predictionLearns if the branch is more biased towards “taken” or “not
taken”
PPC 2012 - Intro, Syllabus & Prelims 23
Even more performance…• Ultimately we want greater and greater
Instruction Level Parallelism (ILP)• How?• Multiple instruction issue.
– Results in CPI’s less than one.– Here, instructions are grouped into “issue
slots”.– So, we usually talk about IPC (instructions
per cycle)– Static: uses the compiler to assist with
grouping instructions and hazard resolution. Compiler MUST remove ALL hazards.
– Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards
PPC 2012 - Intro, Syllabus & Prelims 24
Example Static 2-issue Datapath
Additions:•32 bits from intr. Mem•Two read, 1 write ports on reg file•1 more ALU (top handles address calc)
PPC 2012 - Intro, Syllabus & Prelims 25
Ex. 2-Issue Code Schedule
Loop: lw $t0, 0($s1) #t0=array elementaddiu $t0, $t0, $s2 #add scalar in $s2sw $t0, 0($s1) #store resultaddi$s1, $s1, -4 # dec pointerbne $s1, $zero, Loop # branch $s1!=0
4sw $t0, 4($s1)bne $s1, $zero, Loop3addu $t0, $t0, $s22 addi $s1, $s1, -41lw $t0, 0($s1)Loop:CyclesData Xfer Inst.ALU/Branch
It take 4 clock cycles for 5 instructions or IPC of 1.25
PPC 2012 - Intro, Syllabus & Prelims 26
More Performance: Loop Unrolling
• Technique where multiple copies of the loop body are made.
• Make more ILP available by removing dependencies.
• How? Complier introduces additional registers via “register renaming”.
• This removes “name” or “anti” dependence– where an instruction order is purely a consequence of
the reuse of a register and not a real data dependence.– No data values flow between one pair and the next pair– Let’s assume we unroll a block of 4 interations of the
loop..
PPC 2012 - Intro, Syllabus & Prelims 27
Dynamic Scheduled Pipeline
PPC 2012 - Intro, Syllabus & Prelims 28
Intel P4 Dynamic Pipeline – Looks like a cluster .. Just much much smaller…
PPC 2012 - Intro, Syllabus & Prelims 29
Summary of Pipeline TechnologyWe’ve
exhausted this!!
IPC just won’t go
much higher…Why??
PPC 2012 - Intro, Syllabus & Prelims 30
More Speed til it Hertz!• So, if no ILP is available, why not
increase the clock frequency – E.g. why don’t we have 100 GHz processors
today?• ANSWER: POWER & HEAT!!
– With current CMOS technology power needs polynominal++ increase with a linear increase in clock speed.
– Power leads to heat which will ultimately turn your CPU to heap of melted silicon!
PPC 2012 - Intro, Syllabus & Prelims 31
PPC 2012 - Intro, Syllabus & Prelims 32
CPU Power Consumption…
Typically, 100 watts is the limit..
PPC 2012 - Intro, Syllabus & Prelims 33
Where do we go from here?(actually, we’ve arrived @ “here”!)
• Current Industry Trend: Multi-core CPUs– Typically lower clock rate (i.e., < 3 Ghz)– 2, 4 and now 8 cores in single “socket” package– Because of smaller VLSI design processes (e.g. < 45
nm) can reduce power & heat..• Potential for large, lucrative contracts in turning
old dusty sequential codes to multi-core capable– Salesman: here’s your new $200 CPU, & oh, BTW,
you’ll need this million $ consulting contract to port your code to take advantage of those extra cores!
• Best business model since the mainframe!– More cores require greater and greater exploitation of
available parallelism in an application which gets harder and harder as you scale to more processors..
• Due to cost, we’ll force in-house development of talent pool..– You could be that talent pool…
PPC 2012 - Intro, Syllabus & Prelims 34
Examples: Multicore CPUs• Brief listing of the recently released new 45 nm processors: Based on Intel site
(Processor Model - Cache - Clock Speed - Front Side Bus)• Desktop Dual Core:
– E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz– E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz– E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz
• Laptop Dual Core:– T9500 - 6 MB L2 - 2.60 GHz - 800 MHz– T9300 - 6 MB L2 - 2.50 GHz - 800 MHz– T8300 - 3 MB L2 - 2.40 GHz - 800 MHz– T8100 - 3 MB L2 - 2.10 GHz - 800 MHz
• Desktop Quad Core:– Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz– Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz– Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz
• Desktop Extreme Series:– QX9650 - 12 MB L2 - 3 GHz - 1333 MHz
• Note: Intel's new 45nm Penryn-based Core 2 Duo and Core 2 Extreme processors were released on January 6, 2008. The new processors launch within a 35W thermal envelope.
These are becoming the building block of today’s SCs
Getting large amounts of speed requires lots of processors…
Amdahl’s Law
PPC 2012 - Intro, Syllabus & Prelims 35
K Computer - #1 on Top500 @ 8.1 PF
36
• ~548K cores over 672 racks• Consumes 9.89 Mwatts of power• Efficiency: .825 Gflops/watt
• #6 on Green 500 list• 1 PB of RAM• Located at RIKEN in Japan
NSF MRI “Balanced” Cyberinstrument @ CCNI
37
• Blue Gene/Q– 104 Tflops @ 2+ GF/watt– #1 on Green 500 list– 10PF and 20PF systems by 2013– 32K threads/8K cores– 8 TB RAM
• RAM Storage Accelerator– 4 TB @ 40+ GB/sec– 32 servers @ 128 GB each
• Disk storage– 32 servers @ 24 TB disk– 4 meta-data servers w/ SSD– Bandwidth: 5 to 24 GB/sec
• Viz systems– CCNI: 16 servers w/ dual GPUs– EMACS: display wall + servers
Disruptive Exascale Challenges…• 1 billion-way parallelism• Cost budget of O($200M) and O(20M watts)
– Note: 1M watt per year == $1 million US dollars• Power
– 1K-2K pJ/op today (according to Bill Harrod @ DOE)• Really @ 500 pJ/op using Blue Gene/Q data
– Need 20 pJ/op ( ~50 GF/watt) to meet 20 Mwatt power ceiling– Dominated by data movement & overhead
• Programmability– Writing an efficient parallel program is hard!– Locality required for efficiency– System complexity is BARRIER to programmability
38
Power Drives Radically New Hardware and Software
39
Compute is FREE, cost is moving data
All software will have to be radically redesigned to be locality aware
Bill Dally – All CS complexity theory will need to be re-done!
Note: IBM Blue Gene/Q today @ 45 nm!!
Reliable Exabyte Storage is HARD!
The gap between computation and I/O performance continues to increase.
• Current Data:– Intrepid 478 Tflops but 60
GB/sec storage – Jaugar 2.2 Pflops but ~200
GB/sec storage– Storage BW/Flop is
shrinking!!• In practice…
– 1/3 of app exec time consumed by I/O
– Checkpointing & downward spiral of I/O
– Kernel panic @ 600K files..
40
PPC 2012 - Intro, Syllabus & Prelims 41
What are SC’s used for??• Can you say “fever for the
flavor”..• Yes, Pringles used an SC to
model airflow of chips as the entered “The Can”..
• Improved overall yield of “good” chips in “The Can” and less chips on the floor…
• P&G has also used SCs to improve other products like: Tide, Pampers, Dawn, Downy and Mr. Clean
PPC 2012 - Intro, Syllabus & Prelims 42
– Virtual flow facility for patient specific surgical planning– High quality patient specific flow simulations needed quickly– Simulation on massively parallel computers– Cost only $600 on 32K Blue Gene/L vs. $50K for a repeat
open heart surgery… – At exascale this will cost more like $6
Patient Specific Vascular Surgical Planning
Disruptive Opportunities @ 2^60• A radical new way to think about science
and engineering– Extreme time compression on very
large-scale complex applications– Materials, Drug Discovery, Finance,
Defense, and Disaster Planning & Recovery…
• Technology enabler for …– Smartphone “supercomputers” w/ 25
GFlop and 100’s GB RAM– Petascale “supercomputer” in all major
universities @ $200K– IBM Watson “desk-side” edition – Home users have 100 GB network and
Terascale+ “home” supercomputers…
43By 2020, we will have unprecedented access to be vast amounts of data but the potential ubiquitous distruptive-scale computing power to use that data in our everyday lives