43
PPC 2012 - Intro, Syllabus & Prelims 1 CSCI-4320/6360: Parallel Programming & Computing West Hall, Tues./Fri. 12-1:20 p.m. Introduction, Syllabus & Prelims Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m] [email protected] www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012

Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

  • Upload
    lenci

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m] [email protected] www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012. CSCI-4320/6360: Parallel Programming & Computing West Hall, Tues./Fri. 12-1:20 p.m. Introduction, Syllabus & Prelims. - PowerPoint PPT Presentation

Citation preview

Page 1: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 1

CSCI-4320/6360: Parallel Programming & Computing

West Hall, Tues./Fri. 12-1:20 p.m.Introduction, Syllabus &

Prelims Prof. Chris Carothers

Computer Science DepartmentMRC 309a

Office Hrs: Tuesdays, 1:30 – 3:30 p.m][email protected]

www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012

Page 2: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 2

Let’s Look at the Syllabus…

• See the syllabus on the course webpage.

Page 3: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 3

To Make A Fast Parallel Computer You Need a Faster Serial Computer…well sorta…

• Review of…– Instructions…– Instruction processing..

• Put it together…why the heck do we care about or need a parallel computer?– i.e., they are really cool pieces of technology,

but can they really do anything useful beside compute Pi to a few billion more digits…

Page 4: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 4

Processor Instruction Sets• In general, a computer needs a few

different kinds of instructions:– mathematical and logical operations– data movement (access memory)– jumping to new places in memory

• if the right conditions hold.– I/O (sometimes treated as data movement)

• All these instructions involve using registers to store data as close as possible to the CPU– E.g. $t0, $s0 in MIPs on %eax, %ebx in x86

Page 5: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 5

a=(b+c)-(d+e);

add $t0, $s1, $s2 # t0 = b+cadd $t1, $s3, $s4 # t1 = d+esub $s0, $t0, $t1 # a = $t0–$t1

$s0 $s1 $s2 $s3 $s4

Page 6: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 6

lw destreg, const(addrreg)

“Load Word”

Name of register to put value in

A number

Name of register to get base address from

address = (contents of addrreg) + const

Page 7: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 7

Array Example: a=b+c[8];

lw $t0,8($s2) # $t0 = c[8]add $s0, $s1, $t0 # $s0=$s1+$t0

(yeah, this is not quite right …)

$s0 $s1 $s2

Page 8: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 8

lw destreg, const(addrreg)

“Load Word”

Name of register to put value in

A number

Name of register to get base address from

address = (contents of addrreg) + const

Page 9: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 9

sw srcreg, const(addrreg)

“Store Word”

Name of register to get value from

A number

Name of register to get base address from

address = (contents of addrreg) + const

Page 10: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 10

How are instructions processed?• In the simple case…

– Fetch instruction from memory– Decode it (read op code, and use registers

based on what instruction the op code says– Execute the instruction– Write back any results to register or memory

• Complex case…– Pipeline – overlap instruction processing…– Superscalar – multi-instruction issue per

clock cycle..

Page 11: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 11

Simple (relative term) CPU Multicyle Datapath & Control

Page 12: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 12

Simple (yeah right!) Instruction Processing FSM!

Page 13: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 13

Pipeline Processing w/ Laundry

• While the first load is drying, put the second load in the washing machine.

• When the first load is being folded and the second load is in the dryer, put the third load in the washing machine.

• NOTE: unrealistic scenario for CS students, as most only own 1 load of clothes…

Page 14: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 14

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

Page 15: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 15

Pipelined DP w/ signals

Page 16: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 16

Pipelined Instruction.. But wait, we’ve got dependencies!

Page 17: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 17

Pipeline w/ Forwarding Values

Page 18: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 18

Where Forwarding Fails…must stall

Page 19: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 19

How Stalls Are Inserted

Page 20: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 20

What about those crazy branches?

Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed

Page 21: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 21

Dynamic Branch Prediction• From the phase “There is no such thing as a

typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm.

• Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not.

• Implementation: branch prediction buffer or branch history table.– Index based on lower part of branch address– Single bit indicates if branch at address was last taken

or not. (1 or 0)– But single bit predictors tends to lack sufficient history…

Page 22: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 22

Solution: 2-bit Branch Predictor

Must be wrong twice before changing predictionLearns if the branch is more biased towards “taken” or “not

taken”

Page 23: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 23

Even more performance…• Ultimately we want greater and greater

Instruction Level Parallelism (ILP)• How?• Multiple instruction issue.

– Results in CPI’s less than one.– Here, instructions are grouped into “issue

slots”.– So, we usually talk about IPC (instructions

per cycle)– Static: uses the compiler to assist with

grouping instructions and hazard resolution. Compiler MUST remove ALL hazards.

– Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards

Page 24: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 24

Example Static 2-issue Datapath

Additions:•32 bits from intr. Mem•Two read, 1 write ports on reg file•1 more ALU (top handles address calc)

Page 25: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 25

Ex. 2-Issue Code Schedule

Loop: lw $t0, 0($s1) #t0=array elementaddiu $t0, $t0, $s2 #add scalar in $s2sw $t0, 0($s1) #store resultaddi$s1, $s1, -4 # dec pointerbne $s1, $zero, Loop # branch $s1!=0

4sw $t0, 4($s1)bne $s1, $zero, Loop3addu $t0, $t0, $s22 addi $s1, $s1, -41lw $t0, 0($s1)Loop:CyclesData Xfer Inst.ALU/Branch

It take 4 clock cycles for 5 instructions or IPC of 1.25

Page 26: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 26

More Performance: Loop Unrolling

• Technique where multiple copies of the loop body are made.

• Make more ILP available by removing dependencies.

• How? Complier introduces additional registers via “register renaming”.

• This removes “name” or “anti” dependence– where an instruction order is purely a consequence of

the reuse of a register and not a real data dependence.– No data values flow between one pair and the next pair– Let’s assume we unroll a block of 4 interations of the

loop..

Page 27: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 27

Dynamic Scheduled Pipeline

Page 28: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 28

Intel P4 Dynamic Pipeline – Looks like a cluster .. Just much much smaller…

Page 29: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 29

Summary of Pipeline TechnologyWe’ve

exhausted this!!

IPC just won’t go

much higher…Why??

Page 30: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 30

More Speed til it Hertz!• So, if no ILP is available, why not

increase the clock frequency – E.g. why don’t we have 100 GHz processors

today?• ANSWER: POWER & HEAT!!

– With current CMOS technology power needs polynominal++ increase with a linear increase in clock speed.

– Power leads to heat which will ultimately turn your CPU to heap of melted silicon!

Page 31: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 31

Page 32: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 32

CPU Power Consumption…

Typically, 100 watts is the limit..

Page 33: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 33

Where do we go from here?(actually, we’ve arrived @ “here”!)

• Current Industry Trend: Multi-core CPUs– Typically lower clock rate (i.e., < 3 Ghz)– 2, 4 and now 8 cores in single “socket” package– Because of smaller VLSI design processes (e.g. < 45

nm) can reduce power & heat..• Potential for large, lucrative contracts in turning

old dusty sequential codes to multi-core capable– Salesman: here’s your new $200 CPU, & oh, BTW,

you’ll need this million $ consulting contract to port your code to take advantage of those extra cores!

• Best business model since the mainframe!– More cores require greater and greater exploitation of

available parallelism in an application which gets harder and harder as you scale to more processors..

• Due to cost, we’ll force in-house development of talent pool..– You could be that talent pool…

Page 34: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 34

Examples: Multicore CPUs• Brief listing of the recently released new 45 nm processors: Based on Intel site

(Processor Model - Cache - Clock Speed - Front Side Bus)• Desktop Dual Core:

– E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz– E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz– E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz

• Laptop Dual Core:– T9500 - 6 MB L2 - 2.60 GHz - 800 MHz– T9300 - 6 MB L2 - 2.50 GHz - 800 MHz– T8300 - 3 MB L2 - 2.40 GHz - 800 MHz– T8100 - 3 MB L2 - 2.10 GHz - 800 MHz

• Desktop Quad Core:– Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz– Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz– Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz

• Desktop Extreme Series:– QX9650 - 12 MB L2 - 3 GHz - 1333 MHz

• Note: Intel's new 45nm Penryn-based Core 2 Duo and Core 2 Extreme processors were released on January 6, 2008. The new processors launch within a 35W thermal envelope.

These are becoming the building block of today’s SCs

Getting large amounts of speed requires lots of processors…

Page 35: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

Amdahl’s Law

PPC 2012 - Intro, Syllabus & Prelims 35

Page 36: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

K Computer - #1 on Top500 @ 8.1 PF

36

• ~548K cores over 672 racks• Consumes 9.89 Mwatts of power• Efficiency: .825 Gflops/watt

• #6 on Green 500 list• 1 PB of RAM• Located at RIKEN in Japan

Page 37: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

NSF MRI “Balanced” Cyberinstrument @ CCNI

37

• Blue Gene/Q– 104 Tflops @ 2+ GF/watt– #1 on Green 500 list– 10PF and 20PF systems by 2013– 32K threads/8K cores– 8 TB RAM

• RAM Storage Accelerator– 4 TB @ 40+ GB/sec– 32 servers @ 128 GB each

• Disk storage– 32 servers @ 24 TB disk– 4 meta-data servers w/ SSD– Bandwidth: 5 to 24 GB/sec

• Viz systems– CCNI: 16 servers w/ dual GPUs– EMACS: display wall + servers

Page 38: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

Disruptive Exascale Challenges…• 1 billion-way parallelism• Cost budget of O($200M) and O(20M watts)

– Note: 1M watt per year == $1 million US dollars• Power

– 1K-2K pJ/op today (according to Bill Harrod @ DOE)• Really @ 500 pJ/op using Blue Gene/Q data

– Need 20 pJ/op ( ~50 GF/watt) to meet 20 Mwatt power ceiling– Dominated by data movement & overhead

• Programmability– Writing an efficient parallel program is hard!– Locality required for efficiency– System complexity is BARRIER to programmability

38

Page 39: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

Power Drives Radically New Hardware and Software

39

Compute is FREE, cost is moving data

All software will have to be radically redesigned to be locality aware

Bill Dally – All CS complexity theory will need to be re-done!

Note: IBM Blue Gene/Q today @ 45 nm!!

Page 40: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

Reliable Exabyte Storage is HARD!

The gap between computation and I/O performance continues to increase.

• Current Data:– Intrepid 478 Tflops but 60

GB/sec storage – Jaugar 2.2 Pflops but ~200

GB/sec storage– Storage BW/Flop is

shrinking!!• In practice…

– 1/3 of app exec time consumed by I/O

– Checkpointing & downward spiral of I/O

– Kernel panic @ 600K files..

40

Page 41: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 41

What are SC’s used for??• Can you say “fever for the

flavor”..• Yes, Pringles used an SC to

model airflow of chips as the entered “The Can”..

• Improved overall yield of “good” chips in “The Can” and less chips on the floor…

• P&G has also used SCs to improve other products like: Tide, Pampers, Dawn, Downy and Mr. Clean

Page 42: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims 42

– Virtual flow facility for patient specific surgical planning– High quality patient specific flow simulations needed quickly– Simulation on massively parallel computers– Cost only $600 on 32K Blue Gene/L vs. $50K for a repeat

open heart surgery… – At exascale this will cost more like $6

Patient Specific Vascular Surgical Planning

Page 43: Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

Disruptive Opportunities @ 2^60• A radical new way to think about science

and engineering– Extreme time compression on very

large-scale complex applications– Materials, Drug Discovery, Finance,

Defense, and Disaster Planning & Recovery…

• Technology enabler for …– Smartphone “supercomputers” w/ 25

GFlop and 100’s GB RAM– Petascale “supercomputer” in all major

universities @ $200K– IBM Watson “desk-side” edition – Home users have 100 GB network and

Terascale+ “home” supercomputers…

43By 2020, we will have unprecedented access to be vast amounts of data but the potential ubiquitous distruptive-scale computing power to use that data in our everyday lives