CPS 303 High Performance Computingshen/cps303/cps303_1.pdf · CISC and RISC machines CISC: stands for complex instruction set computer.A single bus system. CISC: Each individual instruction

CPS 303 High Performance Computing

Wensheng ShenDepartment of Computational Science

SUNY Brockport

Chapter 1: Introduction to High Performance Computing

van Neumann ArchitectureCPU and Memory SpeedMotivation of Parallel ComputingApplications of Parallel Computing

1.1. van Neumann Architecture

A fixed-program computerA stored-program computerA computer model for more than 40 yearsCPU executes a stored programThe operation is sequentialA sequence of read and write operations on the memoryvan Neumann proposed the use of ROM: read only stored program

John van Neumann

Born on December 28, 1903, died on February 8, 1957.Mastered calculus at the age of 8.Graduate level math at the age of 12Obtained his Ph.D. at the age of 23Stored program concept

A Typical Example of van Neumann Architecture

Memory

Control Unit Arithmetic LogicUnit

CPU

InputDevices

OutputDevices

ExternalStorage

Modern Personal Computers

1. Monitor2. Motherboard3. CPU (Microprocessor)4. Primary storage (RAM)5. Expansion cards6. Power supply7. Optical disc drive8. Secondary storage (Hard

disk)9. Keyboard10. Mouse

http://en.wikipedia.org/wiki/Personal_computer

Peripheral Component Interconnect

Graphics cardsSound cards

Network cardsModems

CISC and RISC machines

CISC: stands for complex instruction set computer. A single bus system.CISC: Each individual instruction can execute several low-level operations, such as a memory load, an arithmetic operation, and a memory store. RISC: stands for reduced instruction set computer. Two bus system, a data bus and an address bus.They are all SISD machines: Single Instruction Stream on Single Data Stream.

1.2 CPU and memory speed

Cray 1: 12ns 1975Cray 2: 6ns 1986Cray T-90 2ns 1997Intel PC 1ns 2000Today’s PC 0.3ns 2006(P4)

Moore’s LawMoore’s law (1965): the number of transistors per square inch on integrated circuits had double every two years since the integrated circuit was inventedHow about the future? (price of computers that have the same computing power falls by half every two years?)In a 2008 article in InfoWorld, Randall C. Kennedy, formerly of Intel introduces this term using successive versions of Microsoft Office between the year 2000 and 2007 as his premise. Despite the gains in computational performance during this time period according to Moore's law, Office 2007 performed the same task at half the speed on a prototypical year 2007 computer as compared to Office 2000 on a year 2000 computer.

CPU and memory speed comparisonIn 20 years, CPU speed (clock rate) has increased by a factor of 1000DRAM speed has increased only by a factor of smaller than 4CPU speed: 1-2 nsCache speed: 10 nsDRAM speed: 50-60 nsHow to feed data fast enough to keep CPU busy?

Possible Solutions

A hierarchy of successive fast memory devices (multilevel caches)Location of data referenceEfficient programming can be an issueParallel systems may provide(1) large aggregate cache(2) high aggregate bandwidth to the memory system

1.3 Price and Performance Comparison

Price for high-end CPU rises sharply

Intel processor price/performance

1.4 Computation for special purpose

Weather forecastingInformation retrieval Car and aircraft designNASA space discovery

Problem:

Insufficient memory

Slow in speed

Example: predicting weather of US and Canada for next two days

20 million square kilometers = 2.0×107

kilometer

5,000 kilometers4,000 kilometers

20 kilometerΔx=Δy=Δz=0.1 kilometer

Mesh size

111041.0

201.0

40001.0

5000×=××=n

Number of cells: 0.1 kilometer

Assuming it takes 100 calculations to determine the weather at a typical grid point, we wants to predict the weather condition at each hour for the next 48 hours, the total number of calculations are:

1511 10248100104 ×≈×××

Assuming that our computer can execute one billion (109) calculations per second, it will take

695 10210/102 ×=× Seconds, or 23 days

Increase the CPU speed to one trillion calculations per second? We still need more than half an hour. What happens if we wants to predict the weather for the whole earth, or if we want to use a smaller grid size, Δx=Δy=Δz=0.05 kilometer for better accuracy?

The memory requirement

If we need 7 variables (u, v, w, p, T, ρ, ω) at each location, the memory cost is,

7×4×1011 words = 112×1011 bytes = 11,200 Gbytes

Data transfer latency among CPU, registers, and memory

Possible solution: to build a processor executing 1 trillion operations per second

For (i=0; i<ONE_TRILLION; i++)z[i] = x[i] + y[i];

Fetch x[i], y[i]Add z[i], x[i], y[i]Store z[i]

At least 3×1012 copies of data must be transfer between registers and memory per second

Data are transferred with the speed of light 3×108 m/s.

rWe assume that r is the average distance of a word memory from the CPU, then r must satisfy

3×1012 r meters = 3×108 meters/second × 1 second

r = 10-4 meters

We need at least three trillion words of memory to store x, y, and z. Memory words are typically arranged in rectangular grid in hardware. If we use square grid with side length s and connect the CPU to the center of the square, then the average distance from a memory location to the CPU is about s/2=r, so s=2×10-4 meters. For a square grid, a typical row of memory words will contain

s

612 103103 ×=× words

106

4

10103

102 −−

≈××

Therefore, we need to fit a single word of memory into a square with a side length of

meters, the size of an atom

That is to say, we need to figure out how to represent a 32-bit word with a single atom.

The solution of building a computer performing one trillion operations is extremely difficulty. Other solutions?

To invite one hundred people for a diner, should we build one big table to seat everyone?More tablesHow to perform the task of 1015

calculation in minutes? More computers.We divide one big problem into many small sized sub problems.

1.5 Challenges:

Communications. In a dinner, people sitting in different table can work around to talk to each other.How about data in different processors?How do processors communicate?

Tasks:Decide on and implement an interconnection network for the processors and memory modulesDesign and implement system software for the hardwareDevise algorithms and data structures for solving our problemDivide the algorithm and data structures up into subproblemsIdentify the communications that will be needed among the subproblemsAssign subproblems to processors and memory modules.

1.6 Topics covered in the course

The architecture, interconnection network, and system software for parallel computingMessage passing interface (MPI) libraries for parallel computing

Basic communication of MPIApplications of using MPI in numerical computationCollective communication of MPIDesigning and coding issues in parallel computingThe performance of parallel computing

Documents

CPS 303 High Performance Computingshen/cps303/cps303_1.pdf · CISC and RISC machines CISC: stands for complex instruction set computer.A single bus system. CISC: Each individual instruction