Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
CPS 303 High Performance Computing
Wensheng ShenDepartment of Computational Science
SUNY Brockport
Chapter 1: Introduction to High Performance Computing
van Neumann ArchitectureCPU and Memory SpeedMotivation of Parallel ComputingApplications of Parallel Computing
1.1. van Neumann Architecture
A fixed-program computerA stored-program computerA computer model for more than 40 yearsCPU executes a stored programThe operation is sequentialA sequence of read and write operations on the memoryvan Neumann proposed the use of ROM: read only stored program
John van Neumann
Born on December 28, 1903, died on February 8, 1957.Mastered calculus at the age of 8.Graduate level math at the age of 12Obtained his Ph.D. at the age of 23Stored program concept
A Typical Example of van Neumann Architecture
Memory
Control Unit Arithmetic LogicUnit
CPU
InputDevices
OutputDevices
ExternalStorage
Modern Personal Computers
1. Monitor2. Motherboard3. CPU (Microprocessor)4. Primary storage (RAM)5. Expansion cards6. Power supply7. Optical disc drive8. Secondary storage (Hard
disk)9. Keyboard10. Mouse
http://en.wikipedia.org/wiki/Personal_computer
Peripheral Component Interconnect
Graphics cardsSound cards
Network cardsModems
CISC and RISC machines
CISC: stands for complex instruction set computer. A single bus system.CISC: Each individual instruction can execute several low-level operations, such as a memory load, an arithmetic operation, and a memory store. RISC: stands for reduced instruction set computer. Two bus system, a data bus and an address bus.They are all SISD machines: Single Instruction Stream on Single Data Stream.
1.2 CPU and memory speed
Cray 1: 12ns 1975Cray 2: 6ns 1986Cray T-90 2ns 1997Intel PC 1ns 2000Today’s PC 0.3ns 2006(P4)
Moore’s LawMoore’s law (1965): the number of transistors per square inch on integrated circuits had double every two years since the integrated circuit was inventedHow about the future? (price of computers that have the same computing power falls by half every two years?)In a 2008 article in InfoWorld, Randall C. Kennedy, formerly of Intel introduces this term using successive versions of Microsoft Office between the year 2000 and 2007 as his premise. Despite the gains in computational performance during this time period according to Moore's law, Office 2007 performed the same task at half the speed on a prototypical year 2007 computer as compared to Office 2000 on a year 2000 computer.
CPU and memory speed comparisonIn 20 years, CPU speed (clock rate) has increased by a factor of 1000DRAM speed has increased only by a factor of smaller than 4CPU speed: 1-2 nsCache speed: 10 nsDRAM speed: 50-60 nsHow to feed data fast enough to keep CPU busy?
Possible Solutions
A hierarchy of successive fast memory devices (multilevel caches)Location of data referenceEfficient programming can be an issueParallel systems may provide(1) large aggregate cache(2) high aggregate bandwidth to the memory system
1.3 Price and Performance Comparison
Price for high-end CPU rises sharply
Intel processor price/performance
1.4 Computation for special purpose
Weather forecastingInformation retrieval Car and aircraft designNASA space discovery
Problem:
Insufficient memory
Slow in speed
Example: predicting weather of US and Canada for next two days
20 million square kilometers = 2.0×107
kilometer
5,000 kilometers4,000 kilometers
20 kilometerΔx=Δy=Δz=0.1 kilometer
Mesh size
111041.0
201.0
40001.0
5000×=××=n
Number of cells: 0.1 kilometer
Assuming it takes 100 calculations to determine the weather at a typical grid point, we wants to predict the weather condition at each hour for the next 48 hours, the total number of calculations are:
1511 10248100104 ×≈×××
Assuming that our computer can execute one billion (109) calculations per second, it will take
695 10210/102 ×=× Seconds, or 23 days
Increase the CPU speed to one trillion calculations per second? We still need more than half an hour. What happens if we wants to predict the weather for the whole earth, or if we want to use a smaller grid size, Δx=Δy=Δz=0.05 kilometer for better accuracy?
The memory requirement
If we need 7 variables (u, v, w, p, T, ρ, ω) at each location, the memory cost is,
7×4×1011 words = 112×1011 bytes = 11,200 Gbytes
Data transfer latency among CPU, registers, and memory
Possible solution: to build a processor executing 1 trillion operations per second
For (i=0; i<ONE_TRILLION; i++)z[i] = x[i] + y[i];
Fetch x[i], y[i]Add z[i], x[i], y[i]Store z[i]
At least 3×1012 copies of data must be transfer between registers and memory per second
Data are transferred with the speed of light 3×108 m/s.
rWe assume that r is the average distance of a word memory from the CPU, then r must satisfy
3×1012 r meters = 3×108 meters/second × 1 second
r = 10-4 meters
We need at least three trillion words of memory to store x, y, and z. Memory words are typically arranged in rectangular grid in hardware. If we use square grid with side length s and connect the CPU to the center of the square, then the average distance from a memory location to the CPU is about s/2=r, so s=2×10-4 meters. For a square grid, a typical row of memory words will contain
s
612 103103 ×=× words
106
4
10103
102 −−
≈××
Therefore, we need to fit a single word of memory into a square with a side length of
meters, the size of an atom
That is to say, we need to figure out how to represent a 32-bit word with a single atom.
The solution of building a computer performing one trillion operations is extremely difficulty. Other solutions?
To invite one hundred people for a diner, should we build one big table to seat everyone?More tablesHow to perform the task of 1015
calculation in minutes? More computers.We divide one big problem into many small sized sub problems.
1.5 Challenges:
Communications. In a dinner, people sitting in different table can work around to talk to each other.How about data in different processors?How do processors communicate?
Tasks:Decide on and implement an interconnection network for the processors and memory modulesDesign and implement system software for the hardwareDevise algorithms and data structures for solving our problemDivide the algorithm and data structures up into subproblemsIdentify the communications that will be needed among the subproblemsAssign subproblems to processors and memory modules.
1.6 Topics covered in the course
The architecture, interconnection network, and system software for parallel computingMessage passing interface (MPI) libraries for parallel computing
Basic communication of MPIApplications of using MPI in numerical computationCollective communication of MPIDesigning and coding issues in parallel computingThe performance of parallel computing