Upload
sunawar-khan-ahsan
View
143
Download
3
Tags:
Embed Size (px)
Citation preview
Report on
Matrix Multiplication using Parallel Processing
Subject
Advance Computer Architecture
Submitted To
Mr. Asim Munir
Class
MSCS-F14
Prepared by: Sunawar Khan
Reg No: 813-MSCS-F14
Matrix Multiplication using Parallel Processing
Parallel Processing
Introduction to parallel processing
There are many application in day to day life that demand real time solution to problem. For
example weather forecasting has to done in a timely fashion etc.. If an expert system is used to
aid a physician in surgical procedures ,decisions have to be made within seconds. And so on.
Programs written for such applications have to perform an enormous amount of computation.
Even the fastest single-processor machine may not be able to come up with solutions within
tolerable time limits. Parallel Random Access Machines (PRAM) offer the potential of
decreasing the solution times enormously. E.g.
Say there are 100 numbers to be added and there are two persons A and B. Person A can add the
first 50 numbers. At the same time B can add the next 50 numbers. When they are done, one of
them can add two individual sums to get the final answer. So two people can add the 100
numbers in almost half the time required by one.
Here is another example of Computing Convex Hull using parallel processing
Take the set of points and divide the set into two halves Assume that recursive call computes the
convex hull of the two halves Conquer stage: take the two convex hulls and merge it to obtain
the convex hull for the entire set.
What is Computational Model
“A computational model is a mathematical model in computational science that requires
extensive computational resources to study the behavior of a complex system by computer
simulation.”
Random access machine model
Algorithms can be measured in a machine-independent way using the Random Access Machine
(RAM) model. This model assumes a single processor. In the RAM model, instructions are
executed one after the other, with no concurrent operations. This model of computation is an
abstraction that allows us to compare algorithms on the basis of performance. The assumptions
made in the RAM model to accomplish this are:
Each simple operation takes 1 time step.
Loops and subroutines are not simple operations.
Each memory access takes one time step, and there is no shortage of memory.
For any given problem the running time of algorithms is assumed to be the number of time steps.
The space used by an algorithm is assumed to be the number of RAM memory cells.
A model of computation whose memory consists of an unbounded sequence of registers, each of
which may hold an integer. In this model, arithmetic operations are allowed to compute the
address of a memory register.
In the RAM (Random Access Machine) we assume that any of the following operation can be
done in one unit of time: addition, subtraction, multiplication, division, comparison, memory
access, assignment, and so on. This model widely accepted as a valid sequential model.
An important feature of parallel computing that is absent in sequential computing is need for
interprocessor communication. For example given any problem, the processor have to
communicate among themselves and agree on the subproblems each will work on. Also they
need to communicate to see whether everyone has finished its task so on.
PRAM Model for Single Processor Machine. The designer of sequential algorithm typically formulates the algorithm using an abstract model
of computation called random access machine(RAM). In this model, the machine consist of a
single processor connected to a memory system. Each CPU include arithmetic operation, logical
operation, and memory access require one step access.
Standard Random Access Machine
Each Operation load, store, jump, add, etc takes one unit of time.
Simply general one model.
Basic model for sequential algorithm
Unbounded number of local memory cells
Each memory cell can hold an integer of unbounded size
Instruction set included –simple operations, data operations, comparator, branches
All operations take unit time
Time complexity = number of instructions executed
Space complexity = number of memory cells used
Parallel Machine or multiprocessor Model
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine,
or PRAM. A multiprocessor model is a generalization of the sequenatial RAM Model in which
there is more than one processor. Multiprocessor model can be classified into three basic types,
local memory machine, modular memory machine, and Parallel Random-Access Machien. Each
describe figure below
“The Parallel Random Access Machine (PRAM) is an abstract model for parallel computation
which assumes that all the processors operate synchronously under a single clock and are able to
randomly access a large shared memory.”
A natural extension of the Random Access Machine(RAM), serial architecture is the
Parallel Random Access Machine, or PRAM
Pram Consist of p Processor and a global memory of unbounded size that is uniformly
accessible to all processors
Processors share a common clock but may execute different instruction in each cycle
PRAM Architecture
The parallel version of RAM constitutes an abstract model of the class of global-memory parallel
processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection network
and taking the view that each processor can access any memory location in each machine cycle,
independent of what other processors are doing.
Processor i can do the following in three phases of one cycle:
Fetch a value from address si in shared memory
Perform computations on data held in local registers
3. Store a value into address di in shared memory
Is an abstract machine for designing the algorithms applicable to parallel computers
Unbounded collection of RAM processors P0, P1, …,
Processors don’t have tape
Each processor has unbounded registers
Unbounded collection of share memory cells
All processors can access all memory cells in unit time
All communication via shared memory
Two or more processors may read simultaneously from the same cell
A write conflict occurs when two or more processors try to write simultaneously into
the same cell
Classification of PRAM Model
The classification in is reminiscent of Flynn’s classification and offers yet another example of
the quest to invent four-letter abbreviations/acronyms in computer architecture! Note that here,
too, one of the categories is not very useful, because if concurrent writes are allowed, there is no
logical reason for excluding the less problematic concurrent reads. EREW PRAM is the most
realistic of the four submodels (to the extent that thousands of processors concurrently accessing
thousands of memory locations within a shared memory address space of millions or even
billions of locations can be considered realistic!). CRCW PRAM is the least restrictive
submodel, but has the disadvantage of requiring a conflict resolution mechanism to define the
effect of concurrent writes (more on this below). The default submodel, which is assumed when
nothing is said about the submodel, is CREW PRAM. For most computations, it is fairly easy to
organize the algorithm steps so that concurrent writes to the same location are never attempted.
CRCW PRAM is further classified according to how concurrent writes are handled. Here are a
few example submodels based on the semantics of concurrent writes in CRCW PRAM:
EREW Least “powerful”, most “realistic”
CREW Default
ERCW Not useful
CRCW Most “powerful”,
further subdivided
Reads from same location
Wri
tes t
o s
am
e lo
ca
tio
n
Exclusive
Co
ncurr
ent
Concurrent
Exclu
siv
e
CRCW SubModel
• Undefined: In case of multiple writes, the value written is undefined (CRCW-U).
• Detecting: A special code representing “detected collision” is written (CRCW-D).
Common: Multiple writes allowed only if all store the same value (CRCW-C). This is
sometimes called the consistent-write submodel.
• Random: The value written is randomly chosen from among those offered (CRCWR).
This is sometimes called the arbitrary-write submodel.
• Priority: The processor with the lowest index succeeds in writing its value (CRCW-P).
• Max/Min: The largest/smallest of the multiple values is written (CRCW-M).
• Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR
(CRCW-X), or some other combination of the multiple values is written.
Physical Complexity
Processors and memories are connected via switches.
Since these switches must operate in O(1) time at the level of words, for a system of p
Processors and m words, the switch is O(mp).
Clearly, for meaningful values of p and m, a true PRAM is not realizable
Relationship between different PRAM Submodel
The following relationships have been established between some of the PRAM submodels:
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
Even though all CRCW submodels are strictly more powerful than the EREW submodel, the
latter can simulate the most powerful CRCW submodel listed above with at most logarithmic
slowdown.
A p-processor CRCW-P (priority) PRAM can be simulated by a p-processor EREW PRAM with
a slowdown factor of Θ(log p). EREW PRAM to sort or find the smallest of p values in Θ(log p)
time, as we shall see later. To avoid concurrent writes, each processor writes an ID-address-value
triple into its corresponding element of a scratch list of size p, with the p processors then
cooperating to sort the list by the destination addresses, partition the list into segments
corresponding to common addresses (which are now adjacent in the sorted list), do a reduction
operation within each segment to remove all writes to the same location except the one with the
smallest processor ID, and finally write the surviving address-value pairs into memory. This final
write operation will clearly be of the exclusive variety.
Matrix Multiplication
Introduction to Matrix Multiplication.
We will show how to implement matrix multiplication C=C+A*B on several of the
communication networks discussed in the last lecture, and develop performance models to
predict how long they take. We will see that the performance depends on several factors.
we discuss PRAM matrix multiplication algorithms as representative examples of the class of
numerical problems. Matrix multiplication is quite important in its own right and is also used as
a building block in many other parallel algorithms. For example, matrix multiplication is useful
in solving graph problems when the graphs are represented by their adjacency or weight
matrices. Given m × m matrices A and B, with elements a i jand b ij,, their product C is defined
as
The following O(m³)-step sequential algorithm can be used for multiplying m × m matrices:
Sequential matrix multiplication
for i = 0 to m – 1 do for j = 0 to m – 1 do
t := 0 for k = 0 to m – 1 do
t := t + aikbkj endfor cij := t
endfor endfor
Consider n*n matric multiplication with n3 processors
Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n
O(log n) time.
On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n
copies of a and b, which takes O(log n) time with n Processors
Total time is still O(log n)
Memory requirement is ofcourse much higher for the EREW PRAM
Complexity: Θ(n3)
Better Algorithm that improve slightly
Multiplication by block
PRAM matrix multiplication using m2 processors
Proc (i, j), 0 i, j < m, do begin
t := 0 for k = 0 to m – 1 do
t := t + aikbkj endfor cij := t
end
Because multiple processors will be reading the same row of A or the same column of B, the
above naive implementation of the algorithm would require the CREW submodel. However, it is
possible to convert the algorithm to an EREW PRAM algorithm by skewing the memory
accesses (how?).
matrix multiplication can be done in Θ(m²) time by using Processor i to compute the m elements
in Row i of the product matrix C in turn.
PRAM Matrix Multiplication with m Processors
for j = 0 to m – 1 Proc i, 0 i < m, do t := 0
for k = 0 to m – 1 do t := t + aikbkj
endfor cij := t
endfor
m processors read A at once (no concurrent)
- All m processors read same column of B at same time (concurrent read should be allowed)
- if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory
access
More Efficient Matrix Multiplication (for NUMA)
On the Cm* NUMA-type shared-memory multiprocessor, a research prototype machine built at
Carnegie-Mellon University in the 1980s, this block matrix multiplication algorithm exhibited
good, but sublinear, speed-up. With 16 processors, the speed-up was only 5 in multiplying 24 ×
24 matrices. However, the speed-up improved to about 9 when larger 36 × 36 (48 × 48) matrices
were multiplied. It is interesting to note that improved locality of the block matrix multiplication
algorithm can also improve the running time on a uniprocessor, or distributed shared-memory
multiprocessor with caches, in view of higher cache hit rates.
Detail of Block Matrix Multiplication
A multiply-add computation on q q blocks needs 2q 2 = 2m 2/p memory accesses and 2q 3
arithmetic operations So, q arithmetic operations are done per memory access.
iq + q - 1
iq + a
iq + 1
iq
jq jq + b jq + q - 1
kq + c
kq + c
iq + q - 1
iq + a
iq + 1
iq
jq jq + 1 jq + b jq + q - 1
Multiply
Add Elements of
block (i, j)
in matrix C
Elements of
block (k, j)
in matrix B
Element of
block (i, k)
in matrix A
jq + 1
A Simple Parallel Algorithm
Example for n numbers addition:
1. We start with 4 processors and each of them adds 2 items in the first step. 2. The number of items is halved at every subsequent step. Hence logn steps are required
for adding n numbers.
3. The processor requirement is O(n).
CREW Cost
Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1)
Read n2 processors Bij all cells in = O(1)
Each processor multiply Aij* Bij in = O(1)
Parallel Sum to get Cij = O(logn)
Store Value Cij = O(1)
T(n) = O(log n)
P(n) = O(n2)
W(n) = O(n2log n) = total # of all operations
EREW Cost
Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1)
Cannot read n2 processors Bij all cells in = O(1)
Concurrent read is not allowed
Skew the memory – replicate – or - Parallel processor read
O(logn)
Each processor multiply Aij* Bij in = O(1)
Parallel Sum to get Cij = O(logn)
Store Value Cij = O(1)
T(n) = O(log n)
P(n) = O(n2)
W(n) = O(n2log n) = total # of all operations
Advantages of PRAM Model
PRAM removes algorithmic details concerning synchronization and communicating,
allowing the algorithm designer to focus on problem properties
A PRAM algorithm includes an explicit understanding of the operation performed at each
time unit and an explicit allocation of processors to jobs at each time unit.
PRAM designer paradigm have turned out to be robust and have been mapped efficiently
onto many other parallel models and even network models
Disadvantage of PRAM Model
Model Inaccuracies unbounded local memory(register)
All operation takes unit time processors run in lock steps
Unaccounted costs Non-local memory access
Latency Bandwidth Memory Access Contention
Conclusion
PRAM algorithm is the source of most fundamental ideas
It’s a source of inspiration for algorithms
PRAM is simple and easy to understand.
The improved locality of block matrix multiplication can also improve the running time on a uniprocessor, or distributed shared-memory multiprocessor with caches.
Reason: Higher Cache Higher Hit Rates