8
1 CS 770G - Parallel Algorithms in Scientific Computing July 18 , 2001 Lecture 14 Parallel Sorting 2 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis , Benjamin Cummings. A portion of the notes comes from Prof. J. Demmel’s CS267 course at UC Berkeley. 3 Where the input and output sequences are stored. Input: unsorted sequence distributed uniformly among processors. Output: sorted sequence across processors. Global ordering - proc enumeration. Compare-exchange or compare -split on nonlocal elements. Issues in Sorting on Parallel Computers 4 One element per processor. •P i & P j compare their elements a i & a j . Send their element to each other. –P i keeps min(a i ,a j ). –P j keeps max(a i ,a j ). Communication time: T comm = t s +t w . Compare-Exchange P i P j P i P j P i P j a i a j a i , a j a i , a j min{a i , a j } max{a i , a j }

CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

  • Upload
    others

  • View
    6

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

1

CS 770G - Parallel Algorithms in Scientific Computing

July 18 , 2001Lecture 14

Parallel Sorting

2

References

• Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis , Benjamin Cummings.

• A portion of the notes comes from Prof. J. Demmel’sCS267 course at UC Berkeley.

3

• Where the input and output sequences are stored.– Input: unsorted sequence distributed uniformly among

processors.

– Output: sorted sequence across processors.

• Global ordering - proc enumeration.

• Compare-exchange or compare -split on nonlocal elements.

Issues in Sorting on Parallel Computers

4

• One element per processor.• Pi & Pj compare their elements ai & aj.

– Send their element to each other.– Pi keeps min(ai,aj).– Pj keeps max(ai,aj).

• Communication time: Tcomm = ts + tw.

Compare-Exchange

Pi Pj Pi Pj Pi Pj

ai aj ai, aj ai, aj min{ai, aj} max{ai, aj}

Page 2: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

2

5

• More than one element per processor.

• Pi & Pj compare their blocks A i & Aj.– Sort their block locally.– Send their block to each other.– Each proc merges the two sorted blocks, and retains the

appropriate half.

• Communication time: Tcomm = ts + tw (n/p).

Compare-Split

Pi PjPi Pj

1, 3 2, 4

Pi Pj

1, 3 1, 3

2, 4 2, 4

1, 2, 3, 41, 2, 3, 4

Pi Pj

1, 2 3, 4

6

• Algorithm:

• O(n) per iteration + n iterations ⇒ complexity = O(n2).

• Inherently sequential -- compare adjacent pairs in order.

Bubble Sort

endloop-i end

loop-j end)a,exchange(a-compare

i to1 jfor 1 to1-nifor

begin(n) TBUBBLE_SOR

1jj +

==

7

• Bubble sort variant.• Sort n elements in n phases.• Each phase requires n/2 compare-exchange operations.• Alternate between 2 phases -- odd & even.• Let {a1, a2, …, an} be the sequence to be sorted.

– Odd phase: compare-exchange the pairs (a1,a2), (a3, a4), … , (an-1,an).

– Even phase: compare-exchange the pairs (a2,a3), (a4, a5), … , (an-2,an-1).

• n comparisons per phase + n phases ⇒ complexity = O(n2).

Odd-Even Transposition

8

• Sequential algorithm:

Odd-Even Transposition (cont.)

endloop-i end

loop-j end)a,exchange(a-compare

1-n/2 to1 jfor even then is i if

loop-j end)a,exchange(a-compare

1-n/2 to0 jfor thenodd is i if

n to1ifor begin

(n) ODD_EVEN

12 j2 j

22 j12 j

+

++

=

=

=

Page 3: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

3

9

Example

3 2 3 8 5 6 4 1Phase 1 (odd)

2 3 3 8 5 6 1 4Phase 2 (even)

2 3 3 5 8 1 6 4Phase 3 (odd)

2 3 3 5 1 8 4 6Phase 4 (even)

10

Example (cont.)

2 3 3 1 5 4 8 6Phase 5 (odd)

2 3 1 3 4 5 6 8Phase 6 (even)

2 1 3 3 4 5 6 8Phase 7 (odd)

1 2 3 3 4 5 6 8 Phase 8 (even)

11

• One element per processor.

• Compare-exchange operations on pairs of elements are done simultaneously.

• Odd phase: proc2i-1 compare-exchanges its element with proc 2i.

• Even phase: proc2i compare-exchanges its element with proc 2i+1.

Parallel Implementation

12

• In each phase, the complexity of compare-exchange = O(1).

• A total of n phases ⇒ complexity = O(n).

• Sequential complexity of the best sorting algorithm = O(n log n).

• Hence, odd-even transposition sort is not cost-optimal because processor-time product = O(n2).

Parallel Complexity

Page 4: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

4

13

• More than one element per processor, p < n.

• Complexity of local sort = O(n/p log(n/p) ).

• P phases: (p/2 odd & p/2 even)

• Odd phase: proc2i-1 compare-splits its element with proc 2i.

• Even phase: proc2i compare-splits its element with proc 2i+1.

Parallel Implementation (cont.)

14

• Parallel run-time:

• Speedup:

• Efficiency:

• Cost optimal ⇒ p=O(log n).

Parallel Performance

)()()log( nOnOpn

pn

OTp ++=

)()log(

)log(

nOpn

pn

O

nnOTT

Sp

sp

+==

)log

()loglog(1

1

npO

npOp

SE p

p

+−==

local sort comparisons communications

15

• Divide-and-conquer.

• (Average) complexity = O(n log n).

• Let the sequence be A[1..n].

• Two steps:– Divde: given A[q..r], divide into 2 subarrays A[q..s] & A[s+1..r]

such that each element of A[q..s] ≤ each element of A[s+1..r].

– Conquer: apply Quicksort to the subsequences.

• Partitioning– Select a pivot x.

– A subsequence contain elements ≤ x; another subsequnce contains elements > x.

Quicksort

16

Quicksort Algorithm

endif end

r);1,sA,QUICKSORT(s);q,A,QUICKSORT(

A[s]);swap(A[q],loop-i endif end

A[i]);swap(A[s],1;ssx then A[i] if

r to1qifor q;sA[q];xr thenq if

beginr)q,(A, QUICKSORT

+

+=≤

+===<

Page 5: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

5

17

• Perform quicksort on the subsequences in parallel.

• Start with a single processor.– Assign one of the subproblems to another processor.

– Each of these processors sort its array by using quicksort and assigns one of its subproblems to other processors.

– Algorithm terminates when the arrays cannot be further partitioned.

Parallel Quicksort

18

• Problem: partition is done by a single processor.

• In the beginning, the complexity of partition = O(n).

• Hence the lower bound = O(n).

• Processor-time product = O(n2) ⇒ not cost-optimal.

• Needs parallel partitioning– PRAM model

Parallel Complexity

19

• Concurrent-read, concurrent-write parallel random-access machine.

• Write conflicts are resolved arbitrarily.

• Quicksort can be interpreted as constructing a binary tree.– Pivot is the root.

– Elements ≤ pivot go to the left subtree.

– Elements > pivot go to the right subtree.

• Sorted sequence obtained by inorder trasversal.

Parallel CRCW PRAM Model

20

• Select a pivot.

• Partition into 2 parts.

• Subsequent pivot elements, one for each new subtree, are then selected in parallel.

• In each iteration, a level of the tree is constructed in O(1) time.

• Thus, the averge complexity = depth of tree = O(log n).

• The sorted squence is obtained by inorder trasversal in O(1) time.

• Thus, it is cost-optimal.

Parallel PRAM Algorithm

Page 6: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

6

21

BuildTree Algorithm

endrepeat end

if end];[parentrightchild parent elseexit then ][parentrightchild i if

i; ][parentrightchildelse

];parentleftchild[ parent elseexit then ]parentleftchild[ i if

i; ]parentleftchild[ then)parenti and ]A[parent(A[i]or ])A[parentA[i]( if

root i proceach for repeat for end

1;n [i]rightchild i]leftchild[root; parent

i;rootdo i proceach for

begin(A[1..n]) BUILD_TREE

ii

i

i

ii

i

i

iii

i

==

=

==

=<=<

+===

=

22

• A bitionic sorting network sorts n elements in O(log2n).

• The key operation is rearrange a bitonic sequence into a sorted sequence.

• Bitonic sequence: {a0, a1, …, an-1} with the property that either:

(1) there exists i such that {a0, …, ai} is monotonically increasing and {ai+1, …, an-1} is monotonically decreasing, or

(2) there exists a cyclic shift of indices so that (1) holds.

Bitonic Sort

Sorting on Different Networks

Sorting Networks

PRAM Sorts

MEM

p p p°°°

Sorting on Network Y

P

M

network

P

M

P

M°°°

LogP SortsSorting onMachine X

24

• Let s1 = {a0, a1, …, an-1} be a bitonic sequence such that a0 ≤ a1 ≤ … ≤ an/2-1 and an/2 ≥ an/2+1 ≥ … ≥ an-1.

• Define 2 subsequences:

• In sequence s 1, there is an element b i=min{ai,an/2+i} such that all the elements before b i are from the increasing part of the original sequence, and all those after are from the decreasing part.

Bitonic Seq. to Increasing Seq.

s a a a a a a

s a a a a a an n n n

n n n n

1 0 2 1 2 1 2 1 1

2 0 2 1 2 1 2 1 1

=

=+ − −

+ − −

(min{ , },min{ , },. .., min{ , })

(max{ , }, max{ , },. .. ,max{ , })/ / /

/ / /

Page 7: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

7

25

• In sequence s 2, the element b i’=max{ai,an/2+i} is such that all the elements before b i’ are from the decreasing part of the original sequence, and all those after are from the increasing part.

• Every element of the first sequence ≤ every elment of the second sequence.

• Both s1 and s2 are bitonic sequences.

• Thus, the initial problem of rearranging a bitonic sequence of size n is reduced to that of rearranging 2 smaller bitionic sequences of size n/2 and concatenating the results.

Bitonic Seq. to Increasing Seq. (cont.)

26

• Repeat the process recursively until we obtain subsequences of size 1.

• At that point, the output is sorted in monotonically increasing order.

• Since after each bitonic split, the size of the problem is halved, the number of splits = log n.

• Sorting a bitonic sequence using bitonic splits is called bitonic merge.

• Can be implemented easily on a network of comparators.

Bitonic Seq. to Increasing Seq. (cont.)

27

• A sequence of 2 elements forms a bitonic sequence.

• Hence, any unsorted sequence is a concatenation of bitonic sequence of size 2.

• Merge adjacent bitonic sequences in increasing and decreasing order.

• By definition, the sequence obtained by concatenating the increasing and decreasing sequences is bitonic.

• By merging larger and larger bitonic sequences, we eventually obtain a bitonic sequence of size n.

Sorting Unordered Elements

28

Bitonic Sort Algorithm

Page 8: CS 770G - Parallel Algorithms in Introduction to Parallel Computing …cs770g/handout/sorting.pdf · 2001. 7. 17. · 1 CS 770G - Parallel Algorithms in Scientific Computing July

8

29

Parallel Bitonic Sort

30

Parallel Bitonic Sort (cont.)

31

Parallel Performance

32

Parallel Sorting Comparison