Upload
askbilladdmicrosoft
View
408
Download
3
Embed Size (px)
Citation preview
John Mellor-Crummey
Department of Computer ScienceRice University
Parallel Sorting
COMP 422 Lecture 19 5 April 2011
Topics for Today
• Introduction
• Issues in parallel sorting
• Sorting networks and Batcher’s bitonic sort
• Bubble sort and odd-even transposition sort
• Parallel quicksort
2
3
Why Study Parallel Sorting?
• One of the most common operations performed
• Close relation to task of routing on parallel computers—e.g. HPC Challenge RandomAccess benchmark
4
Sorting Algorithm Attributes
• Internal vs. external—internal: data fits in memory—external: uses tape or disk
• Comparison-based or not—comparison sort
– basic operation: compare elements and exchange as necessary– Θ(n log n) comparisons to sort n numbers
—non-comparison-based sort– e.g. radix sort based on the binary representation of data– Θ(n) operations to sort n numbers
• Parallel vs. sequential
5
Parallel Sorting Is Intrinsically Interesting
Different algorithms for different architecture variants
• Abstract parallel architecture—PRAM
• Network topology—hypercube—mesh
• Communication mechanism—shared address space—message passing
Today’s focus: parallel comparison-based sorting
6
Parallel Sorting Basics
• Where are the input and output lists stored? —we assume that both input and output lists are distributed
• What is a parallel sorted sequence? —sequence partitioned among the processors—each processor’s sub-sequence is sorted —all in Pj's sub-sequence < all in Pk's sub-sequence if j < k
– the best process numbering can depend on network topology
7
When partitioning is one element per process
1. Processes Pj and Pk send their elements to each other
Each process now has both elements
2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)
Pj Pk
aj ak
Element-wise Parallel Compare-Exchange
Pj Pk
aj, ak ak, aj
Pj Pk
min(aj, ak) max(ak, aj)
[communication step]
[comparison step]
8
Bulk Parallel Compare-Split
1. Send block of size n/p to partner
2. Each partner now has both blocks
3. Merge received block with own block
4. Retain only the appropriate half of the merged block Pi retains smaller values; process Pj retains larger values
9
Basic Analysis
• Assumptions—Pi and Pj are neighbors—communication channels are bi-directional
• Elementwise compare-exchange: 1 element per processor— time = ts + tw
• Bulk compare-split: n/p elements per processor —after compare-split on pair of processors Pi and Pj, i < j
– smaller n/p elements are at processor Pi – larger n/p elements at Pj
— time = ts+ twn/p– merge in O(n/p) time, as long as partial lists are sorted
10
Sorting Network
• Network of comparators designed for sorting
• Comparator : two inputs x and y; two outputs x' and y’—types
– increasing (denoted ⊕): x' = min(x,y) and y' = max(x,y)
– decreasing (denoted Ө) : x' = max(x,y) and y' = min(x,y)
• Sorting network speed is proportional to its depth
x min(x,y)
y max(x,y)
⊕
⊕
x max(x,y)
y min(x,y)
Ө
Ө
11
Sorting Networks
• Network structure: a series of columns
• Each column consists of a vector of comparators (in parallel)
• Sorting network organization:
12
Example: Bitonic Sorting Network
• Bitonic sequence—two parts: increasing and decreasing
– 〈1,2,4,7,6,0〉: first increases and then decreases (or vice versa)—cyclic rotation of a bitonic sequence is also considered bitonic
– 〈8,9,2,1,0,4〉: cyclic rotation of 〈0,4,8,9,2,1〉
• Bitonic sorting network—sorts n elements in Θ(log2 n) time—network kernel: rearranges a bitonic sequence into a sorted one
13
Bitonic Split
• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that—a0 ≤ a1 ≤ ··· ≤ an/2-1 , and
—an/2 ≥ an/2+1 ≥ ··· ≥ an-1
• Consider the following subsequences of s
s1 = 〈min(a0,an/2),min(a1,an/2+1),…,min(an/2-1,an-1)〉
s2 = 〈max(a0,an/2),max(a1,an/2+1),…,max(an/2-1,an-1)〉
• Sequence properties—s1 and s2 are both bitonic —∀x ∀y x ∈ s1, y ∈ s2 , x < y
• Apply recursively on s1 and s2 to produce a sorted sequence
• Works for any bitonic sequence, even if |s1| ≠ |s2|
Splitting Bitonic Sequences - I
14min max
Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y
Splitting Bitonic Sequences - II
15min max
Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y
16
Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits
Example: use bitonic merge to sort 16-element bitonic sequence
How: perform a series of log2 16 = 4 bitonic splits
17
Sorting via Bitonic Merging Network
• Sorting network can implement bitonic merge algorithm —bitonic merging network
• Network structure—log2 n columns—each column
– n/2 comparators – performs one step of the bitonic merge
• Bitonic merging network with n inputs: ⊕BM[n]—yields increasing output sequence
• Replacing ⊕ comparators by Ө comparators: ӨBM[n]—yields decreasing output sequence
18
Bitonic Merging Network, ⊕ BM[16]
• Input: bitonic sequence— input wires are numbered 0,1,…, n - 1 (shown in binary)
• Output: sequence in sorted order
• Each column of comparators is drawn separately
19
Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge?
Two steps
• Build a bitonic sequence
• Sort it using a bitonic merging network
20
Building a Bitonic Sequence
• Build a single bitonic sequence from the given sequence —any sequence of length 2 is a bitonic sequence. —build bitonic sequence of length 4
– sort first two elements using ⊕BM[2] – sort next two using ӨBM[2]
• Repeatedly merge to generate larger bitonic sequences—⊕BM[k] & ӨBM[k]: bitonic merging networks of size k
21
Building a Bitonic Sequence
Input: sequence of 16 unordered numbers
Output: a bitonic sequence of 16 numbers
22
Bitonic Sort, n = 16
• First 3 stages create bitonic sequence input to stage 4
• Last stage (⊕BM[16]) yields sorted sequence
23
Complexity of Bitonic Sorting Networks
• Depth of the network is Θ(log2 n)—log2 n merge stages—jth merge stage is log2 2j = j
—depth =
• Each stage of the network contains n/2 comparators
• Complexity of serial implementation = Θ(n log2 n)€
log2 2j
j=1
log2 n
∑ = ji=1
log2 n
∑ = (log2 n +1)(log2 n) /2 = θ(log2 n)
24
Mapping Bitonic Sort to a Hypercube
Consider one item per processor
• How do we map wires in bitonic network onto a hypercube?
• In earlier examples—compare-exchange between two wires when labels differ in 1 bit
• Direct mapping of wires to processors—all communication is nearest neighbor
25
Mapping Bitonic Merge to a Hypercube
Communication during the last merge stage of bitonic sort
• Each number is mapped to a hypercube node
• Each connection represents a compare-exchange
26
Mapping Bitonic Sort to Hypercubes
Communication in bitonic sort on a hypercube
• Processes communicate along dims shown in each stage
• Algorithm is cost optimal w.r.t. its serial counterpart
• Not cost optimal w.r.t. the best sorting algorithm
Batcher’s Bitonic Sort in NESL
function merge(a) =if (#a == 1) then aelse let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]});
function bitonic_sort(a) =if (#a == 1) then aelse let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1]));bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]);
27Try it at: http://www.cs.rice.edu/~johnmc/nesl.html
28
Bubble Sort and Variants
Sequential bubble sort algorithm
Compares and exchanges adjacent elements sequence
29
Bubble Sort and Variants
• Bubble sort complexity: Θ(n2)
• Difficult to parallelize—algorithm has no concurrency
• A simple variant uncovers concurrency
30
Sequential Odd-Even Transposition Sort
31
Odd-Even Transposition Sort, n = 8
In each phase, n = 8 elements are compared
32
Odd-Even Transposition Sort
• After n phases of odd-even exchanges, sequence is sorted
• Each phase of algorithm requires Θ(n) comparisons
• Serial complexity is Θ(n2)
33
Parallel Odd-Even Transposition
Consider one item per processor
• n iterations—in each iteration, each processor does one compare-exchange
• Parallel run time of this formulation is Θ(n)
• Cost optimal with respect to the base serial algorithm —but not the optimal one!
34
Parallel Odd-Even Transposition Sort
note: if partner id < 1 or > n, then skip compare
35
Quicksort
• Popular sequential sorting algorithm —simplicity, low overhead, optimal average complexity
• Operation—select an entry in the sequence to be the pivot —divide the sequence into two halves
– one with all elements less than the pivot – other greater
• Apply process recursively to each of sublist
36
Parallelizing Quicksort
• First, recursive decomposition —partition the list serially —handle each subproblems on a different processor
• Time for this algorithm is lower-bounded by Ω(n)!
• Can we parallelize the partitioning step? —can we use n processors to partition a list of length n around a
pivot in O(1) time?
• Tricky on real machines
37
Practical Parallel Quicksort
Each processor initially responsible for n/p elements
• Shared memory formulation—select first pivot & broadcast—each processor partitions own data—globally rearrange data into smaller and larger parts (in place)—recurse with proportional # processors on each part
• Message passing formulation—partitioning
– each processor first partitions local portion of array– determine which processes will be responsible for each partition– (based on size of smaller than pivot and larger than pivot groups)– divide up the data among the processor subsets responsible for each part
—continue recursively
Data Parallel Quicksort in NESL
• Total work is O(n log n)• Recursion depth is O(log n)• Depth of each operation is constant 38
12345678
• Total depth is O(log n) as well
39
Other Sorting Algorithms• Shellsort - another variant of bubble sort
—two stage process– log p rounds of long distance exchanges– followed by rounds of odd-even transposition sort until done
—key idea: long distance moves of first stage reduce number of rounds necessary in second stage
• Radix sort : in a series of rounds, sort elements into buckets by digit
• Bucket and sample sort—assumes evenly distributed items in an interval—buckets represent evenly-sized subintervals
• Enumeration sort: —determine rank of each element—place it in the correct position—CRCW PRAM algorithm: n2 processors, sort in Θ(1) time
– assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j]
– place A[j] into A[C[j]]
Other Sorting Algorithms (Cont)
• Histogram Sorting —goal: divide keys into p evenly sized pieces—use an iterative approach to do so—initiating processor broadcasts k > p-1 splitter guesses—each processor determines how many keys fall in each bin—sum histogram with global reduction—one processor examines guesses to see which are satisfactory—broadcast finalized splitters and number of keys for each
processor—each processor sends local data to appropriate processors using
all-to-all communication—each processor merges chunks it receives— Kale and Solomonik improved this (IPDPS 2010)
40
41
References
• Adapted from slides “Sorting” by Ananth Grama
• Based on Chapter 9 of “Introduction to Parallel Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003
• “Programming Parallel Algorithms.” Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996.
• http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort
• Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.