Comp422 2011 Lecture19 Sorting

John Mellor-Crummey

Department of Computer ScienceRice University

[email protected]

Parallel Sorting

COMP 422 Lecture 19 5 April 2011

Topics for Today

• Introduction

• Issues in parallel sorting

• Sorting networks and Batcher’s bitonic sort

• Bubble sort and odd-even transposition sort

• Parallel quicksort

2

3

Why Study Parallel Sorting?

• One of the most common operations performed

• Close relation to task of routing on parallel computers—e.g. HPC Challenge RandomAccess benchmark

4

Sorting Algorithm Attributes

• Internal vs. external—internal: data fits in memory—external: uses tape or disk

• Comparison-based or not—comparison sort

– basic operation: compare elements and exchange as necessary– Θ(n log n) comparisons to sort n numbers

—non-comparison-based sort– e.g. radix sort based on the binary representation of data– Θ(n) operations to sort n numbers

• Parallel vs. sequential

5

Parallel Sorting Is Intrinsically Interesting

Different algorithms for different architecture variants

• Abstract parallel architecture—PRAM

• Network topology—hypercube—mesh

• Communication mechanism—shared address space—message passing

Today’s focus: parallel comparison-based sorting

6

Parallel Sorting Basics

• Where are the input and output lists stored? —we assume that both input and output lists are distributed

• What is a parallel sorted sequence? —sequence partitioned among the processors—each processor’s sub-sequence is sorted —all in Pj's sub-sequence < all in Pk's sub-sequence if j < k

– the best process numbering can depend on network topology

7

When partitioning is one element per process

1. Processes Pj and Pk send their elements to each other

Each process now has both elements

2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)

Pj Pk

aj ak

Element-wise Parallel Compare-Exchange

Pj Pk

aj, ak ak, aj

Pj Pk

min(aj, ak) max(ak, aj)

[communication step]

[comparison step]

8

Bulk Parallel Compare-Split

1. Send block of size n/p to partner

2. Each partner now has both blocks

3. Merge received block with own block

4. Retain only the appropriate half of the merged block Pi retains smaller values; process Pj retains larger values

9

Basic Analysis

• Assumptions—Pi and Pj are neighbors—communication channels are bi-directional

• Elementwise compare-exchange: 1 element per processor— time = ts + tw

• Bulk compare-split: n/p elements per processor —after compare-split on pair of processors Pi and Pj, i < j

– smaller n/p elements are at processor Pi – larger n/p elements at Pj

— time = ts+ twn/p– merge in O(n/p) time, as long as partial lists are sorted

10

Sorting Network

• Network of comparators designed for sorting

• Comparator : two inputs x and y; two outputs x' and y’—types

– increasing (denoted ⊕): x' = min(x,y) and y' = max(x,y)

– decreasing (denoted Ө) : x' = max(x,y) and y' = min(x,y)

• Sorting network speed is proportional to its depth

x min(x,y)

y max(x,y)

⊕

⊕

x max(x,y)

y min(x,y)

Ө

Ө

11

Sorting Networks

• Network structure: a series of columns

• Each column consists of a vector of comparators (in parallel)

• Sorting network organization:

12

Example: Bitonic Sorting Network

• Bitonic sequence—two parts: increasing and decreasing

– 〈1,2,4,7,6,0〉: first increases and then decreases (or vice versa)—cyclic rotation of a bitonic sequence is also considered bitonic

– 〈8,9,2,1,0,4〉: cyclic rotation of 〈0,4,8,9,2,1〉

• Bitonic sorting network—sorts n elements in Θ(log2 n) time—network kernel: rearranges a bitonic sequence into a sorted one

13

Bitonic Split

• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that—a0 ≤ a1 ≤ ··· ≤ an/2-1 , and

—an/2 ≥ an/2+1 ≥ ··· ≥ an-1

• Consider the following subsequences of s

s1 = 〈min(a0,an/2),min(a1,an/2+1),…,min(an/2-1,an-1)〉

s2 = 〈max(a0,an/2),max(a1,an/2+1),…,max(an/2-1,an-1)〉

• Sequence properties—s1 and s2 are both bitonic —∀x ∀y x ∈ s1, y ∈ s2 , x < y

• Apply recursively on s1 and s2 to produce a sorted sequence

• Works for any bitonic sequence, even if |s1| ≠ |s2|

Splitting Bitonic Sequences - I

14min max

Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y

Splitting Bitonic Sequences - II

15min max

Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y

16

Bitonic Merge

Sort a bitonic sequence through a series of bitonic splits

Example: use bitonic merge to sort 16-element bitonic sequence

How: perform a series of log2 16 = 4 bitonic splits

17

Sorting via Bitonic Merging Network

• Sorting network can implement bitonic merge algorithm —bitonic merging network

• Network structure—log2 n columns—each column

– n/2 comparators – performs one step of the bitonic merge

• Bitonic merging network with n inputs: ⊕BM[n]—yields increasing output sequence

• Replacing ⊕ comparators by Ө comparators: ӨBM[n]—yields decreasing output sequence

18

Bitonic Merging Network, ⊕ BM[16]

• Input: bitonic sequence— input wires are numbered 0,1,…, n - 1 (shown in binary)

• Output: sequence in sorted order

• Each column of comparators is drawn separately

19

Bitonic Sort

How do we sort an unsorted sequence using a bitonic merge?

Two steps

• Build a bitonic sequence

• Sort it using a bitonic merging network

20

Building a Bitonic Sequence

• Build a single bitonic sequence from the given sequence —any sequence of length 2 is a bitonic sequence. —build bitonic sequence of length 4

– sort first two elements using ⊕BM[2] – sort next two using ӨBM[2]

• Repeatedly merge to generate larger bitonic sequences—⊕BM[k] & ӨBM[k]: bitonic merging networks of size k

21

Building a Bitonic Sequence

Input: sequence of 16 unordered numbers

Output: a bitonic sequence of 16 numbers

22

Bitonic Sort, n = 16

• First 3 stages create bitonic sequence input to stage 4

• Last stage (⊕BM[16]) yields sorted sequence

23

Complexity of Bitonic Sorting Networks

• Depth of the network is Θ(log2 n)—log2 n merge stages—jth merge stage is log2 2j = j

—depth =

• Each stage of the network contains n/2 comparators

• Complexity of serial implementation = Θ(n log2 n)€

log2 2j

j=1

log2 n

∑ = ji=1

log2 n

∑ = (log2 n +1)(log2 n) /2 = θ(log2 n)

24

Mapping Bitonic Sort to a Hypercube

Consider one item per processor

• How do we map wires in bitonic network onto a hypercube?

• In earlier examples—compare-exchange between two wires when labels differ in 1 bit

• Direct mapping of wires to processors—all communication is nearest neighbor

25

Mapping Bitonic Merge to a Hypercube

Communication during the last merge stage of bitonic sort

• Each number is mapped to a hypercube node

• Each connection represents a compare-exchange

26

Mapping Bitonic Sort to Hypercubes

Communication in bitonic sort on a hypercube

• Processes communicate along dims shown in each stage

• Algorithm is cost optimal w.r.t. its serial counterpart

• Not cost optimal w.r.t. the best sorting algorithm

Batcher’s Bitonic Sort in NESL

function merge(a) =if (#a == 1) then aelse let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]});

function bitonic_sort(a) =if (#a == 1) then aelse let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1]));bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]);

27Try it at: http://www.cs.rice.edu/~johnmc/nesl.html

28

Bubble Sort and Variants

Sequential bubble sort algorithm

Compares and exchanges adjacent elements sequence

29

Bubble Sort and Variants

• Bubble sort complexity: Θ(n2)

• Difficult to parallelize—algorithm has no concurrency

• A simple variant uncovers concurrency

30

Sequential Odd-Even Transposition Sort

31

Odd-Even Transposition Sort, n = 8

In each phase, n = 8 elements are compared

32

Odd-Even Transposition Sort

• After n phases of odd-even exchanges, sequence is sorted

• Each phase of algorithm requires Θ(n) comparisons

• Serial complexity is Θ(n2)

33

Parallel Odd-Even Transposition

Consider one item per processor

• n iterations—in each iteration, each processor does one compare-exchange

• Parallel run time of this formulation is Θ(n)

• Cost optimal with respect to the base serial algorithm —but not the optimal one!

34

Parallel Odd-Even Transposition Sort

note: if partner id < 1 or > n, then skip compare

35

Quicksort

• Popular sequential sorting algorithm —simplicity, low overhead, optimal average complexity

• Operation—select an entry in the sequence to be the pivot —divide the sequence into two halves

– one with all elements less than the pivot – other greater

• Apply process recursively to each of sublist

36

Parallelizing Quicksort

• First, recursive decomposition —partition the list serially —handle each subproblems on a different processor

• Time for this algorithm is lower-bounded by Ω(n)!

• Can we parallelize the partitioning step? —can we use n processors to partition a list of length n around a

pivot in O(1) time?

• Tricky on real machines

37

Practical Parallel Quicksort

Each processor initially responsible for n/p elements

• Shared memory formulation—select first pivot & broadcast—each processor partitions own data—globally rearrange data into smaller and larger parts (in place)—recurse with proportional # processors on each part

• Message passing formulation—partitioning

– each processor first partitions local portion of array– determine which processes will be responsible for each partition– (based on size of smaller than pivot and larger than pivot groups)– divide up the data among the processor subsets responsible for each part

—continue recursively

Data Parallel Quicksort in NESL

• Total work is O(n log n)• Recursion depth is O(log n)• Depth of each operation is constant 38

12345678

• Total depth is O(log n) as well

39

Other Sorting Algorithms• Shellsort - another variant of bubble sort

—two stage process– log p rounds of long distance exchanges– followed by rounds of odd-even transposition sort until done

—key idea: long distance moves of first stage reduce number of rounds necessary in second stage

• Radix sort : in a series of rounds, sort elements into buckets by digit

• Bucket and sample sort—assumes evenly distributed items in an interval—buckets represent evenly-sized subintervals

• Enumeration sort: —determine rank of each element—place it in the correct position—CRCW PRAM algorithm: n2 processors, sort in Θ(1) time

– assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j]

– place A[j] into A[C[j]]

Other Sorting Algorithms (Cont)

• Histogram Sorting —goal: divide keys into p evenly sized pieces—use an iterative approach to do so—initiating processor broadcasts k > p-1 splitter guesses—each processor determines how many keys fall in each bin—sum histogram with global reduction—one processor examines guesses to see which are satisfactory—broadcast finalized splitters and number of keys for each

processor—each processor sends local data to appropriate processors using

all-to-all communication—each processor merges chunks it receives— Kale and Solomonik improved this (IPDPS 2010)

40

41

References

• Adapted from slides “Sorting” by Ananth Grama

• Based on Chapter 9 of “Introduction to Parallel Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003

• “Programming Parallel Algorithms.” Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996.

• http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort

• Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.

Documents

Comp422 2011 Lecture19 Sorting