62
04/27/2009 CS267 Lecture 24 1 Parallel Sorting James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09

Parallel Sorting

  • Upload
    haamid

  • View
    91

  • Download
    1

Embed Size (px)

DESCRIPTION

Parallel Sorting. James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09. Some Sorting algorithms. Choice of algorithm depends on Is the data all in memory, or on disk/tape? Do we compare operands, or just use their values? Sequential or parallel? Shared or distributed memory? SIMD or not? - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Sorting

04/27/2009 CS267 Lecture 24 1

Parallel Sorting

James Demmel

www.cs.berkeley.edu/~demmel/cs267_Spr09

Page 2: Parallel Sorting

Some Sorting algorithms

• Choice of algorithm depends on• Is the data all in memory, or on disk/tape?

• Do we compare operands, or just use their values?

• Sequential or parallel? Shared or distributed memory? SIMD or not?

• We will consider all data in memory and parallel:• Bitonic Sort

• Naturally parallel, suitable for SIMD architectures

• Sample Sort

• Good generalization of Quick Sort to parallel case

• Radix Sort

• Data measured on CM-5• Written in Split-C (precursor of UPC)

• www.cs.berkeley.edu/~culler/papers/sort.ps

04/27/2009 CS267 Lecture 24 2

Page 3: Parallel Sorting

LogP – model to predict, understand performance

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors

Page 4: Parallel Sorting

Bottom Line on CM-5 using Split-C (Preview)

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best

• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

Algorithm, #Procs

04/27/2009 CS267 Lecture 24 4

Page 5: Parallel Sorting

04/27/2009 CS267 Lecture 24 5

Bitonic Sort (1/2)• A bitonic sequence is one that is:

1. Monotonically increasing and then monotonically decreasing

2. Or can be circularly shifted to satisfy 1

• A half-cleaner takes a bitonic sequence and produces1. First half is smaller than smallest element in 2nd

2. Both halves are bitonic

• Apply recursively to each half to complete sorting

• Where do we get a bitonic sequence to start with?

00111000

00001011

00000111

00001011

Page 6: Parallel Sorting

04/27/2009 CS267 Lecture 24 6

Bitonic Sort (2/2)• A bitonic sequence is one that is:

1. Monotonically increasing and then monotonically decreasing

2. Or can be circularly shifted to satisfy

• Any sequence of length 2 is bitonic• So we can sort it using bitonic sorter on last slide

• Sort all length-2 sequences in alternating increasing/decreasing order

• Two such length-2 sequences form a length-4 bitonic sequence

• Recursively • Sort length-2k sequences in alternating increasing/decreasing

order

• Two such length-2k sequences form a length-2k+1 bitonic sequence (that can be sorted)

Page 7: Parallel Sorting

Bitonic Sort for n=16 – all dependencies shown

Block Layout

lg N/p stages are local sort – Use best local sort

remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)

Similar pattern as FFT: similar optimizations possible

Page 8: Parallel Sorting

Bitonic Sort: time per key

Predicted using LogP Model

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

#Procs

04/27/2009 CS267 Lecture 24 8

Page 9: Parallel Sorting

Sample Sort

1. compute P-1 values of keys that

would split the input into roughly equal pieces.

– take S~64 samples per processor

– sort P·S keys (on processor 0)

– let keys S, 2·S, . . . (P-1)·S be “Splitters”

– broadcast Splitters

2. Distribute keys based on Splitters

Splitter(i-1) < keys Splitter(i) all sent to proc i

3. Local sort of keys on each proc

[4.] possibly reshift, so each proc has N/p keys

If samples represent total population, then Splitters should dividepopulation into P roughly equal pieces

04/27/2009 CS267 Lecture 24 9

Page 10: Parallel Sorting

Sample Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

# Processors

04/27/2009 CS267 Lecture 24 10

Page 11: Parallel Sorting

Sample Sort Timing Breakdown

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Split

Sort

Dist

Split-m

Sort-m

Dist-m

Predicted and Measured (-m) times

04/27/2009 CS267 Lecture 24 11

Page 12: Parallel Sorting

004/21/23 CS267 Lecture 24 12

Sequential Radix Sort: Counting Sort

• Idea: build a histogram of the keys and compute position in answer array for each element

A = [3, 5, 4, 1, 3, 4, 1, 4]

• Make temp array B, and write values into position B = [1, 1, 3, 3, 4, 4, 4, 5]

• Cost = O(#keys + size of histogram)

• What if histogram too large (eg all 32-bit ints? All words?)

0

1

2

3

4

0 1 2 3 4 5

Page 13: Parallel Sorting

004/21/23 CS267 Lecture 24 13

Radix Sort: Separate Key Into Parts

• Divide keys into parts, e.g., by digits (radix)

• Using counting sort on these each radix:• Start with least-significant

sat run sat pin

saw pin saw run

tip tip tip sat

run sat pin saw

pin saw run tip

sort on 3rd character

sort on 2nd character

sort on 1st character

sat run sat pin

saw pin saw run

tip tip tip sat

run sat pin saw

pin saw run tip

sat run sat pin

saw pin saw run

tip tip pin sat

run sat tip saw

pin saw run tip

sat run sat pin

saw pin saw run

tip tip pin sat

run sat tip saw

pin saw run tip

• Cost = O(#keys * #characters)

Page 14: Parallel Sorting

Histo-radix sort

P

n=N/P

Per pass:

1. compute local histogram

– r-bit keys, 2r bins

2. compute position of 1st

member of each bucket in

global array

– 2r scans with end-around

3. distribute all the keys

Only r = 4, 8,11,16 make sense

for sorting 32 bit numbers

2r23

004/21/23 CS267 Lecture 24 14

Page 15: Parallel Sorting

Histo-Radix Sort (again)

Local Data

Local Histograms

Each Passform local histogramsform global histogramglobally distribute data

P

004/21/23 CS267 Lecture 24 15

Page 16: Parallel Sorting

Radix Sort: Times

Predicted

N/P

us

/ke

y

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

us

/ke

y

512

256

128

64

Measured

N/P

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

32

# procs

004/21/23 CS267 Lecture 24 16

Page 17: Parallel Sorting

Radix Sort: Timing Breakdown

N/P

us/

key

0

20

40

60

80

100

120

14016

384

3276

8

6553

6

1E+

05

3E+

05

5E+

05

1E+

06

Dist

GlobalHist

LocalHist

Dist-m

GlobalHist-m

LocalHist-m

004/21/23 CS267 Lecture 24 17

Predicted and Measured (-m) times

Page 18: Parallel Sorting

Local Sort Performance on CM-5

Log N/P

µs

/ K

ey

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

31

25.1

16.9

10.4

6.2

Entropy inKey Values

<--------- TLB misses ---------->

(11 bit radix sort of 32 bit numbers)

004/21/23 18CS267 Lecture 24

Entropy = -i pi log2 pi ,

pi = Probability of key i

Ranges from 0 to log2 (#different keys)

Page 19: Parallel Sorting

Radix Sort: Timing dependence on Key distribution

Entropy

µs /

ke

y

0

10

20

30

40

50

60

700 6

10

17

25

31

Cycl

ic

118

Dist

Global Hist

Local Hist

Slowdown due to contentionin redistribution

004/21/23 CS267 Lecture 24 19

Entropy = -i pi log2 pi ,

pi = Probability of key i

Ranges from 0 to log2 (#different keys)

Page 20: Parallel Sorting

Bottom Line on CM-5 using Split-C

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted (using LogP model) and measured (10%)• No single algorithm always best

• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

Algorithm, #Procs

04/27/2009 CS267 Lecture 24 20

Page 21: Parallel Sorting

Sorting Conclusions

• Distributed memory model leads to hybrid global / local algorithm• Use best local algorithm combined with global part

• LogP model is good enough to model global part• bandwidth (g) or overhead (o) matter most• including end-point contention• latency (L) only matters when bandwidth doesn’t

• Modeling local computational performance is harder• dominated by effects of storage hierarchy (eg TLBs), • depends on entropy

• See http://www.cs.berkeley.edu/~culler/papers/sort.ps• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps

• disk-to-disk parallel sorting

04/27/2009 CS267 Lecture 24 21

Page 22: Parallel Sorting

EXTRA SLIDES

004/21/23 CS267 Lecture 24 22

Page 23: Parallel Sorting

Radix: Stream Broadcast Problem

n

(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well

• Processor 0 does only sends• Others receive then send• Receives prioritized over sends Processor 0 needs to be delayed

004/21/23 CS267 Lecture 24 23

Page 24: Parallel Sorting

What’s the right communication mechanism?

• Permutation via writes• consistency model?• false sharing?

• Reads?• Bulk Transfers?

• what do you need to change in the algorithm?

• Network scheduling?

Page 25: Parallel Sorting

Comparison

• Good fit between predicted and measured (10%)

• Different sorts for different sorts• scaling by processor, input size, sensitivity

• All are global / local hybrids• the local part is hard to implement and model

N/P

us

/ke

y

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

Page 26: Parallel Sorting

04/27/2009 CS267 Lecture 24 26

Outline

• Some Performance Laws

• Performance analysis

• Performance modeling

• Parallel Sorting: combining models with measurments

• Reading: • Chapter 3 of Foster’s “Designing and Building Parallel Programs”

online text

• http://www-unix.mcs.anl.gov/dbpp/text/node26.html

• Abbreviated as DBPP in this lecture

• David Bailey’s “Twelve Ways to Fool the Masses”

Page 27: Parallel Sorting

04/21/23 CS267, Yelick 27

Measuring Performance

• Performance criterion may vary with domain

• There may be limits on acceptable running time• E.g., a climate model must run 1000x faster than real time.

• Any performance improvement may be acceptable• E.g., faster on 4 cores than on 1.

• Throughout may be more critical than latency• E.g., number of images processed per minute (throughput) vs.

total delay for one image (latency) in a pipelined system.

• Execution time per unit cost• E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt

• Parallel scalability (speedup or parallel efficiency)

• Percent relative to best possible (some kind of peak)

Page 28: Parallel Sorting

04/21/23 CS267, Yelick 28

Amdahl’s Law (review)

• Suppose only part of an application seems parallel

• Amdahl’s law• let s be the fraction of work done sequentially, so

(1-s) is fraction parallelizable

• P = number of processors

Speedup(P) = Time(1)/Time(P)

<= 1/(s + (1-s)/P)

<= 1/s

• Even if the parallel part speeds up perfectly performance is limited by the sequential part

Page 29: Parallel Sorting

04/21/23 CS267, Yelick 29

Amdahl’s Law (for 1024 processors)

Does this mean parallel computing is a hopeless enterprise?

Source: Gustafson, Montry, Benner

Page 30: Parallel Sorting

04/21/23 CS267, Yelick 30

Scaled Speedup

See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

Page 31: Parallel Sorting

04/21/23 CS267, Yelick 31

Scaled Speedup (background)

Page 32: Parallel Sorting

04/21/23 CS267, Yelick 32

Little’s Law

• Latency vs. Bandwidth• Latency is physics (wire length)

• e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room

• Bandwidth is cost:

• add more cables to increase bandwidth (over-simplification)

• Principle (Little's Law): the relationship of a production system in steady state is:

Inventory = Throughput × Flow Time

• For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency

• Required concurrency = Bandwidth * Latency

bandwidth-delay product

• For parallel computing, this means:

Concurrency = bandwidth x latency

Page 33: Parallel Sorting

04/21/23 CS267, Yelick 33

Little’s Law

• Example 1: a single processor:• If the latency is to memory is 50ns, and the bandwidth is 5 GB/s

(.2ns / Bytes = 12.8 ns / 64-byte cache line)

• The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed)

• An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them)

• Example 2: 1000 processor system• 1 GHz clock, 100 ns memory latency, 100 words of memory in

data paths between CPU and memory.

• Main memory bandwidth is:

~ 1000 x 100 words x 109/s = 1014 words/sec.

• To achieve full performance, an application needs:

~ 10-7 x 1014 = 107 way concurrency

(some of that may be hidden in the instruction stream)

Page 34: Parallel Sorting

04/21/23 CS267, Yelick 34

In Performance Analysis: Use more Data

• Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information.

• E.g., from DBPP: • Serial algorithm scales as N + N2

• Speedup of 10.8 on 12 processors with problem size N=100

• Case 1: T = N + N2/P

• Case 2: T = (N + N2)/P + 100

• Case 2: T = (N + N2)/P + 0.6*P2

• All have speedup ~10.8 on 12 procs

• Performance graphs (n = 100, 1000) show differences in scaling

Page 35: Parallel Sorting

04/21/23 CS267, Yelick 35

Example: Immersed Boundary Simulation

Joint work with Ed Givelberg, Armando Solar-Lezama

Time per timestep

0

20

40

60

80

100

# procs

time

(sec

s)

Pow3/SP 256 3̂

Pow3/SP 512 3̂

P4/Myr 512 2̂x256

• Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC• How useful is this data? What are ways to make is more useful/interesting?

Page 36: Parallel Sorting

04/21/23 CS267, Yelick 36

Performance Analysis

time breakdown

0%10%20%30%40%50%60%70%80%90%

100%

256

on 2

256

on 4

256

on 8

256

on 1

6

256

on 3

2

256

on 6

4

512

on 3

2

512

on 6

4

512

on 1

28

move

Unpack U

Send velocities

Pack U

Copy fluid U

Inverse FFTs

SolveXformXEqns

Forward FFTs

Upwind

Exchange ghost

Unpack F

Set F = 0

Send F

Pack F

Spread F

Compute F

Page 37: Parallel Sorting

04/21/23 CS267, Yelick 37

Building a Performance Model• Based on measurements/scaling of components

• FFT is time is: • 5*nlogn flops * flops/sec (measured for FFT)

• Other costs are linear in either material or fluid points• Measure constants:

a) # flops/point (independent machine or problem size)b) Flops/sec (measured per machine, per phase)

• Time is: a * b * #points• Communication done similarly

• Find formula for message size as function of problem size• Check the formula using tracing of some kind• Use model to predict running time: * size

Page 38: Parallel Sorting

04/21/23 CS267, Yelick 38

A Performance Model

• 5123 in < 1 second per timestep not possible • Primarily limited by bisection bandwidth

Performance Model Validation

0.1

1

10

100

1000

# procs

tim

e (

se

cs

)

Total time (256 model)

Total time (256 actual)

Total time (512 model)

Total time (512 actual)

Page 39: Parallel Sorting

04/21/23 CS267, Yelick 39

Model Success and Failure

Page 40: Parallel Sorting

04/21/23 CS267, Yelick 40

OSKI SPMV: What We Expect

• Assume• Cost(SpMV) = time to read matrix

• 1 double-word = 2 integers

• r, c in {1, 2, 4, 8}

• CSR: 1 int / non-zero

• BCSR(r x c): 1 int / (r*c non-zeros)

• As r*c increases, speedup should• Increase smoothly

• Approach 1.5 (eliminate all index overhead)

5.11

1

5.1

),(,

cr

BCSR

CSR

rccrT

TSpeedup

Page 41: Parallel Sorting

04/21/23 CS267, Yelick 41

What We Get (The Need for Search)

Reference

Best: 4x2

Mflop/s

Mflop/s

Page 42: Parallel Sorting

04/21/23 CS267, Yelick 42

Using Multiple Models

Page 43: Parallel Sorting

04/21/23 CS267, Yelick 43

Multiple Models

Page 44: Parallel Sorting

04/21/23 CS267, Yelick 44

Multiple Models

Page 45: Parallel Sorting

Extended ExampleUsing Performance Modeling (LogP)

To Explain Data

Application to Sorting

Page 46: Parallel Sorting

Deriving the LogP Model

° Processing

– powerful microprocessor, large DRAM, cache => P

° Communication

+ significant latency (100's –1000’s of cycles) => L

+ limited bandwidth (1 – 5% of memory bw) => g

+ significant overhead (10's – 100's of cycles) => o- on both ends

– no consensus on topology

=> should not exploit structure

+ limited network capacity– no consensus on programming model

=> should not enforce one

Page 47: Parallel Sorting

LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors

Page 48: Parallel Sorting

Using the LogP Model

° Send n messages from proc to proc in time

2o + L + g (n-1)

– each processor does o*n cycles of overhead

– has (g-o)(n-1) + L available compute cycles

° Send n total messages from one to many

in same time

° Send n messages from many to one

in same time

– all but L/g processors block

so fewer available cycles, unless scheduled carefully

o L o

o og

Ltime

P

P

Page 49: Parallel Sorting

Use of the LogP Model (cont)

° Two processors sending n words to each other (i.e., exchange)

2o + L + max(g,2o) (n-1) max(g,2o) + L

° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic

exercise: what’s wrong with the formula above?

Assumes optimal pattern of send/receive, so could underestimate time

o L o

o og

L

o

Page 50: Parallel Sorting

LogP "philosophy"

• Think about:

• – mapping of N words onto P processors

• – computation within a processor, its cost, and balance

• – communication between processors, its cost, and balance

• given a charaterization of processor and network performance

• Do not think about what happens within the network

This should be good enough!

Page 51: Parallel Sorting

Typical Sort

Exploits the n = N/P grouping

° Significant local computation

° Very general global communication / transformation

° Computation of the transformation

Page 52: Parallel Sorting

Costs Split-C (UPC predecessor) Operations

• Read, Write x = *G, *G = x 2 (L + 2o)• Store *G :– x L + 2o• Get x := *G o .... 2L + 2o

sync(); o•with interval g

• Bulk store (n words with words/message)

2o + (n-1)g + L

• Exchange 2o + 2L + (nL/g) max(g,2o)

• One to many• Many to one

Page 53: Parallel Sorting

LogP model

• CM5:• L = 6 µs• o = 2.2 µs• g = 4 µs• P varies from 32 to 1024

• NOW• L = 8.9• o = 3.8• g = 12.8• P varies up to 100

• What is the processor performance?• Application-specific• 10s of Mflops for these machines

Page 54: Parallel Sorting

04/21/23 CS267, Yelick 54

LogP Parameters Today

0

1020

304050

607080

90100

4*o + 2*L g

Millennium (MPICH/M2K)

SP Loopback (MPI)

SP (MPI)

Quadrics loopback (MPI, ORNL)

Quadrics (MPI,ORNL)

Quadrics (Shmem, ORNL)

Quadrics (UPC, ORNL)

Myrinet(GM)

MVICH/Giganet

VIPL/Giganet/VIA

MVICH/M-VIA/Syskonnect

VIPL/M-VIA/SysKonnect

T3E Loopback (MPI/NERSC)

T3E (MPI, NERSC)

T3E UPC (NERSC)

Page 55: Parallel Sorting

Local Computation Parameters - Empirical

Parameter Operation µs per key Sort

Swap Simulate cycle butterfly per key 0.025 lg N Bitonic

mergesort Sort bitonic sequence 1.0

scatter Move key for Cyclic-to-block 0.46

gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column

1.1 otherwise

local sort Local radix sort (11 bit) 4.5 if n < 64K

9.0 - (281000/n)

merge Merge sorted lists 1.5 Column

copy Shift Key 0.5

zero Clear histogram bin 0.2 Radix

hist produce histogram 1.2

add produce scan value 1.0

bsum adjust scan of bins 2.5

address determine desitination 4.7

compare compare key to splitter 0.9 Sample

localsort8 local radix sort of samples 5.0

Page 56: Parallel Sorting

Odd-Even Merge - classic parallel sort

N values to be sorted

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Treat as two lists ofM = N/2

Sort each separately

A0 A2 … AM-2 B0 B2 … BM-2

Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1

Merge into twosorted lists

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Pairwise swaps ofEi and Oi will put itin order

Page 57: Parallel Sorting

Where’s the Parallelism?

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/4

2xN/2

Page 58: Parallel Sorting

Mapping to a Butterfly (or Hypercube)

A0 A1 A2 A3 B0 B1 B2 B3

A0 A1 A2 A3

A0 A1 A2 A3

B0B1 B3 B2

B2B3 B1 B0

A0 A1 A2 A3 B2B3 B1 B0

Reverse Orderof one list viacross edges

two sorted sublists

Pairwise swapson way back2 3 4 8 7 6 5 1

2 3 4 7 6 5 81

2 4 6 81 3 5 7

1 2 3 4 5 6 7 8

Page 59: Parallel Sorting

Bitonic Sort with N/P per node

all_bitonic(int A[PROCS]::[n])sort(tolocal(&A[ME][0]),n,0)for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); }

sort(tolocal(&A[ME][0]),n,mask(i));

sortswap

A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences

Page 60: Parallel Sorting

Bitonic: Breakdown

Predicted

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Remap B-C

Remap C-B

Mergesort

Swap

Localsort

P= 512, random

Page 61: Parallel Sorting

Bitonic: Effect of Key Distributions

Entropy (bits)

µs/k

ey

0

5

10

15

20

25

30

35

40

45

0 6 10 17 25 31

Swap

Merge Sort

Local Sort

Remap C-B

Remap B-C

P = 64, N/P = 1 M

Page 62: Parallel Sorting

04/21/23 CS267, Yelick 62

Sequential Radix Sort: Counting Sort

• Idea: build a histogram of the keys and compute position in answer array for each element

A = [3, 5, 4, 1, 3, 4, 1, 4]

• Make temp array B, and write values into position

0

1

2

3

4

0 1 2 3 4 5

1 5 8 7 9 2 4 4 2 4 9 2 4 9 7 2 4 8 7 2 4 9

5 8 7 2 4 9

1 3 4 4 1 3 4

5