43
Sanders: Algorithm Engineering – Parallel Sorting 1 Algorithm Engineering An Attempt at a Definition Using Parallel (External) Sorting as an Example Peter Sanders

Sanders: Algorithm Engineering – Parallel Sorting

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 1

Algorithm EngineeringAn Attempt at a Definition

Using Parallel (External) Sortingas an Example

Peter Sanders

Page 2: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 2

Overview

� a general definition

[with Kurt Mehlhorn, Rolf Möhring, Petra Mutzel, Dorothea Wagner]

� main challenges

� Parallel (external) sorting as an example

[with Andreas Beckmann, Roman Dementiev, David Hutchinson,

Kanela Kaligosi, Nicolai Leischner, Ulrich Meyer, Vitaly Osipov,

Mirko Rahn, Johannes Singler, Jeff Vitter, Sebastian Winkel]

Page 3: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 3

Algorithmics

= the systematic design of efficient software and hardware

computer science

logics correct

efficient

soft− & hardware

the

ore

tic

al p

rac

tica

l

algorithmics

Page 4: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 4

(Caricatured) Traditional View: Algorithm Theory

models

design

analysis

perf. guarantees applications

implementation

deduction

Theory Practice

Page 5: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 5

Gaps Between Theory & Practice

Theory ←→ Practice

simple appl. model complex

simple machine model real

complex algorithms FOR simple

advanced data structures arrays,. . .

worst case max complexity measure inputs

asympt. O(·) efficiency 42% constant factors

Page 6: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 6

Algorithmics as Algorithm Engineering

models

design

perf.−guarantees

deduction

analysis experiments

algorithmengineering

??

?

Page 7: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 7

Algorithmics as Algorithm Engineering

models

design

implementationperf.−guarantees

deduction

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering

Page 8: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 8

Algorithmics as Algorithm Engineering

realisticmodels

implementationperf.−guarantees

deduction

falsifiable

inductionhypotheses experiments

algorithmengineering

design

analysis

real.

real.

Page 9: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 9

Algorithmics as Algorithm Engineering

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

deduction

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

Page 10: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 10

Algorithmics as Algorithm Engineering

appl. engin.

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

applications

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

deduction

Page 11: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 11

Goals

� bridge gaps between theory and practice

� accelerate transfer of algorithmic results into applications

� keep the advantages of theoretical treatment:

generality of solutions and

reliability, predictability from performance guarantees

appl. engin.

realisticmodels

design

implementation

librariesalgorithm−

perf.−guarantees

applications

falsifiable

inductionhypothesesanalysis experiments

algorithmengineering real

Inputs

deduction

Page 12: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 12

Bits of History1843– Algorithms in theory and practice

1950s,1960s Still infancy

1970s,1980s Paper and pencil algorithm theory.

Exceptions exist, e.g., [J. Bentley, D. Johnson]

1986 Term used by [T. Beth],

lecture “Algorithmentechnik” in Karlsruhe.

1988– Library of Efficient Data Types and Algorithms

(LEDA) [K. Mehlhorn, S. Näher]

1990– DIMACS Implementation Challenges [D. Johnson]

1997– Workshop on Algorithm Engineering

ESA applied track [G. Italiano]

1997 Term used in US policy paper [Aho, Johnson, Karp, et. al]

1998 Alex workshop in Italy ALENEX

Page 13: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 13

Commercial Break [Bader, Sanders, Wagner]

10th DIMACS Implementation Challenge

Two related challenges:

� (Balanced) Graph Partitioning (cut minimization) and

� Clustering (modularity, others)

Variants welcome for the workshop

Atlanta February 13/14, 2012

June 1 2011: testbed creation

Oct 21 2011: paper deadline

http://www.cc.gatech.edu/dimacs10/

Page 14: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 14

Realistic Models – The Beauty and the Beast

Theory ←→ Practice

simple appl. model complex

simple machine model real

� Careful refinements

� Try to preserve (partial) analyzability / simple results

Page 15: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 15

Sorting – Model

basedComparison

<e.g. integer

arbitrary

full informationtrue/false

Page 16: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 16

Advanced Machine Models

Parallel Disks

M

B... D1 2

von Neumann

Page 17: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 17

Advanced Machine Models

SSDs

MM

B... ...

B B’

D1 2

Page 18: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 18

Set Associative Caches

MB

...cache lines of the memory

cache setsa=2

B cache

main memory

...

Page 19: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 19

Branch Prediction

Page 20: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 20

Hierarchical Parallel External Memory

multicore

disks...

MPI

Page 21: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 21

Our Cost Model for Parallel External Sorting

“sequential” aspects: Comparisons, cache faults, branch

mispredictions, ILP

shared memory: remote cache accesses (not here:

synchronizations,. . . )

disk: I/Os, overlapping, tune block size

distributed memory: communication volume (with alltoallv)

(not here latency, collectives)

overall: time, energy

partly plug-and-play of previous results.

Mix of formal and informal consideration

Page 22: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 22

Graphics Processing Units

not here

Page 23: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 23

Design

of algorithms that work well in practice

� simplicity

� reuse

� constant factors

� exploit easy instances

Page 24: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 24

Design – Sorting

� simplicity

...

...

...

...

...

...

...

...

...

...

...

...

merge distribute

� reuse disk scheduling, prefetching,

load balancing, sequence partitioning

� constant factors detailed machine model·

(caches, TLBs, registers, branch prediction, ILP, communication)

� instances randomization for difficult instances

Page 25: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 25

Design – Parallel External Multiway Mergesort

� run formation: internal parallel sorting

(multi-core parallel subroutines).

shuffle blocks between runs randomly

� data redistribution by

external inplace all-to-all

� node-local multi-core-parallel merging

Page 26: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 26

Analysis

� Constant factors matter

� Beyond worst case analysis

� Practical algorithms might be difficult to analyze

(randomization, meta heuristics,. . . )

Page 27: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 27

Analysis – Sorting

.

.

......

merge

write

prefetch

overlap

merge

� Constant factors matter (1 + o(1))sort(n)

I/Os for parallel (disk) external sorting

Open Problem: optimal I/O AND communication volume

� Beyond worst case analysis Open Problem:

quicksort: avg. case analysis of branch mispredictions

� Practical algorithms might be difficult to analyze Open Problem:

greedy algorithm for parallel disk prefetching [Knuth@48]

Page 28: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 28

Analysis – Parallel External Sorting

Best case: 2 I/O passes, 1×communication (optimal)

Worst case: one additional I/O, communication pass

Expected case: much less extra I/O/comm. even for arbitrarily skewed

inputs

Page 29: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 29

Implementation

sanity check for algorithms !

Challenges

Semantic gaps:

Abstract algorithm

C++, OpenMP, MPI,. . .

hardware

Small constant factors:

compare highly tuned competitors

.

.

......

merge

write

prefetch

overlap

merge

Page 30: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 30

Example: Inner Loops Sample Sort

template <class T>

void findOraclesAndCount(const T* const a,

const int n, const int k, const T* const s,

Oracle* const oracle, int* const bucket) {

{ for (int i = 0; i < n; i++)

int j = 1;

while (j < k) {

j = j*2 + (a[i] > s[j]);

}

int b = j-k;

bucket[b]++;

oracle[i] = b;

}

}

6s

splitter array index

3s

<

< <

<<<< 1s 5s 7s6 7

buckets

4

3s 2 2

1 4s

5

decisions

decisions

decisions

> >

>>>>

*2

*2 *2

>+1

+1 +1

Page 31: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 31

Example: Inner Loops Sample Sort

template <class T>

void findOraclesAndCountUnrolled([...]){

for (int i = 0; i < n; i++)

int j = 1;

j = j*2 + (a[i] > s[j]);

j = j*2 + (a[i] > s[j]);

j = j*2 + (a[i] > s[j]);

int b = j-k;

bucket[b]++;

oracle[i] = b;

} }

6s

splitter array index

3s

<

< <

<<<< 1s 5s 7s6 7

buckets

4

3s 2 2

1 4s

5

decisions

decisions

decisions

> >

>>>>

*2

*2 *2

>+1

+1 +1

Page 32: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 32

Example: Inner Loops Sample Sort

template <class T>

void findOraclesAndCountUnrolled2([...]){

for (int i = n & 1; i < n; i+=2) {\

int j0 = 1; int j1 = 1;

T ai0 = a[i]; T ai1 = a[i+1];

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

j0=j0*2+(ai0>s[j0]); j1=j1*2+(ai1>s[j1]);

int b0 = j0-k; int b1 = j1-k;

bucket[b0]++; bucket[b1]++;

oracle[i] = b0; oracle[i+1] = b1;

} }

Page 33: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 33

Implementation – Parallel External Sorting

shared memory: g++ STL parallel mode (parallel multiway mergesort)

(developed by Johannes Singler)

disk: STXXL by Roman Dementiev et al. overlapping, disk scheduling

distributed memory: MPI + fix 32 bit problems + inplace external

alltoallv

Page 34: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 34

Experiments

� sometimes a good surrogate for analysis

� too much rather than too little output data

� reproducibility (10 years!)

� software engineering

� reliable parallel running time measurements

Page 35: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 35

Example, Parallel External Sorting

0

1000

2000

3000

4000

1 2 4 8 16 32 64

sec

nodes

sort 100GiB per node

worst case inputworst case input, randomized

random inputrandom input, randomized

Page 36: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 36

Algorithm Libraries — Challenges

� software engineering , e.g. CGAL

� standardization, e.g. java.util, C++ STL and BOOST

� performance ↔ generality ↔ simplicity

� applications are a priori unknown

� result checking, verification

MC

ST

L

������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�������������������������������������������������������������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������������������������������������������

�����������������������������������������������������������������������������������������������������������������������������������������������������������

TX

XL

S

files, I/O requests, disk queues,

block prefetcher, buffered block writer

completion handlers

Asynchronous I/O primitives layer

Block management layertyped block, block manager, buffered streams,

Containers:

STL−user layervector, stack, set

priority_queue, mapsort, for_each, merge

Pipelined sorting,zero−I/O scanning

Streaming layer

Algorithms:

Operating System

Applications

Page 37: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 37

Problem Instances

Benchmark instances for NP-hard problems

� TSP

� Steiner-Tree

� SAT

� set covering

� graph partitioning

� . . .

have proved essential for development of practical algorithms

Strange: much less real world instances for polynomial problems

(MST, shortest path, max flow, matching. . . )

Page 38: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 38

Example: Sorting Benchmark

100 byte records, 10 byte random keys, with file I/O

Category data volume performance improvement

GraySort 100 000 GB 564 GB / min 17×

MinuteSort 955 GB 955 GB / min > 10×

JouleSort 100 000 GB 3 400 Recs/Joule ???×

JouleSort 1 000 GB 17 500 Recs/Joule 5.1×

JouleSort 100 GB 39 800 Recs/Joule 3.4×

JouleSort 10 GB 43 500 Recs/Joule 5.7×

Also: PennySort

Page 39: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 39

GraySort: inplace multiway mergesort, exact splitting

16 GBRAM

240GB

Infiniband switch400 MB / s node all−all

XeonXeon

Page 40: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 40

JouleSort

� Intel Atom N330

� 4 GB RAM

� 4×256 GB

SSD (SuperTalent)

Algorithm similar to

GraySort

Page 41: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 41

Applications that “Change the World”

Algorithmics has the potential to SHAPE applications

(not just the other way round) [G. Myers]

Bioinformatics: sequencing, proteomics, phylogenetic trees,. . .

Information Retrieval: Searching, ranking,. . .

Traffic Planning: navigation, flow optimization,

adaptive toll, disruption management

Energy Grid: virtual powerplants (sun, wind, water, heat, negawatt),

disruption management,. . .

Communication Networks: mobile, cloud, selfish users,. . .

Page 42: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 42

Conclusion:

Algorithm Engineering ↔ Algorithm Theory

� algorithm engineering is a wider view on algorithmics

(but no revolution. None of the ingredients is really new)

� rich methodology

� better coupling to applications

� experimental algorithmics≪ algorithm engineering

� algorithm theory⊂ algorithm engineering

� sometimes different theoretical questions

� algorithm theory may still yield the strongest, deepest and most

persistent results within algorithm engineering

Page 43: Sanders: Algorithm Engineering – Parallel Sorting

Sanders: Algorithm Engineering – Parallel Sorting 43

Interactions with other (Sub)disciplines

models

design

implementation

librariesalgorithm−

perf.−guarantees

applicatio

ns

analysis experiments

realInputs

realistic

SE

OS

CompilersOSOR SE

ORC. Architecture

OR

SE