Loop fusion in Haskell
Roman Leshchinskiy
Programming Languages and SystemsUniversity of New South Wales
What is this about?
What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations
What is this about?
What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations
zipWith (-)(zipWith (*)
(zipWith (-) (replicate_s segd as1) xs)(zipWith (-) (replicate_s segd bs1) ys))
(zipWith (*)(zipWith (-) (replicate_s segd bs2) ys)(zipWith (-) (replicate_s segd as2) xs))
What is this about?
What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations
return . foldl’ hash 5381. map toLower. filter isAlpha =<< readFile f
What is this about?
What I do
Data Parallel Haskell
compiles nested data-parallel programs to flat data-parallel ones
lots of arrays and collective operations involved
What other people do
array programs with lots of collective operations
What everybody wants
no temporary arrays
fused loops
C-like speed
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)
RULES
"map/map" map f (map g xs) = map (f . g) xs
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)
RULES
"map/map" map f (map g xs) = map (f . g) xs
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)
RULES
"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)
RULES
"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)
RULES
"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)
RULES
"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)
= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)
= mapFilter (f λ x → g x && h x) xs...
Loop fusion is easy!
foo xs = map (*5) (map (+3) xs)bar ys = filter even (filter (<42) ys)baz zs = map (+1) (filter even zs)
RULES
"map/map" map f (map g xs) = map (f . g) xs"filter/filter" filter f (filter g xs)
= filter (λ x → f x && g x) xs"map/filter" map f (filter g xs) = mapFilter f g xs"map/mapFilter" map f (mapFilter g h xs)
= mapFilter (f . g) h xs"mapFilter/filter" mapFilter f g (filter h xs)
= mapFilter (f λ x → g x && h x) xs... BAD
IDEA
The challenge
use a constant number of rewrite rules
don’t require new rules for new combinators
make adding new combinators easy
fuse everything!
don’t require specialised compiler support
handle both sequential and parallel loops
Sequential loops
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
stepper
state
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of
Yield x s’ → go (z+x) s’Done → z
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)
| otherwise = Done
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
mapS :: (a → b) → Stream a → Stream bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of
Yield x s’ → Yield (f x) s’Done → Done
Streams
data Step s a = Yield a s| Done
data Stream a = ∃s. Stream (s → Step s a) s
stepper produces next element and state from current state
similar to an iterator
actually encodes an anamorphism (unfold)
unstream :: Stream a → Array aunstream (Stream step s) = <allocate, fill and freeze>
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
= sumS . stream . unstream . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
= sumS . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
Stream fusion in three easy steps
Step 1: implement array operations in terms of streams
sum :: Num a ⇒ Array a → asum = sumS . stream
map :: (a → b) → Array a → Array bmap f = unstream . mapS f . stream
Step 2: inline them
sumsq :: Num a ⇒ Array a → asumsq = sum . map (λx -> x*x)
= sumS . mapS f . stream
Step 3: eliminate stream/unstream pairs
"stream/unstream" stream (unstream s) = s
LetGHC
doth
ere
st
Optimising stream operations
sumsq xs = sumS (mapS square ( stream xs))
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = sumS (mapS square ( stream xs))
stream :: Array a → Stream astream arr = Stream step 0where step i | i < length arr = Yield (arr ! i) (i+1)
| otherwise = Done
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
inline
Optimising stream operations
sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = sumS ( mapS square (Stream step1 0))wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
mapS :: (a → b) → Stream a → bmapS f (Stream step s) = Stream step’ swhere step’ s = case step s of
Yield x s’ → Yield (f x) s’Done → Done
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
inline
Optimising stream operations
sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = sumS (Stream step2 0)wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
sumS :: Num a ⇒ Stream a → asumS (Stream step s) = go 0 swhere go z s = case step s of
Yield x s’ → go (z+x) s’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
inline
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
inline
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
go z i = case (case step1 i ofYield x i’ → Yield (square x) i’Done → Done) of
Yield x i’ → go (z+x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
go z i = case (case step1 i of
Yield x i’ → Yield (square x) i’
Done → Done) ofYield x i’ → go (z+x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
case of case
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
go z i = case step1 i of
Yield x i’ → go (z + square x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
go z i = case step1 i of
Yield x i’ → go (z + square x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
inline
Optimising stream operations
sumsq xs = go 0 0wherego z i = case (case i < length xs of
True → Yield (xs ! i) (i+1)False → Done) of
Yield x i’ → go (z + square x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = go 0 0wherego z i = case (case i < length xs of
True → Yield (xs ! i) (i+1)False → Done) of
Yield x i’ → go (z + square x) i’Done → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
case of case
Optimising stream operations
sumsq xs = go 0 0wherego z i = case i < length xs of
True → go (z + square (xs ! i)) (i+1)False → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Optimising stream operations
sumsq xs = go 0 0wherego z i = case i < length xs of
True → go (z + square (xs ! i)) (i+1)False → z
optimal loop
no Stream or Step values ever created
only general-purpose optimisations
will be optimised further (unboxing etc.)
requires a great compiler (thanks GHC team!)
Why does it work?
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
Why does it work?
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
non-recursive
Why does it work?
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
non-recursive
non-recursive
Why does it work?
sumsq xs = go 0 0wherestep1 i = case i < length xs of
True → Yield (xs ! i) (i+1)False → Done
step2 i = case step1 i ofYield x i’ → Yield (square x) i’Done → Done
go z i = case step2 i of
Yield x i’ → go (z+x) i’Done → z
non-recursive
non-recursive
recursive
A slight problem
filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of
Yield x s’| f x → Yield x s’| otherwise → step s’
Done → Done
A slight problem
filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of
Yield x s’| f x → Yield x s’| otherwise → step s’
Done → Done
recursive
Extending streams
Idea: allow a loop iteration not to produce an element
data Step s a = Yield a s| Skip s| Done
Extending streams
Idea: allow a loop iteration not to produce an element
data Step s a = Yield a s| Skip s| Done
Extending streams
Idea: allow a loop iteration not to produce an element
data Step s a = Yield a s| Skip s| Done
filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of
Yield x s’| f x → Yield x s’| otherwise → Skip s’
Skip s’ → Skip s’Done → Done
Extending streams
Idea: allow a loop iteration not to produce an element
data Step s a = Yield a s| Skip s| Done
filterS :: (a → Bool) → Stream a → Stream afilterS f (Stream step s) = Stream step’ swherestep’ s = case step s of
Yield x s’| f x → Yield x s’| otherwise → Skip s’
Skip s’ → Skip s’Done → Done
non-recursive
Stream fusion - summary
encode loops by streams
implement array operations in terms of streams
eliminate stream/unstream pairs (temporaries)
stream producers are non-recursive
standard optimisations remove overhead (loop fusion)
Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...
Stream fusion - summary
encode loops by streams
implement array operations in terms of streams
eliminate stream/unstream pairs (temporaries)
stream producers are non-recursive
standard optimisations remove overhead (loop fusion)
Standard optimisations: inlining, case-of-case, worker/wrappertransformation, SpecConstr, LiberateCase, specialisation ...
Parallel loops
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>
<map f over each chunk><collect local results>
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>
<map f over each chunk><collect local results>
f is sequential
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
mapP :: (a → b) → Array a → Array bmapP f xs = <split xs across workers>
<map f over each chunk><collect local results>
sumP :: Num a ⇒ Array a → asumP xs = <split xs across workers>
<sum each chunk><reduce local sums>
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
sumsqP = sumP . mapP square
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>
DPH on multicores
Evaluation strategy after vectorisation
operations are data parallel and flat
executed by a gang of worker threads
essentially fork-join parallelism
sumsqP xs = <split xs across workers><map square over each chunk><collect local results><split results across workers><sum each chunk><reduce local sums>
Distributed types
Idea: let’s make the evaluation strategy explicit! (Keller 1999)
data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double
splitD distribute an array across threadsjoinD collect thread-local chunks
mapD execute a sequential operation in each threadsumD compute sum of local values
Distributed types
Idea: let’s make the evaluation strategy explicit! (Keller 1999)
data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double
splitD distribute an array across threadsjoinD collect thread-local chunks
mapD execute a sequential operation in each threadsumD compute sum of local values
Distributed types
Idea: let’s make the evaluation strategy explicit! (Keller 1999)
data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double
splitD distribute an array across threadsjoinD collect thread-local chunks
mapD execute a sequential operation in each threadsumD compute sum of local values
splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array a
Distributed types
Idea: let’s make the evaluation strategy explicit! (Keller 1999)
data Dist a a is distributed across threadsDist (Array a) each thread has a local array (chunk)Dist Double each thread has a local Double
splitD distribute an array across threadsjoinD collect thread-local chunks
mapD execute a sequential operation in each threadsumD compute sum of local values
splitD :: Array a → Dist (Array a)joinD :: Dist (Array a) → Array amapD :: (a → b) → Dist a → Dist bsumD :: Num a ⇒ Dist a → a
Programming with distributed types
mapP f xs = <split xs across workers><map f over each chunk><collect local results>
Programming with distributed types
mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split
Programming with distributed types
mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split
sumP xs = <split xs across workers><sum each chunk><reduce local sums>
Programming with distributed types
mapP f = joinD -- collect. mapD (map f) -- map f over chunks. splitD -- split
sumP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split
Fusing distributed types
sumsqP = sumP . mapP square
Fusing distributed types
sumsqP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split. joinD -- collect. mapD (map square) -- map square over chunks. splitD -- split
Fusing distributed types
sumsqP = sumD -- reduce. mapD sum -- sum each chunk. splitD -- split. joinD -- collect. mapD (map square) -- map square over chunks. splitD -- split
RULES
splitD (joinD xs) = xs
Fusing distributed types
sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split
RULES
splitD (joinD xs) = xs
Fusing distributed types
sumsqP = sumD -- reduce. mapD sum -- sum each chunk. mapD (map square) -- map square over chunks. splitD -- split
RULES
splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs
Fusing distributed types
sumsqP = sumD -- reduce
. mapD (sum . map square) -- work
. splitD -- split
RULES
splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs
Fusing distributed types
sumsqP = sumD -- reduce
. mapD (sum . map square) -- work
. splitD -- split
RULES
splitD (joinD xs) = xsmapD f (mapD g xs) = mapD (f . g) xs
stream fusion
Distributed types on multicores
data Dist a a is distributed across threads
splitD distribute xs across threadsjoinD collect thread-local chunksmapD execute a sequential operation in each thread
splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation
Distributed types on clusters
data Dist a a is distributed across nodes
splitD scatterjoinD gathermapD execute operation on each node
splitD/joinD eliminate communicationmapD/mapD eliminate synchronisation
Distributed types on GPUs
data Dist a a is in GPU memory
splitD CPU −→ GPU transferjoinD GPU −→ CPU transfermapD execute kernel on the GPU
splitD/joinD eliminate memory transfers (communication)mapD/mapD fuse kernels (synchronisation)
Distribured types – summary
encode parallel loops as split/work/join
eliminate unnecessary split/join pairs
fuse sequential work (stream fusion)
very general mechanism for fusing parallel computations
applicable to a wide range of architectures
again, no specialised compiler support
Obligatory benchmark1
10
100
1000
1 2 4 8
sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C
1
10
100
1000
10000
1 2 4 8 16 32 64
Runtime @ greyarea
sumsq, Haskell sumsq, C dotp, Haskelldotp, C smvm, Haskell smvm, C
Parting thoughts
it’s nice, it’s easy to use, it works
high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in an imperative language...
Stream fusion: dph, bytestring, vector, uvector
Distributed types: dph
Parting thoughts
it’s nice, it’s easy to use, it works
high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in an imperative language...
Stream fusion: dph, bytestring, vector, uvector
Distributed types: dph
don’t
Parting thoughts
it’s nice, it’s easy to use, it works
high-level functional programs compiled to highly efficient code
even parallel ones!
rewrite rules + great optimiser = win
DPH doesn’t require any special-purpose optimisations
try this in an imperative language...
Stream fusion: dph, bytestring, vector, uvector
Distributed types: dph
don’t