Transcript
Page 1: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

1/21

Lattice Boltzmann for Blood Flow:A Software Engineering Approach

for a DataFlow SuperComputer

Nenad Korolija, [email protected] Djukic, [email protected]

Nenad Filipovic, [email protected] Milutinovic, [email protected]

Page 2: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

MyWork in a NutShell

1. Introduction: Synergy of Physics and Logics

2. Problem: Moving LB to Maxeler

3. ExistingSolutions: None :)

4. Essence: Map+Opt(PACT)

5. Details: MyPhD

6. Analysis: BaU

7. Conclusions: 1000 (SPC)

2/21

Page 3: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

Cooperation between BioIRC, UniKG and School of Electrical Engineering,

UniBG

3/21

Page 4: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

4/21

Lattice Boltzmann for Blood Flow:A Software Engineering Approach

Expensive

Quiet

Fast

Electrical

20m cord

Environment-friendly

Big-pack

Wide-track

Easy handling

Reparation manual

Reparation kit

5Y warranty

Service in your town

New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag ...

Page 5: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

5/21

Lattice Boltzmann for Blood Flow:A Software Engineering Approach

Expensive

Quiet

Electrical

20m cord

Environment-friendly

Big-pack

Wide-track

Easy handling

Reparation manual

Reparation kit

5Y warranty

Service in your town

New-technology high-quality non-rusting heavy-duty precise-cutting recyclable blades streaming grass only to bag ...

Page 6: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

Lattice Boltzmann for Blood Flow:A Software Engineering Approach

6/21

Page 7: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

7/21

Structure of the Existing C-Codefor a MultiCore Computer

LS1 LS2 LS3 LS4 LS5

Statically: P / T = 100 / 400 = 25% => Only 100 lines to “kernelize”

Dynamically: P / T = 99%=> Potential speed-up factor is at most 100

LS – Looping structure

LS1 and LS5 – Nested loops

LS2, LS3, and LS4 – Simple loops

P – lines to parallelize

T – total number of lines

Page 8: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

8/21

What Looping Structures to “Kernelize”

All,because we like all datato reside on MAX3prior to the execution start

MAX

CPU

MAX

CPU

MAX

CPU

MAX

CPU

MAX

CPU

MAX

CPU

Page 9: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

9/21

What Looping StructuresBring what Benefits?

LS1 moderate

LS2, LS3, LS4negligible,but must “kernelize”

LS5 major

FOR i = 1 2 3 4 5 … k … n DO FOR i = 1 2 3 4 5 … n DO

T0 T1 T2 T3 T4 T0 Tk T2k T3k

OP1 OP1

OP2 OP2

OP3 OP3

OP4 OP4

OP5 OP5

OP6 OP6

. .

. .

. .

OPk OPk

Tk Tk+1 Tk+2 Tk T2k

1 result/clockMAX T3k T4k

1 result/k*clockCPU

DF

E d

oing

k o

pera

tions

CP

U d

oing

onl

y on

e

Page 10: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

10/21

Why “Kernelizing” the Looping Structures?Conditions for “Kernelizing” Revisited

Why? LS1 LS2/3/4 LS5

1. BigData O(n2) O(n2) O(n2)

2. WORM + + +

3. Tolerance to latency + + +

4. Over 95% of run time in loops ++ ++ ++

5. Reusability of the data ++ ++ ++

6. Skills + + ++

Page 11: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

11/21

Programming: Iteration #1 What to do with LS1..5?

Direct MultiCore Data Choreography

1, 2, 3, 4, ...

Direct MultiCore Algorithm Execution

∑∑ + ∑ + ∑ + ∑ + ∑∑

Direct MultiCore Computational Precision:Double Precision Floating Point (64 bits)

Page 12: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

12/21

Programming: Iteration #1 Potentials of Direct “Kernelization”

Amdahl Low: limes(DFE Potential → ∞) = 100

Reality Estimate: limes(work → 30.6.2013.) = N

99%1%

0%1%

x%1%

Page 13: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

13/21

Pipelining the Inner Loops

j

i

0

3200 112

inputs

output

Kernel

Kernel(s) Stream

MiddleFunctionsKernels

Kernel(s) Collide

Manager

Page 14: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

14/21

The Kernel for LS1:Direct Migration

public class LS1Kernel extends Kernel { public LS1Kernel(KernelParameters parameters) { super(parameters); // Input HWVar f1new = io.scalarInput("f1new" ,hwFloat(8, 24)); HWVar f5new = io.scalarInput("f5new" ,hwFloat(8, 24)); HWVar f8new = io.scalarInput("f8new" ,hwFloat(8, 24)); HWVar f1 = io.input("f1", hwFloat(8, 24)); // j HWVar f2m = io.input("f2m", hwFloat(8, 24)); // j-1 HWVar f3 = io.input("f3", hwFloat(8, 24)); // j HWVar f4p = io.input("f4p", hwFloat(8, 24)); // j+1 HWVar f5m = io.input("f5m", hwFloat(8, 24)); // j-1 HWVar f6m = io.input("f6m", hwFloat(8, 24)); // j-1 HWVar f7p = io.input("f7p", hwFloat(8, 24)); // j+1 HWVar f8p = io.input("f8p", hwFloat(8, 24)); // j+1

Page 15: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

15/21

The Kernel for LS5: Direct Migration

// Do the summations needed to evaluate the density and components of velocity HWVar ro = f0 + f1 + f2 + f3 + f4 + f5 + f6 + f7 + f8; HWVar rovx = f1 - f3 + f5 - f6 - f7 + f8; HWVar rovy = f2 - f4 + f5 + f6 - f7 - f8; HWVar vx = rovx/ro; HWVar vy = rovy/ro; // Also load the velocity magnitude into plotvar - this is what we will // display using OpenGL later HWVar v2x = vx * vx; HWVar v2y = vy * vy; HWVar plotvar = KernelMath.sqrt(v2x + v2y); HWVar v_sq_term = 1.5f*(v2x + v2y); // Evaluate the local equilibrium f values in all directions HWVar vxmvy = vx - vy; HWVar vxpvy = vx + vy; HWVar rortau = ro * rtau; HWVar rortaufaceq2 = rortau * faceq2; HWVar rortaufaceq3 = rortau * faceq3; HWVar vxpvyp3 = 3.f*vxpvy; HWVar vxmvyp3 = 3.f*vxmvy; HWVar vxp3 = 3.f*vx; HWVar vyp3 = 3.f*vy; HWVar v2xp45 = 4.5f*v2x; HWVar v2yp45 = 4.5f*v2y; HWVar mv_sq_term = 1.f - v_sq_term; HWVar mv_sq_termpv2xp45 = mv_sq_term + v2xp45; HWVar mv_sq_termpv2yp45 = mv_sq_term + v2yp45; HWVar vxpvyp45vxpvy = 4.5f*vxpvy*vxpvy; HWVar vxmvyp45vxmvy = 4.5f*vxmvy*vxmvy; HWVar mv_sq_termpvxpvyp45vxpvy = mv_sq_term + vxpvyp45vxpvy; HWVar mv_sq_termpvxmvyp45vxmvy = mv_sq_term - vxmvyp45vxmvy; HWVar f0eq = rortau * faceq1 * mv_sq_term; HWVar f1eq = rortaufaceq2 * (mv_sq_termpv2xp45 + vxp3); HWVar f2eq = rortaufaceq2 * (mv_sq_termpv2yp45 + vyp3); HWVar f3eq = rortaufaceq2 * (mv_sq_termpv2xp45 - vxp3); HWVar f4eq = rortaufaceq2 * (mv_sq_termpv2yp45 - vyp3); HWVar f5eq = rortaufaceq3 * (mv_sq_termpvxpvyp45vxpvy + vxpvyp3); HWVar f6eq = rortaufaceq3 * (mv_sq_termpvxmvyp45vxmvy - vxmvyp3); HWVar f7eq = rortaufaceq3 * (mv_sq_termpvxpvyp45vxpvy - vxpvyp3); HWVar f8eq = rortaufaceq3 * (mv_sq_termpvxmvyp45vxmvy + vxmvyp3);

Page 16: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

16/21

Programming: Iteration #2 Ideas for Additional Speedup (a)

Better Data Choreography

5x x 5x

Estimate:

1.2 X Speed-up (as seen from the drawing above)

Page 17: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

17/21

Programming: Iteration #3 Ideas for Additional Speedup (b)

Algorithmic Changes:∑∑ + ∑ + ∑ + ∑ + ∑∑ → ∑∑ + ∑ + ∑∑

Explanation: As seen from the previous drawing,LS2 and LS3 can be integrated with LS1

Estimate: 1.6

Page 18: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

18/21

Programming: Iteration #4 Ideas for Additional Speedup (c)

Precision Changes:LUT (Double-precision floating point, 64) = 500LUT (Maxeler-precision floating point, 24) = 24

Explanation:With less precision,hardware complexity can be reduced by a factor of about 20.Increasing number of iterations 4 timesbrings approximately similar precision, much faster.

Estimate: Factor = (500/24)/4 ≈ 5

This is the only action,before which an topic expert has to be consulted!

Page 19: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

19/21

Lattice Boltzman

http://www.youtube.com/watch?v=vXpCC3q0tXQ

Page 20: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

20/21

Results: SPTC ≈ 1000x“Maxeler’s technology enables organizations to speed up processing times by 20-50x,

with over 90% reduction in energy usage and over 95% reduction in data centre space”.

Speedup factor: 1.2 x 1.6 x 5 x N ≈ 10N- Precisely 30.6.2013.

Power reduction factor(i7/MAX3) =17.6 / (MAX2 / MAX3) ≈ 10- Precisely: the WallCord method

Transistor count reduction factor = i7 / MAX3- Precisely: about 20

Cost reduction factor: x- Precisely: depends on production volumes

Page 21: 1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, nenadko@etf.rs Tijana Djukic, tijana@kg.ac.rs

Q&A: [email protected]

awai

i Tahiti

10km/h !

30km/h !!!21/21