1
Korali a High-Performance Multi-Intrusive Bayesian Inference Software for Large-Scale Scientific Models G. Arampatzis, S. Martin, D. Wälchli, and P. Koumoutsakos # 1 y ax 1 y tr 1 y ax,new 1 y tr,new 1 # 2 y ax 2 y tr 2 y ax,new 2 y tr,new 2 # 3 y sh 3 y sh,new 3 # 7 y sh 7 y sh,new 7 ... ... # new y new 1 2 stretching shear # i =( Q 1,i ,Q 2,i 0,i , σ st,i ),i =1, 2 # i =( Q 1,i ,Q 2,i 0,i , Q 3,i ,Q 4,i ,, σ sh,i ),i =3,..., 7 # new =( Q 1 ,Q 2 0 , Q 3 ,Q 4 , σ sh , σ st ) 5 6 9 11 12 13 14 15 16 17 18 19 20 1 2 3 4 7 8 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Optimal Sensor Placement Hierarchical Bayesian Infernce p (ϑ 1 | d 1 , 1 ) + U (s)= Z Y Z R log p(y |r , s) p(y |s) p(r ) p(y |r , s)dr dy s ? = argmax s ˆ U (s) Korali Problem SolYer EYalXaWion E[ecXWion GaXVVian ProceVV Model GaXVVian ProceVV Ba\eVian DirecW Ζnference Hierarchical Reference CXVWom PVi TheWa E[ecXWor OpWimiVer Sampler CMA-ES CCMA-ES LM-CMA-ES DEA Rprop CondXiW ConcXrenW DiVWribXWed SeqXanWial MCMC DRAM TMCMC Extensible Ease of Use Motivation Workflow 5orker 9 Rank 0 _Core 0` Rank 1 _Core 1` 5orker : Rank 0 _Core 2` Rank 1 _Core <` 5orker ; Rank 0 _Core =` Rank 1 _Core >` 5orker < Rank 0 _Core ?` Rank 1 _Core @` #periment ' 1olver .roblem !onduit waiting for sample Generate Samples Preprocess Samples Postprocess Results Distribute Samples Collect Results #periment '' 1olver .roblem Preprocess Samples Postprocess Results Update State Generate Samples #ngine finsihed busy Update State SWaUW GeneUaWion RXn E[SeUimenW Check TeUminaWion waiting for sample Korali Supervisor Supercomputer p(# | d)= p(d | #) p(#) p(d) Bayesian Inference Load Balance Unbalanced Balanced Scalable Computational Model Relaxation Wime (hRXUV) JRb 1 ChecN PRLQW 1 JRb 2 ReVXme O(1000) NRdeV ChecN PRLQW 2 ChecN PRLQW N Fault Tolerant system error

Korali - CSE-Lab...Korali Problem SolYer EYalXaWion E[ecXWion GaXVVian ProceVV Model GaXVVian ProceVV Ba\eVian DirecW nference Hierarchical Reference CXVWom PVi TheWa E[ecXWor OpWimiVer

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Korali - CSE-Lab...Korali Problem SolYer EYalXaWion E[ecXWion GaXVVian ProceVV Model GaXVVian ProceVV Ba\eVian DirecW nference Hierarchical Reference CXVWom PVi TheWa E[ecXWor OpWimiVer

Korali a High-Performance Multi-Intrusive Bayesian Inference Software for Large-Scale Scientific Models

G. Arampatzis, S. Martin, D. Wälchli, and P. Koumoutsakos

#1

yax1

ytr1

yax,new1

ytr,new1

#2

yax2

ytr2

yax,new2

ytr,new2

#3

ysh3 ysh,new

3

#7

ysh7 ysh,new

7

. . .

. . .

#new

ynew

1 2

stretching shear

#i = ( Q1,i, Q2,i, µ0,i, �st,i ), i = 1, 2 #i = ( Q1,i, Q2,i, µ0,i, Q3,i, Q4,i, ,�sh,i ), i = 3, . . . , 7

#new = ( Q1, Q2, µ0, Q3, Q4,�sh, �st )

5

6

9

11

12 13 1415

16

17

18

19

20

1

2

3

4

7

8 10

1

2

34

5

67

8

9

101112 1314

15

1617

18

19

20

• Optimal Sensor Placement

• Hierarchical Bayesian Infernce

p(ϑ1 |d1, ℳ1)

+

U(s) =

Z

Y

Z

Rlog

p(y|r, s)p(y|s) p(r) p(y|r, s) dr dy

s? = argmaxs

U(s)

Korali

Problem

Solver

Evaluation

Execution

GaussianProcess

Model

GaussianProcess

Bayesian

Direct

Inference

Hierarchical

Reference

Custom

Psi

Theta

Executor

Optimiser

Sampler

CMA- ES

CCMA- ES

LM- CMA- ES

DEA

Rprop

Conduit

Concurent

Distributed

Sequantial

MCMC

DRAM

TMCMC

Extensible

Ease of Use

Motivation

Workflow

Worker 0

Rank 0 (Core 0)

Rank 1 (Core 1)

Worker 1

Rank 0 (Core 2)

Rank 1 (Core 3)

Worker 2

Rank 0 (Core 4)

Rank 1 (Core 5)

Worker 3

Rank 0 (Core 6)

Rank 1 (Core 7)

Experiment I

Solver Problem

Conduit

waiting for sample

Generate Samples

Preprocess Samples

Postprocess Results

Distribute Samples

CollectResults

Experiment II

Solver Problem

Preprocess Samples

Postprocess Results

Update State

Generate Samples

Engine finsihed

busy

Update State

Start Generation

Run Experiment

Check Termination

waiting for sample

Korali Supervisor Supercomputer

p(# | d) = p(d |#) p(#)p(d)

• Bayesian Inference

Load Balance

Unba

lanc

edBa

lanc

ed

ScalableComputational

Model Relaxation

time (hours)

Job 1

CheckPoint 1

Job 2

Resume

O(1000) Nodes

CheckPoint 2

CheckPoint N

Fault Tolerant system error