3 1 Massively Parallel Depth-First Search and It's ...deza/slidesRIKEN2019/yoshizoe.pdf · Search Algorithms Game AI algorithms Parallel Search Algorithms Computer Go book (in Japanese)

Massively Parallel Depth-First Searchand It's Applications

Kazuki Yoshizoe (��)Search and Parallel Computing Unit, RIKEN AIPJul. 29rd, 2019, 2nd DOML Workshop @RIKEN AIP Nihombashi

3

0 2

3 1

…GTCTAAAACATGATT… 0

…GTCTGAATCATGATT… 1

…GTCTGAAACATGATT… 0

…GTCTGAATCATCATT… 1

Who am I?

2

Parallel computing lab

at graduate school

Search Algorithms

Game AI algorithms

Parallel Search Algorithms

Computer Go book

(in Japanese)

describing MCTS

K. Yoshizoe et al., “Scalable Distributed Monte-Carlo Tree Search”,

Symposium on Combinatorial Search (SoCS), 2011.

3

0 2

3 1

AlphaGo (by DeepMind) uses Deep Learning with

Monte-Carlo Tree Search (MCTS)

Scales up to 3000+ fold speedup

What is Search?Here Search means

Explicitly given Graph

Rule Based Graph

Road map socialnetwork

Combinatorialoptimization

Games

SAT, CSP

Node or path showsü “shortest path”ü “optimal combination”ü “best play in games”

Trains

finding node(s) or path(s)from a given graph

2003 nicolas p. rougier (CC BY-SA 4.0)

Parallelizing Search Algorithms4

r

a

d

f g

e

b

i

k

j

h l

c

n o

m

0 1 2 3

Therefore, simply splitting search spaces cause highly unbalanced workloadsPractical search algorithms

“prune” search spaces to focus on promising part.

Parallel Depth-First Search (DFS) and applications

DFS is simple but has many important applications

- Frequent Itemset Mining- Statistical Pattern Mining

- Constraints Satisfaction

- Continuous Optimizations[Ishii, Yoshizoe, Suzumura 2014]

[Yoshizoe, Terada, Tsuda 2018]

[Izumi, Yoshizoe, Ishii 2018]

Frequent Itemset Mining5

databaseitems

trans

actio

nsex1. Market Basket Analysisitems: productstrans.: customersx: purchased items

ex2. Genomics (GWAS)items: SNPstrans.: humanx: SNP

Counting / EnumeratingFrequent itemsets

from a given database

{1}, {2}, {3}, {4}, {5}, {6},{1,4}, {1,6}, {2,4}, {2,5}, {4,6}, {1,4,6}

ex. itemsets with freq. 3 or higher

…GTCTAAAACATGATT…

…GTCTGAATCATGATT…

…GTCTGAAACATGATT…

…GTCTGAATCATCATT…SNP: Single Nucleotide Polymorphism

1 2 3 4 5 6A x x x x x x +B x x x +C x x -D x x x x x +E x x -F x x x -G x x x +

Statistical Pattern Mining

[Yoshizoe, Terada, Tsuda 2018] Bioinformatics

Fundamental problem in data mining

Example. Finding cause of genetic disease from DNA

Depth-First Search Applicationsor Association Rule Mining

Finding Statistically Significant SNPs6


SNP

patie

nts

Improving GWAS

nu. SNP 380,157nu. patients 364nu. Positive 176

Example, Early-onset Alzheimer SNP database

Existing GWAS: Finds one or two SNPsOur tool : Finds any number of SNPs

A naïve approach requires testing all possible2"#$%&' combinations of SNPs.We make it possible with- a smart pruning of the search space- large scale parallelization

[Alzheimer] J. A. Webster, J. R. Gibbs, J. Clarke et al., “Genetic control of human brain transcript expression in Alzheimer disease.” Am J Hum Genet., vol. 84, no. 4, pp. 445–58, 2009.

Genome Wide Association Studies

A genetic disease can be caused by a combination ofmultiple SNPs (not by just one or two SNPs)

Depth First Search (w/o threshold)7

DFS() {Recur(r)

}Recur(node n) {

foreach (child c of n) {// do something for cRecur(c)

}}

Back tracking DFS

r

a

d

f g

e

b

i

k

j

h l

c

n o

m

back tracking can be naturallyimplemented with recursive call Memory usage O(d)

Only current path is needed

Simply traversesall nodes in the tree

Frequent Itemset Mining can be solved using DFS w/o threshold

Depth-First Search is- suitable for trees- require small memory- Parallelization friendly

Depth First Search with threshold update8

DFS() {Recur(r)

}Recur(node n) {

foreach (child c of n) {// do something for cif (c is within threshold) Recur(c)UpdateThreshold()

}}

DFS with threshold

r

a

d

f g

e

b

i

k

j

h l

c

n o

m

Prune search space bydynamically updating threshold

Ex. finding top-k nodes

Statistical Patten Mining can be implemented in DFS with threshold.Significance threshold (P-Value lower bound) is updated and propagated.

Update threshold during search.More branches in the right are pruned.(Search progresses from left to right.)

With threshold,more difficult toparallelize, but possiblewithout large overhead

[Yoshizoe, Terada, Tsuda 2018]

[Ishii, Yoshizoe, Suzumura 2014]

g1 g2 g3 g4 g5 g6

g2 g3 g4g1 g5 g6

g3 g4g2 g4g2 g3g1 g2 g1 g3 g1 g4 g1 g6g1 g5 g3 g5 g4 g6g3 g6g2 g6g2 g5 g4 g5 g5 g6

g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g4 g6 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6


g1 g2 g3 g4 g5 g1 g2 g3 g4 g6 g1 g2 g3 g5 g6 g1 g2 g4 g5 g6 g1 g3 g4 g5 g6 g2 g3 g4 g5 g6

Original Search space ofPattern Mining

Depth-First SearchExample for Data Mining

1 2 3 4 5 6A x x x x x xB x x xC x xD x x x x xE x xF x x xG x x x

Needs pruning!

Closed Itemset

10


item

trans

actio

n

{1, 4, 6} is a closed itemset

implicitly includes {1}, {1, 4}, {1, 6}

closed itemset is a compressedway of enumeration Note: freq{4, 6}=4, so {4,6} is not included

[Pasquier, Bastide, Taouil, Lakhal 1999]

{1}, {2}, {3}, {4}, {5}, {6},{1,4}, {1,6}, {2,4}, {2,5}, {4,6}, {1,4,6}

{2}, {3}, {4}, {2,4}, {2,5}, {4,6}, {1,4,6}

Enumeration of itemsets (freq. >=3)

Enumeration of closed itemsets (freq. >=3)

An itemset I is a closed itemsetif there is no item j such that! ∉ # ∧ freq # ∪ ! = freq(#)

def: Closed Itemset

(frequency decreases by addingany other item to a closed itemset)

freq{1,4,6}=freq{1,4}=freq{1,6}=freq{1}=3

Pruned Search SpaceUsing Closed Itemset

g1 g2 g3 g4 g5 g6

g2 g3 g4g1 g5 g6


g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6g1 g4 g6



5 3 5

3 4 4

3 2 2

2

1

Closed Itemsets (red nodes)describe all frequent itemsets[Pasquier, Bastide, Taouil, Lakhal 1999]

An itemset I is a closed itemsetif there is no item j such that! ∉ # ∧ freq # ∪ ! = freq(#)

def: Closed Itemset

(frequency decreases by adding any other item to a closed itemset)



Pruned Pattern Mining withStatistical Significance threshold

g1 g2 g3 g4 g5 g6

g2 g3 g4g1 g5 g6


g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6g1 g4 g6




5 3 5

3 4 4

3 2 2

2

1

Enumerate statistically significantnodes from the search space.

[Terada, Okada-Hatakeyama, Tsuda, Sese, 2013]

[Minato, Uno, Tsuda, Terada, Sese 2014]


Search space can bepruned using the lowerbound of Fisher P-Value(nodes below this line willnever be significant)

Work stealing and broadcast/reduce

Massive Parallel Statistical Pattern Mining13

Frequent Itemset Mining based onClosed Itemset

[Pasquier, Bastide, Taouil, Lakhal 1999]

Apply reverse search technique(LCM algorithm)

[Uno, Kiyomi, Arimura 2004]

Applied to Statistical Pattern MiningLAMP algorithm

[Terada, Okada-Hatakeyama, Tsuda, Sese, 2014]

Faster LAMP using DFS with threshold[Minato, Uno, Tsuda, Terada, Sese 2014]

Massive Parallel LAMP (MP-LAMP)[Yoshizoe, Terada, Tsuda 2018]

[Avis, Fukuda 1996]

requ

est

reje

ct

request

give1 4

32

5 8

76

Root1 4

32

5 8

76

Reformulate algorithmfrom recursive call

to stack + loop

hardware/middlewareaware algorithmand implementation

Parallelization Method

For solving GWAS and others

Preparation for Parallelization: Reformulate the Algorithm14

DFS() {Recur(r)

}Recur(node n) {foreach (child c of n) {

// do something for cif (c is within threshold) Recur(c)UpdateThreshold()

}}

r

a

d

f g

e

b

i

k

j

h l

c

n o

m

Pros: O(d) memoryCons: difficult to parallelize

StackDFS() {push(r)Loop()

}Loop() {

while(stack not empty) {pop n from stackforeach (child c of n) {

// do something for cif (c is within threshold) push(c)

UpdateThreshold()}

}}

Cons: O(d b) memoryPros: easy to parallelize

depth d, branching factor b

recursive function call (original) implementation using stack and loop

reformulate

15

recursion to stack + loop

r

a

d

f g

e

b

i

k

j

h l

c

n o

m

ba

c

b

c

b

ed

c

b

e

c

b

e

gf

c

b

e

g

c

b

e

c

b

c

b

ha

r

d

f

g

c

b

e

child nodes are pushed in the reverse order for DFS behavior

DFS() {

Recur(r)}

Recur(node n) {

foreach (child c of n) {

// do something for cif (c is within threshold) Recur(c)UpdateThreshold()

}

}

StackDFS() {

push(r)Loop()

}

Loop() {

while(stack not empty) {

pop n from stack

foreach (child c of n) {

// do something for cif (c is within threshold) push(c)

UpdateThreshold()

}

}

}

foreach in

reverse order

Note: if pushed in the normal order, it will be breadth-first search

which requires large amount of memory

h

r c

Parallelization Method16

work stealingIf the stack is empty, find a victimand steal workloads from it

[Saraswat et al. 2011] Hypercube + random edge

workload request is sent to neighborprocess in a virtual Hypercube basedcommunication graph

2 3

41

6 7

85

master2 3

41

6 7

85

master

request

reject2 3

41

6 7

85

(2)

IVIIIIII

requ

est

give

II, I

V(1)

BroadcastThreshold info is sent on a spanning tree

Termination DetectionIt is not straight forward in a distributed environment. Correctly detect the termination of all processes.

[Mattern 1990] distributed termination detection on spanning tree

Implementation based on C++ and MPI library

17

Work Stealing based parallelization

Receiver initiated Work stealingCompute nodes with empty stack (empty job)

1, select a victim node

2, send job request to the victim

the victim will either

1, give half of the stack (if has enough job)

2, reject the request

Lifeline graph [Saraswat et al. 2011]

victims are

- randomly selected once, then

- selected from neighbors in communication graph

Virtual graph based on

Hypercube and random edges

request

reject2 3

41

6 7

85

(2)

IVIIIIII

requ

est

give

II, I

V

(1)

DTD Distributed Termination Detection

18

If all stacks are empty, terminate the algorithm

If number of send and recv are equal, terminate the algorithm

If any node received a message, restart counting send/recv

There might be unfinished communication

Counting is not synchronized sosame number of send and recv can be overlooked

It is proved to be correct

A famous token based algorithm by [Dijkstra and Scholten 1980]Much improved DTD on spanning tree [Mattern 1990] (not well known)

Using two communication patterns

Work stealing between stacks DTD and threshold broadcast/reduce

request

reject2 3

41

6 7

85

(a)

2 3

41

6 7

85

(b)

master

l’

l’

l’

l’l’

l’

l’

2 3

41

6 7

85

master(2)

IVIIIIII

requ

est

give

II, I

V

(1) (c)

A

B

Distributed Termination Detection

Combine with Static load balancing20

0 1 2 3

distribute depth 1 nodes at the initialization

0 1 2 3r

Original algorithm starts with onlythe root node pushed

If initialized like this, load balancingcan be easier

Additional technique

10-1

100

101

102

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Tim

e (

s)

Sp

eedu

p

Nu. Process

time (s)speedup

101

102

103

104

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Tim

e (

s)

Speedup

Nu. Process

time (s)speedup

101

102

103

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Tim

e (

s)

Sp

eedu

p

Nu. Process

time (s)speedup

101

102

103

104

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Tim

e (

s)

Speedup

Nu. Process

time (s)speedup

10-1

100

101

102

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Tim

e (

s)

Sp

eedu

p

Nu. Process

time (s)speedup

Statistical Pattern Mining: Speedup21

JPN dom. 10%item: 11253, trans:697, dens:1.0%

JPN dom. 20%item:11914, trans:697, dens:1.9%

Alz. dom. 5%item:44052, trans:364, dens:5.4%

Alz. dom. 10%item: 91126, trans: 364, dens:9.8%

Alz. rec. 30%item: 250120, trans:364, dens:2.9% Speedup for

Finding combinationof SNPs related toAlz. or Japanesefrom subset of SNPs

126s

0.444s

258s

0.409s

4361s

9.58s

48285s

41.1s

17646s

16.0s

101

102

103

18 36 54 72 0

10

20

30

40

50

60

70

80

time (

s)

speedup

time (s)speedup

1800s

31.0s

1,000 Gen JPN dom. 15%SNP:11676, dens:1.5%

EC2 c4.8xlargeXeon E5-2666v3

(2.9GHz, 18cores)10G ethernet,

using cfncluster

22

Breakdown of Execution time (Search, Communication, Idle)

HapMap dom. 20%item:11914, trans:697, dens:1.9%

Alz. dom. 10%item: 91126, trans: 364, dens:9.8%

JPN dom. 10%item: 11253, trans:697, dens:1.0%

Alz. dom. 5%item:44052, trans:364, dens:5.4%

Alz. rec.30%item: 250120, trans:364, dens:2.9%

0

100

200

300

400

500

600

700

1 12 24 48 96 192 300 6001200

To

tal t

ime

(s)

probeidle

preprocessmain

0

10000

20000

30000

40000

50000

60000

1 12 24 48 96 192 300 6001200

Tota

l tim

e (

s)

Nu. Proc.

0

100

200

300

400

500

600

1 12 24 48 96 192 300 6001200

probeidle

preprocessmain

0

5000

10000

15000

20000

25000

1 12 24 48 96 192 300 6001200

Nu. Proc.

0

2000

4000

6000

8000

10000

12000

14000

1 12 24 48 96 192 300 6001200

probeidle

preprocessmain

For more difficult problems,communication delay is hidden

Efficient Speedup on K-Computer23

Used up to 140K CPU cores on K-computer,achieved estimated speed up of 110-120K fold.(collaboration with RIKEN R-CCS, unpublished)Other existing work for parallel pattern mining,speedups are limited to 30-fold or less.

We are also working on- Parallel Itemset Mining Library - Algorithms for other mining problems

(includes continuous feature data)

One of the discovered SNP combinationsfrom a subset of SNPs for Alzheimer

P-value after correction forthese 4 SNPs was 0.00031777

rs1057079 mTOR, mTOR-AS1

rs1064261 mTOR

rs2075800 HSPA1L (Hsp70)

rs429358 APOE

[Alzheimer] J. A. Webster, J. R. Gibbs, J. Clarke et al., “Genetic control of human brain transcript expression in Alzheimer disease.” Am J Hum Genet., vol. 84, no. 4, pp. 445–58, 2009.

24

database

nu. o

f dat

aro

ughl

y 10

00

…GTCTAAAACATGATT…

…GTCTGAATCATGATT…

…GTCTGAAACATGATT…

…GTCTGAATCATCATT…

1 2 3 4 5 6 ?

A x x x x x x +B x x x x +C x x -D x x x x x x +E x x x -F x x x x -G x x x +

? x x x x -

Statistical Pattern Mining and Feature Selectionnu. of features roughly 10K to 1M

High dimensional(binary) features

relatively smallnu. of samples

Our method findsstatistically significantcombinations of features

We have methods both forbinary and continuousfeatures (some limitations).

Conclusions• Parallel Search Algorithms are not straight forward but possible.– Depth-First Search algorithms are relatively easy to parallelize.

• We parallelized DFS based Pattern mining and observed highly efficient parallel speedup.– speedup measured on a real application for GWAS– Run on K-computer (in RIKEN R-CCS), achieved 110K+ speedup

• We can apply this approach to other DFS based mining algorithms (and hopefully to other DFS in general)– We implemented a parallel pattern mining library

(preparing to make it public)• Looking for collaboration with ML people in high dimensional

feature selection

25

Documents

3 1 Massively Parallel Depth-First Search and It's ...deza/slidesRIKEN2019/yoshizoe.pdf · Search Algorithms Game AI algorithms Parallel Search Algorithms Computer Go book (in Japanese)