Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Massively Parallel Depth-First Searchand It's Applications
Kazuki Yoshizoe (����)Search and Parallel Computing Unit, RIKEN AIPJul. 29rd, 2019, 2nd DOML Workshop @RIKEN AIP Nihombashi
3
0 2
3 1
…GTCTAAAACATGATT… 0
…GTCTGAATCATGATT… 1
…GTCTGAAACATGATT… 0
…GTCTGAATCATCATT… 1
Who am I?
2
Parallel computing lab
at graduate school
Search Algorithms
Game AI algorithms
Parallel Search Algorithms
Computer Go book
(in Japanese)
describing MCTS
K. Yoshizoe et al., “Scalable Distributed Monte-Carlo Tree Search”,
Symposium on Combinatorial Search (SoCS), 2011.
3
0 2
3 1
AlphaGo (by DeepMind) uses Deep Learning with
Monte-Carlo Tree Search (MCTS)
Scales up to 3000+ fold speedup
What is Search?Here Search means
Explicitly given Graph
Rule Based Graph
Road map socialnetwork
Combinatorialoptimization
Games
SAT, CSP
Node or path showsü “shortest path”ü “optimal combination”ü “best play in games”
Trains
finding node(s) or path(s)from a given graph
2003 nicolas p. rougier (CC BY-SA 4.0)
Parallelizing Search Algorithms4
r
a
d
f g
e
b
i
k
j
h l
c
n o
m
0 1 2 3
Therefore, simply splitting search spaces cause highly unbalanced workloadsPractical search algorithms
“prune” search spaces to focus on promising part.
Parallel Depth-First Search (DFS) and applications
DFS is simple but has many important applications
- Frequent Itemset Mining- Statistical Pattern Mining
- Constraints Satisfaction
- Continuous Optimizations[Ishii, Yoshizoe, Suzumura 2014]
[Yoshizoe, Terada, Tsuda 2018]
[Izumi, Yoshizoe, Ishii 2018]
Frequent Itemset Mining5
databaseitems
trans
actio
nsex1. Market Basket Analysisitems: productstrans.: customersx: purchased items
ex2. Genomics (GWAS)items: SNPstrans.: humanx: SNP
Counting / EnumeratingFrequent itemsets
from a given database
{1}, {2}, {3}, {4}, {5}, {6},{1,4}, {1,6}, {2,4}, {2,5}, {4,6}, {1,4,6}
ex. itemsets with freq. 3 or higher
…GTCTAAAACATGATT…
…GTCTGAATCATGATT…
…GTCTGAAACATGATT…
…GTCTGAATCATCATT…SNP: Single Nucleotide Polymorphism
1 2 3 4 5 6A x x x x x x +B x x x +C x x -D x x x x x +E x x -F x x x -G x x x +
Statistical Pattern Mining
[Yoshizoe, Terada, Tsuda 2018] Bioinformatics
Fundamental problem in data mining
Example. Finding cause of genetic disease from DNA
Depth-First Search Applicationsor Association Rule Mining
Finding Statistically Significant SNPs6
1 2 3 4 5 6A x x x x x x +B x x x +C x x -D x x x x x +E x x -F x x x -G x x x +
SNP
patie
nts
Improving GWAS
nu. SNP 380,157nu. patients 364nu. Positive 176
Example, Early-onset Alzheimer SNP database
Existing GWAS: Finds one or two SNPsOur tool : Finds any number of SNPs
A naïve approach requires testing all possible2"#$%&' combinations of SNPs.We make it possible with- a smart pruning of the search space- large scale parallelization
[Alzheimer] J. A. Webster, J. R. Gibbs, J. Clarke et al., “Genetic control of human brain transcript expression in Alzheimer disease.” Am J Hum Genet., vol. 84, no. 4, pp. 445–58, 2009.
Genome Wide Association Studies
A genetic disease can be caused by a combination ofmultiple SNPs (not by just one or two SNPs)
Depth First Search (w/o threshold)7
DFS() {Recur(r)
}Recur(node n) {
foreach (child c of n) {// do something for cRecur(c)
}}
Back tracking DFS
r
a
d
f g
e
b
i
k
j
h l
c
n o
m
back tracking can be naturallyimplemented with recursive call Memory usage O(d)
Only current path is needed
Simply traversesall nodes in the tree
Frequent Itemset Mining can be solved using DFS w/o threshold
Depth-First Search is- suitable for trees- require small memory- Parallelization friendly
Depth First Search with threshold update8
DFS() {Recur(r)
}Recur(node n) {
foreach (child c of n) {// do something for cif (c is within threshold) Recur(c)UpdateThreshold()
}}
DFS with threshold
r
a
d
f g
e
b
i
k
j
h l
c
n o
m
Prune search space bydynamically updating threshold
Ex. finding top-k nodes
Statistical Patten Mining can be implemented in DFS with threshold.Significance threshold (P-Value lower bound) is updated and propagated.
Update threshold during search.More branches in the right are pruned.(Search progresses from left to right.)
With threshold,more difficult toparallelize, but possiblewithout large overhead
[Yoshizoe, Terada, Tsuda 2018]
[Ishii, Yoshizoe, Suzumura 2014]
g1 g2 g3 g4 g5 g6
g2 g3 g4g1 g5 g6
g3 g4g2 g4g2 g3g1 g2 g1 g3 g1 g4 g1 g6g1 g5 g3 g5 g4 g6g3 g6g2 g6g2 g5 g4 g5 g5 g6
g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g4 g6 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6
g1 g2 g3 g4 g1 g2 g3 g5 g1 g2 g3 g6 g1 g2 g4 g5 g1 g2 g4 g6 g1 g2 g5 g6 g1 g3 g4 g5 g1 g3 g4 g6 g1 g3 g5 g6 g1 g4 g5 g6 g2 g3 g4 g5 g2 g3 g4 g6 g2 g3 g5 g6 g2 g4 g5 g6 g3 g4 g5 g6
g1 g2 g3 g4 g5 g1 g2 g3 g4 g6 g1 g2 g3 g5 g6 g1 g2 g4 g5 g6 g1 g3 g4 g5 g6 g2 g3 g4 g5 g6
Original Search space ofPattern Mining
Depth-First SearchExample for Data Mining
1 2 3 4 5 6A x x x x x xB x x xC x xD x x x x xE x xF x x xG x x x
Needs pruning!
Closed Itemset
10
1 2 3 4 5 6A x x x x x xB x x xC x xD x x x x xE x xF x x xG x x x
item
trans
actio
n
{1, 4, 6} is a closed itemset
implicitly includes {1}, {1, 4}, {1, 6}
closed itemset is a compressedway of enumeration Note: freq{4, 6}=4, so {4,6} is not included
[Pasquier, Bastide, Taouil, Lakhal 1999]
{1}, {2}, {3}, {4}, {5}, {6},{1,4}, {1,6}, {2,4}, {2,5}, {4,6}, {1,4,6}
{2}, {3}, {4}, {2,4}, {2,5}, {4,6}, {1,4,6}
Enumeration of itemsets (freq. >=3)
Enumeration of closed itemsets (freq. >=3)
An itemset I is a closed itemsetif there is no item j such that! ∉ # ∧ freq # ∪ ! = freq(#)
def: Closed Itemset
(frequency decreases by addingany other item to a closed itemset)
freq{1,4,6}=freq{1,4}=freq{1,6}=freq{1}=3
Pruned Search SpaceUsing Closed Itemset
g1 g2 g3 g4 g5 g6
g2 g3 g4g1 g5 g6
g3 g4g2 g4g2 g3g1 g2 g1 g3 g1 g4 g1 g6g1 g5 g3 g5 g4 g6g3 g6g2 g6g2 g5 g4 g5 g5 g6
g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6g1 g4 g6
g1 g2 g3 g4 g1 g2 g3 g5 g1 g2 g3 g6 g1 g2 g4 g5 g1 g2 g4 g6 g1 g2 g5 g6 g1 g3 g4 g5 g1 g3 g4 g6 g1 g3 g5 g6 g1 g4 g5 g6 g2 g3 g4 g5 g2 g3 g4 g6 g2 g3 g5 g6 g2 g4 g5 g6 g3 g4 g5 g6
g1 g2 g3 g4 g5 g1 g2 g3 g4 g6 g1 g2 g3 g5 g6 g1 g2 g4 g5 g6 g1 g3 g4 g5 g6 g2 g3 g4 g5 g6
5 3 5
3 4 4
3 2 2
2
1
Closed Itemsets (red nodes)describe all frequent itemsets[Pasquier, Bastide, Taouil, Lakhal 1999]
An itemset I is a closed itemsetif there is no item j such that! ∉ # ∧ freq # ∪ ! = freq(#)
def: Closed Itemset
(frequency decreases by adding any other item to a closed itemset)
Depth-First SearchExample for Data Mining
1 2 3 4 5 6A x x x x x xB x x xC x xD x x x x xE x xF x x xG x x x
Pruned Pattern Mining withStatistical Significance threshold
g1 g2 g3 g4 g5 g6
g2 g3 g4g1 g5 g6
g3 g4g2 g4g2 g3g1 g2 g1 g3 g1 g4 g1 g6g1 g5 g3 g5 g4 g6g3 g6g2 g6g2 g5 g4 g5 g5 g6
g1 g2 g3 g1 g2 g4 g1 g2 g5 g1 g2 g6 g1 g3 g4 g1 g3 g5 g1 g3 g6 g1 g4 g5 g1 g5 g6 g2 g3 g4 g2 g3 g5 g2 g3 g6 g2 g4 g5 g2 g4 g6 g2 g5 g6 g3 g4 g5 g3 g4 g6 g3 g5 g6 g4 g5 g6g1 g4 g6
g1 g2 g3 g4 g1 g2 g3 g5 g1 g2 g3 g6 g1 g2 g4 g5 g1 g2 g4 g6 g1 g2 g5 g6 g1 g3 g4 g5 g1 g3 g4 g6 g1 g3 g5 g6 g1 g4 g5 g6 g2 g3 g4 g5 g2 g3 g4 g6 g2 g3 g5 g6 g2 g4 g5 g6 g3 g4 g5 g6
g1 g2 g3 g4 g5 g1 g2 g3 g4 g6 g1 g2 g3 g5 g6 g1 g2 g4 g5 g6 g1 g3 g4 g5 g6 g2 g3 g4 g5 g6
1 2 3 4 5 6A x x x x x x +B x x x +C x x -D x x x x x +E x x -F x x x -G x x x +
5 3 5
3 4 4
3 2 2
2
1
Enumerate statistically significantnodes from the search space.
[Terada, Okada-Hatakeyama, Tsuda, Sese, 2013]
[Minato, Uno, Tsuda, Terada, Sese 2014]
Depth-First SearchExample for Data Mining
Search space can bepruned using the lowerbound of Fisher P-Value(nodes below this line willnever be significant)
Work stealing and broadcast/reduce
Massive Parallel Statistical Pattern Mining13
Frequent Itemset Mining based onClosed Itemset
[Pasquier, Bastide, Taouil, Lakhal 1999]
Apply reverse search technique(LCM algorithm)
[Uno, Kiyomi, Arimura 2004]
Applied to Statistical Pattern MiningLAMP algorithm
[Terada, Okada-Hatakeyama, Tsuda, Sese, 2014]
Faster LAMP using DFS with threshold[Minato, Uno, Tsuda, Terada, Sese 2014]
Massive Parallel LAMP (MP-LAMP)[Yoshizoe, Terada, Tsuda 2018]
[Avis, Fukuda 1996]
requ
est
reje
ct
request
give1 4
32
5 8
76
Root1 4
32
5 8
76
Reformulate algorithmfrom recursive call
to stack + loop
hardware/middlewareaware algorithmand implementation
Parallelization Method
For solving GWAS and others
Preparation for Parallelization: Reformulate the Algorithm14
DFS() {Recur(r)
}Recur(node n) {foreach (child c of n) {
// do something for cif (c is within threshold) Recur(c)UpdateThreshold()
}}
r
a
d
f g
e
b
i
k
j
h l
c
n o
m
Pros: O(d) memoryCons: difficult to parallelize
StackDFS() {push(r)Loop()
}Loop() {
while(stack not empty) {pop n from stackforeach (child c of n) {
// do something for cif (c is within threshold) push(c)
UpdateThreshold()}
}}
Cons: O(d b) memoryPros: easy to parallelize
depth d, branching factor b
recursive function call (original) implementation using stack and loop
reformulate
15
recursion to stack + loop
r
a
d
f g
e
b
i
k
j
h l
c
n o
m
ba
c
b
c
b
ed
c
b
e
c
b
e
gf
c
b
e
g
c
b
e
c
b
c
b
ha
r
d
f
g
c
b
e
child nodes are pushed in the reverse order for DFS behavior
DFS() {
Recur(r)}
Recur(node n) {
foreach (child c of n) {
// do something for cif (c is within threshold) Recur(c)UpdateThreshold()
}
}
StackDFS() {
push(r)Loop()
}
Loop() {
while(stack not empty) {
pop n from stack
foreach (child c of n) {
// do something for cif (c is within threshold) push(c)
UpdateThreshold()
}
}
}
foreach in
reverse order
Note: if pushed in the normal order, it will be breadth-first search
which requires large amount of memory
h
r c
Parallelization Method16
work stealingIf the stack is empty, find a victimand steal workloads from it
[Saraswat et al. 2011] Hypercube + random edge
workload request is sent to neighborprocess in a virtual Hypercube basedcommunication graph
2 3
41
6 7
85
master2 3
41
6 7
85
master
request
reject2 3
41
6 7
85
(2)
IVIIIIII
requ
est
give
II, I
V(1)
BroadcastThreshold info is sent on a spanning tree
Termination DetectionIt is not straight forward in a distributed environment. Correctly detect the termination of all processes.
[Mattern 1990] distributed termination detection on spanning tree
Implementation based on C++ and MPI library
17
Work Stealing based parallelization
Receiver initiated Work stealingCompute nodes with empty stack (empty job)
1, select a victim node
2, send job request to the victim
the victim will either
1, give half of the stack (if has enough job)
2, reject the request
Lifeline graph [Saraswat et al. 2011]
victims are
- randomly selected once, then
- selected from neighbors in communication graph
Virtual graph based on
Hypercube and random edges
request
reject2 3
41
6 7
85
(2)
IVIIIIII
requ
est
give
II, I
V
(1)
DTD Distributed Termination Detection
18
If all stacks are empty, terminate the algorithm
If number of send and recv are equal, terminate the algorithm
If any node received a message, restart counting send/recv
There might be unfinished communication
Counting is not synchronized sosame number of send and recv can be overlooked
It is proved to be correct
A famous token based algorithm by [Dijkstra and Scholten 1980]Much improved DTD on spanning tree [Mattern 1990] (not well known)
Using two communication patterns
Work stealing between stacks DTD and threshold broadcast/reduce
request
reject2 3
41
6 7
85
(a)
2 3
41
6 7
85
(b)
master
l’
l’
l’
l’l’
l’
l’
2 3
41
6 7
85
master(2)
IVIIIIII
requ
est
give
II, I
V
(1) (c)
A
B
Distributed Termination Detection
Combine with Static load balancing20
0 1 2 3
distribute depth 1 nodes at the initialization
0 1 2 3r
Original algorithm starts with onlythe root node pushed
If initialized like this, load balancingcan be easier
Additional technique
10-1
100
101
102
0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Tim
e (
s)
Sp
eedu
p
Nu. Process
time (s)speedup
101
102
103
104
0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Tim
e (
s)
Speedup
Nu. Process
time (s)speedup
101
102
103
0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Tim
e (
s)
Sp
eedu
p
Nu. Process
time (s)speedup
101
102
103
104
0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Tim
e (
s)
Speedup
Nu. Process
time (s)speedup
10-1
100
101
102
0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Tim
e (
s)
Sp
eedu
p
Nu. Process
time (s)speedup
Statistical Pattern Mining: Speedup21
JPN dom. 10%item: 11253, trans:697, dens:1.0%
JPN dom. 20%item:11914, trans:697, dens:1.9%
Alz. dom. 5%item:44052, trans:364, dens:5.4%
Alz. dom. 10%item: 91126, trans: 364, dens:9.8%
Alz. rec. 30%item: 250120, trans:364, dens:2.9% Speedup for
Finding combinationof SNPs related toAlz. or Japanesefrom subset of SNPs
126s
0.444s
258s
0.409s
4361s
9.58s
48285s
41.1s
17646s
16.0s
101
102
103
18 36 54 72 0
10
20
30
40
50
60
70
80
time (
s)
speedup
time (s)speedup
1800s
31.0s
1,000 Gen JPN dom. 15%SNP:11676, dens:1.5%
EC2 c4.8xlargeXeon E5-2666v3
(2.9GHz, 18cores)10G ethernet,
using cfncluster
22
Breakdown of Execution time (Search, Communication, Idle)
HapMap dom. 20%item:11914, trans:697, dens:1.9%
Alz. dom. 10%item: 91126, trans: 364, dens:9.8%
JPN dom. 10%item: 11253, trans:697, dens:1.0%
Alz. dom. 5%item:44052, trans:364, dens:5.4%
Alz. rec.30%item: 250120, trans:364, dens:2.9%
0
100
200
300
400
500
600
700
1 12 24 48 96 192 300 6001200
To
tal t
ime
(s)
probeidle
preprocessmain
0
10000
20000
30000
40000
50000
60000
1 12 24 48 96 192 300 6001200
Tota
l tim
e (
s)
Nu. Proc.
0
100
200
300
400
500
600
1 12 24 48 96 192 300 6001200
probeidle
preprocessmain
0
5000
10000
15000
20000
25000
1 12 24 48 96 192 300 6001200
Nu. Proc.
0
2000
4000
6000
8000
10000
12000
14000
1 12 24 48 96 192 300 6001200
probeidle
preprocessmain
For more difficult problems,communication delay is hidden
Efficient Speedup on K-Computer23
Used up to 140K CPU cores on K-computer,achieved estimated speed up of 110-120K fold.(collaboration with RIKEN R-CCS, unpublished)Other existing work for parallel pattern mining,speedups are limited to 30-fold or less.
We are also working on- Parallel Itemset Mining Library - Algorithms for other mining problems
(includes continuous feature data)
One of the discovered SNP combinationsfrom a subset of SNPs for Alzheimer
P-value after correction forthese 4 SNPs was 0.00031777
rs1057079 mTOR, mTOR-AS1
rs1064261 mTOR
rs2075800 HSPA1L (Hsp70)
rs429358 APOE
[Alzheimer] J. A. Webster, J. R. Gibbs, J. Clarke et al., “Genetic control of human brain transcript expression in Alzheimer disease.” Am J Hum Genet., vol. 84, no. 4, pp. 445–58, 2009.
24
database
nu. o
f dat
aro
ughl
y 10
00
…GTCTAAAACATGATT…
…GTCTGAATCATGATT…
…GTCTGAAACATGATT…
…GTCTGAATCATCATT…
1 2 3 4 5 6 ?
A x x x x x x +B x x x x +C x x -D x x x x x x +E x x x -F x x x x -G x x x +
? x x x x -
Statistical Pattern Mining and Feature Selectionnu. of features roughly 10K to 1M
High dimensional(binary) features
relatively smallnu. of samples
Our method findsstatistically significantcombinations of features
We have methods both forbinary and continuousfeatures (some limitations).
Conclusions• Parallel Search Algorithms are not straight forward but possible.– Depth-First Search algorithms are relatively easy to parallelize.
• We parallelized DFS based Pattern mining and observed highly efficient parallel speedup.– speedup measured on a real application for GWAS– Run on K-computer (in RIKEN R-CCS), achieved 110K+ speedup
• We can apply this approach to other DFS based mining algorithms (and hopefully to other DFS in general)– We implemented a parallel pattern mining library
(preparing to make it public)• Looking for collaboration with ML people in high dimensional
feature selection
25