Upload
vuxuyen
View
220
Download
0
Embed Size (px)
Citation preview
Dat
a M
inin
g, P
aral
lelis
m,
Dat
a M
inin
g, P
aral
lelis
m,
and
Grid
s an
d G
rids
Dav
id S
killic
orn
Dav
id S
killic
orn
Que
en�s
Uni
vers
ity, K
ings
ton
Que
en�s
Uni
vers
ity, K
ings
ton
skill@
cs.q
ueen
su.c
ask
ill@cs
.que
ensu
.ca
Dat
a m
inin
g bu
ilds
mod
els
from
dat
a �
in t
he h
ope
that
th
ese
mod
els
reve
al s
ome
know
ledg
e ab
out
the
unde
rlyi
ng d
ata.
Thin
k of
dat
a as
a m
atri
x:
Obj
ects
Attri
bute
s
Mod
els
of d
ata
are
used
for
:
--pr
edic
tion
(sup
ervi
sed
lear
ning
, one
of
the
attr
ibut
es is
the
tar
get
attr
ibut
e)-
neur
al n
etwo
rks
-de
cisi
on t
rees
-su
ppor
t ve
ctor
mac
hine
s
--un
ders
tand
ing
(uns
uper
vise
d le
arni
ng, l
earn
re
lati
onsh
ips
amon
g ob
ject
s vi
a th
eir
attr
ibut
es)
-cl
uste
ring
-ne
ural
net
work
s
Ther
e ar
e ty
pica
lly t
hree
pha
ses
to t
he d
ata
min
ing
proc
ess:
1.A
mod
el is
bui
lt o
n a
trai
ning
dat
aset
2.Th
e m
odel
is t
este
d on
a t
est
data
set
3.Th
e m
odel
is d
eplo
yed
on n
ew d
ata
The
qual
ity
of a
mod
el is
mea
sure
d by
the
pre
dict
ion
erro
r ra
te o
n th
e te
st d
atas
et (s
uper
vise
d), o
r so
me
mea
sure
of
the
cons
iste
ncy
and
tigh
tnes
s of
the
re
lati
onsh
ips
(uns
uper
vise
d).
App
licat
ions
exi
st in
:
�co
mm
erci
al (c
usto
mer
rel
atio
nshi
p m
anag
emen
t);
�in
dust
rial
(com
pone
nt m
aint
enan
ce p
redi
ctio
n);
�sc
ient
ific
(unu
sual
par
ticl
e in
tera
ctio
ns);
�en
gine
erin
g (t
urbu
lenc
e in
flu
id f
low)
;
but
the
com
mer
cial
sec
tor
is b
y fa
r th
e la
rges
t, a
nd
ther
e�s
still
lots
of
runw
ay.
(The
maj
or li
mit
atio
n on
fur
ther
gro
wth
is t
he
shor
tage
of
skill
ed p
eopl
e.)
Para
llel D
ata
Min
ing
Sinc
e da
ta m
inin
g al
gori
thm
s ar
e bo
th c
ompu
te-b
ound
an
d da
ta-a
cces
s bo
und,
it�s
natu
ral t
o us
e pa
ralle
lism
.
Mul
tipl
e pr
oces
sors
hel
p wi
th t
he c
ompu
te p
art;
Para
llel c
ompu
ters
hav
e fl
atte
r m
emor
y hi
erar
chie
s (m
ore
mem
ory
is c
lose
r to
a p
roce
ssor
) whi
ch h
elps
wi
th d
ata
acce
ss.
Mos
t D
M t
echn
ique
s ar
e ap
prox
imat
ing
in t
his
sens
e:
The
qual
ity
of t
he m
odel
impr
oves
as
mor
e ex
ampl
es a
re s
een.
This
cre
ates
som
e in
tere
stin
g po
ssib
iliti
es:
--qu
ick
and
dirt
y m
odel
ling
base
d on
sm
all s
ampl
es--
stee
rabl
em
odel
ling
wher
e ea
rly
feed
back
hel
psa
user
sel
ect
inte
rest
ing
ques
tion
s--
para
llel a
nd d
istr
ibut
ed m
odel
ling
wher
eea
ch p
roce
ssor
mod
els
its
part
of
a la
rger
dat
aset
Surp
rise
1: T
he r
ate
at w
hich
mod
el `
qual
ity�
impr
oves
is
muc
h gr
eate
r th
an k
nown
err
or b
ound
s su
gges
t.
qual
ity
num
ber o
f exa
mpl
es s
een
regi
on o
f ver
y fa
st im
prov
emen
t
still
impr
ovin
g
very
slo
w im
prov
emen
t
1%5%
Oft
en 9
5% o
f th
e ob
ject
s pr
ovid
e th
e fi
nal 2
% o
f m
odel
impr
ovem
ent.
qual
ity
num
ber o
f exa
mpl
es s
een
but t
he a
sym
ptot
e de
pend
s on
the
tota
l num
ber o
f exa
mpl
es �
so m
ore
is
bette
r. Sa
mpl
ing
isn�
t the
ans
wer
.
Wha
t�s g
oing
on?
Ther
e's
a lo
t of r
epet
ition
in ty
pica
l dat
a m
inin
g da
tase
ts.
Ever
y ea
rly e
xam
ple
reve
als
a ne
w a
spec
t of t
he m
odel
. Af
ter a
whi
le, n
ew e
xam
ples
repe
at m
uch
of th
e `k
now
ledg
e� fr
om e
xam
ples
see
n ea
rlier
.
A bi
g sa
mpl
e co
ntai
ns m
ore
exam
ples
that
diff
er fr
om
one
anot
her.
They
forc
e th
e m
odel
to c
onsi
der r
iche
r re
pres
enta
tions
.
Stra
tegy
for
par
alle
lism
:
1.Pa
rtit
ion
the
data
set
`by
rows
� and
allo
cate
a
part
itio
n to
eac
h pr
oces
sor;
2.Le
arn
a m
odel
loca
lly a
t ea
ch p
roce
ssor
;3.
(Som
ehow
) mer
ge t
he lo
cal m
odel
s in
to a
sin
gle,
gl
obal
mod
el t
hat
woul
d ha
ve b
een
prod
uced
by
a se
quen
tial
dat
a m
inin
g le
arne
r.
Ther
e�s
no m
agic
bul
let
�a
new
mer
ging
tec
hniq
ue
has
to b
e di
scov
ered
for
eac
h da
ta m
inin
g te
chni
que.
Mer
ging
tec
hniq
ues
are
know
n fo
r:
1.N
eura
l net
work
s (s
uper
vise
d an
d un
supe
rvis
ed)
2.In
duct
ive
logi
c pr
ogra
mm
ing
3.Fr
eque
nt s
ets
(and
so
asso
ciat
ion
rule
s)4.
Bagg
ing
5.Bo
osti
ng/a
rcin
g
and
prob
ably
for
oth
er t
echn
ique
s to
o.
Usi
ng p
proc
esso
rs g
ives
an
imm
edia
te s
peed
up o
f al
mos
t p
(less
mer
ging
ove
rhea
d).
(Spe
edup
Fac
tor
1)
But�
eac
h pr
oces
sor
is w
orki
ng w
ith
earl
y ex
ampl
es.
The
mer
ge s
tep
impr
oves
mod
el q
ualit
y wi
thou
t se
eing
fr
esh
exam
ples
.
(Spe
edup
Fac
tor
2)
exam
ples
per
pro
cess
or
qual
ity
So t
here
�s an
ext
ra s
peed
up (r
eal s
uper
linea
rsp
eedu
p)
beca
use
ever
y cy
cle
is b
eing
spe
nt le
arni
ng in
a
prod
ucti
ve r
ange
of
the
data
set
�co
nver
genc
e ha
ppen
s m
ore
quic
kly.
[Of
cour
se, t
his
mea
ns t
hat
sequ
enti
al
impl
emen
tati
ons
shou
ld u
se a
seq
uent
ialis
atio
nof
thi
s pa
ralle
l str
ateg
y �
a bi
tewi
sest
rate
gy. T
his
is o
ne o
f th
e fe
w ex
ampl
es o
f ho
w a
para
llel m
inds
et le
ads
to
new
sequ
enti
al a
lgor
ithm
s.]
Surp
rise
2: E
xcha
ngin
g lo
cal m
odel
s wi
th o
ther
pr
oces
sors
ten
ds t
o cr
eate
eve
n fa
ster
con
verg
ence
�a
thir
d so
urce
of
spee
dup.
(Spe
edup
Fac
tor
3)
The
mec
hani
sm o
f th
is `
extr
a� sp
eedu
p de
pend
s on
th
e un
derl
ying
dat
a m
inin
g te
chni
que.
For
som
e da
tase
ts, i
t�s b
ecau
se o
f th
e sh
ape
of t
he
qual
ity-
exam
ple
curv
e.
If t
he d
atas
et is
big
eno
ugh,
eac
h pr
oces
sor
gets
en
ough
dat
a in
its
part
itio
n th
at it
mov
es b
eyon
d th
e ea
rly
exam
ples
, whe
re le
arni
ng im
prov
es s
teep
ly, a
nd
star
ts t
o sp
end
tim
e in
the
nex
t re
gion
.
So e
xcha
nge
mod
els
at t
he e
nd o
f th
e st
eepe
st
regi
on.
Exam
ple:
Neu
ral n
etwo
rks
�ch
oosi
ng t
he c
orre
ct
batc
h si
ze is
cri
tica
l.
rapi
d im
prov
emen
t reg
ion
Qua
lity
impr
ovem
ent
for
each
pro
cess
or
rapi
d im
prov
emen
t reg
ion
gain
from
exc
hang
ing
mod
els
gain
from
exc
hang
ing
mod
els M
uch
fast
er
conv
erge
nce
over
all
exam
ples
see
n by
a p
roce
ssor
For
othe
r da
tase
ts, t
he `
extr
a� sp
eedu
p co
mes
be
caus
e so
me
obje
cts
can
be ig
nore
d on
ce t
hey
are
acco
unte
d fo
r by
the
mod
el.
Exam
ple:
Ind
ucti
ve lo
gic
prog
ram
min
g �
find
a
disj
unct
ion
of c
once
pts
that
exp
lain
s al
l of
the
obje
cts.
Onc
e an
obj
ect
is a
ccou
nted
for
, it
does
not
nee
d to
be
con
side
red
furt
her.
Get
ting
pco
ncep
ts in
eac
h ro
und
redu
ces
the
rem
aini
ng e
xam
ples
qui
ckly
.
The
over
all p
rogr
am s
truc
ture
is:
Part
itio
n th
e da
tase
t ac
ross
p p
roce
ssor
sFo
rall
proc
esso
rs (i
n pa
ralle
l)Se
t ba
se m
odel
to
be e
mpt
yFo
r q
roun
dsIm
prov
e th
e ba
se m
odel
usi
ng n
/pq
new
data
(c
hoos
e n/
pqto
get
opt
imal
spe
edup
beh
avio
ur)
Tota
l exc
hang
e of
mod
els
amon
g pr
oces
sors
Prod
uce
a ne
w ba
se m
odel
mer
ging
mod
els
rece
ived
N.B
. fit
wit
h BS
P!
N.B
. the
str
uctu
re o
f th
ese
algo
rith
ms
goes
wel
l be
yond
the
red
ucti
ve s
truc
ture
ass
umed
by
othe
r da
ta-in
tens
ive
appr
oach
es, e
.g. D
ataC
utte
r.
Ther
e re
mai
n in
tere
stin
g pr
oble
ms
arou
nd s
tora
ge
man
agem
ent
�e.
g. e
xtra
ctin
g a
sam
ple
with
out
fetc
hing
eve
ry p
age
to m
emor
y.
Dis
trib
uted
Dat
a M
inin
g
Dis
trib
uted
dat
a m
inin
g is
als
o be
com
ing
impo
rtan
t.
Her
e th
e at
trib
utes
of
an o
bjec
t ar
e lo
cate
d in
di
ffer
ent
plac
es. P
erha
ps t
hey
were
col
lect
ed v
ia
diff
eren
t to
uchp
oint
s (s
tore
, 800
num
ber,
web
sit
e),
or d
iffe
rent
cha
nnel
s (r
oam
ing
cell
phon
e us
e).
This
cor
resp
onds
to
part
itio
ning
the
dat
aset
by
colu
mns
.
It is
oft
en n
ot p
ossi
ble
to c
olle
ct t
he a
ttri
bute
s in
on
e pl
ace
beca
use
the
data
set
is t
oo b
ig; o
r th
ere
are
juri
sdic
tion
al b
ound
arie
s.
Solu
tion
s re
quir
e le
arni
ng u
sefu
l inf
orm
atio
n lo
cally
in
such
a w
ay t
hat
it c
an b
e co
mbi
ned
to g
ive
a gl
obal
ly
accu
rate
mod
el.
For
exam
ple,
a c
usto
mer
may
see
m t
o fi
t th
e pr
ofile
of
a g
ood
cust
omer
by
her
attr
ibut
es a
t 1
site
, but
no
t at
the
oth
ers.
How
can
we
tell
the
true
sta
te o
f af
fair
s (i.
e. w
hat
the
sequ
enti
al a
lgor
ithm
wou
ld h
ave
said
)?
Ther
e�s
only
ver
y pr
elim
inar
y wo
rk �
e.g.
Kar
gupt
a(F
ouri
er b
ases
, wav
elet
s), m
y gr
oup
(SVD
).
Dis
trib
uted
DM
is t
he f
irst
obv
ious
exa
mpl
e of
an
incr
easi
ngly
impo
rtan
t cl
ass
of a
pplic
atio
ns: t
hose
th
at u
se la
rge,
imm
ovab
le d
atas
ets
and
larg
e co
mpu
tati
ons
on t
hem
.
Oth
er e
xam
ples
incl
ude:
on-
the-
fly
appl
icat
ion
cons
truc
tion
fro
m c
ompo
nent
s (`
clou
d co
mpu
ting
�);an
d m
obile
age
nt a
pplic
atio
ns.
Dat
a ha
s in
erti
a: it
's e
asy
to k
eep
it in
a f
ixed
pla
ce;
and
it's
eas
y to
mov
e it
aro
und:
but
tra
nsit
ions
be
twee
n th
ese
two
stat
es a
re c
ompl
ex, m
essy
, and
sl
ow.
Dat
a gr
ids
don'
t se
em v
ery
scal
able
. Mov
ing
a pe
taby
teof
dat
a is
pro
blem
atic
, no
mat
ter
what
you
r be
liefs
abo
ut n
etwo
rk c
ost
and
band
widt
h. F
indi
ng a
pe
taby
teof
tem
pora
ry d
isk
spac
e fo
r ev
ery
appl
icat
ion
runn
ing
on a
com
pute
ser
ver
seem
s un
real
isti
c. A
nd y
et p
etab
yte
data
sets
are
ver
y cl
ose.
Incr
easi
ngly
, mov
ing
data
to
com
puta
tion
s is
the
wr
ong
thin
g to
do;
bet
ter
to m
ove
com
puta
tion
s to
da
ta. T
his
is t
he p
rem
ise
of t
he d
atac
entr
ic g
rid
proj
ect.
The
mai
n ne
w ar
chit
ectu
ral r
equi
rem
ent
is t
hat
data
re
posi
tori
es n
eed
to b
e fr
onte
d by
larg
e co
mpu
te
serv
ers
to p
roce
ss t
heir
dat
a.
data
clus
ter
a th
ick
pipe
Find
the
ave
rage
val
ue o
f ga
laxy
bri
ghtn
ess
in t
he X
-ray
spec
trum
.
Ther
e ar
e 10
0 gi
gaga
laxi
eskn
own;
num
ber
incr
easi
ngra
pidl
y (H
ubbl
e). P
arti
ally
ove
rlap
ped
data
abo
ut t
hem
iske
pt in
~30
big
rep
osit
orie
s.
Gala
xies
hav
e ab
out
a ki
loat
trib
ute:
eac
h re
posi
tory
hold
s so
me;
but
oft
en s
cale
d di
ffer
entl
y (e
.g. t
o ac
coun
tfo
r re
d sh
ift,
or
not)
.
Som
e da
tase
ts c
an b
e do
wnlo
aded
; som
e ha
ve s
qlin
terf
aces
; som
e ha
ve h
ome
grow
n qu
ery
inte
rfac
es.
The
sam
e ob
ject
has
dif
fere
nt n
ames
(30+
)
Toda
y�s s
olut
ion:
Hug
e am
ount
of
figu
ring
out
dat
aset
con
tent
s an
d pr
oper
ties
up
fron
t.
Mes
sy c
ombi
nati
on o
f do
wnlo
adin
g; g
ener
atin
g qu
erie
s; a
nd p
ostp
roce
ssin
g.
Poor
sol
utio
ns, a
nd a
lot
of w
ork
to g
et a
ny u
sefu
l re
sult
s (a
bout
4 g
rad-
stud
ent-
mon
ths
per
resu
lt).
The
requ
irem
ents
for
dat
acen
tric
gri
ds a
re q
uite
di
ffer
ent
from
tho
se o
f co
mpu
tati
onal
gri
ds. S
ome
of
the
inte
rest
ing
issu
es a
re:
* in
fras
truc
ture
for
app
licat
ion
desc
ript
ion
* bu
ildin
g pr
ogra
ms
(per
haps
fro
m q
ueri
es)
* ex
ecut
ion
plan
ning
esp
. as
delt
as a
re c
omm
on*
keep
ing
resu
lts
for
reus
e*
desc
ribi
ng t
he c
onte
nts
of r
epos
itor
ies
(con
tent
san
d ty
pes
( cf
cons
truc
tor
calc
ulus
))
Sum
mar
y
1.D
ata
min
ing
is a
maj
or a
pplic
atio
n ar
ea, w
ith
huge
de
man
ds f
or r
esou
rces
, and
a la
rge
pote
ntia
l poo
l of
user
s.2.
Para
llelis
m a
nd d
ata
min
ing
fit
well
toge
ther
be
caus
e th
e lo
cal c
ompu
tati
on r
equi
rem
ents
are
la
rge,
and
the
glo
bal c
omm
unic
atio
n re
quir
emen
ts
are
smal
l.3.
Dis
trib
uted
, gri
d-sc
ale
com
puti
ng a
nd d
ata
min
ing
fit
well
toge
ther
but
mov
ing
larg
e da
tase
ts is
too
ex
pens
ive;
so
a ne
w da
tace
ntri
c ap
proa
ch is
nee
ded.
Cred
its:
Sabi
ne M
cCon
nell
Free
man
Hua
ngO
wen
Roge
rsA
li Ro
uman
iRi
cky
Wan
gCa
rol Y
u
www.
cs.q
ueen
su.c
a/ho
me/
skill
?