33
Data Mining, Parallelism, Data Mining, Parallelism, and Grids and Grids David Skillicorn David Skillicorn Queens University, Kingston Queens University, Kingston [email protected] [email protected]

Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

  • Upload
    vuxuyen

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dat

a M

inin

g, P

aral

lelis

m,

Dat

a M

inin

g, P

aral

lelis

m,

and

Grid

s an

d G

rids

Dav

id S

killic

orn

Dav

id S

killic

orn

Que

en�s

Uni

vers

ity, K

ings

ton

Que

en�s

Uni

vers

ity, K

ings

ton

skill@

cs.q

ueen

su.c

ask

ill@cs

.que

ensu

.ca

Page 2: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dat

a m

inin

g bu

ilds

mod

els

from

dat

a �

in t

he h

ope

that

th

ese

mod

els

reve

al s

ome

know

ledg

e ab

out

the

unde

rlyi

ng d

ata.

Thin

k of

dat

a as

a m

atri

x:

Obj

ects

Attri

bute

s

Page 3: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Mod

els

of d

ata

are

used

for

:

--pr

edic

tion

(sup

ervi

sed

lear

ning

, one

of

the

attr

ibut

es is

the

tar

get

attr

ibut

e)-

neur

al n

etwo

rks

-de

cisi

on t

rees

-su

ppor

t ve

ctor

mac

hine

s

--un

ders

tand

ing

(uns

uper

vise

d le

arni

ng, l

earn

re

lati

onsh

ips

amon

g ob

ject

s vi

a th

eir

attr

ibut

es)

-cl

uste

ring

-ne

ural

net

work

s

Page 4: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Ther

e ar

e ty

pica

lly t

hree

pha

ses

to t

he d

ata

min

ing

proc

ess:

1.A

mod

el is

bui

lt o

n a

trai

ning

dat

aset

2.Th

e m

odel

is t

este

d on

a t

est

data

set

3.Th

e m

odel

is d

eplo

yed

on n

ew d

ata

The

qual

ity

of a

mod

el is

mea

sure

d by

the

pre

dict

ion

erro

r ra

te o

n th

e te

st d

atas

et (s

uper

vise

d), o

r so

me

mea

sure

of

the

cons

iste

ncy

and

tigh

tnes

s of

the

re

lati

onsh

ips

(uns

uper

vise

d).

Page 5: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

App

licat

ions

exi

st in

:

�co

mm

erci

al (c

usto

mer

rel

atio

nshi

p m

anag

emen

t);

�in

dust

rial

(com

pone

nt m

aint

enan

ce p

redi

ctio

n);

�sc

ient

ific

(unu

sual

par

ticl

e in

tera

ctio

ns);

�en

gine

erin

g (t

urbu

lenc

e in

flu

id f

low)

;

but

the

com

mer

cial

sec

tor

is b

y fa

r th

e la

rges

t, a

nd

ther

e�s

still

lots

of

runw

ay.

(The

maj

or li

mit

atio

n on

fur

ther

gro

wth

is t

he

shor

tage

of

skill

ed p

eopl

e.)

Page 6: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Para

llel D

ata

Min

ing

Page 7: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Sinc

e da

ta m

inin

g al

gori

thm

s ar

e bo

th c

ompu

te-b

ound

an

d da

ta-a

cces

s bo

und,

it�s

natu

ral t

o us

e pa

ralle

lism

.

Mul

tipl

e pr

oces

sors

hel

p wi

th t

he c

ompu

te p

art;

Para

llel c

ompu

ters

hav

e fl

atte

r m

emor

y hi

erar

chie

s (m

ore

mem

ory

is c

lose

r to

a p

roce

ssor

) whi

ch h

elps

wi

th d

ata

acce

ss.

Page 8: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Mos

t D

M t

echn

ique

s ar

e ap

prox

imat

ing

in t

his

sens

e:

The

qual

ity

of t

he m

odel

impr

oves

as

mor

e ex

ampl

es a

re s

een.

This

cre

ates

som

e in

tere

stin

g po

ssib

iliti

es:

--qu

ick

and

dirt

y m

odel

ling

base

d on

sm

all s

ampl

es--

stee

rabl

em

odel

ling

wher

e ea

rly

feed

back

hel

psa

user

sel

ect

inte

rest

ing

ques

tion

s--

para

llel a

nd d

istr

ibut

ed m

odel

ling

wher

eea

ch p

roce

ssor

mod

els

its

part

of

a la

rger

dat

aset

Page 9: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Surp

rise

1: T

he r

ate

at w

hich

mod

el `

qual

ity�

impr

oves

is

muc

h gr

eate

r th

an k

nown

err

or b

ound

s su

gges

t.

qual

ity

num

ber o

f exa

mpl

es s

een

regi

on o

f ver

y fa

st im

prov

emen

t

still

impr

ovin

g

very

slo

w im

prov

emen

t

1%5%

Page 10: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Oft

en 9

5% o

f th

e ob

ject

s pr

ovid

e th

e fi

nal 2

% o

f m

odel

impr

ovem

ent.

qual

ity

num

ber o

f exa

mpl

es s

een

but t

he a

sym

ptot

e de

pend

s on

the

tota

l num

ber o

f exa

mpl

es �

so m

ore

is

bette

r. Sa

mpl

ing

isn�

t the

ans

wer

.

Page 11: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Wha

t�s g

oing

on?

Ther

e's

a lo

t of r

epet

ition

in ty

pica

l dat

a m

inin

g da

tase

ts.

Ever

y ea

rly e

xam

ple

reve

als

a ne

w a

spec

t of t

he m

odel

. Af

ter a

whi

le, n

ew e

xam

ples

repe

at m

uch

of th

e `k

now

ledg

e� fr

om e

xam

ples

see

n ea

rlier

.

A bi

g sa

mpl

e co

ntai

ns m

ore

exam

ples

that

diff

er fr

om

one

anot

her.

They

forc

e th

e m

odel

to c

onsi

der r

iche

r re

pres

enta

tions

.

Page 12: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Stra

tegy

for

par

alle

lism

:

1.Pa

rtit

ion

the

data

set

`by

rows

� and

allo

cate

a

part

itio

n to

eac

h pr

oces

sor;

2.Le

arn

a m

odel

loca

lly a

t ea

ch p

roce

ssor

;3.

(Som

ehow

) mer

ge t

he lo

cal m

odel

s in

to a

sin

gle,

gl

obal

mod

el t

hat

woul

d ha

ve b

een

prod

uced

by

a se

quen

tial

dat

a m

inin

g le

arne

r.

Ther

e�s

no m

agic

bul

let

�a

new

mer

ging

tec

hniq

ue

has

to b

e di

scov

ered

for

eac

h da

ta m

inin

g te

chni

que.

Page 13: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Mer

ging

tec

hniq

ues

are

know

n fo

r:

1.N

eura

l net

work

s (s

uper

vise

d an

d un

supe

rvis

ed)

2.In

duct

ive

logi

c pr

ogra

mm

ing

3.Fr

eque

nt s

ets

(and

so

asso

ciat

ion

rule

s)4.

Bagg

ing

5.Bo

osti

ng/a

rcin

g

and

prob

ably

for

oth

er t

echn

ique

s to

o.

Usi

ng p

proc

esso

rs g

ives

an

imm

edia

te s

peed

up o

f al

mos

t p

(less

mer

ging

ove

rhea

d).

(Spe

edup

Fac

tor

1)

Page 14: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

But�

eac

h pr

oces

sor

is w

orki

ng w

ith

earl

y ex

ampl

es.

The

mer

ge s

tep

impr

oves

mod

el q

ualit

y wi

thou

t se

eing

fr

esh

exam

ples

.

(Spe

edup

Fac

tor

2)

exam

ples

per

pro

cess

or

qual

ity

Page 15: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

So t

here

�s an

ext

ra s

peed

up (r

eal s

uper

linea

rsp

eedu

p)

beca

use

ever

y cy

cle

is b

eing

spe

nt le

arni

ng in

a

prod

ucti

ve r

ange

of

the

data

set

�co

nver

genc

e ha

ppen

s m

ore

quic

kly.

[Of

cour

se, t

his

mea

ns t

hat

sequ

enti

al

impl

emen

tati

ons

shou

ld u

se a

seq

uent

ialis

atio

nof

thi

s pa

ralle

l str

ateg

y �

a bi

tewi

sest

rate

gy. T

his

is o

ne o

f th

e fe

w ex

ampl

es o

f ho

w a

para

llel m

inds

et le

ads

to

new

sequ

enti

al a

lgor

ithm

s.]

Page 16: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Surp

rise

2: E

xcha

ngin

g lo

cal m

odel

s wi

th o

ther

pr

oces

sors

ten

ds t

o cr

eate

eve

n fa

ster

con

verg

ence

�a

thir

d so

urce

of

spee

dup.

(Spe

edup

Fac

tor

3)

The

mec

hani

sm o

f th

is `

extr

a� sp

eedu

p de

pend

s on

th

e un

derl

ying

dat

a m

inin

g te

chni

que.

Page 17: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

For

som

e da

tase

ts, i

t�s b

ecau

se o

f th

e sh

ape

of t

he

qual

ity-

exam

ple

curv

e.

If t

he d

atas

et is

big

eno

ugh,

eac

h pr

oces

sor

gets

en

ough

dat

a in

its

part

itio

n th

at it

mov

es b

eyon

d th

e ea

rly

exam

ples

, whe

re le

arni

ng im

prov

es s

teep

ly, a

nd

star

ts t

o sp

end

tim

e in

the

nex

t re

gion

.

So e

xcha

nge

mod

els

at t

he e

nd o

f th

e st

eepe

st

regi

on.

Exam

ple:

Neu

ral n

etwo

rks

�ch

oosi

ng t

he c

orre

ct

batc

h si

ze is

cri

tica

l.

Page 18: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

rapi

d im

prov

emen

t reg

ion

Qua

lity

impr

ovem

ent

for

each

pro

cess

or

rapi

d im

prov

emen

t reg

ion

gain

from

exc

hang

ing

mod

els

gain

from

exc

hang

ing

mod

els M

uch

fast

er

conv

erge

nce

over

all

exam

ples

see

n by

a p

roce

ssor

Page 19: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

For

othe

r da

tase

ts, t

he `

extr

a� sp

eedu

p co

mes

be

caus

e so

me

obje

cts

can

be ig

nore

d on

ce t

hey

are

acco

unte

d fo

r by

the

mod

el.

Exam

ple:

Ind

ucti

ve lo

gic

prog

ram

min

g �

find

a

disj

unct

ion

of c

once

pts

that

exp

lain

s al

l of

the

obje

cts.

Onc

e an

obj

ect

is a

ccou

nted

for

, it

does

not

nee

d to

be

con

side

red

furt

her.

Get

ting

pco

ncep

ts in

eac

h ro

und

redu

ces

the

rem

aini

ng e

xam

ples

qui

ckly

.

Page 20: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

The

over

all p

rogr

am s

truc

ture

is:

Part

itio

n th

e da

tase

t ac

ross

p p

roce

ssor

sFo

rall

proc

esso

rs (i

n pa

ralle

l)Se

t ba

se m

odel

to

be e

mpt

yFo

r q

roun

dsIm

prov

e th

e ba

se m

odel

usi

ng n

/pq

new

data

(c

hoos

e n/

pqto

get

opt

imal

spe

edup

beh

avio

ur)

Tota

l exc

hang

e of

mod

els

amon

g pr

oces

sors

Prod

uce

a ne

w ba

se m

odel

mer

ging

mod

els

rece

ived

N.B

. fit

wit

h BS

P!

Page 21: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

N.B

. the

str

uctu

re o

f th

ese

algo

rith

ms

goes

wel

l be

yond

the

red

ucti

ve s

truc

ture

ass

umed

by

othe

r da

ta-in

tens

ive

appr

oach

es, e

.g. D

ataC

utte

r.

Ther

e re

mai

n in

tere

stin

g pr

oble

ms

arou

nd s

tora

ge

man

agem

ent

�e.

g. e

xtra

ctin

g a

sam

ple

with

out

fetc

hing

eve

ry p

age

to m

emor

y.

Page 22: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dis

trib

uted

Dat

a M

inin

g

Page 23: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dis

trib

uted

dat

a m

inin

g is

als

o be

com

ing

impo

rtan

t.

Her

e th

e at

trib

utes

of

an o

bjec

t ar

e lo

cate

d in

di

ffer

ent

plac

es. P

erha

ps t

hey

were

col

lect

ed v

ia

diff

eren

t to

uchp

oint

s (s

tore

, 800

num

ber,

web

sit

e),

or d

iffe

rent

cha

nnel

s (r

oam

ing

cell

phon

e us

e).

This

cor

resp

onds

to

part

itio

ning

the

dat

aset

by

colu

mns

.

It is

oft

en n

ot p

ossi

ble

to c

olle

ct t

he a

ttri

bute

s in

on

e pl

ace

beca

use

the

data

set

is t

oo b

ig; o

r th

ere

are

juri

sdic

tion

al b

ound

arie

s.

Page 24: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Solu

tion

s re

quir

e le

arni

ng u

sefu

l inf

orm

atio

n lo

cally

in

such

a w

ay t

hat

it c

an b

e co

mbi

ned

to g

ive

a gl

obal

ly

accu

rate

mod

el.

For

exam

ple,

a c

usto

mer

may

see

m t

o fi

t th

e pr

ofile

of

a g

ood

cust

omer

by

her

attr

ibut

es a

t 1

site

, but

no

t at

the

oth

ers.

How

can

we

tell

the

true

sta

te o

f af

fair

s (i.

e. w

hat

the

sequ

enti

al a

lgor

ithm

wou

ld h

ave

said

)?

Ther

e�s

only

ver

y pr

elim

inar

y wo

rk �

e.g.

Kar

gupt

a(F

ouri

er b

ases

, wav

elet

s), m

y gr

oup

(SVD

).

Page 25: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dis

trib

uted

DM

is t

he f

irst

obv

ious

exa

mpl

e of

an

incr

easi

ngly

impo

rtan

t cl

ass

of a

pplic

atio

ns: t

hose

th

at u

se la

rge,

imm

ovab

le d

atas

ets

and

larg

e co

mpu

tati

ons

on t

hem

.

Oth

er e

xam

ples

incl

ude:

on-

the-

fly

appl

icat

ion

cons

truc

tion

fro

m c

ompo

nent

s (`

clou

d co

mpu

ting

�);an

d m

obile

age

nt a

pplic

atio

ns.

Page 26: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Dat

a ha

s in

erti

a: it

's e

asy

to k

eep

it in

a f

ixed

pla

ce;

and

it's

eas

y to

mov

e it

aro

und:

but

tra

nsit

ions

be

twee

n th

ese

two

stat

es a

re c

ompl

ex, m

essy

, and

sl

ow.

Dat

a gr

ids

don'

t se

em v

ery

scal

able

. Mov

ing

a pe

taby

teof

dat

a is

pro

blem

atic

, no

mat

ter

what

you

r be

liefs

abo

ut n

etwo

rk c

ost

and

band

widt

h. F

indi

ng a

pe

taby

teof

tem

pora

ry d

isk

spac

e fo

r ev

ery

appl

icat

ion

runn

ing

on a

com

pute

ser

ver

seem

s un

real

isti

c. A

nd y

et p

etab

yte

data

sets

are

ver

y cl

ose.

Page 27: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Incr

easi

ngly

, mov

ing

data

to

com

puta

tion

s is

the

wr

ong

thin

g to

do;

bet

ter

to m

ove

com

puta

tion

s to

da

ta. T

his

is t

he p

rem

ise

of t

he d

atac

entr

ic g

rid

proj

ect.

The

mai

n ne

w ar

chit

ectu

ral r

equi

rem

ent

is t

hat

data

re

posi

tori

es n

eed

to b

e fr

onte

d by

larg

e co

mpu

te

serv

ers

to p

roce

ss t

heir

dat

a.

data

clus

ter

a th

ick

pipe

Page 28: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Find

the

ave

rage

val

ue o

f ga

laxy

bri

ghtn

ess

in t

he X

-ray

spec

trum

.

Ther

e ar

e 10

0 gi

gaga

laxi

eskn

own;

num

ber

incr

easi

ngra

pidl

y (H

ubbl

e). P

arti

ally

ove

rlap

ped

data

abo

ut t

hem

iske

pt in

~30

big

rep

osit

orie

s.

Gala

xies

hav

e ab

out

a ki

loat

trib

ute:

eac

h re

posi

tory

hold

s so

me;

but

oft

en s

cale

d di

ffer

entl

y (e

.g. t

o ac

coun

tfo

r re

d sh

ift,

or

not)

.

Som

e da

tase

ts c

an b

e do

wnlo

aded

; som

e ha

ve s

qlin

terf

aces

; som

e ha

ve h

ome

grow

n qu

ery

inte

rfac

es.

The

sam

e ob

ject

has

dif

fere

nt n

ames

(30+

)

Page 29: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Toda

y�s s

olut

ion:

Hug

e am

ount

of

figu

ring

out

dat

aset

con

tent

s an

d pr

oper

ties

up

fron

t.

Mes

sy c

ombi

nati

on o

f do

wnlo

adin

g; g

ener

atin

g qu

erie

s; a

nd p

ostp

roce

ssin

g.

Poor

sol

utio

ns, a

nd a

lot

of w

ork

to g

et a

ny u

sefu

l re

sult

s (a

bout

4 g

rad-

stud

ent-

mon

ths

per

resu

lt).

Page 30: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

The

requ

irem

ents

for

dat

acen

tric

gri

ds a

re q

uite

di

ffer

ent

from

tho

se o

f co

mpu

tati

onal

gri

ds. S

ome

of

the

inte

rest

ing

issu

es a

re:

* in

fras

truc

ture

for

app

licat

ion

desc

ript

ion

* bu

ildin

g pr

ogra

ms

(per

haps

fro

m q

ueri

es)

* ex

ecut

ion

plan

ning

esp

. as

delt

as a

re c

omm

on*

keep

ing

resu

lts

for

reus

e*

desc

ribi

ng t

he c

onte

nts

of r

epos

itor

ies

(con

tent

san

d ty

pes

( cf

cons

truc

tor

calc

ulus

))

Page 31: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Sum

mar

y

1.D

ata

min

ing

is a

maj

or a

pplic

atio

n ar

ea, w

ith

huge

de

man

ds f

or r

esou

rces

, and

a la

rge

pote

ntia

l poo

l of

user

s.2.

Para

llelis

m a

nd d

ata

min

ing

fit

well

toge

ther

be

caus

e th

e lo

cal c

ompu

tati

on r

equi

rem

ents

are

la

rge,

and

the

glo

bal c

omm

unic

atio

n re

quir

emen

ts

are

smal

l.3.

Dis

trib

uted

, gri

d-sc

ale

com

puti

ng a

nd d

ata

min

ing

fit

well

toge

ther

but

mov

ing

larg

e da

tase

ts is

too

ex

pens

ive;

so

a ne

w da

tace

ntri

c ap

proa

ch is

nee

ded.

Page 32: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

Cred

its:

Sabi

ne M

cCon

nell

Free

man

Hua

ngO

wen

Roge

rsA

li Ro

uman

iRi

cky

Wan

gCa

rol Y

u

www.

cs.q

ueen

su.c

a/ho

me/

skill

Page 33: Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

?