View
215
Download
0
Category
Preview:
Citation preview
8/9/2019 datamining-lecture 2
1/46
DATA MINING
LECTURE 2Data Preprocessing
Exploratory Analysis
Post-processing
8/9/2019 datamining-lecture 2
2/46
Wat is Data Mining!
" Data #ining is te $se o% e%%icient tecni&$es %or te analysis o% 'ery
large collections o% (ata an( te extraction o% $se%$lan( possi)ly
$nexpecte(patterns in (ata*
" +Data #ining is te analysis o% ,o%ten large o)ser'ational (ata sets to%in($ns$specte(relationsipsan( to s$##ari.ete (ata in no'el
/ays tat are )ot$n(erstan(a)le an( $se%$l to te (ata analyst0
,1an( Mannila 3#yt
" +Data #ining is te (isco'ery o% #o(els%or (ata0 ,Ra4ara#an Ull#an" We can a'e te %ollo/ing types o% #o(els
" Mo(els tat explainte (ata ,e*g* a single %$nction
" Mo(els tat pre(ictte %$t$re (ata instances*
" Mo(els tat s$##ari.ete (ata
" Mo(els te extractte #ost pro#inent %eat$reso% te (ata*
8/9/2019 datamining-lecture 2
3/46
Wy (o /e nee( (ata #ining!
" Really $gea#o$nts o% co#plex(ata generate( %ro# #$ltiple so$rces
an( interconnecte(in (i%%erent /ays" 3cienti%ic(ata %ro# (i%%erent (isciplines
" Weater astrono#y pysics )iological #icroarrays geno#ics
" 1$ge textcollections
" Te We) scienti%ic articles ne/s t/eets %ace)oo5 postings*
" Transaction (ata" Retail store recor(s cre(it car( recor(s
" 6ea'ioral(ata" Mo)ile pone (ata &$ery logs )ro/sing )ea'ior a( clic5s
" Net/or5e((ata
" Te We) 3ocial Net/or5s IM net/or5s e#ail net/or5 )iological net/or5s*
"All tese types o% (ata can )e co#)ine(in #any /ays" 7ace)oo5 as a net/or5 text i#ages $ser )ea'ior a( transactions*
" We nee( to analy.etis (ata to extract5no/le(ge" 8no/le(ge can )e $se( %or co##ercial or scienti%icp$rposes*
" 9$r sol$tions so$l( scale to te si.e o% te (ata
8/9/2019 datamining-lecture 2
4/46
Te (ata analysis pipeline
" Mining is not te only step in te analysis process
" Preprocessing: real (ata is noisy inco#plete an( inconsistent* Data cleaning is
re&$ire( to #a5e sense o% te (ata" Tecni&$es: 3a#pling Di#ensionality Re($ction 7eat$re selection*
"A (irty /or5 )$t it is o%ten te #ost i#portant step %or te analysis*
" Post-Processing: Ma5e te (ata actiona)le an( $se%$l to te $ser" 3tatistical analysis o% i#portance
" ;is$ali.ation*
" Pre- an( Post-processing are o%ten (ata #ining tas5s as /ell
Data
PreprocessingData Mining
Res$lt
Post-processing
8/9/2019 datamining-lecture 2
5/46
8/9/2019 datamining-lecture 2
6/46
8/9/2019 datamining-lecture 2
7/46
3a#pling
" Te 5ey principle %or e%%ecti'e sa#pling is te %ollo/ing:" $sing a sa#ple /ill /or5 al#ost as /ell as $sing te entire
(ata sets i% te sa#ple is representati'e
"A sa#ple is representati'e i% it as approxi#ately te sa#eproperty ,o% interest as te original set o% (ata
" 9ter/ise /e say tat te sa#ple intro($ces so#e )ias
" Wat appens i% /e ta5e a sa#ple %ro# te $ni'ersity ca#p$s
to co#p$te te a'erage eigt o% a person at Ioannina!
8/9/2019 datamining-lecture 2
8/46
Types o% 3a#pling
" 3i#ple Ran(o# 3a#pling" Tere is an e&$al pro)a)ility o% selecting any partic$lar ite#
" 3a#pling /ito$t replace#ent
" As eac ite# is selecte( it is re#o'e( %ro# te pop$lation
" 3a#pling /it replace#ent" 9)4ects are not re#o'e( %ro# te pop$lation as tey are selecte( %or te
sa#ple*
" In sa#pling /it replace#ent te sa#e o)4ect can )e pic5e( $p #ore tanonce* Tis #a5es analytical co#p$tation o% pro)a)ilities easier
" E*g* /e a'e =>>people B=are /o#en P,W >*B= F#en P,M >*F* I% I pic5 t/o persons /at is te pro)a)ility P,WWtat )ot are/o#en!" 3a#pling /it replace#ent: P,WW >*B=2
" 3a#pling /ito$t replace#ent: P,WW B==>> H B>FF
8/9/2019 datamining-lecture 2
9/46
Types o% 3a#pling
" 3trati%ie(sa#pling" 3plit te (ata into se'eral gro$ps ten (ra/ ran(o# sa#ples %ro# eac
gro$p*" Ens$res tat )ot gro$ps are represente(*
" Exa#ple =* I /ant to $n(erstan( te (i%%erences )et/een legiti#ate an(
%ra$($lent cre(it car( transactions* >*=Jo% transactions are %ra$($lent*Wat appens i% I select =>>>transactions at ran(o#!" I get =%ra$($lent transaction ,in expectation* Not eno$g to (ra/ any concl$sions* 3ol$tion:
sa#ple =>>>legiti#ate an( =>>>%ra$($lent transactions
" Exa#ple 2*I /ant to ans/er te &$estion: Do /e) pages tat are lin5e(a'e on a'erage #ore /or(s in co##on tan tose tat are not! I a'e =Mpages an( =Mlin5s /at appens i% I select =>8pairs o% pages at ran(o#!" Most li5ely I /ill not get any lin5s* 3ol$tion: sa#ple =>8 ran(o# pairs an( =>8 lin5s
Pro)a)ility Re#in(er: I% an e'ent as pro)a)ility po% appening an( I (o N
trials te expecte( n$#)er o% ti#es te e'ent occ$rs is pN
8/9/2019 datamining-lecture 2
10/46
8/9/2019 datamining-lecture 2
11/46
3a#ple 3i.e
" What sample size is necessary to get at least one
object from each of 10 groups.
8/9/2019 datamining-lecture 2
12/46
A (ata #ining callenge
" Ko$ a'e Nintegers an( yo$ /ant to sa#ple one integer
$ni%or#ly at ran(o#* 1o/ (o yo$ (o tat!
" Te integers are co#ing in a strea#: yo$ (o not 5no/ te
si.e o% te strea# in a('ance an( tere is not eno$g
#e#ory to store te strea# in #e#ory* Ko$ can only 5eep a
constanta#o$nt o% integers in #e#ory
" 1o/ (o yo$ sa#ple!
" 1int: i% te strea# en(s a%ter rea(ing nintegers te last integer inte strea# so$l( a'e pro)a)ility =nto )e selecte(*
" Reser'oir 3a#pling:" 3tan(ar( inter'ie/ &$estion %or #any co#panies
8/9/2019 datamining-lecture 2
13/46
Reser'oir sa#pling
"Algorit#: Wit pro)a)ility =n select te n-t ite#
o% te strea# an( replace te pre'io$s coice*
" Clai#: E'ery ite# as pro)a)ility =N to )eselecte( a%ter N ite#s a'e )een rea(*
" Proo%" Wat is te pro)a)ility o% te n-te ite# to )e selecte(!
" Wat is te pro)a)ility o% te n-t ite#s to s$r'i'e %or N-n
ro$n(s!
"
8/9/2019 datamining-lecture 2
14/46
8/9/2019 datamining-lecture 2
15/46
8/9/2019 datamining-lecture 2
16/46
Mining Tas5
" Collect all re'ie/s %or te top-=> #ost re'ie/e(
resta$rants in NK in Kelp" ,tan5s to 1a(y La/
" 7in( %e/ ter#s tat )est (escri)e te resta$rants*
"Algorit#!
8/9/2019 datamining-lecture 2
17/46
Exa#ple (ata" I heard so many good things about this place so I was pretty juiced to try it. I'm
from Cali and I heard Shake Shack is comparable to IN-N-O! and I gotta say" Shake
Shake wins hands down. Surprisingly" the line was short and we waited about #$
%IN. to order. I ordered a regular cheeseburger" fries and a black&white shake. So
yummer. I lo(e the location too) It's in the middle of the city and the (iew is
breathtaking. *efinitely one of my fa(orite places to eat in N+C.
" I'm from California and I must say" Shake Shack is better than IN-N-O!" all day"
err'day.
" ,ould I pay #/ for a burger here0 No. 1ut for the price point they are asking for"
this is a definite bang for your buck 2though for some" the opportunity cost of
waiting in line might outweigh the cost sa(ings3 !hankfully" I came in before the
lunch swarm descended and I ordered a shake shack 2the special burger with the patty
/ fried cheese 4amp5 portabella topping3 and a coffee milk shake. !he beef patty was
(ery juicy and snugly packed within a soft potato roll. On the downside" I could do
without the fried portabella-thingy" as the crispy taste conflicted with the juicy"tender burger. 6ow does shake shack compare with in-and-out or -guys0 I say a (ery
close tie" and I think it comes down to personal affliations. On the shake side" true
to its name" the shake was well churned and (ery thick and luscious. !he coffee
fla(or added a tangy taste and complemented the (anilla shake well. Situated in an
open space in N+C" the open air sitting allows you to munch on your burger while
watching people oom by around the city. It's an oddly calming e7perience" or perhaps
it was the food coma I was slowly falling into. 8reat place with food at a great
price.
8/9/2019 datamining-lecture 2
18/46
7irst c$t" Do si#ple processing to +nor#ali.e0 te (ata ,re#o'e p$nct$ation #a5einto lo/er case clear /ite spaces oter!
" 6rea5 into /or(s 5eep te #ost pop$lar /or(s
the 9:#;
and #;$
with ;#>
to >9;
a >=:$
it #>?
of #?
is ;#?
sauce ;$9$
in =?#
this =#?
was =;=
for ==9:you =99$
that 9:>?
but 9?$
food 9;?:
on 9=$
my 9=##
cart 99=>
chicken 999$
with 9#?rice 9$;?
so #99so #>#$
ha(e #
8/9/2019 datamining-lecture 2
19/46
7irst c$t" Do si#ple processing to +nor#ali.e0 te (ata ,re#o'e p$nct$ation #a5einto lo/er case clear /ite spaces oter!
" 6rea5 into /or(s 5eep te #ost pop$lar /or(s
the 9:#;
and #;$>
with ;#>
to >9;
a >=:$
it #>?
of #?
is ;#?
sauce 4020
in =?#
this =#?
was =;=
for ==9:you =99$
that 9:>?
but 9?$
food 9;?:
on 9=$
my 9=##
cart 2236
chicken 2220
with 9#?rice 9$;?
so #99
so #>#$
ha(e #
8/9/2019 datamining-lecture 2
20/46
3econ( c$t
" Re#o'e stop /or(s" 3top-/or( lists can )e %o$n( online*
a"about"abo(e"after"again"against"all"am"an"and"any"are"aren't"as"at"be"be
cause"been"before"being"below"between"both"but"by"can't"cannot"could"could
n't"did"didn't"do"does"doesn't"doing"don't"down"during"each"few"for"from"f
urther"had"hadn't"has"hasn't"ha(e"ha(en't"ha(ing"he"he'd"he'll"he's"her"he
re"here's"hers"herself"him"himself"his"how"how's"i"i'd"i'll"i'm"i'(e"if"in
"into"is"isn't"it"it's"its"itself"let's"me"more"most"mustn't"my"myself"no"
nor"not"of"off"on"once"only"or"other"ought"our"ours"oursel(es"out"o(er"own
"same"shan't"she"she'd"she'll"she's"should"shouldn't"so"some"such"than"tha
t"that's"the"their"theirs"them"themsel(es"then"there"there's"these"they"they'd"they'll"they're"they'(e"this"those"through"to"too"under"until"up"(ery
"was"wasn't"we"we'd"we'll"we're"we'(e"were"weren't"what"what's"when"when's
"where"where's"which"while"who"who's"whom"why"why's"with"won't"would"would
n't"you"you'd"you'll"you're"you'(e"your"yours"yourself"yoursel(es"
8/9/2019 datamining-lecture 2
21/46
3econ( c$t
" Re#o'e stop /or(s" 3top-/or( lists can )e %o$n( online*
ramen #
noodles 99:?
ippudo 99>#
buns 99#
broth 9$;#
like #?$9
just #
get #>;#
time #>#=
one #;>$
really #;=:
go #=>>
food #9?>
bowl #9:9
can #9>
great ##:9
best ##>:
burger ;=;$
shack =9?#
shake =99#
line 9=?:
fries 99>$
good #?9$
burgers #>;=
wait #$9
place ##?
one ### 9
patty #9.99>$=
ss #;?.>>#= #
patties #;:=?9=999=99 9
>th >$.:?=$#:=;>< ?
;am .;#::;;;;:?>
yellow ;.;;:$9>9$>>:= $: 9
deli's ##:.;=#?> #
car(er ##.#9?9;>;?:$9 #
brown's #$?.;;#::
8/9/2019 datamining-lecture 2
26/46
Tir( c$t
" T7-ID7 ta5es care o% stop /or(s as /ell
" We (o not nee( to re#o'e te stop/or(s since
tey /ill get ID7,/ >
8/9/2019 datamining-lecture 2
27/46
8/9/2019 datamining-lecture 2
28/46
8/9/2019 datamining-lecture 2
29/46
7re&$ency an( Mo(e
" Te %re&$encyo% an attri)$te 'al$e is tepercentage o% ti#e te 'al$e occ$rs in te
(ata set
" 7or exa#ple gi'en te attri)$te gen(er an( arepresentati'e pop$lation o% people te gen(er %e#ale
occ$rs a)o$t B>J o% te ti#e*
" Te #o(eo% a an attri)$te is te #ost %re&$ent
attri)$te 'al$e" Te notions o% %re&$ency an( #o(e are typically
$se( /it categorical (ata
8/9/2019 datamining-lecture 2
30/46
Percentiles
" 7or contin$o$s (ata te notion o% a percentileis
#ore $se%$l*
Gi'en an or(inal or contin$o$s attri)$te xan( an$#)er p)et/een >an( =>> te ptpercentile is
a 'al$e o% xs$c tat pJo% te o)ser'e( 'al$es
o% x are less tan *
" 7or instance te B>t percentile is te 'al$e s$c
tat B>J o% all 'al$es o% x are less tan *
"
8/9/2019 datamining-lecture 2
31/46
Meas$res o% Location: Mean an( Me(ian
" Te #eanis te #ost co##on #eas$re o% telocation o% a set o% points*
" 1o/e'er te #eanis 'ery sensiti'e to o$tliers*
" T$s te #e(ianor a tri##e( #ean is alsoco##only $se(*
8/9/2019 datamining-lecture 2
32/46
Exa#ple
Mean: =>F>8
Tri##e( #ean ,re#o'e #in #ax: =>B8
Me(ian: ,F>=>>2 FB8
8/9/2019 datamining-lecture 2
33/46
Meas$res o% 3prea(: Range an( ;ariance
" Rangeis te (i%%erence )et/een te #ax an( #in
" Te 'arianceor stan(ar( (e'iation is te #ost
co##on #eas$re o% te sprea( o% a set o% points*
"
8/9/2019 datamining-lecture 2
34/46
Nor#al Distri)$tion
"
"An i#portant (istri)$tion tat caracteri.es #any
&$antities an( as a central role in pro)a)ilities an(
statistics*"Appears also in te central li#it teore#
" 7$lly caracteri.e( )y te #ean an( stan(ar(
(e'iation
"
Tis is a 'al$e istogra#
8/9/2019 datamining-lecture 2
35/46
Not e'eryting is nor#ally (istri)$te(
" Plot o% n$#)er o% /or(s /it x n$#)er o% occ$rrences
" I% tis /as a nor#al (istri)$tion /e /o$l( not a'e a
%re&$ency as large as 2@8
> B>>> =>>>> =B>>> 2>>>> 2B>>> ?>>>> ?B>>>
>
=>>>
2>>>
?>>>
>>>
B>>>
>>>
O>>>
@>>>
8/9/2019 datamining-lecture 2
36/46
Po/er-la/ (istri)$tion
" We can $n(erstan( te (istri)$tion o% /or(s i% /e
ta5e te log-logplot
" Linear relationsip in te log-log space
"
= => =>> =>>> =>>>> =>>>>>
=
=>
=>>
=>>>
=>>>>
8/9/2019 datamining-lecture 2
37/46
ip%s la/
" Po/er la/s can )e (etecte( )y a linear relationsip in te
log-log space %or te ran5-%re&$encyplot
" 7re&$ency o% te r-t#ost %re&$ent /or(
"
= => =>> =>>> =>>>> =>>>>>
=
=>
=>>
=>>>
=>>>>
=>>>>>
8/9/2019 datamining-lecture 2
38/46
Po/er-la/s are e'ery/ere
" Inco#ingan( o$tgoing lin5s o% /e) pages n$#)er o% %rien(sin
social net/or5s n$#)er o% occ$rrences o% /or(s %ile si.es city
si.es inco#e (istri)$tion pop$larityo% pro($cts an( #o'ies" 3ignat$re o% $#an acti'ity!
"A #ecanis# tat explains e'eryting!
" Ric get ricer process
8/9/2019 datamining-lecture 2
39/46
Te Long Tail
3o$rce: Cris An(erson ,2>>
http://www.wired.com/wired/archive/12.10/tail.htmlhttp://www.wired.com/wired/archive/12.10/tail.html8/9/2019 datamining-lecture 2
40/46
8/9/2019 datamining-lecture 2
41/46
3catter Plot Array o% Iris Attri)$tes
8/9/2019 datamining-lecture 2
42/46
Conto$r Plot Exa#ple: 33T Dec =FF@
Celsi$s
43
8/9/2019 datamining-lecture 2
43/46
Meaning%$lness o% Ans/ers
"A )ig (ata-#ining ris5 is tat yo$ /ill +(isco'er0
patterns tat are #eaningless*
" 3tatisticians call it 6on%erronis principle:,ro$gly i% yo$ loo5 in #ore places %or
interesting patterns tan yo$r a#o$nt o% (ata
/ill s$pport yo$ are )o$n( to %in( crap*
" Te Rine Para(ox: a great exa#ple o% o/not to con($ct scienti%ic researc*
C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an
44
8/9/2019 datamining-lecture 2
44/46
Rine Para(ox ,=
" Qosep Rine /as a parapsycologist in te
=FB>s /o ypotesi.e( tat so#e people a(
Extra-3ensory Perception*" 1e (e'ise( ,so#eting li5e an experi#ent /ere
s$)4ects /ere as5e( to g$ess => i((en car(s
re( or )l$e*
" 1e (isco'ere( tat al#ost = in =>>> a( E3P tey /ere a)le to get all => rigtS
C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an
45
8/9/2019 datamining-lecture 2
45/46
Rine Para(ox ,2
" 1e tol( tese people tey a( E3P an( calle(
te# in %or anoter test o% te sa#e type*
"Alas e (isco'ere( tat al#ost all o% te# a(
lost teir E3P*" Wat (i( e concl$(e!
"Ans/er on next sli(e*
C3?BA Data Mining on te We): Anan( Ra4ara#an Qe%% Ull#an
8/9/2019 datamining-lecture 2
46/46
Recommended