46814507 Do an Phan Cum Du Lieu

Embed Size (px)

Citation preview

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn I HC THI NGUYN KHOA CNG NGH THNG TIN Nguyn Trung Sn PHNG PHP PHN CM V NG DNG Chuyn ngnh :KHOA HC MY TNH M s :60.48.01 LUN VN THC S KHOA HC MY TNH NGI HNG DN KHOA HC 1.PGS. TS V C THI Thi Nguyn 2009 S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn I HC THI NGUYN KHOA CNG NGH THNG TIN Nguyn Trung Sn PHNG PHP PHN CM V NG DNG Chuyn ngnh :KHOA HC MY TNH M s :60.48.01 LUN VN THC S KHOA HC MY TNH NGI HNG DN KHOA HC 1.PGS. TS V C THI Thi Nguyn 2009 -2- MC LC TRANG LI CM N5 LI M U6 CHNG I : TNG QUAN THUYT V PHN CMD LIU7 1. Phn cm d liu7 1.1 nh ngha v phn cm d liu7 1.2 Mt s v d vphn cm d liu7 2. Mt s kiu d liu10 2.1 D liu Categorical10 2.2 D liu nh phn13 2.3 D liu giao dch14 2.4 D liu Symbolic15 2.5 Chui thi gian(Time Series)16 3. Php Bin i v Chun ha d liu 16 3.1 Php chun ha d liu17 3.2 Bin i d liu21 3.2.1 Phn tch thnh phn chnh21 3.2.2 SVD23 3.2.3 Php bin i Karhunen-Love 24 CHNG II. CC THUT TON PHN CM D LIU28 1. Thut ton phn cm d liu da vo phn cm phn cp28 1.1 Thut ton BIRCH28 1.2 Thut ton CURE30 1.3 Thut ton ANGNES32 1.4 Thut ton DIANA33 1.5 Thut ton ROCK33 1.6 Thut ton Chameleon34 -3- 2. Thut ton phn cm d liu m35 2.1Thut ton FCM36 2.2 Thut ton FCM37 3.Thut ton phn cm d liu da vo cm trung tm 37 3.1 . Thut ton K MEANS37 3.2Thut ton PAM41 3.3 Thut ton CLARA42 3.4 Thut ton CLARANS44 4. Thut ton phn cm d liu da vo tm kim46 4.1 Thut ton di truyn (GAS)46 4.2 J- Means48 5. Thut ton phn cm d liu da vo li49 5.1STING49 5.2. Thut ton CLIQUE51 5.3. Thut ton WaveCluster52 6. Thut ton phn cm d liu da vo mt 53 6.1 Thut ton DBSCAN53 6.2. Thut ton OPTICS57 6.3. Thut ton DENCLUDE58 7. Thut ton phn cm d liu da trn mu60 7.1 Thut ton EM60 7.2 Thut ton COBWEB61 CHNG III :NG DNG CA PHN CM D LIU62 1. Phn on nh 62 1.1. nh ngha Phn on nh63 1.2 Phn on nh da vo phn cm d liu65 2. Nhn dngi tng v k t71 2.1 Nhn dng i tng71 -4- 2.2 Nhn dng k t. 75 3. Truy hi thng tin76 3.1 Biu din mu78 3.2 Php o tng t79 3.3 Mt gii thut cho phn cm d liu sch80 4. Khai ph d liu81 4.1 Khai ph d liu bng Phng php tip cn. 82 4.2 Khai ph d liuc cu trc ln.83 4.3 Khai ph d liu trong C s d liu a cht.84 4.4 Tm tt86 KT LUN ,HNG PHT TRIN CA TI90 PH LC 91 TI LIU THAM KHO99 -5- LI CM N Em xin chn thnh cm n PGS. TS V c Thi tn tnh hng dn khoa hc, gip em hon thnh tt lun vn tt nghip ny. Em cngxin gi li cm n ti cc thy, c gio dyd, v truyn t kin thc cho em trong sut qutrnh hc tp v nghin cu HC VIN NGUYN TRUNG SN -6- LI M U Trongnhngnmgny,sphttrinmnhmcaCNTTlm chokhnngthuthpvlutrthngtincacchthngthngtintng nhanhmtcchchngmt.Bncnh,victinhchamtcchtv nhanhchngcchotngsnxut,kinhdoanhcngnhnhiulnhvc hotngkhctorachochngtamtlngdliulutrkhngl. Hng triu CSDL c s dng trong cc hot ng sn xut, kinh doanh, qun l..., trong c nhiu CSDL cc ln c Gigabyte, thm ch l Terabyte. Sbngnnydntimtyucucpthitlcncnhngk thut v cng c mi t ng chuyn i lng d liu khng l kia thnh cc tri thc c ch. T , cc k thut khai ph d liu tr thnh mt lnh vc thi s ca nn CNTT th gii hin nay ni chung v Vit Nam ni ring. Khaiphdliuangcpdngmtcchrngritrongnhiulnhvc kinhdoanhvisngkhcnhau:marketing,tichnh,ngnhngvbo him, khoa hc, y t, an ninh, internet Rt nhiu t chc v cng ty ln trn th gii p dng k thut khai ph d liu vo cc hot ng sn xut kinh doanh ca mnh v thu c nhng li ch to ln. Cc k thut khai ph d liu thng c chia thnh 2 nhm chnh: - K thut khai ph d liu m t: c nhim v m t v cc tnh cht hoc cc c tnh chung ca d liu trong CSDL hin c. -Kthutkhaiphdliudon:cnhimvaraccdon da vo cc suy din trn d liu hin thi. BnlunvnnytrnhbymtsvnvPhncmdliu,mt trongnhngkthut cbnKhaiphdliu.ylhngnghincu c trinvng ch ra nhng s lc trong vichiu v khai thc CSDL khng l, khm ph thng tin hu ch n trong d liu; hiu c ngha thc t ca d liu.Lun vn c trnh by trong 3 chng v phn ph lc : Chng 1 : Trnh by tng quan l thuyt v Phn cmdliu,cc kiu d liu, Php bin i v chun ha d liu. Chng 2: Gii thiu, phn tch, nhgi cc thut ton dng phn cm d liuChng 3 : Trnh by mt s ng dng tiu biu ca phn cm d liu. Kt lun : Tm tt cc vn c tm hiu trong lun vn v cc vn lin quan trong lun vn, a ra phng hng nghin cu tip theo. -7- CHNG I : TNG QUAN L THUYT V PHN CM D LIU 1. Phn cm d liu 1.1 nh ngha v phn cm d liu Phncmdliu(DataClustering)hayphncm,cngcthgil phntchcm,phntchphnon,phntchphnloi,lqutrnhnhm mt tp cc i tng thc th hay tru tng thnh lp cc i tng tng t.Mtcmlmttphpccitngdliumccphntcan tngtnhaucngtrongmtcmvphitngtviccitngtrong cccmkhc.Mtcmccitngdliucthxemnhlmtnhm trong nhiu ng dng.1.2 Mt s v d vphn cm d liu 1.2.1 Phn cm d liu phc v cho biu din d liu gene Phncmlmttrongnhngphntchcsdngthngxuyn nht trong biu din dliugene (Yeung et al., 2003; Eisen at al., 1998).D liu biu din gene l mt tp hp cc php o c ly t DNAmicroarray (cn gi l DNA chip hay gene chip) l mt tm thy tinh hoc nha trn c gn cc on DNAthnh cc hng siu nh. Cc nhnghin cu s dng cc con chipnh vy snglc ccmu sinhhcnhmkimtra s cmt hnglottrnhtcngmtlc.CconDNAgntrnchipcgil probe (mu d). Trn mi im ca chip c hng ngn phn t probe vi trnh tgingnhau.Mttphpdliubiudingenecthcbiudin thnh mt ma trn gi tr thc : ,2 12 22 211 12 11|||||.|

\|=nd n nddx x xx x xx x xD Trong : -n l s lng cc gen -d l s lng mu hay iu kin th -xij l thc o biu din mc gen i trong mu j -8- Bi v cc biu ma trn gc cha nhiu, gi tr sai lch, h thng bin th, do tin x l l i hi cn thit trc khi thc hin phn cm. Hnh 1 Tc v ca Khai ph d liu D liu biu din gen c th c phn cm theo hai cch. Cchth nht l nhm cc cc mu gen ging nhau, v d nh gom cc dng ca ma trn D. Cchkhclnhmccmukhcnhautrncchstngng,vdnh gom cc ct ca ma trn D. 1.2.2 Phn cm d liu phc trong sc khe tm l Phncmdliupdngtrongnhiulnhvcsckhetml,bao gm c vic thc y v duy tr sc khe, ci thin cho h thng chm sc sc khe,vcngtcphngchngbnhttvngikhuyttt(Clatworthyet al., 2005). Trong s pht trin h thng chm sc sc khe, phn cm d liu c s dng xc nh cc nhm ca ngi dn m c th c hng li t cc dch v c th (Hodgesv Wotring, 2000). Trong thc y y t, nhm phn tch c s dng la chn nhm mc tiu vo nhm s c kh nng emlilichchosckhecthtccchindchqungbvtoiu kinthunlichosphttrincaqungco.Ngoira,phncmdliuKhai ph d liu Khai ph d liu trc tip Khai ph d liu gin tip Phn loi c lng D on Phn cm Lut kt hp Din gii v trc quan ha -9- c s dng xc nh cc nhm dn c b ri ro dopht trin y t v cc iu kin nhng ngi c nguy cngho. 1.2.3 Phn cm d liu i vi hot ng nghin cu th trng Trongnghincuthtrng,phncmdliucsdngphn onthtrngvxcnhmctiuthtrng(Chrisoppher,1969; Saunders,1980,FrankandGreen,1968).Trongphnonthtrng,phn cmdliuthngcdngphnchiathtrngthnhnhngcm mangngha,chnghannhchiaraitngnamgiit21-30tuiv namgiingoi51tui,itngnamgiingoi51tuithngkhngc khuynh hng mua cc sn phm mi. 1.2.4 Phn cm d liu i vi hot ng Phn on nhPhnonnhlvicphntchmcxm haymucanhthnhcc lt ngnht (ComaniciuandMeer,2002).Trongphnonnh,phncm d liu thng c s dng pht hin bin ca i tng trong nh. Phncmdliulmtcngcthityucakhaiphdliu,khai ph d liu l qu trnh khm ph v phn tch mt khi lng ln d liu lycccthngtinhuch(BerryandLinoff,2000).Phncmdliu cnglmtvncbntrongnhndngmu(patternrecognition).Hnh 1.1 a ra mt danh sch gin lc cc tc v a dng ca khai ph d liu v chng t vai tr ca phn cm d liu trong khai ph d liu. Nhnchung,Thngtinhudngcthckhmphtmtkhi lng ln d liu thng qua phng tin t ng hay bn t ng (Berry and Linoff, 2000). Trong khai ph d liu gin tip, khng c bin no c chn ranhmtbinch,vmctiulkhmphramtvimiquanh giatt cccbin.Trongkhiivikhaiphdliugintipmtvi bin li c chn ranh cc bin ch. Phn cm d liu l khai ph d liu gin tip, bi v trong khai ph d liu, ta khng m bo chc chn chnh xc cm d liu m chng ta ang tm kim,ng vai tr g trong vic hnh thnh cc cm d liu , v n lm nh th no. Vn phn cm dliu c quan tmmt cch rng ri,mc d cha c nhngha ng b v phn cmdliuv c th s khng baogi lmtvinthngnht.(Estivill-Castro,2002;Dubes,1987;Fraleyand Raftery,1998).Nimtcchikhil:Phncmdliu,cnghalta -10- chomt tp d liu vmt phng php tng t, chng ta nhmdliu li chng hn nh im d liu trong cng mt nhm ging nhau v im d liu trongccnhmkhcnhauvskhngngdng.Rrnglvnny cbtgptrongnhiungdng,chnghnnhkhaiphvnbn,biu din gen, phn loi khch hng, x l nh 2. Mt s kiu d liu Thut tonphncmdliucnht rt nhiulinkt viccloid liu. V vy, s hiu bit v quy m, bnh thng ho, v gn nhau l rt quan trng trongvicgii thch cc kt qu cathut ton phn cm dliu.Kiu dliuninmclngthatrongdliu(JainvDubes,1988; Anderberg, 1973) - mt thuc tnh duy nht c th c g nh nh phn, ri rc,hoclintc.thuctnhnhphncchnhxchaigitr,nhlng hocsai.Thuctnhrirccmtshuhnccgitrcth,vthcc loi nh phn l mt trng hp c bit ca cc loi ri rc (xem hnh 2).Dliuquym,mchratmquantrngtngicacccons, cnglmt vn quan trng trong phncm d liu. Vyliu c th c chia thnh quy m nh lng v quy m nh tnh. quymnhlng bao gm quym danhngha v quymgiihn; quym nh tnh baogm quy m khong v quy m khong t l (hnh 3). cc kiu d liu s c xem xt trong phn ny . 2.1 D liu CategoricalThuctnhCategoricalcngcgilthuctnhdanhngha,thuc tnhnynginlsdngnhtn,chnghnnhccthnghiuxev tncaccchinhnhngnhng.Chngtaxemxtccdliutphpvi mt s hu hn cc im d liu, mt thuc tnh trn danh ngha ca cc im d liu trong tp d liu c th ch c mt s hu hn cc gi tr; nh vy, cc loi danh ngha cng l mt trng hp c bit ca kiu ri rc. -11- Hnh 2. Biu cc dng d liu Hnh 3. Biu quy m d liu Trong phn ny, chng ta s gii thiu cc bng biu tng v bng tn s v k hiu mt s b d liu Categorical. Bng 1 Mu v d ca tp d liu Categorical Bn ghiGi tr x1(A, A, A, A, B, B)x2(A, A, A, A, C, D)x3(A, A, A, A, D, C)x4(B, B, C, C, D, C)x5(B, B, D, D, C, D) Cho { }nx x x D , ,2 1 = l mt tp d liu tuyt i vi khong cch n,cmtbidthuctnhCategoricalv1,v2,vd.tDOM(vj)thuc Kiu d liu Ri rcLin tc Danh nghaNh phn i xngBt i xng Quy m d liu nh lng Danh nghaGiihn nh tnh T l Khong -12- min thuc tnh vj . Trong tp d liu Categorical cho trong bng 2.1, v d min ca v1 v v4 l DOM(v1) = {A, B} v DOM(v4) ={A, C, D},tch bit. ChomttpdliuCategoricalD,gisrng ( ) { }jjn j j jA A A v DOM , , ,2 1 = vij=1,2,,d.GiAjl jn l s s 1ltrng thi thuc tnh Categorical vj cho trong tp d liu D. Mt bng Ts ca tp d liu c nh ngha Ts = (s1, s2, , sd), (2.1) Ni sj ) 1 ( d l s sl vecto nh ngha l ( )Tjn j j jjA A A s , , ,2 1 =. Vcnhiutrngthicthlccgitr(hoc)chomtbin,mt bng biu tng ca mt tp d liu thng l khng duy nht. V d, i vi b d liu trong bng 1, c hai bng 2 v Bng 3 l bng biu tng ca n.Bngtnsctnhtheomtbngbiutngvnchnhxc cng kch thc nh bng biu tng. t C l mt cm. Sau , bng tn s Tf (C) ca cc cm C c nh ngha l ( ) ( ) ( ) ( ) ( ), , , ,2 1C f C f C f C Td f = (2.2) Ni() C fj l mt vecto c nh ngha() () () () ( ) , , , ,2 1Tjn j j fC f C f C f C Tj = (2.3) Bng 2. Mt trong nhng bng biu tng ca b d liu trong bng 1|||.|

\|DCBDCBDCADCABABA Bng 3 : Bng biu tng ca b d liu trong bng 1. |||.|

\|DBCDCBDCAACDABBA Ni fjr(C) ) 1 , 1 (jn r d j s s s sl s im d liutrong cm Cmgi tr Ajr ti mng th j, v.v () { }, :jr j jrA x C x C f = e = (2.4) -13- Ni xj l gi tr b phn j ca x i vi mt bng biu tng cho trc ca b d liu, bng tn s ca mi cml duynht ln n rng bng biu tng. V d, ivi b d liu trong bng 2.1, cho C c mt cm, trong C = (x1, x2, x3). Sau , nus dngccbiutngtrnhbytrongbng2bngtnstngngchocc nhm C c cho trong bng 2.4. Nhng nus dng bng biu tng trnh bytrongBng2.3,saulbngtnschoccnhmCcchotrong bng 2.5. c c b d liu Categorical D, chng ta thy rng Tf(D)lmt bngtnhtontnstrncsdliutonbthitlp.GisDlphn vng khng chng cho vo kcm C1, C2,..., Ck. Sau chng ta c() ( )==kii jr jrC f D f1 (2.5) Vi tt c r = 1, 2, , nj v j = 1, 2, d. 2.2 D liu nh phn Mt thuc tnhnh phnlmt thuc tnhc hai gi tr chnhxcnht c th, chng hn nh "ng" hay "Sai" Lu rng cc bin nh phn c th c chia thnh hai loi:. bin nh phn i xng v cc bin nh phn bt i xng. Trong mt bin nh phn i xng, hai gi tr c quan trng khng km nhau. Mt v d l "nam-n". Bin nh phn i xngl mt bin danh ngha. Trongmtbinkhngixng,mttrongnhnggitrcanmangtm quantrnghnbinkhc.Vd,"c"lvitttcashindincamt thuc tnh nht nh v "khng" ngha l s vng mt ca mt thuc tnh nht nh. Mtvectonhphnxvikchthcdcnhnghal(x1,x2,, xd)(Zhang and Srihari 2003), ni{ }( ) d i xis s e 1 1 , 0 l gi tr thnh phn j ca x. Vecto khinh phnI ca kch thc d l mt vecto nh phn vimigi tr nhpvobng1.Vicbxungmtvectonhphnxcnhnghal x I x =, ni I l mt n v vecto nh phn c cng kch thc nh x. Xt hai vecto nh phn x v y trong khng gian d, v cho( ) y x Sij,{ } ( ) 1 , 0 ,e j ibiu th s ln xut hin ca i trong x v j trong y tng ng, v d ( ) { i x k y x Sk ij= = : , v} d k j yk, , 2 , 1 , = = .(2.6) -14- Sau , r rng chng ta c ng thc sau : ( )== =dii iy x y x y x S111, . ,(2.7a) ( ) ( )( )= = =dii iy x y x y x S1_ _00, 1 1 . ,(2.7b) ( ) ( )= = =dii iy x y x y x S1_01, 1 . ,(2.7c) ( ) ( )= = =dii iy x y x y x S1_10, 1 . ,(2.7d) Tacng c : ( ) ( ) ( ) ( ). , , , ,11 10 01 00y x S y x S y x S y x S d + + + =(2.8) Bng 4: Bng tnh ton tn s t bng biu tng trong bng 2 |||.|

\|1111110030030303 Bng5: Bng tnh ton tn s t bng biu tng trong bng 3 |||.|

\|1111110033003003 2.3 D liu giao dch Chomt tphp cc phn tI = (I1, I2,. . . , Im),mt giao dchlmt tp hp con caI (Yang et al, 2002b.; Wang et al, 1999a.; Xiaov Dunham, 2001).Mttpdliugiaodchlmttphpccgiaodch,vd { }. , 2 , 1 , : n i I t t Di i = _ = .Giaodchcthcidinbivectornhphn, trong mi mc biu th cc c hay khng c mc tng ng. V d, chng ta c th i din chomt giao dch ti do vc t nh phn (bi1, bi2,.., bim.), ni bij = 1 nu IJ ti v bij = 0 nu Ij e ti. T imny, cc d liugiao dchl -15- mt trng hp c bit ca d liu nh phn. V d ph bin nht ca d liu giaodchlthtrngdliutronggihng.Trongmtthtrngthitlpdliutronggihng,giaodchcchamttphpconcatp tngsmthngmcthcmua.V d,sauylhaigiaodch:(to, bnh),(to,mnn,trng,c,).Nichung,nhiugiaodchcthchin ccmcthathtphnphi.Vd,mtkhchhngchcthmuamts mt hng t mt ca hng vi hng nghn mt hng. Nh ch ra bi Wang et al. (1999a), cho cc giao dch c thc hin cc mc tha tht phn phi, cptngtlkhngcnthit,cngkhngnhgixemmtcm giao dch l tng t. 2.4 D liu Symbolic DliuCategoricalvdliunhphnlloidliucin,vd liu symbolicl mt phn m rng ca cc kiu d liu c in. Trong b d liuthngthng,ccitngangccoilcnhn(lnucci tng t) (Malerba et al, 2001.), trong khi ti tp d liu symbolic , cc i tng l nhiu hn "thng nht" do c ngha l cc mi quan h. Nh vy, cc dliusymboliccnhiuhnhocthnngnhthocccnhmca cc c nhn (th hai i tng t) (Malerba et al, 2001.). Malerba et al. (2001) c xc nhmt d liu symbolic c thit lp mt lp hoc nhm ca cc c nhn m t bi mt s thit lp gi tr hoc bin phng thc. Bin A c gi lgi tr thit lp nunngvaitrgitrcantrongthitlpmincan.Mtbin phngthclmtthitlpgitrbinvimtbinphphocphnphi mt (tn s, xc sut, hoc trng lng) kt hp vi mi i tng.GowdavDiday(1992)tmtt skhcbitgiadliusymbolicv d liu thng thng nh sau: Ttcccitngtrongmtdliusymboliccthkhngc nh ngha v cc bin tng t. Mi bin c th mt nhiu hn mt gi tr hoc thm ch khongmt gi tr. Cc bin trongmt dliu symbolic phc tpc thmt gi trbao gm mt hoc nhiu i tng c bn. -16- Ccmtcamtitngtngtrngcthphthucvomi quan h hin ti gia cc i tng khc. Ccgitrccbinmt cthchothytnsutxuthin,khnng tng i, mc quan trng ca cc gi tr, vv. DliuSymboliccthctnghptccdliukhcthngv l do l ring t. Trong s liu iu tra dn s, v d, cc d liu c to sn dng tng hp m bo rng cc nh phn tch d liu khng th xc nh mt c nhn hay mt doanh nghip duy nht thnh lp.2.5 Chui thi gian(Time Series) Chui thi gianl nhng hnh thc n gin nht ca d liu tm thi. Chnh xc, mt chui thi gian l mt chui ca s thc i din cho cc php ocamtbinthctticckhongthigianbng(GunopulosvDas, 2000).Vd,gicphiuccphongtro,nhittimtimno,v khi lng bn hng theo thi giantt c o l cc chui thi gian.Mt chui thi gian l ri rc nu bin c xc nh trnmt tp hu hnccimthigian.Nhiunhtcachuithigiangpphitrongphn tch cml thi gian ri rc. Khimt bin c nhngha tt c cc im trong thi gian, sau l chui thi gian l lin tc. Ni chung, mt chui thi gian c th c coi l mt hn hp ca bn thnh phn sau (Kendall v Ord, 1990): 1. Mt xu hng, v d., cc phong tro lu di; 2. Bin ng v xu hng u n hn hoc t hn; 3. Mt thnh phn theo ma; 4. Mt hiu ng d hoc ngu nhin. 3. Php bin i v chun ha d liuTrongnhiungdngcaphncmdliu,dliuth,hococ thct,khngcsdngtrctip,trkhimtmhnhxcsut chocc th h khun mu c sn (Jain v Dubes, 1988). Vic chun b cho vic phn cmdliuyucumtsloichuyni,chnghnnhbiniv chun ha d liu. Mt s phng php bin i d liu thng c s dng phncmdliusctholuntrongphn.Mtsphngphp chun ho d liu c trnh by trong Phn 4.1. -17- thun tin hy cho{ }* *2*1*, , ,nx x x D =biu th tp d liu th d-chiu. T ma trn d liu l mt ma trn n x d c cho bi ( )|||||.|

\|=* *2*1*2*22*21*1*12*11* *2*1, , ,nd n nddTnx x xx x xx x xx x x (4.1) 3.1 Php chun ha d liu Chunholmchodliugimkchthci.Ncchxcnh tiu chun ho ch s. Sau chun ha, tt c cc kin thc v v tr v quy m ca cc d liu gc c th b mt. N l cn thit chun ha cc bin trong trnghpccbinphpkhnggingnhau,chnghnnhkhongcch Euclide,lnhycmvinhngkhcbit tronglnhocquymcacc binuvo(MilliganvCooper,1988).Ccphngphptipcncc chunho ca cc binbn cht cahailoi: Chunha ton ccv chun ho trong cm.Chun ha ha ton cc lm chun cc bin trn tt c cc yu t trong cctpdliu.Trongvng-cmtiuchunhodngchtiuchunha xyra trong cc cm binmi ngy. Mt s hnh thc tiu chunho cth csdngtrongccchunhatonccvchunhatrongphmvirt tt,nhngmt shnh thcchunhochcthcsdngtrongchun ho ton cc. Khng th trc tip chun ha cc bin trong cc cm trong phn cm,bi v cc cm khng c bit trc khi chun ha. khc phc kh khn ny, khc phng php phi c thc hin. Tng th v Klett (1972) xut mt cchtipcnlprngcccmthucutindatrnsclng tng th v sau s dng cc cm gip xc nh cc bin bn trong nhm chnh lch i vichun ho trong mt phn cm th hai.chunhadliuthcaratrongphngtrnh(4,1),tac th tr mt thc o v tr v phn chia mt bin php quy m cho mi bin. l, jj ijijML xx=*(4.2) Edited by Foxit ReaderCopyright(C) by Foxit Corporation,2005-2009For Evaluation Only. -18- ni ijxbiu th gi tr c chun ha, jL l v tr o, v jMl quy m o. Chng ti c th c c phng php tiu chun ho khc nhau bng cch chn khc nhau LJ v MJ trong phng trnh (4,2). Mt s phng php chunhonitingtrungbnh,tiuchunlch,phmvi,Hubercad ton, d ton biweight Tukey's, v Andrew c tnh ca sng.Bng4,1chomtshnhthctiuchunho,ni *jx , *jRv *jo ,c ngha l, phm vi, v lch chun ca bin th j, tng ng, ngha l ==niij jxnx1* *1(4.3a) , min max*1*1*ijn iijn ijx x Rs s s s =(4.3b) 2121* * *) (11((

==nij ij jx xno(4.3c) Bygichngtatholunvmt schitit cchnhthcchungca tiuchunhovthuctnh.z-scorelmthnhthccatiuchunho c s dng chuyn bin th bnh thng to im chun. Cho mt tp hp cc d liu th D*, cc Z-score cng thc chun c nh ngha l ( )** **1jj ijij ijx xx Z xo= =(4.4) Ni *jx , *joc ngha l cc mu v lch chun ca cc thuc tnh th j, tng ng. Biniscmtnghaca0vphngsaimt trongs1.Vtr quy m v thng tin ca bin gc b mt. Chuyn i ny cng l trnh by trong (Jain v Dubes, 1988, trang 24). Mt iu quan trng hn ch ca chun haZ1lnphicpdngtrongtiuchuntoncuvkhngtrong phmvi-cmtiuchunho(MilliganvCooper,1988).Trongthct,hy xemxt trnghphaicmtchracngtntitrongccdliu.Numtmu cvtrmi hai cm trung tm, sau trongvng-cm chun s chun haccmunmticmtrungtmvkhngvect.Btkthutton clustering s nhmhai s khngvect vi nhau, c nghal hainguynmu Edited by Foxit ReaderCopyright(C) by Foxit Corporation,2005-2009For Evaluation Only. -19- s c c nhm chomt cluster. iuny to ramt kt qu phnnhm rt gy hiu nhm. Bng4.1Mtviphpchunhadliu,ni *jx , *jRv *jocnhngha trong biu thc 4.3 TnLjLj z-score *jx*joUSTD0 *joMaxium0 *1maxijn ixs s Mean *jx 1 Median *21jnx + nu n l l ||.|

\|++*22*221jnjnx x nu n l chn 1 Sum0 =niijx1* Range *1minijn ixs s *jR ChunhaUSTD(lchchuncctrngkhngchnhxc)cng tng t nhchun ho im z-score v c nh ngha l ( )***2jijij ijxx Z xo= =(4.5) Ni *joc nh ngha trong biu thc (4.3c) BinibiZ2scmtphngsaica1.Ktkhicims khngctrungtmbngcchtricnghal,ccthngtinvtrgia ccimvncn.Nhvy,chunhaZ2 skhngphichunhngvn ca s mt thng tin v cc Cm centroids.PhngphpchunhothbatrnhbytrongMilliganvCooper (1988) l s dng im ti a v bin: ( )*1**3maxijn iijij ijxxx Z xs s= = (4.6) Edited by Foxit ReaderCopyright(C) by Foxit Corporation,2005-2009For Evaluation Only. -20- MtXbinibiZ3scmtngha ) max( XXvlchchun ,) max( XXoniXv Xol trung bnh v lch chun ca bin gc. Z3 l nhy cmvishindincaOutliers(MilliganvCooper,1988).Numt n ln quan st trnmt bin c trnh by,Z3 s chunha cc gi tr cnli gn 0. Z3 c vl c ngha ch khi binnylmt bin php trongmt phm vi t l (Milligan v Cooper, 1988).Hai quy chun c lin quan n vic s dng phm vi ca bin c trnh by trong (Milligan v Cooper, 1988): ( )***4jijij ijRxx Z x = = (4.7a) ( ) ,min**1**5jijn iijij ijRx xx Z xs s= = (4.7b) Ni *jRlphmvithuctnhthjcnhnghatrongbiuthc (4.3b) MtbinXbinibiZ4vZ5scnghal ) min( ) max( X XX v ) min( ) max() min(X XX X,tngng,vccnglchchun ) min( ) max( X XXo .C hai Z4 v Z5 d phi s hin din ca Outliers. Mt tiuchunhotrncsbnh thnghavitngcaccquan st trnh by trong (Milligan v Cooper, 1988) c nh ngha l ( ) ,1***6== =niijijij ijxxx Z x (4.8) CcZ6chuynisbnhthnghatnggitrchuynthnhs thng nht v cc chuyn c nghal s c n1 . Nh vy, c nghal s c lin tc trn tt c cc bin.Edited by Foxit ReaderCopyright(C) by Foxit Corporation,2005-2009For Evaluation Only. -21- Mtcchtipcnrtkhcnhaucachunhombaogmvic chuyniccimnnhgicaoctrnhbytrong(Milliganv Cooper, 1988) v c nh ngha l ( ) ( ),* *7 ij ij ijx Rank x Z x = = (4.9) Ni Rank(X) l cp ch nh cho X Mt bin chuyn bi Z7 s c mt ngha ca 21 + n v mt phng sai ca|.|

\| +++4161 21n nn. Vic chuyn i cp bc lm gim tc ng ca Outliers trong d liu.ConovervIman(1981)xutbnloichuynicpbc. Hng nht chuyn i trnh by c xp hng t nh n ln nht, vi im s nhnht c hng nht, im th hai nh nht c thhng hai, vv. Cp bc trung bnh c ch nh trong trng hp quan h. 3.2 Bin i d liu Bin i D liu c g lm g vi d liu chun ho, nhng n l phc tp hn hn so vichun ho d liu. Chun ho d liu tp trung vo cc bin,nhng Bin i dliutp trungvo cc dliu ton b thit lp. TheoChunhodliunhvy,cthccxemnhlmttrng hp c bit ca Bin i d liu i. Trong phn ny, trnh bymt s d liu k thut Bin i c th c s dng trong phn cm d liu. 3.2.1 Phn tch thnh phn chnh Mcchchnhcaphntchthnhphnchnh(PCA)(DingvHe, 2004; Jolliffe, 2002) l gim chiu cao ca mt chiu t d liu bao gm mt lnglnsbintngquanvngthigilicngnhiucngttca binihindintrongtpdliu.Ccthnhphnchnh(PC)lccbin mickhngtngquanvralnhnhvylngiutingilivi phn ln cc bin th hin din trong tt c cc bn gc bin. Cc PC c nh ngha nh sau. Cho( )'=dv v v v , , ,2 1l mt vect ca dngunhinbin,nilhotngtranspose.Bcutinltmmt hm tuyn tnh mt v a1'ca cc yu t ca v c ti a cc phng sai, m a1 l mt vect d-chiu( )'da a a1 12 11, , , do , -22- == 'dii iv a v a11'1 Saukhitmv a v a v aj 1 2 1, , ,' ' ' ,chngtitmmthmtuyntnhv aj'khng tng quan viv a v a v aj 1 2 1, , ,' ' ' v c phng sai ti a. Sau chng ta s tm thydchcnngnhvytuyntnhsaukhibcd.BinbtngunthjPC.Nhnchung,huht ccbinthtrongvscchimbiccPCvi ln u tin.tmmucaPC,chngtacnphibitmatrnhipphngsai cavTronghuhtcctrnghpthct,matrnhipphngsai chacbit,vnscthaythbngmtmuma trn hip phng sai . i vi j = 1, 2,. . . , d, n c th c cho thy th jPCcchobizj=v aj' ,niajlmteigenvectorca tngngvi cc th gi tr j ln nht j.Trong thc t, bc u tin, z1 =v aj'c th tm thy bng cchgii quyt ti u ho vn sau y:Maximize( ) v a1var '11 = 'a a , Ni( ) v a1var ' c tnh nh sau( )=1'1'1var a a v agiiquytvntiuhatrn,cckthutcanhnu Lagrange c th c s dng. Cho lmt s nhnLagrange. Tamunti a ha( ). 1'1 1'1 a a a a (4.10) Phng trnh khc(4.10) vi a1, chng ta c 01 1= a a ( ) 01 = a IdNi Id l ma trn nhn dng d x dV l gi tr ring ca v a1 l vecto c trng ng v. ,1'1 1'1 = a a a a -23- a1lvecto ctrngngvvigitrringlnnht ca .Trong thctncthcbiudinlmtPCthjlv aj',niajlmtvecto c trng ca tng ng vi th j ln nhtgi tr ring j(Jolliffe, 2002).Trong(DinhvHe, 2004), PCAllmvic gim chiu ca d liu thitlpvsauthuttonK-meanscpdngtrongkhnggianconPCA.CcvdkhccaPCApdngtrongphntchcmdliucth c tm thy trong (Yeung v Ruzzo, 2001). Trnh din PCA l tng ng vi gi tr thc hin phn hy t (SVD) trn cc hip phng sai ma trn ca d liu. ORCLUS s dng SVD (Kanth et al, 1998) k thut. tm hiu ty tin theo nh hng khng gian con vi phn cm d liu tt. 3.2.2 SVD SVDlmtkthutmnhmtrongtnhtonmatrnvphntch, chng hn nh vic gii quyt cc h thng phng trnh tuyn tnh v xp x matrn.SVDcnglmt kthutnitingchiutuyntnhvcs dng rng ri trong nn d liu v o (Andrews v Patterson, 1976a, b). trong mc ny, phng php SVD l phng php tm tt. Cho { }nx x x D , , ,2 1 =lmtsdliucttrongmtkhng gian d-chiu. Sau , D c th c i din bi mt n x n ma trn X l( ) ,d nijx X= Ni ijxgi tr thnh phn ca xi

Cho( )d , , ,2 1 =l ct ca X, ===niij jd j xn1, , , 2 , 1 ,11 v cho en l mt vect ct ca n chiu di vi tt c cc yu t tng ng vi n. Sau , SVD th hinne X l, TnUSV e X = (4.11) trongUlmtmatrnnntrcgiao,vd,nghal,UTU=Il ma trn n v. S l mt ma trn cho cha cc gi tr s t,v Vl mt ma trn unita d d , v d, VHV = I, ni VH l ma trn chuyn v lin hp ca V. -24- CcctcamatrnVlvecto ctrngcamatrnhipphngsai C ca X; chnh xc T T TV V X XnC . = = 1 (4.12) K t khi C l ma trn cho i din d d, n c d l s t nhin vecto ctrngtrcgiao.Mkhngmttngqut,choccgitrringcaC gim: 1 2 d. Hy j (j = 1,2 ,..., d) l lch chun ca ct th j ca X, ngha l, ( ) .12112|.|

\| ==nij ij jxn o ca C l bt bin theo lun phin, ngha l, = == =djjdjj1 12 o Chrng n X eTn= v n e enTn= tphngtrnh(4.11)v(4.12), chng ta c T T T T TUSV U VS SV VS =( ) ( ) nTne X e X = nTnTnT TnT Te e e X X e X X + = T Tn X X =TV nVA .(4.13) K t khi V l mt ma trn trc giao, t phng trnh (4,13), cc gi tr t c lin quan n cc gi tr ring bi. , 2 , 1,2d j n sj j = = Cc vecto c trng chim cc my tnh ca X, v cc tnh nng khng tngquanscthucdochuyni( )V e X Yn = .PCAchncc tnh nng vi gi tr ring cao nht. 3.2.3 Php bin i Karhunen-LoveCc php bin i Karhunen-Love (KL) c lin quan vi cc gii thch cu trc d liu thng qua mt s tuyn tnh kt hp ca cc bin. Ging nh PCA,php bin i KL cng l cch ti u cho d n d- chiu im gim imchiusaochosaiscadn(tcltngcakhongcchbnh phng (SSD)) l ti thiu (Fukunaga, 1990). -25- Cho { }nx x x D , , ,2 1 = lmttpdliukhnggiandchiu,vXl ng v ma trn d x d.ngha l( )d nijx X= vi ijxl gi tr j thnh phn ca xi. ( ) n i xi, , 2 , 1 =lvectodchiu.Chngcthhinthkhnglibng php tnh tngvecto tuyn tnh c lp nh == =djTiTj ij iy y x1| |hoc,TY X | = (4.14) ni( ), ,, , 2 1 id i i iy y y y=v ( )ddyyyY | | | | , , , ,2 121=|||||.|

\|=Cc ma trn d dc s |v chng ta bit thm c th cho rng nhng hng|hnh thc mt b trc giao, ngha l, ===, 0, 1j i forj i forTj i| |hay,d dTI = ||Ni d dI l ma trn n v d dI Sau,tphngtrnh(4.14),bphncayjcthctnhton bng , , , 2 , 1 , n i x yi i = = |hoc | X Y =Vvy, Y ch nginlmt bin i trc giao ca X. j|c gil vctthj tnhnngvyijlthnhphnthj camuxi trongkhnggian tnh nng ny. gim bt chiu, chng ta ch chn m(m1). 4.Loi b cc phn t ngoi lai : Trc ht, khi cc cm c hnh thnh cho n khi s cc cm gim xung mt phn so vi s cc cm ban u. Sau , trong trng hp cc phn t ngoi lai c lymucngviqutrnhphakhitomudliu,thutton s t ng loi b cc nhm nh 5.Phncmcccmkhnggian:ccitngidinchocc cmdichuynvhngtrungtmcm,nghalchngc thay th bi cc i tng gn trung tm hn. 6.nh du d liu vi cc nhn tng ng. phctptnhtoncathuttonCURElO(n2log(n)).CUREl thut tontincytrongvickhmphracccmvihnhthbt kvc th p dng tt i vi d liu c phn t ngoi lai v trn cc tp d liu hai chiu.Tuynhin,nlirtnhycmviccthamsnhsccitng i din, t l co ca cc phn t i din. 1.3 Thut ton ANGNES Phng php phnhochANGNES l kthut kiu tch t. ANGNES btungoivimiitngdliutrongcccmringl.Cccm c ha nhp theo mt s loi ca c s lut, cho n khi ch c mt cm nhcaphncp,hocgpiukindng.Hnhdngnycaphncm phn cp cng lin quan n tip cn bottom-up bt u di vi cc nt l trongmicmringlvduytlntrnphncptintgc,nitmthy cm n cui cng vi tt c cc i tng d liu c cha trong cm . -33- 1.4 Thut ton DIANA DIANAthchinilpviAGNES.DIANAbtuvittccc i tng d liu c cha trong mt cm ln v chia tch lp li, theo phn loi ging nhau da trn lut, cho n khi mi i tng d liu ca cm ln cchiatchht.Hnhdangcacmphncpcnglinquantipcn top-down bt uti mc nh nt gc, vi tt c cc i tng d liu, trong mtcm,vduytxungccntldicngnittcccitngd liu tng ci c cha trong cm ca chnh mnh. Trongmiphngphpcahaiphngphp,cthscccmdn ticcmckhcnhautrongphncpbngcchduytlnhocxungcy. Mimccthkhcnhauscccmvtt nhinkt qucngkhcnhau. Mt hnchlncacchtipcnnylcccmchanhphocphn chiamtln,khngthquayliquytnh,chodhanhphocphn chia khng phi l thch hp mc 1.5 Thut ton ROCK --------Main module--------- Procedure cluster(S,k) Begin 1.link:=compute_links(S) 2.for each s e S do 3.q[s]:= build_local heap(link, s) 4.Q:=build_global heap(S, q) 5.while size(Q)> k do{ 6.w:= extract_max(Q) 7.v:= max(q[u]) 8.delete(Q, v) 9.w:= merge(u,v) 10. for each x e q[u]q[v] do { 11.link[x, w]:=link[x, u]+ link[x, v] 12.delete(q[x], u); delete(q[x], v) 13.insert(q[x], w, g(x, w); insert(q[x], w, g(x, w) 14.update(Q, x, q[x]) 15. } -34- 16. insert(W, w, q[w] 17. deallocate(q[u]); deallocate(q[v]) 18. } end ---------------------Compute_links Procedure------------- Procedure compute_links(S) Begin 1.Compute nbrlist[i] for every point i in S 2.Set link[i,j] to be zero all i,j 3.for i:=1 to n do { 4. N:= nbrlist[i] 5. for j:=1 to [N]-1 do 6. for 1:= j+1 to [N]-1 do 7. link[N[j], N[l]:=link[N[j], N[l]+1 8.} End 1.6 Thut ton Chameleon Phng php Chameleon mt cch tip cn khc trong vic s dng m hnhngxcnhcccmnochnhthnh.Bcutinca Chameleon l xy dng mt th mt tha v sau ng dng mt thut tonphnhochthPCDLvislncacccmcon.Tiptheo, Chameleonthchintchtphncmphncp,nhAGNES,bngha nhp cc cm con nh theo hai php o, mi quan h lin thng v mi quan h gn nhau ca cc nhm con. Do , thut ton khng ph thuc vo ngi s dng cc tham s nh K-means v c th thch nghi. Thut ton ny kho st m hnh ng trong phn cm phn cp. Trong , hai cm chanhpnugiahai cm clin quanmt thit ti quan h kt vgnnhau cacc i tng trong cc cm. Qu trnhhanhp d dng khm ph cc cmtnhinv ngnht, ng dng cho tt c cc kiu d liu min l hm tng t c xc nh. -35- N khc phc c nhc im cc phng php CURE v ROCK. L do l CURE v lc lin quan l i thng tin v lin kt ca cc i tng tronghaicmkhcnhau,trongkhiROCKlclinquanlithngtin v gn nhau ca hai cm m li ch trng qu v lin kt. CURE s dng thut ton phn hoch th phn cm cc i tng dliuvotrongmtslnmtcchtnginhcacccmcon. Chameleonsdngthuttonphncmphncptmcccmxcthc bng cch lp nhiu ln kt hp hoc ha nhp cc cm con. xc nh cc cp ca nhiu cm con tng t, phi tnh ton c hai lin kt v gn nhau ca cc cm, c bit cc c trng bn trong ca cc cm ang c ha nhp. Nh vy, n khng ph thuc vo m hnh tnh v c th t ng thch nghivictrngbntrongcacccmangchanhp.Nckh nnghnkhmphcccmchnhthbtkcchtlngcaohn CURE v DBSCAN nhng chi ph x l d liu a chiu ph thuc vo O(n2) thi gian cho n cc i tng trong trng hp xu nht. 2. Thut ton phn cm d liu m Phncmdlium(FCM)lphngphpphncmdliucho php mi im d liu thuc v hai hoc nhiu cm thng qua bc thnh vin. Ruspini(1969)giithiukhiqutkhinimphnhochmmtcu trccmcatpdliuvxutmtthuttontnhtontiuphn hochm.Dunn(19730mrngphngphpphncmvphttrin thuttonphncmm.tngcathuttonlxydngmtphng phpphncmmdatrntithiuhahmmctiu.Bezdek(1981)ci tinvtngquthahmmctiumbngccharatrngsmm xydngthuttonphncmmFuzzyC-means(FCM),vcchng minh hi t ca cc thut ton l cc tiu cc b. ThuttonFCMcpdngthnhcngtronggiiquytmts lnccbitonPCDLnhtrongnhndngmu,xlnh,yhc,Tuy nhin, nhc im ln nht ca FCM l nhy cm vi nhiu v phn t ngoi lai trongdliu,nghal cc trungtm cm cthnmxa sovi trung tm thc ca cm. c nhiu phng php xut ci tin cho nhc im trn cathuttonFCMbaogm:phncmdatrnxcsut(Kellet,1993), -36- phncmnhium(Dave,1991),phncmdatrntontLp, Norm(Kerten, 1999) v thut ton Insensitive Fuzzy C-means( PCM c ). 2.1Thut ton FCM Thut ton FCM gm mt chui cc php lp qua li gia phng trnh (5)v (6). NhvyFCM s dng phplp tiuhmmc tiu, da trn octngtctrngsgiaxkvcmtrungtmVi,saumivng lp, thut ton tnh tonv cp nht phn tujk trongma trnphnhoch U. Phplpsdngkhi{ }, max1 s + kijkij iju u tronglchunkt thcgia0 v1,trongkhiklccbclp.Thtcnyhitticctiuccbhay im yn nga ca Jm(u, V). Thut ton FCM tnh ton ma trn phn hoch U v kch thc ca cc cm thu c cc m hnh m t ma trn ny.Cc bc thc hin ca thut ton FCMnh sau: Input : S cm c v tham s m m cho hm mc tiu J; Output : c cm d liu sao cho hm mc tiu trong (1) t gi tr cc tiu; Begin 1.Nhpthamsc(1 = s Trong ol mt ngng. - Hm nh hng Gaussian: 22( , )2( , )d x ysquaref x y eo=Mt khc, hm mt ti im dx F ec inh ngha l tng cc hm nh hng ca tt cc im d liu. Cho n l cc i tng d liu c m t bi mt tp vecto{ }1,...,dnD x x F = e hm mt c nh ngha nh sau : ( )1( ) ( )nD x iB BiF x F x== -60- Hmmt cthnhlpdatrnnhhngGausscxcnh nh sau : 22( , )21( )id x xnDGaussiF d eo== DENCLUEphthucnhiuvongngnhiuvthamsmt, nhng DENCLUE c ccli th chnh c so snhvi cc thut ton phn cm khc sau y : - C c s ton hc vng chc v tng qut ha cc phng php phn cm khc, bao gm cc phng php phn cp, da trn phn hoch - C cc c tnh phn cmtt cho cc tp d liuvi s lng ln v nhiu -Chophpcccmchnhdngbtktrongtpdliuachiu c m t trong cng thc ton. phctptnhtoncaDENCLUDElO(nlogn).Ccthutton da trn mt khng thc hin k thut phn mu trn tp d liu nh trong ccthuttonphncmphnhoch,viunycthlmtngthm phc tp c s khc nhau gia mt ca cc i tng trong mu vi mt ca ton b d liu. 7. Thut ton phn cm d liu da trn mu 7.1 Thut ton EM ThuttonEMcxemnhlthuttondatrnmuhoclm rngcathuttonK-means.Thtvy,EMgnccitngchocccm chotheoxcsutphnphithnhphncaitng.Phnphixc sutthngcsdnglphnphixcsutGaussianvimcchl khmphlpccgitrttchoccthamscanbnghmtiuchunl hmlogaritkhnngcaitngdliu,ylhmttmhnhxc sut cho cc i tng d liu. EM c th khm phra nhiu hnh dng cm khc nhau, tuy nhin do thi gian lp ca thut ton kh nhiu nhm xc nh cc tham s tt nn chi ph tnh ton ca thut ton kh cao. c mt s ci tincxut choEMdatrncctnhcht cadliu:cthnn,c thsaolutrongbnhvcthhyb.Trongcccitinny,cci tng bhy b khi bit chc chn cnhn phn cm can, chngc -61- nn khi khng loi b v thuc v mt cm qu ln so vi b nh v chng s c lu li trong cc trng hp cn li. Thuttoncchiathnhhaibcvqutrnhclplicho n khi vn c gii quyt : -h b h a E+=+= 21,2121:- ) ( 6, :d c bb ab a M+ ++= 1. Khi to tham s : { }) 0 ( ) 0 (2) 0 (1) 0 ( ) 0 (2) 0 (1 0, , , , , , ,k Kp p p =2. Bc E ( )( ) ( )( )( )( )= =ktjti i ktiti i kt kt j t j kt k jP x PP x Px PP x Px P) ( 2 ) () ( 2 ) (, ,, ,,, ,,o eo e e e e3. Bc M : ( )( )=+kt k ikkt k itix Px x P e e,,) 1 (

( )Rx Ppkt k iti=+ e ,) 1 ( 4. Lp li bc 2, 3 cho n khi t kt qu 7.2 Thut ton COBWEB COBWEB l cch tip cn biu din cc i tng d liu theo kiu cpthuctnhgitr.COBWEBthchinbngcchtocyphnlp, tngtnhkhinimcaBIRCH,tuynhincutrccykhcnhau.Mi nt ca cy phn lp l i din cho khi nim ca i tng d liu v tt c cc im m di lp l cng thuc mt nt. COBWEB s dng cng c phn loi qun l cu trc cy. T cc cm hnh thnh da trn php o tngtmphnloigiatngtvphitngt,chaicthmt phnchiagitrthuctnhgiaccnt tronglp.Cutrccycngcth m t phn chia gi tr thuc tnh gia cc nt trong lp. Cu trc cy cng c thchpnhthocphntchkhichnmtntmivocy.Chai phng php ci tin cho COBWEB v CLASSIT v AutoClass. -62- CHNG III NG DNG CA PHN CM D LIU 1. Phn on nhPhnonnhlmt bphncuthnhcbntrongnhiulnhvc cng dngmy tnh v c th c coi nh l mt lnh vc nghin cu c bncaphncmdliu(RosenfeldandKak1982).Vicphnonccnh da vo vic hin th mt h thng phn tch hnh nh ph thuc vo cnh hinth,Hnhdngnh,cuhnh,vbchuynidngchuynira nh k thuts, v cui cng l u ra(mc tiu) ca h thng. Cc ng dng ca phng php phn cm d liu ivi vn phn onnhnh c cng nhn hn ba thp k trc, v nhng n lc tin phongvnlnntngcsdngngynay.Nntnglplilxcnh cc vc-t c tnh mi mt im nh m nh cha c hm s mt ca nhvhm s bn thnv tr im nh. tngny cm t hnh () bn di.tngnyrtthnhcngkhisdungiviccnhcmt(c hay khng cha kt cu nh), di(bin ) nh, v nh a ph. Hnh25.Tnhnngidinchoclustering.Hnhnhvvtrcc phpocchuynncctnhnng.Cmtrongkhnggiantnhnng tng ng vi cc phn on hnh nh. -63- 1.1. nh ngha Phn on nh Phnonnhchiuthngthnglvicphntchnhuvo thnhccmin(cclpitngringr)miitngcgilmt nhcon.phnbititngnyviitngkhcvtinlichocc bcphntchtiptheo,miitngcgnmtnhn.Thcchtca phnonnhlphpisnhmu.Minhconcphntchchacc thuc tnh (mt , mu, cha vn). Nu ta cho: l mt nh u vo vi Nr dng, v Nc ct v gi tr quan st xij vi im nh (i, j), php phn on nh c th c biu din thnh : vi lth on chamttphpconcaccccktnitaimnh.Khngcon no chia s v tr im nh( ) j i S Sj i= C = , v php hp ca cc phn on bao ton b nh{ } { } ( ). ... 1 ... 11 c r ikiN N S U == Jain v Dubes[1981],sau khi Fu v Mui[1981] pht hin ra 3 k thut s dng phn on nh tmt nh u vol:Kthut phnonnhdatrnmin,Kthut phnonnhda trn bin,v k thut phn on nh bng phn cm d liu. Hy xem xt s hu dng ca vic to ngngmt mc xm n gin phnonmtnhcngtngphncao.Hnh26(a)biudinmt nhthang-o-sngcamvchcamtschgiokhoacscantrnmt my qut hnh phng. Phn b biu din kt qu ca mt tc vto ngng c bnc thit k chia tch min ti v sang trn vng m vch. Ccbc nhphnhanhvy thng c s dng trong cch thngnhn din k t. S to ngng nh hng n phn cm d liuim nh thnh hai nhm datrnphpocngmtchiu[Rosenfeld1969;Dunnetal.1974].Mt bcxlsauchiatchcclpthnhccvngclinkt. Trong khi ngng mc xm n gin l mi trng nh c kim sot c tip nhn v nhiu nh khoa hc cng hin cc phng php thch hp cho -64- victongng[Weszka1978;TriervJain1995],ccnhphctpi hi nhiu k thut phn on chi tit hn. Nhiuphnonsdngchaiphpoquangph(vdnhMy qut a quang ph c s dng trong vin thm) v khng gian (da trnv trimnhtrnmtnhphng).Phpomiimnhttngng trc tip ti ni dung ca mt mu. (a) (b) (c) Hnh26.Nhphnhathngquangng.(a):nhthangoxmgc.(b) Biu mc xm. (c) Kt qu ca vic to ngng -65- 1.2 Phn on nh da vo phn cm d liu Vicpdngcctnhnngcaaphngphnkhcclustering quy m hnh nh mu xm-c ti liu trong Schachter et al. [1979]. Ti liu nhnmnhnviclachnthchhpcatnhnngmiimnhch khng phi l phng php phn cm, v xut vic s dng cc mt phng ta hnh nh (khng gian thng tin) l tnh nng b sung c lm vic ti cc phn nhm da trn phn khc. Mc tiu ca phn cm l c c mt chui cc cm hyperellipsoidal bt u vi cc trung tm cm v tr ti cc v tr mt ti a trong khng gian mu, v cc cm pht trin v cc trung tm cho nkhimt thnghim tt p ca 2_chophhp bvi phm. Mt lot cc tnh nng c tho lun v p dng cho c hai mu xm v mu sc hnh nh. MtthuttonphncmkttcpdngbiSilvermanv Cooper[1988]chovncahckhnggimstcacmvecthscho hai m hnh nh tng ng vi cc phn on hnh nh. Cc m hnh u tin l a thc cho cc s o hnh nh quan st; gi nh y l hnh nh lmt b su tp ca th lin k nhiu b mt, mimt hm a thc ca cc mt phng ta hnh nh, clymu trnli ng qut to ra cc hnh nhquanst .Thut tontinxlbngcchlyvectcahscahnh vung t nht ph hp vi cc d liu trong ca s hnh nh M phn chia. Mt thuttonphncmktthatrn(mibc)haicmcmtcmton cutithiugia-khongcchMahalanobis.Cngmtkhunkhc pdngiviphnkhccahnhnhkt cu,nhngchnhnhnhm hnh a thc l khng thch hp, v mt tham s ngu nhin Markov m hnh trng c gi nh thay th.Wu v Leahy [1993] m t vic p dng cc nguyn tc ca dng chy mngphnloikhnggimst,yieldingmtcuntiuthuytthutton phncpchophncmdliu.Vbncht,kthutnyxemccmu khngnhnnh cc nt trongmt th, trong trng lng camt cnh (chnghnnhdungtch)lmtthcogingnhaugiaccnttng ng.Cmcxcnhbngcchloibcccnhcathtophn chiaktnithcon.Trongphnkhchnhnh,imnhc4-lng ginghoc 8-lngging cc cnh chia s hnh nhmy bay trong th k -66- xy dng,v trnglng camt cnh th da trn lncamt cnh hnh nh a ra gi thuyt gia cc im nh lin quan ( ln ny c tnh bngcchsdngmtnnginphisinh).Do,vicphnonny hot ng bng cch tm ngnt ng ca trong hnh nh, v tt nht l c nhn cnh hn l da trn khu vc trn. Trong Vinod et al. [1994],haimngn-ron thn kinhc thit k thchinmhnhphncmkhikt hp.Mt haitngmnghot ngtrn mt biu a chiu ca d liu xc nh "nguynmu" c s dng phn loi cc m hnh u vo thnh cc cm. Nhng nguyn mu c pht trin thnh mng li phn loi, mt hai iu hnh mng lp trn biu cc dliuuvo,nhngcnhhngctrnglngkhcnhaut mnglachnnguynmu.Trongchaimngli,ccbiucahnh nhcsdngtrnglngnggpcamhnhlmt trongnhng mulnggingtheoxemxtnvtrcanguynmuhayphnloicui cng; nhvy,n c khnnglmnhm hn khi so snhvi cc k thut lmt gi nh tim nmt tham s chcnng cho cc lpmu. Kin trc ny c th nghim trn mu xm quy m v cc vn phn khc mu. Jolion et al. [1991]m tmt qu trnh tch cc cm tun t t cc m hnh u vo thit lp bng cch xc nh cc khu vc hyperellipsoidalc cha mt phn nh trong quy nh ca cc im khng phn lp trong b ny. CckhuvcchitxutcsosnhvinhtlpabinGaussianmt thngquamtthnghimKolmogorov-Smirnov,vchtlngphhp c s dng nh lmt con s ng chnvng tt nht 'mi phplp. Qutrnhnytiptcchonkhidnglilmt tiuchhilng.Thtc nycpdngchoccvncalachnngngchophnkhca ngng ca hnh nh cng v phm vi phn khc ca hnh nh. KthutClusteringcngcthnhcngcsdngchocc phnoncanhiuhnhnh,lmtngunphbincaccdliu uvochobachiuitnghthngcngnhn[JainvFlynn1993]. Phmvicmbinnhlitathngtrlivigitrotimiim nh ang c cc ta ca mt v tr trong khng gian 3D. Nhng v tr 3Dc th c hiu l nhng a im m cc tia ang ni ln t cc a immy bay hnh nh trong mt b ct nhau ca cc i tng pha trc ca cm bin. -67- Cc tnh nng ca c bn ca khi nim phn cm d c bit hp dn choccphnkhchnhnhtnhiu(khnggingnhocng)cc phpotimiimnhccngmtnv(chiudi);nyslmcho qungcohocbinihocchunhahnhnhtnhnngkhngcnthit numc tiu ca h l p t bng rng trn cc tnh nng . Tuynhin, phm vi nh phn on thng gn thm o khng gian chc nng, loi b li th ny. Hnhnhcamththngcmttrongphmviphnkhc Hoffman v Jain [1987] s dng phn cm bnh phng litrong mt khng gian su chiu tnh nng nh mt ngun ca mt phn khc "ban u" l tinh t(thnglquavicspnhpccphnon)thnhuraccphnon. KthutnycnngcaotrongFlynnvJain[1991],vcsdng trong mt so snh c h thng gn y ca phn on hnh nh [Hoover et al. 1996];nhvy,ncllmt trongnhnggiihnkthut phncmtn ti lu nhtphm vi thc hin tt trn rt nhiu hnh nh. Vicphnonnyhot ngnhsau.Timiimnh(i,j)trong phmvihnhnhuvo,ccolngckhiutngng3D ( )ij ij ijz y x , , , nixijl mt hm tuyn tnh ca j (s ct) v yij l mt hm tuyn tnhcati(slnghng).Mtk k lnggingca(i,j)csdng c lng b mt 3D( )zijyijxij ijn n n n , , =ti (i, j), thng thng bng vic tm kim t nht- vung phng ph hp vi cc im 3D trong lng ging. Cc vc t tnhnngchoimnhti(i,j)lsuchiu( )zijyijxij ij ij ijn n n z y x , , , , , ,vmt phn khc ng c vin c tm thy bi phn cm cc vect tnh nng ny. V cc ldothct,khngphivectortnhnngcamiimnhcsdng trong cc th tc phn cm; thng 1.000 vect tnhnng c la chn bi ly mu ph. ThuttonCLUSTER[JainvDubes1988]csdngc c cc nhn phn on cho mi im nh. CLUSTERl mt thut ton m rngcathuttonk-means,nckhnngxcnhmtscmcamt tp d liu, mi mt s khc nhau ca cc cm. Hoffman v Jain [1987] cng th nghim vi cc k thut phn nhm khc (v d, hon chnh, lin kt, lin ktn,th,lthuyt,vccthuttonlibnhphng)vhng CLUSTER cung cp s kt hp tt nht v hiu sut v chnh xc. Mt -68- li th b sung ca CLUSTERl n to ramt chui cc cm u ra (tc l, mt cm2-giiphplpthngquamt KmaxcmgiiphpmKmaxc ch nh bi ngi dng v thng l 20 hoc hn); mi phn nhm theo th tsnlngnythngkclusteringkthpgia-cmtchvtrongcmphntn-.Phncmtiuhaccsliuthngknycchnlmt trongcmttnht.Miimnhtrongphmvihnhnhcgnnhn phn on ca cc trung tm cmgnnht. iu ny bc phnloi khong cchtithiukhngcbomsnxutphnoncktni trong mt phng hnh nh, do vy, mt thnh phn kt ni ghi nhn thut ton phnbccnhnmichocckhuvcphnchiacttrongcngmt nhm. Hot ng tip theo bao gm cc xt nghim loi b mt, vic sp nhp ccbnvlinksdngmtthnghimchoshindincampnhn hoc nhy cnh gia cc phn on lin k, v c lng thng s b mt.

(a) (b)

(c)(d) Hnh 27. Phn on nh bng phn cm d liu. (a): nh u vo. (b): Mt bng chnh tc hnh nh c chn. (c): Bc u phnon(19nhmgiiphp)trlibngcchsdngCLUSTER1000 suchiumuthnhnhnhl mt muthit lp.(d):kt quphnon cui cng(8 phn on) (c) (d) sau khi x l -69- Hnh 27 cho thy tin trnh nyc p dng cho mt lot hnh nh.Mt phncahnhhinthnhiuhnhnhuvo; phnbchothys phnbcabmt.Trongcmtphn,ccphnonutintrlibi CLUSTERvsaimbophnonktnichinth.Phnd cho thy phn khc cui cng c to ra bi vic sp nhp cc bn v li k mkhngcmtmpnhnngkgiachng.Cccmcuicnghpl khc bit i din cho cc b mt c trong ny i tng phc tp. Ccphntchkt cuhnhnhcquantmbiccnhnghin cu trong vi nm. Kt cu k thut phn on c pht trin bng cch sdngmtlotccmhnhktcuvhotnghnhnh.Nguynv Cohen [1993], kt cu phn khc hnh nh c a ch hamuvihnh nhnhlmththngphncpcahaitrngMarkovngunhin,ly mtssliuthngkngintmikhihnhtothnhmtvector tnh nng, v phn nhm cc khi bng cch s dng phng phpK-means m.Thtcphncmylsaicngnhauctnhslngcc cm cng nh cc thnh vin trong m ca mi vc t tnh nng cho cc cm khc nhau.Mt h thng phn chia hnh nh cho kt cu c m t bi Jain v Farrokhnia[1991];,blcGaborcsdngccmtb 28-nhhngvtnhnngchnlcccktcutrongcclnggingca mi im nh. 28tnh nng c gim n mt s lng nh hn thng qua mtthtclachntnhnng,vcctnhnngktquctinxlv sau nhm bng cch s dng chng trnh CLUSTER. Mt bng thng k [Dubes 1987] c s dng la chn tt nht cc cm.Tithiukhongcchphnloicsdngnhnmiimnh trn hnh nh gc. K thut ny c th nghim trnmt s kt cu ghp bao gm cc kt cu Brodatz t nhin v hnh nh tng hp. Hnh 28 (a) cho thymtkhmktcubaogmbnuvocaccktcuBrodatzph bin[Brodatz1966].Phnbchothyphnkhcsnxutkhicctnhnng lc Gabor c ghp cha cc thng tin khnggian (ta pixel). Blc nydavokthutGaborcchngminhrtmnhvcm rng ti cc phn on t ng ca vn bn trong ti liu [Jain v hattacharjee 1992] v phn on ca cc i tng trong nn phc tp [Jain et al. 1997]. -70- (a)(b) Hnh28.Ktqucaktcuphnonnh(a):ktcukhm4lp.(b): bnnhmgii phpthchinbi gii thut CLUSTERvi taimnh bao gm trong cc tnh nng thit lp. Phn cmdliu c th c s dngnhlmt giai on tinxl xcnhcclphcmuphnloigimsttiptheo.Taxtv Lundervold[1994]vLundervoldetal.[1996]mtmtthutton clustering partitional v mt k thut ghi nhn hng dn s dng xc nh cc lp vt liu (v d,no ty cht lng,cht trng, bp Khi, khiu) trong cc hnh nh c ng k ca mt con ngi c c u nm knh khc nhauhnhnhcnghngt(yieldingmtnmchiutnhnngvectorti miimnh).Mtsphncmthucvkthpvikinthctn min (nhnlc chuynmn) xc nh cc lp khcnhau. Quyt nh quy nh phn loi gim st c da trn nhng lp ny c ly. Hnh 29 (mt) cho thy mt trong nhng knh ca mt u vo-a quang ph hnh nh; phn bchothy9-cmktqu.ThuttonK-meanslcpdngchocc phnkhccaLANDSAThnhnhtrongSolbergetal.[1996].Cctrung tmcmbanucchntngtccamtnhiuhnhoto,v tng ng vi cc lp hc s dng t nh khu vc th, t (thc vt min ph) cc khu vc, rng, ng c, v nc. Hnh 30 (mt) cho thy nhng hnh nh u vo hon tr nh mu xm; phn b cho thy kt qu ca th tc phn cm d liu. -71-

(a) (b) Hnh29.Phnonnhytaquangph.(a)Knhduynhtcanhu vo. (b) 9 cm phn on nh (a)(b) Hnh30:PhnonnhLANDSAT.(a)BngchnhnhESA/ EURIMAGE / Sattelitbild). (b):Cnh c phn cm.2.Nhn dng i tng v k t 2.1 Nhn dng i tng Vic s dng cc phn nhm xem nhm i tng 3D cho mc ch cng nhn i tng trong phm vi d liu c m t trong Dorai v Jain [1995].Ccthutngdngchxemmthnhnhphmvicamti tngthu c tbt c quan im ty. H thngxemxt, lmvic theo mt quan im ph thuc (hocxem trung tm) cch tip cn ivivn cngnhnitng;miitngccngnhnlidintrongiu khon ca mt th vin hnh nh lot cc i tng . -72- Crt nhiucthccamt itng3Dvmctiumt trong nhngcngvicmltrnhkthpmtuvoxemkhngrivi tnghnhnhcatngitng.Mtchphbintrongvnhccng nhnitngclpchmc,trongxemchabitcsdng chnmt tp hp con ca im camt tp hp con ca cc i tng trong csdliusosnhhnna,vtchittcccimkhccai tng. Mt trong nhng cch tip cn nh ch s dng cc khi nim ca cctnglpxem;mt lphcxemltphpccimchtlngtngt ca mt i tng. Trong tc phm , cc lp hc xem c xc nh bi phn cm d liu; phn cn li ca tiu mc ny vch ra cc k thut. Xemitngcnhmlivocclphcdatrnhnhdng gingnhaucacctnhnngph.Mihnhnhuvocamt itng xem trong sn lng c lp mt vector tnh nngm n m t. Cc tnh nng vectorchatrongmiphtutintrungtmcamthnhbnhthng =hh H h m ) ( ) (1 hoquangphphnphi,) (h H,camt itngxeml thu c t dliu phmvi can bng cchxy dngmt biu ca cc gi tr ch s hnh dng(c lin quan n cc gi tr bmt cong) v tchly ttcccitngimnhmrivomithng.Bibnhthngha quang ph i vi din tch tng s i tng, quy m (size) khc nhau m c th tntigia cc i tng khcnhau c g b. Tithi imu tin m1 tnh ton m c ngha ) (h H : =hh H h m ) ( ) (1 . (1) Vi momen trung tm khc, mp, 10 2 s s pc nh ngha l : ( ) ( ) =hpph H m h m_1(2) Do cc vecto c tnh c biu th bng ( ), ,..., ,10 2 1m m m R = nmTrong khong [-1,1]. TiO= { }nO O O ,..., ,2 1 l mt la chn ca n i tng 3D vi cnhnm -73- trongcsdliu.MD.cnhthicajitng, ijO trongcsdliu cbiuthbng ijijR L ,,ni ijLlitngnhnv ijR lvectoc tnh.Chomt tpi tng idinRi ={ }i i i iR L R L1 1 1 1, , , mm tm cnh ca ii tng, mc tiu l ly ra mt phn ca cnh Pi= { }iki iiC C C , , ,2 1. Mi cm trong Pi cha nhng cnh ca i tng th imitngccptngtdatrnskhnggingnhaugia cc thi im tng ng vi cc tnh nng ca hnh quang ph ca cc cnh . Cc bin php ca khng ging nhau gia ijR v ikR c nh ngha : D( ) ( )= =1012,liklijlikijR R R R (3) Phn cm d liu Cnh(Views) Mt csdliuchakhong3,200nhca10itngiukhc khc nhauvi 320 cnhc s dng [Dorai and Jain 1995]. Cchnhnhdaongt320quanimcth(xcnhbilit ongca xem-mt cu bng cch s dng khi 20 mt ) ca cc i tng ctnghp.Hnh31chothymt tphpconcatphpccimca Rn h mang c s dng trong th nghim. Hnh dng ph ca tng xem l tnh vc t c tnh v sau tnh nng ca n c xc nh. Cnhca tng itngangttp,datrnDokhnggingnhaugiavectthiim ca h bng cch s dng cc kt ni n clustering th bc [Jain v Dubes 1988]. Cc nhm th bc thu c vi 320 cnhca i tng Rn h mang chinthtronghnh32.Cnhcanhmphncpchnitngkhc cngtngtnhccdendrogramtronghnh32.Dendrogramnycct mckhnggingnhaul0,1hocthnccnhgnvcng cchnhaucm.Ccclusteringsthuctheocchnychngminhrng quan im ca tng i tng ri vo mt vi cm khc bit r rt. Cc trng tm ca mi cm ny c xc nh bi my tnh trung bnh ca vect thi im ca lt xem ri vo mt cm. -74- Hnh 31. Mt tp con cc cnh ca nh Rn h mang c chn t 320 cnhDoraivJain[1995]chngminhrngphnnhmnydatrnxem nhm i tng ph hp vi th tc to iu kin v tnh chnh xc phn loi v s lng ph hp cn thit cho vic phn loi ng ca xem th. Xem i tngcnhmthnhcccmxemnhgnvngnht,nhvychng t sc mnh ca cluster da trn s t chc xem v ph hp vi i tng c hiu qu. -75- Hnh 32 : Cu trc ca mt nhm gm 320 cnh ca mt tc phm iu khccon rn h mang. 2.2 Nhn dng k t.Kthutnhndngkdavophncmdliucphttrinbi ConnellvJain[1998]nhnbitlexemestrongvnbnvittaychocc mcchcanhvnvit taycngnhnclp.Sthnhcngcamt h thngnhndngchvitlcckphthucvochpnhnbingis dngtimnng.Nhvnphthuchthngcungcpmtmccaohn scngnhnchnhxchnsovicchthngnhvnclp,nhngi himtlnglndliuoto.Mt nhvnclphthng,mt khc, phi c kh nng nhn ra nhiu phong cch vn bn nhm p ng mt ngi dngcnhn.Khiccbinthincaphongcchvnbnphi cbtgi bi mt h thng tng, n cng tr nn kh khn phn bit i x gia cc lp khcnhau do s lng chng chonhau trong khnggian tnhnngny. Mttrongnhnggiiphpchovnnyltchccdliutnhng -76- phong cch vit khc nhau chomilp hc vo lp con khc nhau, c gi llexemes.Nhnglexemesidinchoccphncadliucddng hn tch ra t cc d liu ca cc tng lp khc hn m lexeme thuc. Trong h thng ny, ch vit l b bt bi s ho cc ta (x, y) v v trcacccybtvvtrtimbt(lnhocxung)vitllymu khng i. Sau mt s ly li mu, bnh thng ho, v lm mn,mi nt bt l i din nh l mt chui di bin-im. Mt s liu da trn n himu lp trnh ph hp v nng ng, c xc nh cho php khong cch gia hai nt c tnh ton. Sdngcckhongcchtnhbngcchny,mtmatrngnnhau c xy dng ca tng loi ch s (tc l, 0 thng qua 9). Mi bin php ma trnkhongcchlptrongchomt lpchscth.Chstrongmt lp c bit l nhm trong mt thc nghim tm mt s lng nh cc nguyn mu.PhncmcthchinbngcchsdngchngtrnhCLUSTER m t trn [Jainv Dubes 1988], trong vc t tnhnng chomt ch s ca nl N ln cn n con s ca cng mt lp. CLUSTERphn nhm tt nht chomigi tr ca K trnmt s phmvi, trong K l s cmvo d liu ny l c phn vng. Theo d on, c ngha l li bnh phng (MSE) gim n iu nh l mt chc nng ca K. Cc "ti u" gi tr ca K cchnbngcchxcnhmtugitrongbiucaMSEvsK. Khi i din cho mt cm ch s ca mt mu th nghim duy nht, tt nht nhn din on-line kt qu c cng nhn thu c bng cch s dng cc chslgnnhttitrungtmcm's.Sdngsny,mttl nhn din chnh xc l99,33%. 3. Truy hi thng tin Thng tin hi thng tin (Information Retrieval) c lin quan vi lu tr tngvlycctiliu[Rasmussen1992].Nhiuthvincctrngi hc s dngh thngIR cung cp truycp vo cc cun sch, tp ch,v cctiliukhc.CcthvinsdngnLi-braryofCongress Classification(LCC)(PhnloiThvinQuchiM),nnyhiuqu cho vic lu tr v truy tm sch. n LCC bao gm cc lp c nhn A n Z [LC Classification Outline 1990] c s dng k t hasch thuc cc itngkhcnhau.Vd,nhnQtngngvischtronglnhvckhoa -77- hc,vbomcht lngphnlpcphncngtonhc.NhnQA76 ti QA76.8 c s dng phn loi sch lin quan n my tnh v cc lnh vc khc ca khoa hc my tnh.Cmtsvnlinquannvicphnloiccschbngcchs dng s LCC. Mt s trong s ny c lit k di y: (1) Khi mt ngi s dng ang tm kim mt cun sch trong th vin m vi mt ch anh ta quan tm, s LCC mt mnh c th khng th ly tt c cc sch c lin quan. iu ny l do s lng phn loi c ch nh chonhngcunschhayccloichthngcnhpvotrongcs dliu khng c thng tinlin quan n tt c cc ch c bo him trong mt cun sch. minh ha im ny, chng ta hy xem xt cun sch Cc thut ton cho phn cm d liu ca Jain v Dubes [1988]. S LCC ca nl'QA278.J35'.TrongsnyLCC,QA278tngngvich'phn tch cm', J tng ng vi tn tc gi u tin v 35 l s serial phn cng ca Th vin Quc hi. Cc loi ch cho cun sch ny c cung cp bi nh xut bn (m thng cnhpvotrongc s d liu to iu kintm kim)lnhmphntch,xldliuvthutton.Cmtchngtrong sch ny [Jain v Dubes 1988] rng vi tm nhn my tnh, x l hnh nh, v phn khc hnh nh. V vy, mt ngi s dng tm kim cho vn hc trn myvitnhvtmnhn,cbit,hnhnhphnkhcskhngthtruycp cun sch ny bng cch tm kim c s d liu vi s gip ca mt trong haisLCChocccloiitngccungcptrongcsdliu.S LCCchoschtmnhnmytnhcTA1632[LCClassification1990] l rt khc vi QA s 278.J35 c ng k cho cun sch ny. 2)CmtvnchutronggiaoLCCsschmt khuvcpht trin nhanh. V d, chng ta hy xem xt cc khu vc ca cc mng thn kinh. Ban u, th loi 'QP' trong LCC n c s dng nhn sch v th tcttngtihinghkhuvcny.Vd,ProceedingsoftheJoint InternationalConferenceonNeuralNetworks[IJCNN'91]cgiaoQPca s 363,3 '. Tuy nhin, hu ht cc cun sch gn y trn cc mng thn kinh cchomtscchsdngccnhnthloi'QA';Proceedingsof IJCNN'92 cc [IJCNN'92] c phn cng bo m cht lng ca s 76,87 '. Nhiunhnchoschiphvicngmtchsbuchcttrn -78- ngnxp khc nhau trongmt thvin. Do , cmt cn phi cp nht cc nhn phn loi theo thi gian trong mt k lut mi ni. (3)vicgiaomtschomtcunschmilmtvnkhkhn. Mt cun sch c th i ph vi cc ch tng ng vi hai hoc nhiu s LCC, v do , ch nh mt s duy nht cho cun sch nh vy l rt kh khn. Murty v Jain [1995] m t mt kin thc da trn lc phn nhm idinnhmcccunsch,trongthucbngcchsdngCR ACM(HimytnhMyvitnhXemli)phnloicy[ACMCR Classifications1994].CynycsdngbicctcgigpphnACM n phm khc nhau cung cp cc t kha trong cc hnh thc th loi ACM nhnCR.Cynybaogm11ntcputin.CcntlcnhnA n K. Mi nt trong cy ny c mt nhn l mt chui ca mt hay nhiu khiu.Nhngbiutngnycktch-s.Vd,I515lnhnca mt nt cp th t trong cy. 3.1 Biu din mu Micunschcthhinnhmtdanhschtngqut[Sangal 1991] ca nhng dy bng cch s dng phn loi cy ACM CR. V mc ch ngn gn trong i din, cc cp, cc nt th t trong cy phn loi ACM CR c gn nhn bng cch s dng ch s 1-9 v k t A n Z. V d, cc nt concaI.5.1(mhnh)cdnnhnI.5.10,1nI.5.1.6.y,I.5.1.1 tng ng vi cc nt c nhn xc nh, v I.5.1.6 l vit tt ca nt c nhn structural.Inathitrangtngt,ttccccp,ccntthttrongcyc thcgnnhnlcnthit.Tbygi,ccduchmgiabiutng k tip s c b qua n gin ha cc i din. V d, I.5.1.1 s c k hiu l I511.Minhha cho qu trnh ny i dinvi s gip ca cc cun sch caJainvDubes[1988].Cnmchap-terstrongcunschny.n ginchbin,chxemxtcccthngtintrongccnidungchng.C mtmcduynht trongbngnidungchoccchng1,'Giithiu',vv vykhngly bt k t kho t ny. Chng 2, c nhn' D liu i din,' mc tng ngvi cc nhn ca cc nt trong cy phnloi ACM CR [ACM CR Classifications 1994] c a ra di y: (1a) I522 (feature evaluation and selection), -79- (2b) I532 (similarity measures), and(3c) I515 (statistical). Da trnnhng phn tch trn, Chng 2 ca Jainv Dubes [1988] c thcctrngbisphnlytrng((I522I532I515)(1,4)).Cc trnglng(1,4)biuthrngnlmttrongbnchng,trongcvai tr trong cc i din ca cun sch. Cn c vo bng ni dung, chng ti c thsdngmthocnhiudyI522,I532,I515vidinchoChng2. Tng t nh vy, chng ti c th i din cho chng khc trong cun sch nynhccphptuyntrngdatrnccbngnidungvphnloicy ACMCR.Ccidincatonbcunsch,skthpcattcccc quan i din chng, c cho bi (((I522 I532 I515) (1,4) ((I515 I531) (2,4)) ((I541 I46 I434) (1,4))).Hinnay,ccidinctorabngtaybngcchqutccbng ni dung ca sch tronglnhvc khoa hcmy tnhnhACM cy phn loi CRcungcpkinthcvcunschkhoahcmytnh.Ccchitit cab su tp ca cun sch c s dng trong nghin cu ny c sn trong Murty v Jain [1995]. 3.2 Php o tng t Sgingnhaugiahaicunschdatrnsgingnhaugiacc chui tng ng. Hai trong s cc chc nng ni ting, khong cch gia mt cpdyc[Baeza-Yates1992]khongcchHammingvsakhong cch.Khngphicaccchcnngnykhongcchhaicthcs dng trong cc ng dng c ngha ny. V d sau minh ho im. Hy xem xt ba dy I242, I233, v H242. Nhng chui l cc nhn (predicate logic i dinchokinthc,lptrnhlogic,vcchthngcsdliuphntn) trong ba cp th t, cc nt trong cy phn loi ACM CR. Cc nt I242 v I233 l chu ca cc nt c nhn I2 (tr tu nhn to) v H242 l mt chu ca cc nt c nhn H2 (c s d liu qun l). V vy, khong cch gia I242 v I233 phi nh hn m gia I242 v H242. Tuy nhin, khong cch Hamming v sa khong cch [Baeza-Yates 1992] c hai u c mt gi tr 2 gia I242 v I233 vgi tr ca 1 gia I242 v H242. Hn ch ny thc y nh ngha ca mt bin php tng t mi m bt ng s ging nhau gia cc chui trn. S ging nhau gia hai chui c nh ngha l t l chiu di ca tin -80- tphbinnht[MurtyvJain1995]giahaidyvichiudicachui utin.Vd,sgingnhaugiachuiI522vI51l0,5.Ccbinphp tng t c xut l khng ixng,v sgingnhaugia I51v I522 l 0,67. Cc gi tr ti thiuv ti a l bin php tng t ny l 0,0 v 1,0, tng ng. Cc kin thc v cc mi quan h gia cc nt trong cy phn loi ACMCRlbbtbiccidintrongcchnhthcdy.Vd,ntc nhn cng nhn l mu i din l I5 chui, trong khi I53 chui tng ng vi cc nt c nhn clustering. S ging nhau gia hai nt (I5 v I53) l 1,0. Mt binphpixngcatng[MurtyvJain1995]csdngxy dngmt ma trn tng t c kch thc 100 x 100 tng ngvi 100 cun sch c s dng trong cc th nghim. 3.3 Mt gii thut cho phn cm d liu sch Vn phn nhm c th c nu nh sau. Cho mt b su tp B ca cunsch,chngtacnccmttpCthitlpcccm.Mtgn dendrogram(cycccm)[JainvDubes1988],sdngThuttonphn cmktniktthontonhonthuthp100cunschcthhin tronghnh33.Bycmthucbngcchchnmtngng( ) tcgitr 0,12.Nnitingmccgitrkhcnhaucho( ) tcthcungcpcho clusteringskhcnhau.Ngnggitrnycchnbiv"khongcch" trongdendrogramgiacccpmsuvbycmchnhthnhlln nht.Xtnghimcclnhvcchcacunsch[MurtyvJain1995] trongcccmtitlrngcccmthuclthcscngha.Micm c i din bng cch s dngmt danh sch cc chui sv cp sf tn s, ni sf l s sch trong cc cm, trong s l hin ti. V d, cm c1 cha 43 cun sch thuc v nhn din m hnh, cc mng thn kinh, tr tu nhn to v tm nhn my tnh; mt phn ca R(C1) i din ca n c a ra di y. W(C1) = ((B718,1), (C12,1), (D0,2),(D311,1), (D312,2), (D321,1),(D322,1), (D329,1),... (I46,3),(I461,2), (I462,1), (I463, 3),... (J26,1), (J6,1),(J61,7), (J71,1)) -81- Nhngcmschvmtclustertngngcthcsdngnh sau: Numt ngi s dng ang tm kim sch, ni, v hnh nh phn khc (I46),sauchngtachncmC1vi dincamnh cchaI46chui. SchB2(Neurocomputing)vB18(NeuralNetworks:LateralInhibition)l chaithnhvincanhmC1mcds LCCcahkhkhcnhau(B2l QA76.5.H4442, B18 l QP363.3.N33).BnschbsungcnhnB101,B102,B103,B104vcs dngnghincuccvncavicphncngphnloisschmi. NhngsLCCcanhngcunschnyl:(B101)Q335.T39,(B102) QA76.73.P356C57,(B103)QA76.5.B76C.2,v(B104)QA76.9D5W44. Nhngquynschnycgiaochocccmdatrnphnloihngxm gnnht.NhnghngxmgnnhtcaB101,mtcunschvnhnto tnhbo,lB23vvvyB101cphncngcmC1.Ncquanst thysphncngcabnschcccmtngnglcngha,chngt rng kin thc da trn phn cm d liu rt hu ch trong vic gii quyt cc vn lin quan n ly ti liu. 4. Khai ph d liu Trong nhng nm gn y chng ta thy bao gi tng khi lng d liu thu thp ca tt c cc loi. Vi rt nhiu d liu c sn, n l cn thit phttrinccthuttonmcthlythngtintcccahngcngha rngln.Tmkimnuggetshuchcathngtingiaccslngrt ln ca cc d liu c bit n nh l cc lnh vc khai ph d liu.Khai ph d liu c th c p dng cho quan h, giao dch, v c s d liu khng gian, cng nh cc ca hng ln d liu c cu trc nh World WideWeb.Cnhiudliutronghthngkhaithcsdngngynay,v ccngdngbaogmCcNgnkhHoaKphthinratin,Hiphi BngrQucgiahunluynvinphthinxuhngvmhnhcacc cuthchichocnhnvcci,vphnloiccmhnhcatrem trong h thng chm sc nui dng [Hedberg 1996] . Mt s tp ch gn y cnhngvncbitvkhaiphdliu[1996Cohen,Cross1996, Wah 1996]. -82- 4.1 Khai ph d liu bng Phng php tip cn.Khaiphdliu,gingnhphncmdliu,lmthot ngthm d, do , phng php phn cm d liuang rt thch hp khai phd liu.Phncmdliuthnglmt bckhiuquantrngcamt s trongqutrnhkhaiphdliu[Fayyad1996].Mtsphngphpkhai ph d liu s dng phng php phn cm d liuc c s d liu phn khc, mu tin on, v trc quan ha c s d liu ln. Phnon.Phngphpphncmdliucsdngtrongkhai ph d liu vo c s d liu phn khc thnh cc nhm ng nht. iu ny c th phc v mc ch ca nn d liu (lm vic vi cc cm hn l cc c nhn),hocnhnbit cccimcadnsphthucmcthc nhm mc tiu cho cc mc ch c th (v d, tip th nhm vo ngi gi). ThuttonphncmdliuK-means[Faber1994]csdng phn cmim nh tronghnh nhLandsat [Faber et al. 1994]. Mi im nhbanuc7gitrtccbannhcvtinhkhcnhau,baogmhng ngoi.Nhnggitr7lkhkhnchoconnginghavphntch mkhngcnstrgip.Ccimnhviccgitr7tnhnngc nhmthnh256nhm,saumiimnhcgngitrcacmtrung tm. Hnh nh ny sau c th c hin th vi nhng thng tin khng gian cnnguynvn. Conngingixem c thnhnvomt hnh nhnv xc nhmt khuvc quan tm (v d, ng cao tc hoc rng) v nhn n nhlmtkhinim.Hthngnysauxcnhimnhkhctrong cng mt nhm nh l mt v d ca khi nim . on trcmu. Thng k phng php phn tch dliu thnglin quannthnghimmtmhnhgithuytcaccnhphntchc trongtmtr.Khaithcdliucthgipngidngphthingithuyt timnngtrckhisdngcccngcthngk.ontrcmhnhs dng phn nhm cc nhm, sau infers quy tc characterize cc nhm v xut ccmhnh. Vd, ngi ng k tp ch c thcnhmda trn mt s yu t (tui tc, gii tnh, thu nhp, vv), sau cc nhm kt qu c trng trongmt n lc tmmt m hnhm s phn bit cc thu bao ny s gia hn ng k ca h t nhng ngi m s khng [Simoudis 1996]. Hnh nh. Cm trong c s d liu ln c th c s dng hnh dung, -83- h tr cc nh phn tch ca con ngi trong vic xc nh cc nhm v nhm conccimtngt.WinViz[LeevOng1996]lmtcngckhai thc d liu trc quan, trong c ngun gc cm c th c xut khu nh cc thuc tnhmim sau c th c c trng bi h thng. V d, ng cc n sng c nhm theo calo, m, cht bo, natri, cht x, carbohydrate, ng, kali, vitamin v cc ni dung trn phc v. Khi thy cc cm kt qu, ngisdngcthxutcccmWin-Vizlthuctnh.Hthngny cho thy rngmt trong nhng cm c c trng bi ni dung kali cao, v cc nh phn tch ca con ngi nhn ra cc c nhn trong nhm nh l thuc cm "gia nh ng cc", dn n mt khi qut rng "ng cc, cm nhiu cht kali."4.2 Khai ph d liuc cu trc ln. Khaithcdliuthngcthchintrncsdliuquanh giao dch v cng xc nh cc lnh vc m c th c s dng nh l cc tnh nng, nhng c nghin cu gn y v c s d liu c cu trc ln nh World Wide Web [Etzioni 1996].V d v cc n lc gn y phn loi cc vn bn web bng cch s dng t ng hoc cc chc nng ca cc t nh tnh nng bao gm Maarek v Shaul[1996]vChekuriet al.[1999].Tuynhin,btnginhccmu o to c nhn v chiu hn ch rt ln s thnh cng cui cng ca t ng phn loi ti liu web da trn nhng t nh tnh nng.Chkhngphilnhmtiliutrongmtkhnggiantnht, Wulfekuhler v Punch [1997] cm t t mt b su tp nh ca World Wide Web ti liu trong khng gian vn bn. Cc d liu mu thit lp bao gm 85 ti liu t cc min trong sn xut ngi dng khc nhau 4-xc nh loi (lao ng, lut php, chnh ph, v thit k).85 ti liu cha 5.190 thn cy khc bittsaukhicctthngdng(cc,v,trong)cgb.Ktt c chc chn khng phi khng tng quan, h s ri vo ni cm t c s dngmt cch thngnht trn ton btiliu cgi tr tng tnh ca tn s trong mi ti liu. Phng php phn cm bng K-means c ngha l phn nhm c sdngnhmcct5.190thnh10nhm.Mt kt qungngcnhin ltrungbnh92%trongcctrivomtcmduynht,msaucth -84- c loi b khai thc d liumc ch. Cc cmnh nht c iu khon vo mt con ngi c v ng ngha lin quan. Cc cm 7 nh nht t mt hot ng tiu biu c th hin trong hnh 34.iukhoncsdngtrongngcnhbnhthng,hociukin duy nht m khng xy ra thng xuyn trn ton b ti liu o to s c xu hngcmthnhnhmthnhvinln4000.iunyschmscccli chnht, tn ringm khng thngxuyn,v cc iu khon c s dng theocchtngttrongsut t tonbtiliu.iukhonsdngtrong bi cnh c th (nh tp tin trong bi cnh np n sng ch, hn lmt tp tinmytnh)sxuthintrongcctiliuphhpviiukinthchhp khcchorngbngsngch(bicnh,phtminhra)vdoscxu hng cmlivinhau. Trong s cc nhm t, ng cnh c bit ni bt so vi m ng.Sau khi discarding cluster ln nht, cc thit lp nh hn cc tnh nng cthcsdngxydngcctruyvntmracctiliukhcc linquantrnWebtiuchunsdngcngctmkimweb(vd,Lycos, Alta Vista, m vn bn). Tm kim trn Web vi cc iu khon ly t cm t chophpphthinraccchhtmn(vd,gianhytli)trong vng loi c nh ngha rng ri (v d, lao ng). 4.3 Khai ph d liu trong C s d liu a cht. Khai phc s d liu l mt ngun lc quan trng trong vic thm d du m v sn xut. N c ph bin kin thc trong ngnh cng nghip du m chi ph inhnh camt khoanmi ra nc ngoi cngl trong khong $3-40,nhngchicatrangweblmtthnhcngkinhtl1trong 10. Thm thng tin v c h thng khoan quyt nh mt cch ng k c th lm gim chi ph sn xut chung.Tin b trong cng ngh khoan v cc phng php thu thp d liu c dn n cc cng ty dumv ancillaries cah thu thpmt lngln a vt l / d liu a cht t ging sn xut v cc trang web thm d, v sau t chc chng thnh cc c s d liu ln. K thut khai thc d liu gn y csdnglycchnhxcphntchmiquanhgiacchin tngquanstvccthngs.Nhngmiquanhsaucthcs dng nh lng du v kh t. -85- V cht lng, tr lng tt phc hi c bo ha hydrocarbon cao ang mckt bitrmtchrt xp(chaporosity)vbaoquanhbislngln cc loi cng c ngn chn s r r du kh t xa. Mt khi lng ln cc trmtchxplrtquantrngtmdtrphchitt,dophttrin ngtincyvchnhxcccphngphpchodtoncaporositiestrm tchtccdliuthuthplchakhactnhtimnngdukh.Cc quy tc chung ca cc chuyn gia ngn ci s dng cho tnh ton xp, rng l n l mt chc nng lut s m cachiu su: xp =( ) Depth x x x Fme K. , , ,2 1. (4) Mtsyutnhccloi,cutrc,vxybngximngnhcc thngscaFchcnngbirimiquanhny.iunyinhngha cang cnhthchhp, trong cgngkhmph ra cng thc o xp. Bi cnh a cht c th hin trong iu khon ca hin tng a cht, nh lhnhhc,lithology,nncht,vln,linktvikhuvc.Nniting rng nhng thay i bi cnh a cht t lu vc lu vc (cc khu vc a l khc nhau trn th gii) v cng t khu vc ti khu vc trongmt lu vc [AllenvAllen1990;Biswas1995].Hnna,tnhnngtimntrongbi cnhcthkhcnhaurt nhiu.Mhnhkt hpcckthut ngin,m lm vic trong lnh vc k thut m l hn ch bi hnh vi ca con ngi gy rahthngvcngthnhlplutcavtl,khngthpdngtronglnh vc thm d du kh. n a ch ny, phn nhm d liu c s dng xcnhngcnhclinquan,vsauphthinraphngtrnhc thc hin trong bi cnh mi. Mc ch l ly cc tp con x1, x2, ..., xm t mttplncctnhnngacht,vFmiquanhchcnngnhtnh chc nngo rng, xp trong khu vc.Cc phng php tngthminh ho trong Hnh 35, bao gmhai bc chnh:(i)BicnhnhnghabngcchsdngcckthutPhncm khnggimst,v(ii)phthinbngcchphntchPhngtrnhhiquy [LivBiswas1995].Btthmddliuthuthptmtvngluvc Alaskacphntchbngcchsdngphngphpphttrin.Cci tngdliu(mu)cmtv37cimacht,nhxp,tnh thm,mtkchthcht,vphnloi,slngccmnhkhongsn khc nhau (v d, thch anh, Chert, fenspat) hin nay, tnh cht ca cc mnh -86- , l chn lng c im, v xy bng xi mng. Tt c nhng tnh nng cc gitrcobngscthchintrnmuclytccbnghitt trong qu trnh khoan thm d.ThuttonphncmdliuK-meanscsdngxcnh mttpccngnhtcutrcachtnguynthy(g1, g2, ..., gm). Nhng nguyn thy ny sau c nh x vo m n v sovibnnvatnghc.Hnh36mtmtbnmt phncho mt tp hp cc ging v bn cu trc nguyn thy. Bc tip theo trong qu trnh pht hin c xc nh phn ca khu vc ging c to thnh t cng mt trnh t ca a cht nguyn thy. Mi trnh t quy nh mt Ci ng cnh. T mt phn ca bn Hnh 36, trong bi cnh C1 = g2 . g1 . g2 . g3 c xc nh ti hai khu vc tt (ca 300 v 600 series). Sau khi bi cnh c xcnh,dliuimthucbicnhtngcnhmlivinhaucho derivationphngtrnh.Thtcdnxutderivationlmvicphntchhi qui [Sen v Srivastava 1990].Phng phpny c p dng chomt tp dliu ca khong 2.600 i tng tng ng vi mu o thu thp t ging l cc lu vc Alaska.K-meansnhmdliunytthnhbynhm.Nhminhho, Chng ta chn mt b 138 i tng i din chomt bi cnh phn tch. Cc tnhnngnht nhngha cmny cla chn,v cc chuyngia surmisedrngbicnhidinchomtvngxprngthp,cm hnh bng cch s dng cc th tc hi qui. 4.4 Tm tt C rt nhiu ng dng, ni ra quyt nhv phn tchmu thm d cthchintrndliulntra.Vd,tronglytiliu,mttphp cc ti liu c lin quan c th tm thy mt vi trong s hng triu ti liu ca ccchiucahn1000.Cthxlnhngvnnyrt huchnumt s tru tng ca d liu c thu c v c s dng trong vic ra quyt nh,hnltrctipbngcchsdngdliutonbthitlp.Bitru tng ha d liu, chng ti c ngha l mt i din n gin v gn nh ca d liu. n gin ny gip my ch bin c hiu qu hay mt con ngi trong comprehending cu trc trong d liu mt cch d dng. Thut ton phn cm d liurt l tng cho vic t c cc d liu tru tng. -87- Trongbiny,chngtakimtraccbckhcnhautrongphn nhm: (1) m hnh i din, (2) tnh ton tng t, (3) nhm quy trnh, v (4)i din cm. Ngoi ra, cng cp nn thng k, m, thn kinh, tin ha, v kin thc da trn phng php tip cn phn cm d liu. Chng ta c bn m t cc ng dng ca phn nhm: (1) Phn on nh, (2) nhn din i tng, (3) truy hi ti liu, v (4) khai ph d liu.

Hnh 36. M vng so vi bn n v a tngmt phn ca khu vc nghin cu. Phn cm dliulmt qu trnh ca cc nhm d liu da trnmt thc o tng t. Phn cm d liu l mt qu trnh ch quan; cngmt b ccdliuthngxuyncnphicphnvngkhcnhauchoccng dng khc nhau. Ch quan nylm cho qu trnh phn nhm kh khn. iu nyldomt thut ton nhoc phng php tip cnl khng gii quyt mi vn phn cm d liu. Mt gii php c th nm trong ch quan ny phn nh trong cc hnh thc kin thc. Kin thc ny c s dng hoc ngmhocrrngtrongmthocnhiugiaioncaPhncmdliu. Kin thc da trn thut ton phn nhm s dng kin thc mt cch r rng.Bc kh khnnht trong phnnhml tnhnng khai thchocmu i din. Cc nh nghin cu mu nhn din cng nhn thun tin trnh bc -88- ny bng cch gi s rng cc i din c khun mu c sn nh l u vo ca thut ton phn cm d liu. Kch thc nh, tp hp d liu, i din m hnh c th thu c da trn kinh nghim trc y ca ngi dng vi vn ny.Tuynhin,trongtrnghpccbdliuln,lkhkhncho ngisdngtheodisquantrngcamitnhnngtrongphncm d li. Mt gii php l lm cho cc php o nh nhiu trn cc mu cng tt vsdngchngtrongkhunmuidin.Nhngnkhngthsdng mtbsutplnccphpotrctiptrongphncmdliuvchiph tnh ton. V vy, mt s tnh nng khai thc / la chn phng php tip cn cthitkcckthptuyntnhhocphi tuyncaccphp o c th c dng i din cho cc mu. Huht cc n ngh cho khai thc tnh nng / la chn thng c lp li trong t nhin v khng th c s dng trn cc tp d liu ln do chi ph tnh ton. Bc th hai trong phn nhm l ging nhau tnh ton. Mt lot cc ncsdngtnhtongingnhaugiahaimhnh.Hsdng kin thchocngmhoc r rng.Huht cc kin thc da trn thut ton phn nhm s dng kin thc r rng trong tnh ton tng t. Tuy nhin, nu khng phi l i din cho cc mu bng cch s dng cc tnh nng ph hp, sau n khng phi l c th lm cho mt phn vng c ngha khng phn bit cht lngv s lng kin thc c s dng trong tnh ton tng t. Khngcnphchpnhncivimytnhgingnhaugiacc mu i din bng cch s dng mt hn hp ca c hai tnh nng nh lng. Khnggingnhaugiamtcpmucidinbngcchsdngmt thc o khong cch c th hoc khng th c mt s liu. Bc tip theo trong phn nhm l nhm cc bc li vi nhau. C hai nhmnrngri:ntheokthavphnvng.Ccncnhiu thbclinhhot,vccnphnvngttnkm.Ccthuttonphn vng nhm ti a ha kh nng li tiu ch bnh phng. Thc y bi s tht bi ca cc li bnh phng thut ton phn cm d liu phn vng trong vic tm kim cc gii php ti u cho vn ny, mt b su tp ln cc phng php c xut v c s dng c c mt gii php ton cu ti u chovn ny. Tuynhin, cc n c giihn cho phpvmt tnh tontrndliulntra.nphncmdliudatrnmng -89- nowrron(ANN)ctrinkhaithnkinhcaccthuttonphnnhm,v h chia s cc ti sn khng mong mun ca cc thut ton. Tuy nhin, ANNs c kh nng t ng bnh thng ha d liu v trch xut cc tnh nng. Mt quan st quan trng l ngay c khimt n c th tm thy gii php ti u chovnphnvngbnhphngli,nvncththungncaccyu cu v khng th-ng hng bn cht ca cc cm. Trong mt s ng dng, v d trong truy hiti liu, n c th hu ch c mt phnnhm khng phi lmt phnvng. iuny c nghal cccmchngcho.PhncmdliumFuzzylchcnngrt ltng chomcchny.Ngoira,ccthuttonphnnhmmcthxld liu hn hp cc loi. Tuy nhin, mt vn ln vi phn cm d lium l nrtkhccccgitrthnhvin.Mtcchtipcntnghpc th khng lm vic v bn cht ch quan ca phn cm d liu. N l cn thit i din cho cc cm thu c trongmt hnh thc thchhp gip nh snxutquytnh.Kinthcdatrnphnnhmntoraccmt bng trc gic hp dn ca cc cm. H c th c s dng ngay c khi cc m hnh c i din bng cch s dng mt s kt hp cc c tnh v nh lng,minl kin thclin kt mt khinimv cc tnhnnghnhp c sn.Tuynhin,victrinkhaiccnvkhinimphncmdliuc c tnh rtt tin v khng ph hp cho nhm tp hp d liu ln. ThuttonK-meansvgiithutdatrnmngnowrronthnkinh ca,liKohonen,lthnhcngnhtcsdngtrnbdliuln. iunyldolthut tonK-meansnginthchinvctnhhp dn v thi gian tuyn tnh phc tp ca n. Tuy nhin, n khng kh thi s dng ngay c thut ton ny thi gian tuyn tnh trn d liu ln t ra. Thut tongiatngnhlnhovthchinthnkinhcan,mngArt,cth csdngcmtpdliuln.Nhnghcxuhngtphthuc. Phn chia v chinh phc l mt heuristic m c khai thc theo ng thit k thut tonmy tnh gim chi ph tnh ton. Tuynhin, cn khnngoan s dng trong cc phn nhm t c kt qu c ngha. Tm li, Phn cm d liu l mtvn th v, hu ch, v y thch thc. N c tim nng ln trong cc ng dng nh nhn in i tng, phn on hnh nh, v ccchn lc v truy hi thng tin. Tuy nhin cn cn thn thit k mt vi la chn cth khai thc tim nng ny. -90- KT LUN Cc vn c tm hiu trong lun vn Tng hp, nghin cu nhng nt c bn l thuyt v ng dng thc tin caPhn cm d liu. Vi s pht trin ngy cng ln nh v bo ca Cng nghthngtinvstoravCsdliuthngtin.Doyucuv nghincuhonthin,pdngphngphp,kthutPhncmdliul rt cn thit v c ngha to ln Trongchng1,lunvntrnhbytngquan,lthuytvphncm d liu,vmt s l thuyt lin quan trc tip n khai ph dliu. Chng 2,giithiutngqut ccthut tonphncmdliu,thut tonphncm dliulrtnhiu,Lunvnchcpmtsthuttonphbin,thng dng.Chng3lnivmtsngdngtiubiucaphncmdliu nhPhnonnh,Nhndinktvitng,Truyhithngtin,v Khai ph d liu. HNG PHT TRIN CA TI PhncmdliuvngdngcaPhncmdliulhngnghin cucnthit,quantrng,Tuynhinycnglmngrtrng,baohm nhiu phng php, k thut, v hnh thnh nhiu nhm khc nhau.Trong qu trnh nghin cu, thc hinlun vnmc d c gng tp trung nghin cu v tham kho nhiu ti liu, bi bo, tp ch khoa hc trong v ngoi nc, nhng do trnh cn c nhiu gii hn khng th trnh khi thiustvhnch.Emrtmongcschbonggpnhiuhnna ca cc thy, c gio, cc nh khoa hc HNG NGHIN CU PHT TRIN -Tip tc nghin cu thm v l thuyt v phn cm d liu -Xydng,phttrinthmcckthut,ngdngcaPhncmd liu. -91- PH LC : XYDNGCHNGTRNHPHNCMDLIUVI THUN TON K-MEANS BNG NGN NG VISUAL BASIC 6.0 Giao din chng trnh :

-92- *Ngisdngchnslngcmdliu,sauclickngunhinvo khung( nhp d liu X, Y). Chngtrnhtocmtrncstiginbnhphngkhongcch giadliuvcmtrngtmtngng,miimbiuthchomti tngv ta (X, Y) m t hai thuc tnh ca i tng. Mu sc ca imv s nhn biu th cho cm d liu * Thut ton phn cm K-Means lm vic nh sau : Nuslngdliunhhnscmthtagnmidliulmt trngtmcacm.Mitrngtmscmtscm.Nuslnglnd liulnhnscm,vimidliu,tatnhtonkhongcchtitt ccc trng tm v ly khong cch ti thiu. D liu ny c ni l thuc v cm c khong cch ti thiu ti d liu ny. Khi chng ta khng chc chn v v tr ca trng tm, ta cn iu chnh v tr trng tm da vo d liu cp nht hin ti. Sau , ta gntt c d liuti trng tmmi ny. Qu trnh nyclpli cho tikhi khng cn d liu di chuyn sang cm khc. V mt ton hc, vng lp ny c th chng minh l hi t. -93- V d sau khi chy chng trnh vi s cm = 9 -94- M ngun chng trnh Option Explicit Private Data()' Row 0 = cluster, 1 =X, 2= Y; S l- ng d liu trong cc ct PrivateCentroid()AsSingle'cmtrungtm(XvY)cacccm;S l- ng cm = S l- ng ct Private totalData As Integer' Tng s d liu (tng s ct) Private numCluster As Integer ' Tng s cc cm ############################################################## ' Cc form iu khin ' + Form_Load ' + cmdReset_Click ' + txtNumCluster_Change ' + Picture1_MouseDown ' + Picture1_MouseMove ' ############################################################## Private Sub Form_Load() Dim i As Integer Picture1.BackColor = &HFFFFFF ' t mu = trng Picture1.DrawWidth = 10 ' ln ca im Picture1.ScaleMode = 3' pixels '- a ra s l- ng ca cm numCluster = Int(txtNumCluster) ReDim Centroid(1 To 2, 1 To numCluster) For i = 0 To numCluster - 1 'To nhn If i > 0 Then Load lblCentroid(i) lblCentroid(i).Caption = i + 1 lblCentroid(i).Visible = False Next i End Sub Private Sub cmdReset_Click() ' refress li d liu Dim i As Integer Picture1.Cls' Lm sch nh -95- Erase Data' Xa d liu totalData = 0 For i = 0 To numCluster - 1 lblCentroid(i).Visible = False' Khng hin nhn Next i 'Cho php thay i s l- ng cm txtNumCluster.Enabled = True End Sub Private Sub txtNumCluster_Change() 'Thay i s l- ng cm v reset li d liu Dim i As Integer For i = 1 To numCluster - 1 Unload lblCentroid(i) Next i numCluster = Int(txtNumCluster) ReDim Centroid(1 To 2, 1 To numCluster) 'Gi s kin cmdReset_Click For i = 0 To numCluster - 1 If i > 0 Then Load lblCentroid(i) lblCentroid(i).Caption = i + 1 lblCentroid(i).Visible = False Next i End Sub Private Sub Picture1_MouseDown(Button As Integer, Shift As Integer, X As Single, Y As Single) 'Thu thp d liu v trnh din kt qu Dim colorCluster As Integer Dim i As Integer 'V hiu kh nng c th thay i s l- ng cm txtNumCluster.Enabled = False ' To d liu chc nng totalData = totalData + 1 ReDim Preserve Data(0 To 2, 1 To totalData)' Ch : Bt u vi0 cho dng Data(1, totalData) = X Data(2, totalData) = Y -96- 'Thc hin k-mean clustering Call kMeanCluster(Data, numCluster) 'Trnh din kt qu Picture1.Cls For i = 1 To totalData colorCluster = Data(0, i) - 1 If colorCluster = 7 Then colorCluster = 12 ' Nu mu trng (Nu ging mu nn th thay i thnh mu khc) X = Data(1, i) Y = Data(2, i) Picture1.PSet (X, Y), QBColor(colorCluster) Next i 'Hin th cm trung tm For i = 1 To min2(numCluster, totalData) lblCentroid(i - 1).Left = Centroid(1, i) lblCentroid(i - 1).Top = Centroid(2, i) lblCentroid(i - 1).Visible = True Next i End Sub PrivateSubPicture1_MouseMove(ButtonAsInteger,Shift AsInteger,XAs Single, Y As Single) lblXYValue.Caption = X & "," & Y End Sub ' ############################################################## ' FUNCTIONS ' + kMeanCluster: ' + dist: Khong cch tnh ton ' + min2: Tr li gi tr nh nht gia hai s ' ############################################################## Sub kMeanCluster(Data() As Variant, numCluster As Integer) ' Hm chnh phn cm d liu thnh k cm ' input: + Ma trn d liu (0 ti 2, 1 ti TotalData); Row 0 = cluster, 1 =X, 2= Y; D liu trong cc ct '+ numCluster: S l- ng cm ng- i dng mun d liu - c phn cm '+ Cc bin a ph- ng: Centroid, TotalData ' ouput: o) Cm trung tm - c cp nht 'o) Gn s l- ng cc cm vo d liu (= row 0 of Data) Dim i As Integer -97- Dim j As Integer Dim X As Single Dim Y As Single Dim min As Single Dim cluster As Integer Dim d As Single Dim sumXY() Dim isStillMoving As Boolean isStillMoving = True If totalData