Phanloai Van Ban TRINH QUOC SON CH0401047

Embed Size (px)

Citation preview

Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit CHNG 1 : TNG QUAN Ngy ny , s bng n thng tin do b tc ng bi s xut hin ca cc siu phng tin v World Wide Web (WWW) lm cho khng gian d liu gia tng thng xuyn , iu ny to ra mt thch thc cho cc h thng truy vn thng tin sao cho c hiu qa . Mt trong nhng kh khn m cc h thng thng tin thng phi gp l tn sut cp nht ca cc thng tin qa ln .Phng thc s dng giy trong giao dch ang dn c s ha , do nhiu tnh nng vt tri m phng thc ny mang li , nh l c th lu tr lu di , cp nht , sa i , tm kim mt cch nhanh chng . Do s lng vn bn s ha ngy nay ang tng dn theo cp s nhn , cng vi s gia tng ca s lng vn bn , nhu cu tm kim vn bn cng tng theo , khi phn loi vn bn t ng l mt yu cu cp thit c t ra . Phn loi vn bn gip s gip chng ta tm kim thng tin mt cch nhanh chng hn thay v phi tm ln lt trong tng vn bn , hn na khi m s lng vn bn ang gia tng mt cch nhanh chng th thao tc tm ln lt trong tng vn bn s mt rt nhiu thi gian , cng sc v l mt cng vic nhm chn v khng kh thi. Chnh v th nhu cu phn loi vn bn t ng l thc s cn thit.Hin nay c rt nhiu cng trnh nghin cu v phn loi vn bn v c c nhng kt qa ng khch l, nh l : Support Vector Machine , K Nearest Neighbor , Linear Least Squares Fit , Neural Network , Nave Bayes , Centroid Based im chung ca cc phng php ny u da vo xc sut thng k hoc da vo trng s ca cc t , cm t trong vn bn .Trong mi phng phpu c cch tnh ton khc nhau , tuy nhin cc phng php ny u phi thc hin mt s bc chung , nh: u tin mi phng php s da vo thng tin v s xut hin ca cc t trong vn bn ( tn s xut hin trong tp vn bn ,) biu din thnh dng vector , sau ty tng bi ton c thm chng ta s quyt nh chn p dng phng php no , cng thc tnh ton no cho ph hp phn loi tp vn bn da trn tp cc vector xy dng c bc trn , nhm mc ch t c kt qa phn loi tt nht .CHNG 2 : CC HNG TIP CN PHN LOI VN BN__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010471Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit Cng vi cc hng nghin cu khc v x l v rt trch thng tin trong vn bn nh phn cm ( clustering) , tng luc vn bn ( text summarization ) , phn loi vn bn t ng l mt hng nghin cu c quan tm trong nhiu nm gn y . phn loi vn bn cc cng trnh nghin cu thng da vo t kha , da trn ng ngha ca t , tp th hay mt s m hnh khc .I. Biu din vn bn Nh trnh by phn trn , bc u tin trong qui trnh phn loi vn bn l thao tc chuyn vn bn ang c m t dui dng chui cc t thnh mt m hnh khc , sao cho ph hp vi cc thut ton phn loi ,thng thng ngui ta thng biu din vn bn bng m hnh vector. tng ca m hnh ny l xem mi mt vn bn ( Di )c biu din theo dng( ) i ,dDii , trong i l ch s dng nhn din vn bn ny v dil vector c trng ca vn bn Di ny , trong : ) ,....., , (w w wdin 2 i 1 ii , v n l slungc trngca vector vnbn,wijltrngsca c trngthj , { } n 1,2,..., j .Mt vn cn quan tm khi biu din vn bn theo vector c trng chnh l vic chn la c trng v s chiu cho khng gian vector . Cn phi chn bao nhiu t , l cc t no , phng php chn ra sao ? . y l cu hi chng ta phi tr li trong qa trnh chuyn vn bn sang thnh vector , c nhiu cch tip cn khc nhau tr li cho cu hiny, tiubiulsdngphngphpInformationGain, phngphpDF Thresolding hay phng php Term Strength . Phng php Information Gain s dng o MI ( Mutual Information) chn ra tp t kha c trng c o MI cao nht . Tuy nhin , vic chn la phng php no th tu thuc vo thch hp , ph hp ca phng php , ca o m phng php s dng so vi bi ton m chng ta ang __________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010472Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit xem xt gii quyt , c th l nu vn bn l mt trang web th s c phng php chn la c trng khc so vi cc vn bn loi khc .Cc c trng ca vn bn khi biu din di dng vector :- S nhiu khng gian c trng thng ln .- Cc c trng c lp nhau.- Cc c trng ri rc :vector c trng di c th c nhiu thnh phn mang gi tr 0 do c nhiu c trng khng xut hin trong vn bn di (nu chng ta tip cn theo cch s dng gi tr nh phn 1 , 0 biu din cho vic c xut hin hay khng mt c trng no trong vn bn ang c biu din thnh vector) , tuy nhin nu n thun cch tip cn s dng gi tr nh phn 0 , 1 ny th kt qa phn loi phn no hn ch l do c th c trng khng c trong vn bn ang xt nhng trong vn bn ang xt li c t kha khc vi t c trng nhng c ng ngha ging vi t c trng ny , do mt cch tip cn khc l khng s dng s nh phn 0 ,1 m s dng gi tr s thc phn no gim bt s ri rc trong vector vn bn.II.Cc phng php phn loi vn bnII.1. Phng php SVM Support Vector MachineSVM l phng php phn loi rt hiu qa c Vapnik gii thiu nm 1995 . tng ca phng php l cho trc mt tp hun luyn c biu din trong khng gian vector , trong mi mt vn bn c xem nh mt im trong khng gian ny .Phng php ny tm ra mt siu mt phng h quyt nh tt nht c th chia cc im trn khng gian ny thnh hai lp ring bit tng ng ,tm gi l lp + ( cng ) v lp ( tr) .Cht lng ca siu mt phng ny c quyt nh bi mt khong cch ( c gi l bin) ca im d liu gn nht ca mi lp n mt phng ny . Khong cch bin cng ln th cng c s phn chia tt cc im ra thnh hai lp , ngha l s t c kt qa phn loi tt . Mc tiu ca thut ton SVM l tm c khong cch bin ln nht to kt qa phn loi tt .__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010473Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit C th ni SVM thc cht l mt bi ton ti u , mc tiu ca thut ton l tm c mt khng gian H v siu mt phng quyt nh h trn H sao cho sai s khi phn loi l thp nht , ngha l kt qa phn loi s cho kt qa tt nht .Phng trnh siu mt phng cha vector di trong khng gian nh sau :0 bw.di +'< +> ++

,_

,_

0 bw .di,0 b w .di,w .disigndih Nh th vector h(di) biu din s phn lp ca vector di vo hai lp . Gi Yi mang gi tr +1 hoc -1 , khi Yi = +1 vn bn tng ng vi vector dithuc lp + v ngc li n s thuc vo lp - . Khi ny c siu mt phng h ta s gii bi ton sau : Tm Minwvi w v b tha iu kin : 1 b)) wdi(sign( yi: n 1, i + Chng ta thy rng SVM l mt phng quyt nh ch ph thuc vo cc vector h trc khong cch n mt phng quyt nh l1/wi. Khi cc im khc b xa i th thut ton vn cho kt qa ging nh ban u . Chnh c im ny lm cho SVM khc vi cc thut ton khc nh kNN , LLSF , Nnet , NB v tt c d liu trong tp hun luyn u c dng ti u ha kt qa .II.2. Phng php K Nearest Neighbor ( kNN)kNN l phng php truyn thng kh ni ting theo hng tip cn thng k c nghin cu trong nhiu nm qua . kNN c nh gi l mt trong nhng phng php tt nht c s dng t nhng thi k u trong nghin cu v phn loi vn bn .__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010474Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit tng ca phng php ny l khi cn phn loi mt vn bn mi , thut ton s xcnhkhongcch(cthpdngcccngthcvkhongcchnhEuclide, Cosine , Manhattan , ) ca tt c cc vn bn trong tp hun luyn n vn bn ny tm ra k vn bn gn nht ,gi l k nearest neighbor k lng ging gn nht , sau dng cc khong cch ny nh trng s cho tt c cc ch . Khi , trng s ca mt ch chnh l tng tt c cc khong cch trn ca cc vn bn trong k lng ging c cng ch , ch no khng xut hin trong k lng ging s c trng s bng 0 . Sau cc ch s c sp xp theo gi tr trng s gim dn v cc ch c trng s cao s c chn lm ch ca vn bn cn phn loi.Trng s ca ch cj i vi vn bn x c tnh nh sau : bjcj,diy .{kNN}didi,xsim cjx,W

,_

,_

,_

Trong : y (di, c) thuc {0,1} , vi :- y = 0 : vn bn di khng thuc v ch cj- y = 1 : vn bn di thuc v ch cjsim (x , d) : ging nhau gia vn bn cn phn loi x v vn bn d . Chng ta c th s dng o cosine tnh khong cch :dixdi.xdi,xcosdi,xsim

,_

,_

- bj l ngng phn loi ca ch cj c t ng hc s dng mt tp vn bn hp l c chn ra t tp hun luyn. chn c tham s k tt nht cho thao tc phn loi , thut ton cn c chy th nghim trn nhiu gi tr k khc nhau , gi tr k cng ln th thut ton cng n nh v sai st cng thp .__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010475Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit II.3.Phng php Nave Bayes (NB)NB l phng php phn loi da vo xc sut c s dng rng ri trong lnh vc my hc v nhiu lnh vc khc nh trong cc cng c tm kim , cc b lc mail tng c bn ca cch tip cn ny l s dng xc sut c iu kin gia t hoc cm t v ch d on xc sut ch ca mt vn bn cn phn loi.im quan trng ca phng php ny chnh l ch gi nh rng s xut hin ca tt c cc t trong vn bn u c lp vi nhau. Nh th NB khng tn dng c s ph thuc ca nhiu t vo mt ch c th . Chnh gi nh lm cho vic tnh ton NB hiu qa v nhanh chng hn cc phng php khc vi phc tp theo s m v n khng s dng cch kt hp cc t a ra phn on ch .Mc ch chnh l lm sao tnh c xc sut Pr(Cj, d) , xc sut vn bn dnm trong lp Cj.Theo lut Bayes , vn bn d s c gn vo lp Cj no c xc sut Pr(Cj, d) cao nht . Cng thc tnh Pr(Cj, d) nh sau :( )( ) ( )( ) ( )

,_

cc'd'1 iC'|wiPr .c'Prd'1 iCj|wiPr .CjPrargmaxCcjd'HBAYESVi : - TF(wi, d) l s ln xut hin ca t wi trong vn bn d- |d| l s lng cc t trong vn bn d- wi l mt t trong khng gian c trng F vi s chiu l |F|- Pr(Cj) c tnh da trn t l phn trm ca s vn bn mi lp tng ng ( ) CC'C'CjCCjCjPr__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010476Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit trong tp d liu hun luyn -( )( )( ) ++Fw'cj,w'TF Fcj,wiTF 1Cj|wiPrNgoi ra cn c cc phng php NB khc c th k ra nh ML Nave Bayes , MAP Nave Bayes , Expected Nave Bayes . Ni chung Nave Bayes l mt cng c rt hiu qa trong mt s trng hp. Kt qa c th rt xu nu d liu hun luyn ngho nn v cc tham s d on (nh khng gian c trng) c cht lng km.Nhn chung y l mt thut ton phn loi tuyn tnh thch hp trong phn loi vn bn nhiu ch . NB c u im l ci t n gin , tc thc hin thut ton nhanh , d dng cp nht d liu hun luyn mi v c tnh c lp cao vi tp hun luyn .II.4. Phng php Linear Least Square Fit LLSFLLSF l mt cch tip cn nh x c pht trin bi Yang v Chute vo nm 1992 . Ban u LLSF c th nghim trong lnh vc xc nh t ng ngha sau s dng trong phn loi vo nm 1994 . Cc th nghim cho thy hiu sut phn loi ca LLSF c th ngang bng vi phng php kNN kinh in. tng ca LLSF l s dng phng php hi quy hc t tp hun luyn v cc ch c sn. Tp hun luyn c biu din di dng mt cp vector u vo v u ra nh sau:-Vector u vo l mt vn bn bao gm cc t v trng s.-Vector u ra gm cc ch cng vi trng s nh phn ca vn bn ng vi vector u vo .Gii phng trnh cc cp vector u vo , u ra chng ta s thu c ma trn ng hin ca h s hi quy ca t v ch .Phng php ny s dng cng thc : __________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010477Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit B FA2min argFFLS Trong :- A, B l ma trn i din tp d liu hun luyn ( cc ct trong ma trn tng ng l cc vector u vo v u ra).- FLS l ma trn kt qa ch ra mt nh x t mt vn bn bt k vo vector ca ch gn trng s.Nh vo vic sp xp trng s ca cc ch , chng ta c mt danh sch ch c th gn cho vn bn cn phn loi . Nh t ngng ln trng s ca cc ch m ta tm c ch thch hp cho vn bn u vo . H thng t ng hc cc ngng ti u cho tng ch , ging vi kNN . Mc d LLSF v kNN khc nhau v mt thng k , nhng chng ta vn tm thy im chung trong cch lm ca hai phng php ny l qa trnh hc ngng ti u.5. Phng php Centroid based vectorL mt phng php phn loi n gin , d ci t v tc nhanh do c phc tp tuyn tnh O(n). tng ca cch tip cn ny l mi lp trong d liu hun luyn s c biu din bng mt vector trng tm . Vic xc nh lp ca mt vn bn bt k s thng qua vic tm vector trng tm no gn vi vector biu din vn bn th nht.Lp ca vn bn chnh l lp m vector trng tm i din v khong cch c xc nh theo o cosine.Chng ta c cng thc tnh vector trng tm ca lp i :{i}djdj {i}1Ci o khong ccg gia vector x v vector Ci :__________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010478Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit Ci.xCi.xCi,xcos

,_

Trong :- x l vector vn bn cn phn loi- {i} l tp hp cc vn bn thuc ch Ci- Ch ca vector x l Cx tha mn cos(x, Cx)= arg max (cos(x,Ci)).III.Kt lun :Cc thut ton phn loi trn t thut ton phn loi hai lp (SVM) n cc thut ton phn loi a lp (kNN) u c im chung l yu cu vn bn phi c biu din di dng vector c trng . Ngoi ra cc thut ton nh kNN , NB , LLSF u phi s dng cc c lng tham s v ngng ti u khi phn loi vn bn , trong khi thut ton SVM c th t xc nh cc tham s ti u ny trong qa trnh thc hin thut ton .Xt v mt thi gian , cc phng php c thi gian hun luyn khc nhau, cc phng php kNN , NB , LLSF c thi gian hun luyn v phn loi vn bn nhanh hn so vi cc thut ton cn li , ng thi d dng ci t hn.Mt cu hi c t ra l : c c mt kt qa phn loi t kt qa tt th cn nhng yu t g ? C 3 yu t quan trng tc ng n kt qa phn loi vn bn :1) Cn mt tp d liu hun luyn chun v ln cho thut ton hc phn loi . Nu chng ta c c mt tp d liu chun v ln th qa trnh hun luyn s tt v khi chng ta s c kt qa phn loi tt sau khi c hc .2) Cc phng php trn hu ht u s dng m hnh vector biu din vn bn , do phng php tch t trong vn bn ng vai tr quan trng qa trnh biu din vn bn bng vector . Yu t ny rt quan trng , v c th i vi mt s ngn ng nh ting Anh chng hn th thao tc tch t trong vn bn n gin ch l da vo cc khong trng , tuy nhin trong cc ngn ng __________________________________________________________________________Hc vin : Trnh Quc Sn-CH04010479Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit a m tit nh ting Vit v mt s ngn ng khc th s dng khong trng khi tch t l khng chnh xc , do phng php tch t l mt yu t quan trng.3) Thut ton s dng phn loi phi c thi gian x l hp l , thi gian ny bao gm : thi gian hc , thi gian phn loi vn bn , ngoi ra thut ton ny phi c tnh tng cng (incremental function) ngha l khng phn loi li ton b tp vn bn khi thm mt s vn bn mi vo tp d liu m ch phn loi cc vn bn mi m thi , khi thut ton phi c kh nng gim nhiu ( noise ) khi phn loi vn bn.CHNG 3 : CC HNG TIP CN TCH TI.CC HNG TIP CN DA TRN T : Hng tip cn da trn t vi mc tiu tch c cc t hon chnh trong cu. Hng tip cn ny c th chia ra theo 3 hng : da trn thng k (statistics - based) , da trn __________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104710Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit t in ( dictionary based) v hydrid ( kt hp nhiu phng php vi hy vng t c nhng u im ca cc phng php ny) . Hng tip cn da trn thng k:Da trn cc thng tin nh tn s xut hin ca t trong tp hun luyn ban u . Hng tip cn ny c bit da trn tp ng liu hun luyn , nh vy nn hng tip cn ny t ra linh hot v hu dng trong nhiu lnh vc khc nhau.Hng tip cn da trn t in : tng ca hng tip cn ny l nhng cm t c tch ra t vn bn phi c so khp vi cc t trong t in. Do tronghngtipcnnyihitinringchotnglnhvcquantm. Hng tip cn full word / phrase cn s dng mt t in hon chnh c th tch c y cc t hoc ng trong vn bn , trong khi hng tip cn thnh phn component li s dng t in thnh phn .T in thnh phn ch cha cc thnh phn ca t v ng nh hnh v v cc t n gin . Hng tip cn theo t in vn cn mt s hn ch trong vic tch t v thc hin hon ton da vo t in . Nu nh thc hin thao tc tch t bng cch s dng t in hon chnh th trong thc t vic xy dng mt b t in hon chnh l kh thc hin v i hi nhiu thi gian v cng sc . Nu tip cn theo hng s dng t in thnh phn th s gim nh hn ch , kh khn khi xy dng t in , v khi chng ta s s dng cc hnh v t v cc t n gin v cc t khc hnh thnh nn t , cm t hon chnh. Hng tip cn theo Hybrid : Vi mc ch kt hp cc hng tip cn khc nhau tha hng c cc u im ca nhiu k thut v cc hng tip cn khcnhau nhm nng cao ktqa. Hng tipcn ny thngkt hp gia hng da trn thng k v da trn t in nhm tn dng cc mt mnh ca cc phng php ny . Tuy nhin hng tip cn Hybrid li mt nhiu thi gian x l , khng gian a v i hi nhiu chi ph.II. CC HNG TIP CN DA TRN K T:Trong ting vit, hnh v nh nht l ting c hnh thnh bi nhiu k t trong bng ch ci . Hng tip cn ny n thun rt trch ra mt s lng nht nh cc ting __________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104711Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit trong vn bn nh rt trch t 1 k t (unigram) hay nhiu k t (n-gram) v cng mang li mt s kt qa nht nh c minh chng thng qua mt s cng trnh nghin cu c cng b , nh ca tc gi L An H [2003] xy dng tp ng liu th 10MB bng cch s dng phng php qui hoch ng ca i ha xc sut xut hin ca cc ng.Ri cng trnh nghin cu ca H. Nguyn[2005] lm theo hng tip cn l thay v s dng ng liu th , cng trnh tip cn theo hng xem Internet nh mt kho ng liu khng l , sau tin hnh thng k v s dng thut gii di truyn tm cch tch t ti u nht , v mt s cng trnh ca mt s tc gi khc.Khi so snh kt qa ca tc gi L An H v H.Nguynt th thy cng trnh ca H.Nguyn cho c kt qa tt hn khi tin hnh tch t , tuy nhin thi gian x l lu hn.u im ni bt ca hng tip cn da trn nhiu k t l tnh n gin , d ng dng ,ngoi ra cn c thun li l t tn chi ph cho thao tc to ch mc v x l nhiu cu truy vn.Qua nhiu cng trnh nghin cu ca cc tc gi c cng b , hng tip cn tch t da trn nhiu k t ,c th l cch tch t hai k t c cho l s la chn thch hp.III. MT S PHNG PHP TCH T TING VIT HIN NAYIII.1.Phng php Maximum Matching : Forward / BackwardPhng php khp ti a ( MM - Maximum Matching) hay cn gi l LRMM - Left Right Maximum Matching. phng php ny , chng ta s duyt mt ng hoc cu t tri sang phi v chn t c nhiu m tit nht c mt trong t in v c thc hin lp li nh vy cho n ht cu.Dng n gin ca phng php dng gii quyt nhp nhng t n. Gi s chng ta c mt chui k tC1, C2 , , Cn . Chng ta s p dng phng php t u chui . u tin kim tra xem C1 c phi l t hay khng , sau kim tra xem C1C2 c phi l t hay khng . Tip tc thc hin nh th cho n khi tm c t di nht .Dng phc tp : Quy tc ca dng ny l phn on t .Thng thng ngi ta chn phn on ba t c chiu di ti a. Thut ton bt u t dng n gin , c th l nu pht hin ra nhng cch tch t gy nhp nhng, nh v d trn , gi s C1 l t v C1C2 cng l mt t, khi chng ta kim tra k t k tip trong chui C1, C2 , .. ,Cn tm tt c cc on ba t c bt u vi C1 hoc C1C2 .__________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104712Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit V d : Gi s chng ta c c cc on sau :- C1 C2C3 C4-C1C2 C3C4C5-C1C2 C3C4 C5C6Khi chui di nht s l chui th ba . Do t u tin ca chui th ba (C1C2)s c chn . Thc hin cc bc cho n khi c chui t honh chnh.Nhn xt : Phng php ny thc hin tch t n gin , nhanh v ch cn da vo t in thc hin . Tuy nhin , khuyt im ca phng php ny cng chnh l t in , ngha l chnh xc khi thc hin tch t ph thuc hon ton vo tnh , tnh chnh xc ca t in.III.2.Phng php Transformation based Learning TBL :Phng php ny tip cn da trn tp ng liu nh du .Theo cch tip cn ny cho my tnh c th nhn bit ranh gii gia cc t c thtch t chnh xc , chng ta s cho my hc cc cu mu trong tp ng liu c nh du ranh gii gia cc t ng .R rng chng ta thy phng php rt n gin , v ch cn cho my hc cc tp cu mu v sau my s t rt ra qui lut ca ngn ng v t s p dng chnh xc khi c nhng cu ng theo lut m my rt ra . V r rng tch t c hon ton chnh xc trong mi trng hp th i hi phi c mt tp ng liu ting Vit tht y v phi c hun luyn lu cth rt ra cc lut y .III.3.M hnh tch t bng WFST v mng Neural :M hnh mng chuyn dch trng thi hu hn c trng s WFST Weighted Finit State Transducer c p dng trong tch t t nm 1996 . tng c bn l p dng WFST vi trng s l xc sut xut hin ca mi t trong kho ng liu. Dng WFST duyt qua cc cu cn xt , khi t c trng s ln nht l t c chn tch. Phng php ny cng c s dng trong cng trnh c cng b ca tc gi __________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104713Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit inh in [2001] , tc gi s dng WFST km vi mng Neural kh nhp nhng khi tch t , trong cng trnh tc gi xy dng h thng tch t gm tng WFST tch t v x l cc vn lin quan n mt s c th ring ca ngn ng ting Vit nh t ly , tn ring , .. v tng mng Neural dng kh nhp nhng v ng ngha sau khi tch t (nu c).Chi tit v 2 tng ny nh sau :3.1 Tng WFST gm c 3 bc : o Bc 1 : Xy dng t in trng s : theo m hnh WFST , thao tc phn on t c xem nh l mt s chuyn dch trng thi c xc sut.Chng ta miu t t in D l mt th bin i trng thi hu hn c trng s .Gi s : Hl tp cc t chnh t ting Vit (cn gi l ting) .- P l t loi ca t . Mi cung ca D c th l :- T mt phn t ca H ti mt phn t ca H- Cc nhn trong D biu din mt chi ph c c lng theo cng thc : Cost =-log(f/N)Trong : f l tn s ca t , N l kch thc tp mu.o Bc 2 : Xy dng cc kh nng phn on t : gim s bng n t hp khi sinhrady cct ctht mtdy ccting trongcu,tcgi xut phng php kt hp dng thm t in hn ch sinh ra cc bng n t hp , c th l nu pht hin thy mt cch phn on t no khng ph hp ( khng c trong t in , khng c phi l t ly , khng phi l danh t ring ,) th tc gi loi b cc nhnh xut pht t cch phn on on .o Bc 3: La chn kh nng phn on t ti u : Sau khi c c danh sch cc cch phn on t c th c ca cu , tc gi chn trng hp phn on t c trng s b nht.__________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104714Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit 3.2 Tng mng Neural : M hnh c s dng kh nhp nhng khi tch t bng cch kt hp so snh vi t in.Nhn xt : M hnh ny t c chnh xc trn 97% theo nh cng b trong cng trnh ca tc gi , bng vic s dng thm mng Neural kt hp vi t in kh cc nhp nhng c th c khi tch ra cc c nhiu t t mt cu v khi tng mng Neural s loi b i cc t khng ph hp bng cch kt hp vi t in. Bn cnh , cng tng t nh phng php TBL im quan trng ca m hnh ny cn tp ng liu hc y .III.4.Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text Categorization for Documents in Vietnamese) do H. Nguyn xut nm 2005nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng cn dng n mt t in hay tp ng liu hc no . Trong hng tip cn ny , tc gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet .Trong tip cn ca mnh , tc gi m t h thng tch t gm cc thnh phn a. Online Extractor : Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo chng hn . Sau , tc gi s dng cc cng thc di y tnh ton mc ph thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine. Tnh xc sut cc t xut hin trn Internet : __________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104715Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit ( )MAX)w2&w1count()w2&w1p(MAXcount(w)w p Trong MAX = 4 * 109count(w) s lng vn bn trn Internet c tm thy c cha t w hoc cng cha w1 v w2 i vi count(w1&w2). Tnh xc sut ph thuc ca mt t ln mt t khc :( )w1p)w2&w1p()w2|w1p( Thng tin ph thuc ln nhau (mutual information) ca cc t ghp c cu to bi n ting ( cw = w1w2wn)( )n1 j)wn& ..... &w2&w1p(wjp)wn& ..... &w2&w1p(MI(cw)b. GA Engine for Text Segmentation : mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Cc c th trong qun th c khi to ngu nhin , trong mi segment c gii hn trong khong 5 . GA engine sau thc hin cc bc t bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c cch tch t tt nht c th.IV. KT LUN : Sau khi xem xt mt s hng tip cn trong tch t vn bn ting Vit , cc nghin cu c cng b u ch ra rng phng php tch t da trn t mang li kt qa c chnh xc kh cao , iu ny c c nh vo tp hun luyn ln , c nh du ranh gii gia cc t chnh xc gip cho vic hc rt ra cc lut tch t cho cc vn bn __________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104716Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit khc c tt p , tuy nhin chng ta cng d nhn thy hiu sut ca phng php hon ton ph thuc vo tp ng liu hun luyn. Do khc phc s ph thuc ca t in, chng ta ngh s dng hng tip cn ca H.Nguyn (s c trnh by chi tit trong phn sau ) tch t .Hng tip cn da trn k t c u im l d thc hin , thi gian thc hin tng i nhanh , tuy nhin li cho kt qa khng chnh xc bng hng tip cn da trn t . Hng tip cn ny ni chung ph hp cho cc ng dng khng cn chnh xc tuyt i trong tch t vn bn nh ng dng lc spam mail , firewall ,Nhn chung vi hng tip cn ny nu chng ta c th ci tin nng cao chnh xc trong tch t th hng tip cn ny l hon ton kh thi v c kh nng thay th hng tip cn tch t da trn t v khng phi xy dng kho ng liu , mt cng vic i hi nhiu cng sc , thi gian v s h tr ca cc chuyn gia trong cc lnh vc khc nhau.CHNG 4 : PHN LOI VN BN TING VIT tin hnh phn loi vn bn ni chung, chng ta s thc hin cc bc nh sau : Bc 1 : Rt trch c trng vn bn v biu din vn bn bng m hnh vector.__________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104717Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit Bc 2 : p dng thut ton phn loivn bn . Bc ny chng ta s xut s dng thut ton Navie Bayes phn loi vn bn v nhn chung y l mt thut ton phn loi tuyn tnh thch hp trong phn loi vn bn nhiu ch . NB c u im l ci t n gin , tc thc hin thut ton nhanh , d dng cp nht d liu hun luyn mi v c tnh c lp cao vi tp hun luyn .I.RT TRCH C TRNG VN BNV BIU DIN BNG M HNH VECTOR rt trch c trng ca vn bn , chng s thc hin thao tc tch t trong vn bn , xc nh t loi ca t v sau tin hnh biu din cc vn bn bng m hnh vector .I.1.Tch t trong vn bn : Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text Categorization for Documents in Vietnamese) do H. Nguyn xut nm 2005nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng cn dng n mt t in hay tp ng liu hc no . Trong hng tip cn ny , tc gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet .Trong tip cn ca mnh , tc gi m t h thng tch t gm cc thnh phn 1.1 Online Extractor : Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo chng hn . Sau , tc gi s dng cc cng thc di y tnh ton mc ph thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine. Tnh xc sut cc t xut hin trn Internet : ( )MAX)w2&w1count()w2&w1p(MAXcount(w)w p Trong : MAX = 4 * 109__________________________________________________________________________Hc vin : Trnh Quc Sn-CH040104718Kha Lun Mn DataMiningNghin Cu Phn Loi Vn Bn Ting Vit count(w) s lng vn bn trn Internet c tm thy c cha t w hoc cng cha w1 v w2 i vi count(w1&w2). Tnh xc sut ph thuc ca mt t ln mt t khc :( )w1p)w2&w1p()w2|w1p( Thng tin ph thuc ln nhau (mutual information) ca cc t ghp c cu to bi n ting ( cw = w1w2wn)( )n1 j)wn& ..... &w2&w1p(wjp)wn& ..... &w2&w1p(MI(cw)1.2 GA Engine for Text Segmentation : mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Cc c th trong qun th c khi to ngu nhin , trong mi segment c gii hn trong khong 5 . GA engine sau thc hin cc bc t bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c cch tch t tt nht c th.1.2.1 Khi to qun th:a. Biu din c th :Gi s vn bn u vo t bao gm n ting nh sau : T=s1s2sn .Mc ch ca qa trnh thc hin thut ton GA l tm cch tch ra cc t c ph hp cao nht : t=w1w2wm vi wk =sisj ( 1