25
1 ĐẠI HC QUC GIA THÀNH PHHCHÍ MINH TRƯỜNG ĐẠI HC CÔNG NGHTHÔNG TIN CHUYÊN ĐỀ SIMINAR CƠ SỞ TRI THC VÀ NG DNG KHAI PHÁ DLIU BNG CÂY QUYẾT ĐỊNH ID3 CHUYÊN NGHÀNH: KHOA HC MÁY TÍNH MÃ S: 60 48 01 GIÁO VIÊN HƯỚNG DẪN: GS.TSKH.HOÀNG VĂN KIẾM HC VIÊN THC HIỆN: ĐỖ THIỆN VŨ MÃ HC VIÊN: CH1301072 KHOÁ HC: 2013 TP.HCHÍ MINH-2013

Máy học khai phá dữ liệu bằng cây quyết định ID3

  • Upload
    ca-chep

  • View
    317

  • Download
    14

Embed Size (px)

DESCRIPTION

Trình bày thuật toán ID3 để phân lớp dữ liệu, rút ra các luật, rút gọn luật

Citation preview

  • 1

    I HC QUC GIA THNH PH H CH MINH TRNG I HC CNG NGH THNG TIN

    CHUYN SIMINAR

    C S TRI THC V NG DNG

    KHAI PH D LIU BNG CY QUYT NH ID3

    CHUYN NGHNH: KHOA HC MY TNH M S: 60 48 01

    GIO VIN HNG DN: GS.TSKH.HONG VN KIM

    HC VIN THC HIN: THIN V M HC VIN: CH1301072 KHO HC: 2013

    TP.H CH MINH-2013

  • 2

    MC LC

    I. TNG QUAN V CY QUYT NH .................................................................................. 3

    1. Gii thiu chung ..................................................................................................................... 3

    2. Cc kiu cy quyt nh: ........................................................................................................ 3

    3. u im cy quyt nh: ........................................................................................................ 4

    II. XY DNG CY QUYT NH: ........................................................................................ 4

    1. Chn thuc tnh phn tch: .................................................................................................... 5

    2. Php kim tra chn php phn tch tt nht: ................................................................... 7

    3. Bin i cy quyt nh thnh lut: ..................................................................................... 10

    III. THUT TON PHN LP HC CY QUYT NH ID3: .......................................... 11

    1. M hnh: ................................................................................................................................ 11

    2. Gii thut ID3 ....................................................................................................................... 11

    3. Nhn Xt ............................................................................................................................... 12

    3.1. Khng Gian tm kim ................................................................................................... 12

    3.2. Gii Thut: ..................................................................................................................... 14

    4. Ti u Cy Quyt nh Cui Cng : ................................................................................. 15

    5. Khi no nn s dng ID3 ..................................................................................................... 16

    IV. XY DNG CHNG TRNH KHAI PH D LIU THEO GII THUT ID3...... 17

    TI LIU THAM KHO ............................................................................................................ 25

  • 3

    I. TNG QUAN V CY QUYT NH

    1. Gii thiu chung

    Cy quyt nh l cu trc c biu din di dng cy gm cc nt, nhnh v

    l. Cy quyt nh l phng php dng cho vic khai ph d liu l phn loi v d

    bo. Trong , cc nt ca cy l i din cho 1 thuc tnh d liu, cc nhnh biu din

    cc gi tr ca thuc tnh v cc l biu din cc lp quyt nh. Nt trn cng gi l nt

    gc. Cy quyt nh phn lp d liu bng cch i t nt gc di chuyn theo cc nhnh

    cho n khi gp nt l. T ta c th chuyn i v cc lut quyt nh dng IF-Then

    To cy quyt nh chnh l qu trnh phn tch c s d liu, phn lp v a ra

    d on. Cy quyt nh c to thnh bng cch ln lt chia ( quy) mt tp d

    liu thnh cc tp d liu con, mi tp con c to thnh ch yu t cc phn t ca

    cng mt lp. La chn thuc tnh to nhnh thng qua Entropy v Gain.

    Hc bng cy quyt nh cng l mt phng php thng dng trong khai ph d

    liu. Khi , cy quyt nh m t mt cu trc cy, trong , cc l i din cho cc

    phn loi cn cnh i din cho cc kt hp ca cc thuc tnh dn ti phn loi .

    Mt cy quyt nh c th c hc bng cch chia tp hp ngun thnh cc tp con

    da theo mt kim tra gi tr thuc tnh . Qu trnh ny c lp li mt cch qui cho

    mi tp con dn xut. Qu trnh qui hon thnh khi khng th tip tc thc hin vic

    chia tch c na, hay khi mt phn loi n c th p dng cho tng phn t ca tp

    con dn xut.

    Cy quyt nh c th c m t nh l s kt hp ca cc k thut ton hc v

    tnh ton nhm h tr vic m t, phn loi v tng qut ha mt tp d liu cho trc.

    D liu c cho di dng cc bn ghi c dng:1 2 3( , ) ( , , ,..., , )kx y x x x x y . Bin ph

    thuc (dependant variable) y l bin m chng ta cn tm hiu, phn loi hay tng qut

    ha. 1 2 3, , ...x x x l cc bin s gip ta thc hin cng vic .

    2. Cc kiu cy quyt nh:

    Cy quyt nh cn c hai tn khc:

  • 4

    Cy hi quy (Regression tree): c lng cc hm gi c gi tr l s thc thay v

    c s dng cho cc nhim v phn loi. (v d: c tnh gi mt ngi nh hoc khong

    thi gian mt bnh nhn nm vin)

    Cy phn loi (Classification tree): nu y l mt bin phn loi nh: gii tnh (nam

    hay n), kt qu ca mt trn u (thng hay thua).

    .

    3. u im cy quyt nh:

    So vi cc phng php khai ph d liu khc, cy quyt nh l phng php c

    mt s u im:

    Cy quyt nh d hiu. Ngi ta c th hiu m hnh cy quyt nh sau

    khi c gii thch ngn.

    Vic chun b d liu cho mt cy quyt nh l c bn hoc khng cn

    thit. Cc k thut khc thng i hi chun ha d liu, cn to cc bin

    ph (dummy variable) v loi b cc gi tr rng.

    Cy quyt nh c th x l c d liu c gi tr bng s v d liu c gi

    tr l tn th loi. Cc k thut khc thng chuyn phn tch cc b d

    liu ch gm mt loi bin. Chng hn, cc lut quan h ch c th dng cho

    cc bin tn, trong khi mng n-ron ch c th dng cho cc bin c gi tr

    bng s.

    Cy quyt nh l mt m hnh hp trng. Mng n-ron l mt v d v m

    hnh hp en, do li gii thch cho kt qu qu phc tp c th hiu c.

    C th thm nh mt m hnh bng cc kim tra thng k. iu ny lm

    cho ta c th tin tng vo m hnh.

    Cy quyt nh c th x l tt mt lng d liu ln trong thi gian ngn.

    C th dng my tnh c nhn phn tch cc lng d liu ln trong mt

    thi gian ngn cho php cc nh chin lc a ra quyt nh da trn

    phn tch ca cy quyt nh.

    II. XY DNG CY QUYT NH:

    Vic to cy quyt nh bao gm 2 giai on : To cy v ta cy .

  • 5

    - to cy thi im bt u tt c nhng v d hun luyn l gc sau phn chia

    v d hun luyn theo cch qui da trn thuc tnh c chn .

    - Vic ta cy l xc nh v xa nhng nhnh m c phn t hn lon hoc nhng phn

    t nm ngoi (nhng phn t khng th phn vo mt lp no ) .

    C rt nhiu bin i khc nhau v nng ct ca thut ton cy quyt nh, mc

    d vy chng vn tun theo nhng bc c bn sau :

    - Cy c thit lp t trn xung di v theo cch thc chia tr.

    - thi im bt u, cc mu hun luyn nm gc ca cy

    - Thuc tnh c phn loi (Ri rc ha cc thuc tnh dng phi s )

    - Chn mt thuc tnh phn chia thnh cc nhnh. Thuc tnh c chn da trn

    o thng k hoc o heuristic.

    - Tip tc lp li vic xy dng cy quyt nh cho cc nhnh.

    iu kin dng vic phn chia:

    + Tt c cc mu ri vo mt nt thuc v cng mt lp (nt l)

    + Khng cn thuc tnh no c th dng phn chia mu na

    + Khng cn li mu no ti nt.

    1. Chn thuc tnh phn tch:

    Cy quyt nh c xy dng bng cch phn tch cc bn ghi ti mi nt da

    trn mt thuc tnh u vo. R rng nhim v u tin l phi chn ra xem thuc tnh

    no a ra c s phn tch tt nht ti nt .

    o c s dng nh gi kh nng phn tch l tinh khit. Chng ta

    s c nhng phng php xc nh tnh ton tinh khit mt cch chi tit, tuy nhin

    chng u c gng t c hiu qu nh nhau. Mt s phn tch tt nht l s phn

    tch lm tng tinh khit ca tp bn ghi vi s lng ln nht. Mt s phn tch tt

    cng phi to ra cc nt c kch c tng t nhau, hay ch t cng khng to ra cc nt

    c qu t bn ghi.

  • 6

    D liu gc

    Php phn tch km Php phn tch km

    Php phn tch tt

    Thut ton xy dng cy quyt nh ht sc thu o. Chng bt u bng vic

    chn mi bin u vo cha c chn v o mc tng tinh khit trong cc kt

    qu ng vi mi bin. Sau mt php tch tt nht s c s dng trong php tch

    khi u, to hai hay nhiu nt con. Nu khng php phn tch no c kh nng (c

    th do c qu t bn ghi) hoc do khng c php phn tch no lm tng tinh khit

    th thut ton kt thc v nt tr thnh nt l.

    Php phn tch trn cc bin u vo kiu s: i vi s phn tch nh phn

    trn mt bin u vo, mi gi tr m bin cha u c th tr thnh gi tr d tuyn.

    Php phn tch nh phn da trn bin u vo kiu s c dng X < N. ci thin hiu

    nng, mt s thut ton khng kim tra ht ton b cc gi tr ca bin m ch kim tra

    trn tp mu gi tr ca bin .

    Php phn tch trn cc bin u vo nh tnh : thut ton n gin nht trong

    vic phn tch trn mt bin nh tnh l ng vi mi gi tr ca bin , ta to mt

    nhnh tng ng vi mt lp c phn loi. Phng php ny c s dng thc s

    trong mt s phn mm nhng mang li hiu qu thp. Mt phng php ph bin hn

    l nhm cc lp m d on cng kt qu vi nhau. C th, nu hai lp ca bin u

  • 7

    vo c phn phi i vi bin ch ch khc nhau trong mt gii hn cho php th hai

    lp ny c th hp nht vi nhau.

    Php phn tch vi s c mt ca cc gi tr b thiu: mt trong nhng im

    hay nht ca cy quyt nh l n c kh nng x l cc gi tr b thiu bng cch coi

    gi tr rng (NULL) l mt nhnh ca n. Phng php ny c a thch hn so vi

    vic vt cc bn ghi c gi tr thiu hoc c gng gn gi tr no cho n bi v nhiu

    khi cc gi tr rng cng c ngha ring ca n. Mc d php phn tch gi tr rng

    nh l mt lp ring r kh c ngha nhng ngi ta thng xut mt gii php

    khc. Trong khai ph d liu, mi nt cha vi lut phn tch c th thc hin ti nt

    , mi php phn tch da vo cc bin u vo khc nhau. Khi gi tr rng xut

    hin trong bin u vo ca php phn tch tt nht, ta s dng php phn tch thay th

    trn bin u vo c php phn tch tt th hai.

    2. Php kim tra chn php phn tch tt nht:

    - li thng tin (Information gain)

    Information gain l i lng c s dng chn la thuc tnh vi

    information gain ln nht.

    Cho P v N l hai lp v S l mt tp d liu c p phn t lp P v n phn t lp

    N .

    Khi lng thng tin cn thit quyt nh mt mu ty c thuc v lp P

    hay N hay khng l:

    Cho cc tp {S1, S2 , , Sv} l mt phn hoch trn tp S, khi s dng thuc

    tnh A.

    Cho mi Si cha pi mu lp P v ni mu lp N

    Entropy, hay thng tin mong mun cn thit phn lp cc i tng trong tt

    c cc cy con Si l:

    Thng tin c c bi vic phn nhnh trn thuc tnh A l:

    2 2( , ) ( , ) log logp n p p n n

    Info p n Entropyp n p n p n p n p n p n

    1

    ( ) ( , )i i i ii

    p nEntropy A Info p n

    p n

    ( ) ( , ) ( )Gain A Info p n Entropy A

  • 8

    V d: Vi bng d liu:

    D liu chi golf

    Cc bin c lp Bin ph thuc

    Quang cnh Nhit m Gi Chi

    Nng Nng Cao Nh Khng

    Nng Nng Cao Mnh Khng

    m u Nng Cao Nh C

    Ma m p Cao Nh C

    Ma Mt Trung bnh Nh C

    Ma Mt Trung bnh Mnh Khng

    m u Mt Trung bnh Mnh C

    Nng m p Cao Nh Khng

    Nng Mt Trung bnh Nh C

    Ma m p Trung bnh Nh C

    Nng m p Trung bnh Mnh C

    m u m p Cao Mnh C

    m u Nng Trung bnh Nh C

    Ma m p Cao Mnh khng

    Lp P: Chi_tennis = C

    Lp N: Chi_tennis = Khng

    Thng tin cn thit phn lp mt mu c cho l:

    Xt thuc tnh Quang cnh ta c :

    2 2

    9 9 5 5( , ) (9,5) - log - log 0.940

    14 14 14 14Info p n Info

  • 9

    Quang cnh = Nng:

    Info ([2,3]) = entropy (2/5, 3/5) = 2/5log2(2/5) 3/5log2(3/5) = 0.971

    Quang cnh = m u:

    Info ([4,0]) = entropy (1, 0) = 1log2(1) 0log2(0) = 0

    Do khng c log2(0) nn ta quy c n bng 0

    Quang cnh = Ma:

    Info ([3,2]) = entropy (3/5, 2/5) = 3/5log2(3/5) 2/5log2(2/5) = 0.971

    Entropy cho php phn tch trn thuc tnh Quang cnh :

    = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.694

    Do ta c:

    = 0.940 0.694= 0.246

    Xt thuc tnh m ta c :

    m = Cao:

    Info ([3,4]) = entropy (3/7, 4/7) = 3/7log2(3/7) 4/7log2(4/7) = 0.985

    m = Trung bnh:

    Info ([6,1]) = entropy (6/7, 1/7) = 6/7log2(6/7) 1/7log2(1/7) = 0.592

    Entropy( m)= 7/14 Info(3,4) + 7/14 Info(6,1)

    = 7/14* 0.985 + 7/14* 0.592 = 0.789

    Gian( m) = Info(9,5) Entropy( m)

    = 0.940 0.798 = 0.151

    Tng t cho cc thuc tnh cn li ta c:

    5 4 5( ) (2,3) (4,0) (3,2)

    14 14 14Entropy Quang canh Info Info Info

    ( ) (9,5) ( )Gain Quang canh Info Entropy Quang canh

    ( ) 0.246

    ( ) 0.151

    ( ) 0.048

    ( ) 0.029

    Gain Quang canh

    Gain Do am

    Gain Gio

    Gain Nhiet do

  • 10

    R rng ban u ta s chn thuc tnh Quang cnh phn tch. Sau lm tng t

    ta s c cy quyt nh cui cng c dng :

    Cy quyt nh cui cng

    3. Bin i cy quyt nh thnh lut:

    - Biu din tri thc di dng lut IF-THEN .

    - Mi lut to ra t mi ng dn t gc n l.

    - Mi cp gi tr thuc tnh dc theo ng dn to nn php kt (php AND

    v biu din bng du ,)

    - Cc nt l mang tn ca lp

    Xt v d trn ta c cc lut sau:

    R0: If {Quang cnh=Nng, m=Cao,) Then {Khng}

    R1: If {Quang cnh=Nng, m=Trung bnh,} Then {C}

    Khng C C Khng

    Cao Mnh Nh

    Quang cnh

    m Gi

    Nng Ma

    TB

    C

    m u

  • 11

    R2: If {Quang cnh=m u,} Then {C}

    R3: If {Quang cnh=Ma, Gi=Mnh,} Then {Khng}

    R4: If {Quang cnh=Ma, Gi=Nh) Then {C}

    III. THUT TON PHN LP HC CY QUYT NH ID3:

    1. M hnh:

    Gii thut quy np cy ID3 (gi tt l ID3) l mt gii thut hc n gin nhng

    v ang c p dng thnh cng trong nhiu lnh vc.

    u vo: Mt tp hp cc v d. Mi v d bao gm cc thuc tnh m t

    mt tnh hung, hay mt i tng no , v mt gi tr phn loi ca n.

    u ra: Cy quyt nh c kh nng phn loi ng n cc v d trong tp

    d liu rn luyn, v hy vng l phn loi ng cho c cc v d cha gp

    trong tng lai.

    .

    2. Gii thut ID3

    ID3 xy dng cy quyt nh (cy Q) theo cch t trn xung.. ID3 chn mt thuc

    tnh kim tra ti nt hin ti ca cy v dng trc nghim ny phn vng tp hp

    cc v d; thut ton khi xy dng theo cch quy mt cy con cho tng phn vng.

    Vic ny tip tc cho n khi mi thnh vin ca phn vng u nm trong cng mt

    lp; lp tr thnh nt l ca cy.

    V th t ca cc trc nghim l rt quan trng i vi vic xy dng mt cy Q

    n gin, ID3 ph thuc rt nhiu vo tiu chun chn la trc nghim lm gc ca

    cy.

  • 12

    * ID3 xy dng cy quyt nh theo gii thut sau:

    Function TreeNode(tp_v_d, tp_thuc_tnh)

    begin

    Bc 1:

    if mi v d trong tp_v_d u nm trong cng mt lp then

    return mt nt l c gn nhn bi lp

    else if tp_thuc_tnh l rng then

    return nt l c gn nhn bi tuyn ca tt c cc lp trong tp_v_d

    else

    Bc 2:

    begin

    2.1. Chn mt thuc tnh P, ly n lm gc cho cy hin ti;

    2.2 Xa P ra khi tp_thuc_tnh;

    2.3. Vi mi gi tr V ca P

    begin

    2.3.1 To mt nhnh ca cy gn nhn V;

    2.3.2. t vo phn_vngV cc v d trong tp_v_d c gi tr V ti

    thuc tnh P;

    2.3.3. Gi TreeNode(phn_vngV, tp_thuc_tnh), gn kt qu vo

    nhnh V ( qui)

    end

    end

    end

    3. Nhn Xt

    3.1. Khng Gian tm kim

    Cng nh cc phng php hc quy np khc, ID3 cng tm kim trong mt khng

    gian cc gi thuyt mt gi thuyt ph hp vi tp d liu rn luyn. Khng gian gi

    thuyt m ID3 tm kim l mt tp hp cc cy quyt nh c th c. ID3 thc hin mt

    php tm kim t n gin n phc tp, theo gii thut leo-ni (hill climbing), bt u

    t cy rng, sau dn dn xem xt cc gi thuyt phc tp hn m c th phn loi

  • 13

    ng cc v d rn luyn. Hm nh gi c dng hng dn tm kim leo ni

    y l php o lng thng tin thu c.

    T cch nhn ID3 nh l mt gii thut tm kim trong khng gian cc gi thuyt, ta

    c mt s nhn xt nh sau:

    Khng gian gi thuyt cc cy quyt nh ca ID3 l mt khng gian y

    cc cy quyt nh trn cc thuc tnh cho trong tp rn luyn. iu

    ny c ngha l khng gian m ID3 tm kim chc chn c cha cy quyt

    nh cn tm.

    Trong khi tm kim, ID3 ch duy tr mt gi thuyt hin ti. V vy, gii

    thut ny khng c kh nng biu din c tt c cc cy quyt nh khc

    nhau c kh nng phn loi ng d liu hin c.

    V ID3 s dng tt c cc v d mi bc a ra cc quyt nh da

    trn thng k, nn kt qu tm kim ca ID3 rt t b nh hng bi mt vi

    d liu sai (hay d liu nhiu).

    Trong qu trnh tm kim, gii thut ID3 thng l c xu hng chn cy

    quyt nh ngn hn l nhng cy quyt nh di.

  • 14

    3.2. Gii Thut:

    Theo gii thut ID3 trn th cy ID3 c hnh thnh t cc gi tr sng lc dn ca mu v d nhng trong qu trnh hnh thnh cy trong mt s him trng hp s to ra cy

    s c nt l c lp khng b nh hng ca quyt nh. C th ta xt v d sau: V d kho st 8 ngi qua cc thuc tnh vic lm, tui, gii tnh i n quyt

    nh hn nhn hay khng.

    Ta c cc gi tr Gain ln lt nh sau:

    Xt Tt C Thuc Tnh * Gain(Tt C Thuc Tnh,vic_lm) = 0.0487949406953985

    * Gain(Tt C Thuc Tnh,tui) = 0.311278124459133 * Gain(Tt C Thuc Tnh,gii_tnh) = 0 ---> Best Gain : tui (chn lm nt gc P ca cy hin ti b.2.1 )

    Xt 18

    * Gain(18,vic_lm) = 0 * Gain(18,gii_tnh) = 0.251629167387823

    ---> Best Gain : gii_tnh (chn lm nt gc P ca cy hin ti b.2.1 ) Xt nam

    Tt c cc mu u ph nh => Tr v nt gc vi nhn No (b.1)

    Xt n * Gain(n,vic_lm) = 0 ---> Best Gain : vic_lm

    Xt Khng l No

    Cc thuc tnh rng => Tr v nt gc c gi tr ph bin nht l No(b.1) Xt 22 l No (b1)

    Xt 30

    Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes (b.1) Xt 25

    Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes(b.1)

    Theo gii thut ID3 trn ta c cy sau:

  • 15

    T cy tng th c to ra ta thy thuc tnh gii_tnh l nt c lp ca cy v tt c cc gi tr ca n (nhnh) u nm trong cng 1 lp. V vy ta c th rt ngn cy (ti u) bng cch xo nt tha ny i v gn nhn (l) ca n cho nt cha trn n. Vn rt ngn cy quyt nh cui cng l ta c th thm 1 hm kim tra v xo nt tha vo bc cui cng ca gii thut ID3 hoc xy dng hm ti u ring sau khi hnh thnh cy quyt nh cui cng.

    4. Ti u Cy Quyt nh Cui Cng :

    - Ti u cy quyt nh cui cng l qu trnh rt ngn cy bng cch loi b nhng nt tha nh trnh by mc trn. y ta xy dng 1 hm ring kim tra ln lt tt c cc nhnh ca cy quyt nh cui cng t gc n l. Nu nt cui cng ca nhnh l tha

    th gn nhn (l) ca n cho nt cha lin trn n. Sau gi li hm n khi cy khng cn rt ngn rt c (cy khng thay i). - u vo: Cy quyt nh cui cng c phn lp bng gii thut ID3.

  • 16

    - u ra: Cy c loi b cc nt tha. - Code: trnh by phn sau..

    5. Khi no nn s dng ID3

    Gii thut ID3 l mt gii thut hc n gin nhng n ch ph hp vi mt lp cc

    bi ton hay vn c th biu din bng k hiu. Chnh v vy, gii thut ny thuc

    tip cn gii quyt vn da trn k hiu (symbol based approach).

    Tp d liu rn luyn y bao gm cc v d c m t bng cc cp Thuc tnh

    gi tr, nh trong v d Chi tennis trnh by trong sut chng ny, l Gi

    mnh, hay Gi nh, v mi v d u c mt thuc tnh phn loi, v d nh

    chi_tennis, thuc tnh ny phi c gi tr ri rc, nh c, khng.

    Tuy nhin, khc vi mt s gii thut khc cng thuc tip cn ny, ID3 s dng

    cc v d rn luyn dng xc sut nn n c u im l t b nh hng bi mt vi

    d liu nhiu. V vy, tp d liu rn luyn y c th cha li hoc c th thiu mt

    vi gi tr mt s thuc tnh no . Vn ny c th khc phc giai on tin x

    l.

  • 17

    IV. XY DNG CHNG TRNH KHAI PH D LIU

    THEO GII THUT ID3.

    Chng trnh vit trn nn C# visual Studio 2010, NetFramWork 4.0

    Giao din chng trnh sau khi np d liu v d kho st hn nhn v chy gii thut ID3:

    Giao din chng trnh sau khi np d liu v d d bo thi tit v chy gii thut ID3:

    Giao din chy tc v ti u:

  • 18

    Source Code Project ca chng trnh lu ti trang:

    http://www.mediafire.com/folder/yyxek7yc9s25f/CSTT_UD

    Code Class chnh ca chng trnh:

    namespace TieuLuanCoSoTriThuc { class ID3_ALG { List Examples; List Attributes; public List RuleID3=new List();

  • 19

    TreeNode _tree; int _depth; public int RuleCount; public int temp; string _solution; string _solution1; string _Rule; internal TreeNode Tree { get { return _tree; } set { _tree = value; } } public int Depth { get { return _depth; } set { _depth = value; } } public string Solution { get { return _solution; } set { _solution = value; } } public string Solution1 { get { return _solution1; } set { _solution1 = value; } } public string Rule { get { return _Rule; } set { _Rule = value; } } public ID3_ALG(List Examples, List Attributes) { this.Examples = Examples; this.Attributes = Attributes; this.Tree = null; Depth = 0; } // tnh entroypy private double GetEntropy(int Positives , int Negatives) { if (Positives == 0) return 0; if (Negatives == 0) return 0; double Entropy; int total = Negatives + Positives; double RatePositves = (double)Positives / total; double RateNegatives = (double)Negatives / total; Entropy = -RatePositves * Math.Log(RatePositves, 2) - RateNegatives * Math.Log(RateNegatives, 2); return Entropy; }

  • 20

    // tnh Gain(bestat,A); private double Gain(List Examples, Attribute A, string bestat) { double result; int CountPositives = 0; int[] CountPositivesA = new int[A.Value.Count]; int[] CountNegativeA = new int[A.Value.Count]; int Col = Attributes.IndexOf(A); for (int i = 0; i < A.Value.Count; i++) { CountPositivesA[i] = 0; CountNegativeA[i] = 0; } for (int i = 0; i < Examples.Count; i++) { int j = A.Value.IndexOf(Examples[i][Col].ToString()); if (Examples[i][Examples[0].Count - 1]=="yes") { CountPositives++; CountPositivesA[j]++; } else { CountNegativeA[j]++; } } result = GetEntropy(CountPositives, Examples.Count - CountPositives); for (int i = 0; i < A.Value.Count; i++) { double RateValue = (double)(CountPositivesA[i] + CountNegativeA[i]) / Examples.Count; result = result - RateValue * GetEntropy(CountPositivesA[i], CountNegativeA[i]); } Solution = Solution + "\n * Gain(" + bestat + "," + A.Name + ") = " + result.ToString(); return result; } // gii thut ID3 private TreeNode ID3(List Examples, List Attribute,string bestat) { Solution = Solution + " Xt " + bestat + " "; if (CheckAllPositive(Examples)) { Solution += "\n Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes"; return new TreeNode(new Attribute("Yes")); } if (CheckAllNegative(Examples)) { Solution += "\n Tt c cc mu u ph nh => Tr v nt gc vi nhn No"; return new TreeNode(new Attribute("No")); } if (Attribute.Count == 0)

  • 21

    { return new TreeNode(new Attribute(GetMostCommonValue(Examples))); } Attribute BestAttribute = GetBestAttribute(Examples, Attribute, bestat); int LocationBA = Attributes.IndexOf(BestAttribute); TreeNode Root = new TreeNode(BestAttribute); for (int i = 0; i < BestAttribute.Value.Count; i++) { List Examplesvi = new List(); for (int j = 0; j < Examples.Count; j++) { if (Examples[j][LocationBA].ToString() == BestAttribute.Value[i].ToString()) Examplesvi.Add(Examples[j]); } if (Examplesvi.Count==0) { Solution += "\n Cc thuc tnh rng => Tr v nt gc c gi tr ph bin nht "; return new TreeNode(new Attribute(GetMostCommonValue(Examplesvi))); } else { Solution += "\n"; Attribute.Remove(BestAttribute); Root.AddNode(ID3(Examplesvi, Attribute,BestAttribute.Value[i])); } } return Root; } // ly thut tnh c Gain cao nht (bestat) private Attribute GetBestAttribute(List Examples, List Attributes, string bestat) { double MaxGain = Gain(Examples, Attributes[0], bestat); int Max = 0; for (int i = 1; i < Attributes.Count; i++) { double GainCurrent = Gain(Examples, Attributes[i], bestat); if (MaxGain < GainCurrent) { MaxGain = GainCurrent; Max = i; } } Solution = Solution + "\n\t ---> Best Gain : " + Attributes[Max].Name ; return Attributes[Max]; } // ly gi tr ph bin nht ca tp ch private string GetMostCommonValue(List Examples) { int CountPositive = 0; for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1]=="yes") CountPositive++; }

  • 22

    int CountNegative = Examples.Count - CountPositive; string Label; if (CountPositive > CountNegative) Label = "Yes"; else Label = "No"; Solution = Solution + " l " + Label; return Label; } // kim tra xem tt c tp c phi l positive khng private bool CheckAllPositive(List Examples) { for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1].ToString()=="no") return false; } return true; } // kim tra xem tt c tp c phi l Negative khng private bool CheckAllNegative(List Examples) { for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1]=="yes") return false; } return true; } // xy dng cy public void GetTree() { Solution = ""; List at = new List(); for (int i = 0; i < Attributes.Count; i++) { at.Add(Attributes[i]); } Tree = ID3(Examples, at, "Tt C Thuc Tnh"); Depth = GetDepth(Tree); } // Tm lut public void SearchRule(TreeNode Rule) { if (Rule.Attributes.Value.Count == 0) { } else { string temp1=""; Solution1 += Rule.Attributes.Name + " = "; temp1 += Solution1+ " "; for (int i = 0; i < Rule.Attributes.Value.Count; i++)

  • 23

    { string temp2 = ""; temp2 = temp1 + Rule.Attributes.Value[i] + ", "; if (Rule.Childs[i].Attributes.Value.Count == 0) { RuleCount++; Solution1 = temp2 + "} THEN {"+Rule.Childs[i].Attributes.Label+"}"; RuleID3.Add(Solution1); } else { if (Rule.Attributes.Value.Count == 0) { SearchRule(Rule.Childs[i]); } else { Solution1 = temp2; SearchRule(Rule.Childs[i]); } } } } } public void GetRule(TreeNode tree) { Solution1 = ""; Rule += " RT RA LUT T CY QUYT NH ID3 \n\n"; SearchRule(tree); for (int i = 0; i < RuleCount; i++) Rule += " Rule [" +i+ "]: IF {"+RuleID3[i]+ "\n"; Rule+= "\n Tng S Lut: "+ RuleCount; RuleCount = 0; } //Ti u Cy Quyt nh. public bool CheckAllLabelNegative(TreeNode tree) { int test=0; string temp; temp = "No"; for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Childs[i].Attributes.Label == temp) test++; } if ((test > 1) && (test == tree.Attributes.Value.Count)) return true; else return false; } public bool CheckAllLabelPositive(TreeNode tree) { int test=0;

  • 24

    string temp; temp = "Yes"; for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Childs[i].Attributes.Label == temp) test++; } if ((test>1)&&(test == tree.Attributes.Value.Count)) return true; else return false; } public void DeleteTree(TreeNode tree) // hm lm cy rng. { tree.Attributes.Name=""; //tree.Attributes.Label=null; tree.Attributes.Value.Clear(); } public void OptimizeTree(TreeNode tree) { for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Attributes.Value.Count > 1) { if (CheckAllLabelPositive(tree)) { tree.Attributes.Label = "Yes"; DeleteTree(tree); } else OptimizeTree(tree.Childs[i]); if (CheckAllLabelNegative(tree)) { tree.Attributes.Label = "No"; DeleteTree(tree); } else OptimizeTree(tree.Childs[i]); } } } // ly su ca cy private int GetDepth(TreeNode tree) { int depth; if (tree.Childs.Length == 0) return 1; else { depth = GetDepth(tree.Childs[0]); for (int i = 1; i < tree.Childs.Length; i++) { int depthchild = GetDepth(tree.Childs[i]);

  • 25

    if (depth < depthchild) depth = depthchild; } depth++; } return depth; } } }

    TI LIU THAM KHO

    1. GS.TSKH.Hong Vn Kim(2013), Bi ging v Slide mn Cng Ngh Tri Thc ng Dng

    2. Ngun Internet: http://en.wikipedia.org/wiki/ID3_algorithm

    http://en.wikipedia.org/wiki/Entropy_(information_theory)

    http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

    http://en.wikipedia.org/wiki/Machine_learning

    http://en.wikipedia.org/wiki/Decision_tree_learning

    http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree.htm

    http://mrthin.blogspot.com/2010/09/id3-algorithm.html

    http://www.dreamincode.net/forums/topic/193088-decision-tree-with-c%23/

    http://bis.net.vn/forums/t/378.aspx

    http://msdn.microsoft.com/en-us/library/system.windows.forms.treenode.aspx

    http://lap-trinh-may-tinh.blogspot.com/2013/06/chuong-trinh-mo-phong-thuat-toan-

    id3.html

    http://csshare.net/chuong-trnh-m-phong-thuat-ton-id3cy-quyet-dinh/