Upload
ca-chep
View
317
Download
14
Embed Size (px)
DESCRIPTION
Trình bày thuật toán ID3 để phân lớp dữ liệu, rút ra các luật, rút gọn luật
Citation preview
1
I HC QUC GIA THNH PH H CH MINH TRNG I HC CNG NGH THNG TIN
CHUYN SIMINAR
C S TRI THC V NG DNG
KHAI PH D LIU BNG CY QUYT NH ID3
CHUYN NGHNH: KHOA HC MY TNH M S: 60 48 01
GIO VIN HNG DN: GS.TSKH.HONG VN KIM
HC VIN THC HIN: THIN V M HC VIN: CH1301072 KHO HC: 2013
TP.H CH MINH-2013
2
MC LC
I. TNG QUAN V CY QUYT NH .................................................................................. 3
1. Gii thiu chung ..................................................................................................................... 3
2. Cc kiu cy quyt nh: ........................................................................................................ 3
3. u im cy quyt nh: ........................................................................................................ 4
II. XY DNG CY QUYT NH: ........................................................................................ 4
1. Chn thuc tnh phn tch: .................................................................................................... 5
2. Php kim tra chn php phn tch tt nht: ................................................................... 7
3. Bin i cy quyt nh thnh lut: ..................................................................................... 10
III. THUT TON PHN LP HC CY QUYT NH ID3: .......................................... 11
1. M hnh: ................................................................................................................................ 11
2. Gii thut ID3 ....................................................................................................................... 11
3. Nhn Xt ............................................................................................................................... 12
3.1. Khng Gian tm kim ................................................................................................... 12
3.2. Gii Thut: ..................................................................................................................... 14
4. Ti u Cy Quyt nh Cui Cng : ................................................................................. 15
5. Khi no nn s dng ID3 ..................................................................................................... 16
IV. XY DNG CHNG TRNH KHAI PH D LIU THEO GII THUT ID3...... 17
TI LIU THAM KHO ............................................................................................................ 25
3
I. TNG QUAN V CY QUYT NH
1. Gii thiu chung
Cy quyt nh l cu trc c biu din di dng cy gm cc nt, nhnh v
l. Cy quyt nh l phng php dng cho vic khai ph d liu l phn loi v d
bo. Trong , cc nt ca cy l i din cho 1 thuc tnh d liu, cc nhnh biu din
cc gi tr ca thuc tnh v cc l biu din cc lp quyt nh. Nt trn cng gi l nt
gc. Cy quyt nh phn lp d liu bng cch i t nt gc di chuyn theo cc nhnh
cho n khi gp nt l. T ta c th chuyn i v cc lut quyt nh dng IF-Then
To cy quyt nh chnh l qu trnh phn tch c s d liu, phn lp v a ra
d on. Cy quyt nh c to thnh bng cch ln lt chia ( quy) mt tp d
liu thnh cc tp d liu con, mi tp con c to thnh ch yu t cc phn t ca
cng mt lp. La chn thuc tnh to nhnh thng qua Entropy v Gain.
Hc bng cy quyt nh cng l mt phng php thng dng trong khai ph d
liu. Khi , cy quyt nh m t mt cu trc cy, trong , cc l i din cho cc
phn loi cn cnh i din cho cc kt hp ca cc thuc tnh dn ti phn loi .
Mt cy quyt nh c th c hc bng cch chia tp hp ngun thnh cc tp con
da theo mt kim tra gi tr thuc tnh . Qu trnh ny c lp li mt cch qui cho
mi tp con dn xut. Qu trnh qui hon thnh khi khng th tip tc thc hin vic
chia tch c na, hay khi mt phn loi n c th p dng cho tng phn t ca tp
con dn xut.
Cy quyt nh c th c m t nh l s kt hp ca cc k thut ton hc v
tnh ton nhm h tr vic m t, phn loi v tng qut ha mt tp d liu cho trc.
D liu c cho di dng cc bn ghi c dng:1 2 3( , ) ( , , ,..., , )kx y x x x x y . Bin ph
thuc (dependant variable) y l bin m chng ta cn tm hiu, phn loi hay tng qut
ha. 1 2 3, , ...x x x l cc bin s gip ta thc hin cng vic .
2. Cc kiu cy quyt nh:
Cy quyt nh cn c hai tn khc:
4
Cy hi quy (Regression tree): c lng cc hm gi c gi tr l s thc thay v
c s dng cho cc nhim v phn loi. (v d: c tnh gi mt ngi nh hoc khong
thi gian mt bnh nhn nm vin)
Cy phn loi (Classification tree): nu y l mt bin phn loi nh: gii tnh (nam
hay n), kt qu ca mt trn u (thng hay thua).
.
3. u im cy quyt nh:
So vi cc phng php khai ph d liu khc, cy quyt nh l phng php c
mt s u im:
Cy quyt nh d hiu. Ngi ta c th hiu m hnh cy quyt nh sau
khi c gii thch ngn.
Vic chun b d liu cho mt cy quyt nh l c bn hoc khng cn
thit. Cc k thut khc thng i hi chun ha d liu, cn to cc bin
ph (dummy variable) v loi b cc gi tr rng.
Cy quyt nh c th x l c d liu c gi tr bng s v d liu c gi
tr l tn th loi. Cc k thut khc thng chuyn phn tch cc b d
liu ch gm mt loi bin. Chng hn, cc lut quan h ch c th dng cho
cc bin tn, trong khi mng n-ron ch c th dng cho cc bin c gi tr
bng s.
Cy quyt nh l mt m hnh hp trng. Mng n-ron l mt v d v m
hnh hp en, do li gii thch cho kt qu qu phc tp c th hiu c.
C th thm nh mt m hnh bng cc kim tra thng k. iu ny lm
cho ta c th tin tng vo m hnh.
Cy quyt nh c th x l tt mt lng d liu ln trong thi gian ngn.
C th dng my tnh c nhn phn tch cc lng d liu ln trong mt
thi gian ngn cho php cc nh chin lc a ra quyt nh da trn
phn tch ca cy quyt nh.
II. XY DNG CY QUYT NH:
Vic to cy quyt nh bao gm 2 giai on : To cy v ta cy .
5
- to cy thi im bt u tt c nhng v d hun luyn l gc sau phn chia
v d hun luyn theo cch qui da trn thuc tnh c chn .
- Vic ta cy l xc nh v xa nhng nhnh m c phn t hn lon hoc nhng phn
t nm ngoi (nhng phn t khng th phn vo mt lp no ) .
C rt nhiu bin i khc nhau v nng ct ca thut ton cy quyt nh, mc
d vy chng vn tun theo nhng bc c bn sau :
- Cy c thit lp t trn xung di v theo cch thc chia tr.
- thi im bt u, cc mu hun luyn nm gc ca cy
- Thuc tnh c phn loi (Ri rc ha cc thuc tnh dng phi s )
- Chn mt thuc tnh phn chia thnh cc nhnh. Thuc tnh c chn da trn
o thng k hoc o heuristic.
- Tip tc lp li vic xy dng cy quyt nh cho cc nhnh.
iu kin dng vic phn chia:
+ Tt c cc mu ri vo mt nt thuc v cng mt lp (nt l)
+ Khng cn thuc tnh no c th dng phn chia mu na
+ Khng cn li mu no ti nt.
1. Chn thuc tnh phn tch:
Cy quyt nh c xy dng bng cch phn tch cc bn ghi ti mi nt da
trn mt thuc tnh u vo. R rng nhim v u tin l phi chn ra xem thuc tnh
no a ra c s phn tch tt nht ti nt .
o c s dng nh gi kh nng phn tch l tinh khit. Chng ta
s c nhng phng php xc nh tnh ton tinh khit mt cch chi tit, tuy nhin
chng u c gng t c hiu qu nh nhau. Mt s phn tch tt nht l s phn
tch lm tng tinh khit ca tp bn ghi vi s lng ln nht. Mt s phn tch tt
cng phi to ra cc nt c kch c tng t nhau, hay ch t cng khng to ra cc nt
c qu t bn ghi.
6
D liu gc
Php phn tch km Php phn tch km
Php phn tch tt
Thut ton xy dng cy quyt nh ht sc thu o. Chng bt u bng vic
chn mi bin u vo cha c chn v o mc tng tinh khit trong cc kt
qu ng vi mi bin. Sau mt php tch tt nht s c s dng trong php tch
khi u, to hai hay nhiu nt con. Nu khng php phn tch no c kh nng (c
th do c qu t bn ghi) hoc do khng c php phn tch no lm tng tinh khit
th thut ton kt thc v nt tr thnh nt l.
Php phn tch trn cc bin u vo kiu s: i vi s phn tch nh phn
trn mt bin u vo, mi gi tr m bin cha u c th tr thnh gi tr d tuyn.
Php phn tch nh phn da trn bin u vo kiu s c dng X < N. ci thin hiu
nng, mt s thut ton khng kim tra ht ton b cc gi tr ca bin m ch kim tra
trn tp mu gi tr ca bin .
Php phn tch trn cc bin u vo nh tnh : thut ton n gin nht trong
vic phn tch trn mt bin nh tnh l ng vi mi gi tr ca bin , ta to mt
nhnh tng ng vi mt lp c phn loi. Phng php ny c s dng thc s
trong mt s phn mm nhng mang li hiu qu thp. Mt phng php ph bin hn
l nhm cc lp m d on cng kt qu vi nhau. C th, nu hai lp ca bin u
7
vo c phn phi i vi bin ch ch khc nhau trong mt gii hn cho php th hai
lp ny c th hp nht vi nhau.
Php phn tch vi s c mt ca cc gi tr b thiu: mt trong nhng im
hay nht ca cy quyt nh l n c kh nng x l cc gi tr b thiu bng cch coi
gi tr rng (NULL) l mt nhnh ca n. Phng php ny c a thch hn so vi
vic vt cc bn ghi c gi tr thiu hoc c gng gn gi tr no cho n bi v nhiu
khi cc gi tr rng cng c ngha ring ca n. Mc d php phn tch gi tr rng
nh l mt lp ring r kh c ngha nhng ngi ta thng xut mt gii php
khc. Trong khai ph d liu, mi nt cha vi lut phn tch c th thc hin ti nt
, mi php phn tch da vo cc bin u vo khc nhau. Khi gi tr rng xut
hin trong bin u vo ca php phn tch tt nht, ta s dng php phn tch thay th
trn bin u vo c php phn tch tt th hai.
2. Php kim tra chn php phn tch tt nht:
- li thng tin (Information gain)
Information gain l i lng c s dng chn la thuc tnh vi
information gain ln nht.
Cho P v N l hai lp v S l mt tp d liu c p phn t lp P v n phn t lp
N .
Khi lng thng tin cn thit quyt nh mt mu ty c thuc v lp P
hay N hay khng l:
Cho cc tp {S1, S2 , , Sv} l mt phn hoch trn tp S, khi s dng thuc
tnh A.
Cho mi Si cha pi mu lp P v ni mu lp N
Entropy, hay thng tin mong mun cn thit phn lp cc i tng trong tt
c cc cy con Si l:
Thng tin c c bi vic phn nhnh trn thuc tnh A l:
2 2( , ) ( , ) log logp n p p n n
Info p n Entropyp n p n p n p n p n p n
1
( ) ( , )i i i ii
p nEntropy A Info p n
p n
( ) ( , ) ( )Gain A Info p n Entropy A
8
V d: Vi bng d liu:
D liu chi golf
Cc bin c lp Bin ph thuc
Quang cnh Nhit m Gi Chi
Nng Nng Cao Nh Khng
Nng Nng Cao Mnh Khng
m u Nng Cao Nh C
Ma m p Cao Nh C
Ma Mt Trung bnh Nh C
Ma Mt Trung bnh Mnh Khng
m u Mt Trung bnh Mnh C
Nng m p Cao Nh Khng
Nng Mt Trung bnh Nh C
Ma m p Trung bnh Nh C
Nng m p Trung bnh Mnh C
m u m p Cao Mnh C
m u Nng Trung bnh Nh C
Ma m p Cao Mnh khng
Lp P: Chi_tennis = C
Lp N: Chi_tennis = Khng
Thng tin cn thit phn lp mt mu c cho l:
Xt thuc tnh Quang cnh ta c :
2 2
9 9 5 5( , ) (9,5) - log - log 0.940
14 14 14 14Info p n Info
9
Quang cnh = Nng:
Info ([2,3]) = entropy (2/5, 3/5) = 2/5log2(2/5) 3/5log2(3/5) = 0.971
Quang cnh = m u:
Info ([4,0]) = entropy (1, 0) = 1log2(1) 0log2(0) = 0
Do khng c log2(0) nn ta quy c n bng 0
Quang cnh = Ma:
Info ([3,2]) = entropy (3/5, 2/5) = 3/5log2(3/5) 2/5log2(2/5) = 0.971
Entropy cho php phn tch trn thuc tnh Quang cnh :
= (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.694
Do ta c:
= 0.940 0.694= 0.246
Xt thuc tnh m ta c :
m = Cao:
Info ([3,4]) = entropy (3/7, 4/7) = 3/7log2(3/7) 4/7log2(4/7) = 0.985
m = Trung bnh:
Info ([6,1]) = entropy (6/7, 1/7) = 6/7log2(6/7) 1/7log2(1/7) = 0.592
Entropy( m)= 7/14 Info(3,4) + 7/14 Info(6,1)
= 7/14* 0.985 + 7/14* 0.592 = 0.789
Gian( m) = Info(9,5) Entropy( m)
= 0.940 0.798 = 0.151
Tng t cho cc thuc tnh cn li ta c:
5 4 5( ) (2,3) (4,0) (3,2)
14 14 14Entropy Quang canh Info Info Info
( ) (9,5) ( )Gain Quang canh Info Entropy Quang canh
( ) 0.246
( ) 0.151
( ) 0.048
( ) 0.029
Gain Quang canh
Gain Do am
Gain Gio
Gain Nhiet do
10
R rng ban u ta s chn thuc tnh Quang cnh phn tch. Sau lm tng t
ta s c cy quyt nh cui cng c dng :
Cy quyt nh cui cng
3. Bin i cy quyt nh thnh lut:
- Biu din tri thc di dng lut IF-THEN .
- Mi lut to ra t mi ng dn t gc n l.
- Mi cp gi tr thuc tnh dc theo ng dn to nn php kt (php AND
v biu din bng du ,)
- Cc nt l mang tn ca lp
Xt v d trn ta c cc lut sau:
R0: If {Quang cnh=Nng, m=Cao,) Then {Khng}
R1: If {Quang cnh=Nng, m=Trung bnh,} Then {C}
Khng C C Khng
Cao Mnh Nh
Quang cnh
m Gi
Nng Ma
TB
C
m u
11
R2: If {Quang cnh=m u,} Then {C}
R3: If {Quang cnh=Ma, Gi=Mnh,} Then {Khng}
R4: If {Quang cnh=Ma, Gi=Nh) Then {C}
III. THUT TON PHN LP HC CY QUYT NH ID3:
1. M hnh:
Gii thut quy np cy ID3 (gi tt l ID3) l mt gii thut hc n gin nhng
v ang c p dng thnh cng trong nhiu lnh vc.
u vo: Mt tp hp cc v d. Mi v d bao gm cc thuc tnh m t
mt tnh hung, hay mt i tng no , v mt gi tr phn loi ca n.
u ra: Cy quyt nh c kh nng phn loi ng n cc v d trong tp
d liu rn luyn, v hy vng l phn loi ng cho c cc v d cha gp
trong tng lai.
.
2. Gii thut ID3
ID3 xy dng cy quyt nh (cy Q) theo cch t trn xung.. ID3 chn mt thuc
tnh kim tra ti nt hin ti ca cy v dng trc nghim ny phn vng tp hp
cc v d; thut ton khi xy dng theo cch quy mt cy con cho tng phn vng.
Vic ny tip tc cho n khi mi thnh vin ca phn vng u nm trong cng mt
lp; lp tr thnh nt l ca cy.
V th t ca cc trc nghim l rt quan trng i vi vic xy dng mt cy Q
n gin, ID3 ph thuc rt nhiu vo tiu chun chn la trc nghim lm gc ca
cy.
12
* ID3 xy dng cy quyt nh theo gii thut sau:
Function TreeNode(tp_v_d, tp_thuc_tnh)
begin
Bc 1:
if mi v d trong tp_v_d u nm trong cng mt lp then
return mt nt l c gn nhn bi lp
else if tp_thuc_tnh l rng then
return nt l c gn nhn bi tuyn ca tt c cc lp trong tp_v_d
else
Bc 2:
begin
2.1. Chn mt thuc tnh P, ly n lm gc cho cy hin ti;
2.2 Xa P ra khi tp_thuc_tnh;
2.3. Vi mi gi tr V ca P
begin
2.3.1 To mt nhnh ca cy gn nhn V;
2.3.2. t vo phn_vngV cc v d trong tp_v_d c gi tr V ti
thuc tnh P;
2.3.3. Gi TreeNode(phn_vngV, tp_thuc_tnh), gn kt qu vo
nhnh V ( qui)
end
end
end
3. Nhn Xt
3.1. Khng Gian tm kim
Cng nh cc phng php hc quy np khc, ID3 cng tm kim trong mt khng
gian cc gi thuyt mt gi thuyt ph hp vi tp d liu rn luyn. Khng gian gi
thuyt m ID3 tm kim l mt tp hp cc cy quyt nh c th c. ID3 thc hin mt
php tm kim t n gin n phc tp, theo gii thut leo-ni (hill climbing), bt u
t cy rng, sau dn dn xem xt cc gi thuyt phc tp hn m c th phn loi
13
ng cc v d rn luyn. Hm nh gi c dng hng dn tm kim leo ni
y l php o lng thng tin thu c.
T cch nhn ID3 nh l mt gii thut tm kim trong khng gian cc gi thuyt, ta
c mt s nhn xt nh sau:
Khng gian gi thuyt cc cy quyt nh ca ID3 l mt khng gian y
cc cy quyt nh trn cc thuc tnh cho trong tp rn luyn. iu
ny c ngha l khng gian m ID3 tm kim chc chn c cha cy quyt
nh cn tm.
Trong khi tm kim, ID3 ch duy tr mt gi thuyt hin ti. V vy, gii
thut ny khng c kh nng biu din c tt c cc cy quyt nh khc
nhau c kh nng phn loi ng d liu hin c.
V ID3 s dng tt c cc v d mi bc a ra cc quyt nh da
trn thng k, nn kt qu tm kim ca ID3 rt t b nh hng bi mt vi
d liu sai (hay d liu nhiu).
Trong qu trnh tm kim, gii thut ID3 thng l c xu hng chn cy
quyt nh ngn hn l nhng cy quyt nh di.
14
3.2. Gii Thut:
Theo gii thut ID3 trn th cy ID3 c hnh thnh t cc gi tr sng lc dn ca mu v d nhng trong qu trnh hnh thnh cy trong mt s him trng hp s to ra cy
s c nt l c lp khng b nh hng ca quyt nh. C th ta xt v d sau: V d kho st 8 ngi qua cc thuc tnh vic lm, tui, gii tnh i n quyt
nh hn nhn hay khng.
Ta c cc gi tr Gain ln lt nh sau:
Xt Tt C Thuc Tnh * Gain(Tt C Thuc Tnh,vic_lm) = 0.0487949406953985
* Gain(Tt C Thuc Tnh,tui) = 0.311278124459133 * Gain(Tt C Thuc Tnh,gii_tnh) = 0 ---> Best Gain : tui (chn lm nt gc P ca cy hin ti b.2.1 )
Xt 18
* Gain(18,vic_lm) = 0 * Gain(18,gii_tnh) = 0.251629167387823
---> Best Gain : gii_tnh (chn lm nt gc P ca cy hin ti b.2.1 ) Xt nam
Tt c cc mu u ph nh => Tr v nt gc vi nhn No (b.1)
Xt n * Gain(n,vic_lm) = 0 ---> Best Gain : vic_lm
Xt Khng l No
Cc thuc tnh rng => Tr v nt gc c gi tr ph bin nht l No(b.1) Xt 22 l No (b1)
Xt 30
Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes (b.1) Xt 25
Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes(b.1)
Theo gii thut ID3 trn ta c cy sau:
15
T cy tng th c to ra ta thy thuc tnh gii_tnh l nt c lp ca cy v tt c cc gi tr ca n (nhnh) u nm trong cng 1 lp. V vy ta c th rt ngn cy (ti u) bng cch xo nt tha ny i v gn nhn (l) ca n cho nt cha trn n. Vn rt ngn cy quyt nh cui cng l ta c th thm 1 hm kim tra v xo nt tha vo bc cui cng ca gii thut ID3 hoc xy dng hm ti u ring sau khi hnh thnh cy quyt nh cui cng.
4. Ti u Cy Quyt nh Cui Cng :
- Ti u cy quyt nh cui cng l qu trnh rt ngn cy bng cch loi b nhng nt tha nh trnh by mc trn. y ta xy dng 1 hm ring kim tra ln lt tt c cc nhnh ca cy quyt nh cui cng t gc n l. Nu nt cui cng ca nhnh l tha
th gn nhn (l) ca n cho nt cha lin trn n. Sau gi li hm n khi cy khng cn rt ngn rt c (cy khng thay i). - u vo: Cy quyt nh cui cng c phn lp bng gii thut ID3.
16
- u ra: Cy c loi b cc nt tha. - Code: trnh by phn sau..
5. Khi no nn s dng ID3
Gii thut ID3 l mt gii thut hc n gin nhng n ch ph hp vi mt lp cc
bi ton hay vn c th biu din bng k hiu. Chnh v vy, gii thut ny thuc
tip cn gii quyt vn da trn k hiu (symbol based approach).
Tp d liu rn luyn y bao gm cc v d c m t bng cc cp Thuc tnh
gi tr, nh trong v d Chi tennis trnh by trong sut chng ny, l Gi
mnh, hay Gi nh, v mi v d u c mt thuc tnh phn loi, v d nh
chi_tennis, thuc tnh ny phi c gi tr ri rc, nh c, khng.
Tuy nhin, khc vi mt s gii thut khc cng thuc tip cn ny, ID3 s dng
cc v d rn luyn dng xc sut nn n c u im l t b nh hng bi mt vi
d liu nhiu. V vy, tp d liu rn luyn y c th cha li hoc c th thiu mt
vi gi tr mt s thuc tnh no . Vn ny c th khc phc giai on tin x
l.
17
IV. XY DNG CHNG TRNH KHAI PH D LIU
THEO GII THUT ID3.
Chng trnh vit trn nn C# visual Studio 2010, NetFramWork 4.0
Giao din chng trnh sau khi np d liu v d kho st hn nhn v chy gii thut ID3:
Giao din chng trnh sau khi np d liu v d d bo thi tit v chy gii thut ID3:
Giao din chy tc v ti u:
18
Source Code Project ca chng trnh lu ti trang:
http://www.mediafire.com/folder/yyxek7yc9s25f/CSTT_UD
Code Class chnh ca chng trnh:
namespace TieuLuanCoSoTriThuc { class ID3_ALG { List Examples; List Attributes; public List RuleID3=new List();
19
TreeNode _tree; int _depth; public int RuleCount; public int temp; string _solution; string _solution1; string _Rule; internal TreeNode Tree { get { return _tree; } set { _tree = value; } } public int Depth { get { return _depth; } set { _depth = value; } } public string Solution { get { return _solution; } set { _solution = value; } } public string Solution1 { get { return _solution1; } set { _solution1 = value; } } public string Rule { get { return _Rule; } set { _Rule = value; } } public ID3_ALG(List Examples, List Attributes) { this.Examples = Examples; this.Attributes = Attributes; this.Tree = null; Depth = 0; } // tnh entroypy private double GetEntropy(int Positives , int Negatives) { if (Positives == 0) return 0; if (Negatives == 0) return 0; double Entropy; int total = Negatives + Positives; double RatePositves = (double)Positives / total; double RateNegatives = (double)Negatives / total; Entropy = -RatePositves * Math.Log(RatePositves, 2) - RateNegatives * Math.Log(RateNegatives, 2); return Entropy; }
20
// tnh Gain(bestat,A); private double Gain(List Examples, Attribute A, string bestat) { double result; int CountPositives = 0; int[] CountPositivesA = new int[A.Value.Count]; int[] CountNegativeA = new int[A.Value.Count]; int Col = Attributes.IndexOf(A); for (int i = 0; i < A.Value.Count; i++) { CountPositivesA[i] = 0; CountNegativeA[i] = 0; } for (int i = 0; i < Examples.Count; i++) { int j = A.Value.IndexOf(Examples[i][Col].ToString()); if (Examples[i][Examples[0].Count - 1]=="yes") { CountPositives++; CountPositivesA[j]++; } else { CountNegativeA[j]++; } } result = GetEntropy(CountPositives, Examples.Count - CountPositives); for (int i = 0; i < A.Value.Count; i++) { double RateValue = (double)(CountPositivesA[i] + CountNegativeA[i]) / Examples.Count; result = result - RateValue * GetEntropy(CountPositivesA[i], CountNegativeA[i]); } Solution = Solution + "\n * Gain(" + bestat + "," + A.Name + ") = " + result.ToString(); return result; } // gii thut ID3 private TreeNode ID3(List Examples, List Attribute,string bestat) { Solution = Solution + " Xt " + bestat + " "; if (CheckAllPositive(Examples)) { Solution += "\n Tt c cc mu u khng nh => Tr v nt gc vi nhn Yes"; return new TreeNode(new Attribute("Yes")); } if (CheckAllNegative(Examples)) { Solution += "\n Tt c cc mu u ph nh => Tr v nt gc vi nhn No"; return new TreeNode(new Attribute("No")); } if (Attribute.Count == 0)
21
{ return new TreeNode(new Attribute(GetMostCommonValue(Examples))); } Attribute BestAttribute = GetBestAttribute(Examples, Attribute, bestat); int LocationBA = Attributes.IndexOf(BestAttribute); TreeNode Root = new TreeNode(BestAttribute); for (int i = 0; i < BestAttribute.Value.Count; i++) { List Examplesvi = new List(); for (int j = 0; j < Examples.Count; j++) { if (Examples[j][LocationBA].ToString() == BestAttribute.Value[i].ToString()) Examplesvi.Add(Examples[j]); } if (Examplesvi.Count==0) { Solution += "\n Cc thuc tnh rng => Tr v nt gc c gi tr ph bin nht "; return new TreeNode(new Attribute(GetMostCommonValue(Examplesvi))); } else { Solution += "\n"; Attribute.Remove(BestAttribute); Root.AddNode(ID3(Examplesvi, Attribute,BestAttribute.Value[i])); } } return Root; } // ly thut tnh c Gain cao nht (bestat) private Attribute GetBestAttribute(List Examples, List Attributes, string bestat) { double MaxGain = Gain(Examples, Attributes[0], bestat); int Max = 0; for (int i = 1; i < Attributes.Count; i++) { double GainCurrent = Gain(Examples, Attributes[i], bestat); if (MaxGain < GainCurrent) { MaxGain = GainCurrent; Max = i; } } Solution = Solution + "\n\t ---> Best Gain : " + Attributes[Max].Name ; return Attributes[Max]; } // ly gi tr ph bin nht ca tp ch private string GetMostCommonValue(List Examples) { int CountPositive = 0; for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1]=="yes") CountPositive++; }
22
int CountNegative = Examples.Count - CountPositive; string Label; if (CountPositive > CountNegative) Label = "Yes"; else Label = "No"; Solution = Solution + " l " + Label; return Label; } // kim tra xem tt c tp c phi l positive khng private bool CheckAllPositive(List Examples) { for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1].ToString()=="no") return false; } return true; } // kim tra xem tt c tp c phi l Negative khng private bool CheckAllNegative(List Examples) { for (int i = 0; i < Examples.Count; i++) { if (Examples[i][Examples[0].Count - 1]=="yes") return false; } return true; } // xy dng cy public void GetTree() { Solution = ""; List at = new List(); for (int i = 0; i < Attributes.Count; i++) { at.Add(Attributes[i]); } Tree = ID3(Examples, at, "Tt C Thuc Tnh"); Depth = GetDepth(Tree); } // Tm lut public void SearchRule(TreeNode Rule) { if (Rule.Attributes.Value.Count == 0) { } else { string temp1=""; Solution1 += Rule.Attributes.Name + " = "; temp1 += Solution1+ " "; for (int i = 0; i < Rule.Attributes.Value.Count; i++)
23
{ string temp2 = ""; temp2 = temp1 + Rule.Attributes.Value[i] + ", "; if (Rule.Childs[i].Attributes.Value.Count == 0) { RuleCount++; Solution1 = temp2 + "} THEN {"+Rule.Childs[i].Attributes.Label+"}"; RuleID3.Add(Solution1); } else { if (Rule.Attributes.Value.Count == 0) { SearchRule(Rule.Childs[i]); } else { Solution1 = temp2; SearchRule(Rule.Childs[i]); } } } } } public void GetRule(TreeNode tree) { Solution1 = ""; Rule += " RT RA LUT T CY QUYT NH ID3 \n\n"; SearchRule(tree); for (int i = 0; i < RuleCount; i++) Rule += " Rule [" +i+ "]: IF {"+RuleID3[i]+ "\n"; Rule+= "\n Tng S Lut: "+ RuleCount; RuleCount = 0; } //Ti u Cy Quyt nh. public bool CheckAllLabelNegative(TreeNode tree) { int test=0; string temp; temp = "No"; for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Childs[i].Attributes.Label == temp) test++; } if ((test > 1) && (test == tree.Attributes.Value.Count)) return true; else return false; } public bool CheckAllLabelPositive(TreeNode tree) { int test=0;
24
string temp; temp = "Yes"; for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Childs[i].Attributes.Label == temp) test++; } if ((test>1)&&(test == tree.Attributes.Value.Count)) return true; else return false; } public void DeleteTree(TreeNode tree) // hm lm cy rng. { tree.Attributes.Name=""; //tree.Attributes.Label=null; tree.Attributes.Value.Clear(); } public void OptimizeTree(TreeNode tree) { for (int i = 0; i < tree.Attributes.Value.Count; i++) { if (tree.Attributes.Value.Count > 1) { if (CheckAllLabelPositive(tree)) { tree.Attributes.Label = "Yes"; DeleteTree(tree); } else OptimizeTree(tree.Childs[i]); if (CheckAllLabelNegative(tree)) { tree.Attributes.Label = "No"; DeleteTree(tree); } else OptimizeTree(tree.Childs[i]); } } } // ly su ca cy private int GetDepth(TreeNode tree) { int depth; if (tree.Childs.Length == 0) return 1; else { depth = GetDepth(tree.Childs[0]); for (int i = 1; i < tree.Childs.Length; i++) { int depthchild = GetDepth(tree.Childs[i]);
25
if (depth < depthchild) depth = depthchild; } depth++; } return depth; } } }
TI LIU THAM KHO
1. GS.TSKH.Hong Vn Kim(2013), Bi ging v Slide mn Cng Ngh Tri Thc ng Dng
2. Ngun Internet: http://en.wikipedia.org/wiki/ID3_algorithm
http://en.wikipedia.org/wiki/Entropy_(information_theory)
http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
http://en.wikipedia.org/wiki/Machine_learning
http://en.wikipedia.org/wiki/Decision_tree_learning
http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree.htm
http://mrthin.blogspot.com/2010/09/id3-algorithm.html
http://www.dreamincode.net/forums/topic/193088-decision-tree-with-c%23/
http://bis.net.vn/forums/t/378.aspx
http://msdn.microsoft.com/en-us/library/system.windows.forms.treenode.aspx
http://lap-trinh-may-tinh.blogspot.com/2013/06/chuong-trinh-mo-phong-thuat-toan-
id3.html
http://csshare.net/chuong-trnh-m-phong-thuat-ton-id3cy-quyet-dinh/