Upload
le-anh-trung
View
228
Download
0
Embed Size (px)
DESCRIPTION
khai phá dữ liệu
Citation preview
I HC QUC GIA H NI TRNG I HC CNG NGH
PHAN TH THUN
TRCH CHN S KIN TRONG VN BN
TIN TC TING VIT
LUN VN THC S CNG NGH THNG TIN
H NI - 2014
I HC QUC GIA H NI TRNG I HC CNG NGH
PHAN TH THUN
TRCH CHN S KIN TRONG VN BN
TIN TC TING VIT
Ngnh : Cng ngh thng tin Chuyn ngnh : H thng thng tin
M s : 60480104
LUN VN THC S CNG NGH THNG TIN
NGI HNG DN KHOA HC: TS NGUYN TR THNH
H NI - 2014
i
LI CM N
Trc tin, ti xin c gi li cm n v lng bit n su sc nht ti
Thy gio, TS. Nguyn Tr Thnh tn tnh ch bo; hng dn; ng vin v
gip ti trong sut qu trnh thc hin lun vn tt nghip.
Ti xin gi li cm n ti Thy gio, PGS. TS. H Quang Thu ngi
tn tnh gip , c v, v gp cho ti trong sut thi gian ti nghin cu v
lm vic ti phng th nghim Cng ngh Tri thc (Knowledge Technology
Laboratory - KTLab).
Ti xin gi li cm n ti cc anh ch, cc bn sinh vin ti phng th
nghim Cng ngh Tri thc (KTLab) Trng i hc Cng ngh h tr ti
rt nhiu trong qu trnh thc hin lun vn.
Cui cng, ti mun gi li cm n ti gia nh v bn b, nhng ngi
thn yu lun bn cnh: quan tm; ng vin ti trong sut qu trnh hc tp v
thc hin lun vn tt nghip ny.
Ti xin chn thnh cm n!
H Ni, ngy 20 thng 6 nm 2014
Hc vin
Phan Th Thun
ii
LI CAM OAN
Ti xin cam oan gii php trch chn s kin trong vn bn tin tc ting
Vit c trnh by trong lun vn ny do ti thc hin di s hng dn ca
TS. Nguyn Tr Thnh.
Ti trch dn y cc ti liu tham kho, cng trnh nghin cu lin
quan trong nc v quc t. Tt c nhng tham kho t cc nghin cu lin
quan u c nu ngun gc mt cch r rng t danh mc ti liu tham kho
trong lun vn.
H Ni, thng 6 nm 2014
Tc gi lun vn
Phan Th Thun
iii
MC LC
DANH MC CC HNH .............................................................................................. vi
DANH MC CC BNG ............................................................................................. vi
M U ....................................................................................................................... vii
Chng 1. GII THIU TI .................................................................................... 1
1.1. BI TON TRCH CHN THNG TIN TRONG VN BN ................... 1
1.2. TNG QUAN V S KIN ......................................................................... 1
1.2.1. nh ngha s kin .................................................................................. 3
1.2.2. Trch chn s kin ................................................................................... 3
1.3. TRCH CHN S KIN TRONG VN BN TIN TC TING VIT .... 4
1.3.1. Bi ton trch chn s kin v tai nn ..................................................... 4
1.3.2. Pht hin s kin ..................................................................................... 6
1.3.3. Trch chn s kin ................................................................................... 6
1.4. NGHA CA BI TON TRCH CHN S KIN V TAI NN ....... 7
1.4.1. ngha khoa hc .................................................................................... 7
1.4.2. ngha thc tin ..................................................................................... 7
1.5. KT LUN .................................................................................................... 8
Chng 2. MT S PHNG PHP TIP CN ........................................................ 9
2.1. PHNG PHP TIP CN DA TRN TP LUT (RULE BASED) 9
2.1.1. Lut c php (lexico-syntactic patterns) ............................................... 10
2.1.2. Lut ng ngha (lexico-semantic patterns) ............................................ 11
2.1.3. Hnh dng v biu din ca tp lut (Form and Representation of Rules)
......................................................................................................................... 11
iv
2.2. PHNG PHP TIP CN DA TRN HC MY ............................. 15
2.3 PHNG PHP TIP CN KT HP LUT V HC MY ............... 17
2.5. TNG KT .................................................................................................. 18
Chng 3. XUT M HNH TRCH CHN S KIN V TAI NN ............... 19
3.1. CC C TNH CA S KIN V TAI NN ....................................... 19
3.2. PHT BIU BI TON ............................................................................. 19
3.3. M HNH PHT HIN V TRCH CHN S KIN V TAI NN ..... 21
3.3.1. Phng php xut ............................................................................ 21
3.3.2. M hnh pht hin v trch chn s kin v tai nn .............................. 22
3.4. GII QUYT BI TON PHT HIN S KIN V BI TON TRCH
CHN S KIN V TAI NN ........................................................................ 23
3.4.1. Bi ton 1- Php hin s kin v tai nn (pha 1) ................................. 23
3.4.1.1. Pht biu bi ton ..................................................................... 23
3.4.1.2. Xy dng tp lut ...................................................................... 24
3.4.1.3. Xy dng m hnh phn lp ...................................................... 28
3.4.2. Bi ton 2- Trch chn s kin v tai nn (pha 2) ................................ 29
3.4.2.1. Pht biu bi ton ..................................................................... 29
3.4.2.2. Trch chn thi gian .................................................................. 30
3.4.2.3. Trch chn a im ................................................................... 32
3.4.2.4. Trch chn s thng vong ........................................................ 32
3.4.2.5. Trch chn phng tin gy tai nn .......................................... 33
3.5. TNG KT .................................................................................................. 34
Chng 4. THC NGHIM V NH GI.............................................................. 36
v
4.1. MI TRNG V CC CNG C S DNG THC NGHIM ......... 36
4.2. XY DNG TP D LIU ....................................................................... 37
4.2.1. Thu thp d liu .................................................................................... 37
4.2.2. Tin x l d liu .................................................................................. 37
4.3. NH GI QU TRNH PHT HIN S KIN .................................... 37
4.3.1. nh gi b lc d liu ......................................................................... 37
4.3.2. nh gi qu trnh phn lp .................................................................. 38
4.4. NH GI QU TRNH TRCH CHN S KIN ................................. 39
4.4.1. Thc nghim khng qua b phn lp ................................................... 39
4.4.2. Thc nghim qua b phn lp............................................................... 41
4.4.3. Nhn xt ................................................................................................ 41
4.5 PHN TCH LI .......................................................................................... 41
4.5.1. Phn tch li qu trnh pht hin s kin ............................................... 41
4.5.2. Phn tch li qu trnh trch chn s kin ............................................. 42
4.6. MT S KT QU PHN TCH CC S KIN .................................... 43
Biu 4.3. Thng k s v tai nn theo tnh .................................................... 44
4.7. TNG KT .................................................................................................. 45
TI LIU THAM KHO ............................................................................................. 48
vi
DANH MC CC HNH
Hnh 3.1: Qu trnh pht hin v trch chn s kin v tai nn ......................... 22
Hnh 3.2 Thnh phn pht hin s kin .............................................................. 24
Hnh 3.3 Tiu bn tin c cha t lin quan phng tin giao thng ............. 25
Hnh 3.4 Tiu khng cha cc t lin quan n phng tin giao thng ..... 26
Hnh 3.5 Thnh phn trch chn s kin ............................................................. 30
Hnh 4.1. Li b lc khi d liu khng thuc min tai nn giao thng .............. 42
DANH MC CC BNG
Bng 3.1Phng tin giao thng ........................................................................ 26
Bng 4.1 Cu hnh phn cng ............................................................................. 36
Bng 4.2. Cng c phn mm s dng................................................................ 36
Bng 4.3. Cc thnh phn ca mt bn tin ......................................................... 37
Bng 4.4. T l li ca qu trnh lc d liu....................................................... 38
Bng 4.5. nh gi kt qu phn lp ................................................................. 39
Bng 4.6. nh gi qu trnh trch chn - d liu khng qua b phn lp ....... 41
Bng 4.7. nh gi qu trnh trch chn - d liu qua b phn lp. ................. 41
Bng 4.8 Mt s li - trong qu trnh trch chn .............................................. 43
vii
M U
Trch chn thng tin (Information Extraction - IE), c bit l trch chn s
kin (Event Extraction - EE) l mt lnh vc con trong khai ph d liu (Data
Mining - DM). Nhng nm gn y, trch chn s kin thu ht nhiu s quan
tm t cc nh khoa hc trn th gii v thu c nhiu kt qu trong thc t.
Trch chn s kin c th p dng vo nhiu min d liu khc nhau nh kinh
t, vn ha, y t, x hi (chng hn nh thng tin v cc v tai nn giao thng),
chnh tr, ...
Theo nhng con s thng k trn cc trang bo in t v con s tai nn
hng nm, nh: thng tin ng trn bo in t http://binhduong.gov.vn, sng
03 01-2013, Chnh ph t chc Hi ngh trc tuyn tng kt cng tc trt t
an ton giao thng nm 2012 v trin khai nhim v nm 2013 do Ph Th
tng Chnh ph Nguyn Xun Phc ch tr. Trong hi ngh, y ban An ton
giao thng ATGT Quc gia thng k: nm 2012, c nc xy ra 36.376 v
tai nn giao thng, lm cht 9.838 ngi, b thng 38.060 ngi. Cng theo
bo in t http://hanoimoi.com.vn, ngy 31-12-2013, Ph Th tng Chnh
ph, Ch tch y ban ATGT Quc gia Nguyn Xun Phc ch tr hi ngh
trc tuyn vi cc b, ngnh, a phng nhm tng kt cng tc bo m trt t
ATGT nm 2013 v trin khai nhim v nm 2014. Theo thng k ca y ban
ATGT Quc gia, nm 2013 c nc xy ra 29.385 v tai nn giao thng
(TNGT), lm cht 9.369 ngi, b thng 29.500 ngi.
T cc con s thng k tai nn giao thng hng nm, chng ta thy s v
tai nn cn rt cao, i cng vi n l con s t vong v s thng vong l rt
ln. Mt khc, bn tin v tai nn c cp nht kh y v mang tnh thi s
trn cc bo in t. Hn na, trch chn s kin ang rt pht trin, chng ta c
th s dng trch chn s kin trch chn thng tin hu ch t cc bn tin v
tai nn, kt qu ca qu trnh ny s c thng k thnh cc con s hu ch
gip cc nh qun l v ngi dn tham gia giao thng ng cch. cng l l
do, tc gi chn v nghin cu tiTrch chn s kin trong vn bn tin tc
viii
ting Vit min d liu khai thc l s kin v tai nn. Chi tit lun vn c
chia thnh 4 chng:
Chng 1. Gii thiu ti
Chng ny trnh by c bn v bi ton trch chn s kin trong bi cnh
bng n thng tin trn Internet. Hn na nu ln c ngha khoa hc, ngha
thc tin, ng dng ca ti trch chn s kin v tai nn giao thng trn min
vn bn ting Vit.
Chng 2.Mt s phng php tip cn
Chng ny tp trung trnh by cc phng php tip cn cho bi ton trch
chn s kin l, phng php tip cn da trn tp lut, phng php tip cn
da trn hc my, phng php tip cn kt hp lut v hc my, trong mi
phng php u c nhn xt hu ch. T , lun vn s ch ra phng php
ph hp cho bi ton trch chn s kin v tai nn.
Chng 3. xut m hnh trch chn s kin v tai nn
Chng ny, pht biu v m t m hnh tng th cho bi ton trch chn
s kin v tai nn. Sau , pht biu, m t m hnh chi tit v cch gii quyt
cho hai bi ton: pht hin s kin v trch chn s kin.
Chng 4. Thc nghim v nh gi
Chng ny, lun vn m t qu trnh thc nghim v nh gi kt qu
xut da trn hai bi ton, l: bi ton pht hin s kin v bi ton trch
chn s kin. Ba o c s dng trong pha pht hin s kin l chnh xc
(P - Precision), hi tng (R - Recall), v o F1 (F1-score) v so snh vi
kt qu nh gi th cng (bng tay) cho pha trch chn s kin. Thng k v
nh gi (biu ) cc thuc tnh c trch chn.
Phn kt lun: trnh by kt qu t c ca lun vn, nhng hn ch v
hng pht trin ca lun vn trong tng lai.
1
Chng 1. GII THIU TI
Trong chng ny, lun vn tp trung gii quyt cc vn sau: gii thiu
bi ton trch chn thng tin, tng quan v s kin, trch chn s kin trong vn
bn tin tc ting Vit (tin tc c cp l v tai nn), ngha khoa hc v
ngha thc tin ca bi ton trch chn s kin v tai nn.
1.1. BI TON TRCH CHN THNG TIN TRONG VN BN
Theo Douglas E. Appelt, trch chn thng tin (Information Extraction- IE)
c th c coi nm gia thu hi thng tin (Information Retrieval - IR) v hiu
vn bn (Text Understanding - UT) [2]. Khng ging nh thu hi thng tin ch
tp trung vo cc mu thng tin c lin quan trong vn bn m khng ch trng
n vic hiu vn bn; trch chn thng tin cn quan tm ti cc s kin c lin
quan trong vn bn v biu din chng di dng cc khun mu thng tin c
lin quan trong vn bn v biu din chng di dng khun mu. Khc vi
hiu vn bn ch tp trung trn mt phn nh ca vn bn (cu, on), trch chn
thng tin quan tm ti ton b ni dung vn bn.
Theo Peshkin v Pfeffer [11], trch chn thng tin c th c nh ngha:
nh l mt cng vic in thng tin vo cc mu t cc d liu cha bit trc
trong min c nh ngha trc. Mc tiu ca trch chn thng tin l ly t
vn bn cc thng tin ni bt ca cc s kin, thc th, cc mi lin h. Nh
vy, c th coi trch xut thng tin l mt k ngh ly v biu din tri thc thnh
cc thng tin c nh dng v hu ch t ngun d liu ln trn Internet.
Bi ton trch chn thng tin trong vn bn c th c pht biu nh sau:
- u vo: d liu vn bn bt k
- u ra: thng tin hu ch di dng c cu trc.
1.2. TNG QUAN V S KIN
Trch chn s kin vi vai tr trch chn ra cc thng tin c ngha t tp
d liu ln v c cng ng khoa hc rt quan tm v u t nghin cu.
2
Nm 1987, Message Understanding Conferences (MUC)6 c t chc vi s
h tr ca Qu nghin cu B quc phng Hoa K7 v ln u tin khi nim
event (s kin) c cp. Sau , rt nhiu hi ngh c t chc to thnh
dy hi ngh MUC. Vi mi hi ngh, thng tin c quan tm khc nhau nhng
u c c im chung l chng c trch xut t d liu ni v khng hong
(crisis). Cc ch trong d liu thng l ti phm, khng b, nh bom
mt trong nhng ng gp ln ca MUC l a ra vic trch chn thng tin da
trn mu (scenariotemplate). Cc mu c ban t chc quy nh v cc i
tham gia cn in thng tin vo cc mu ny mt cc t ng. Cui cng, cc s
kin c trch chn gm cc thng tin: t chc, i tng tham gia (ngi, s
vt, s vic), thi gian, a im, s lng chnh xc (precision) v hi
tng (recall) ca cc nghin cu tham d MUC nm trong khong 50% n
60% [5].
Chng trnh Pht hin v theo di ch (Topic Detection and Tracking,
TDT)8 c t chc t nm 1997 thu ht nhiu nhm nghin cu t cc trng
i hc tham gia. Chng trnh ny c phi hp bi Vin Cng ngh v
Chun ho quc gia Hoa K (NIST) v DAPRA nhm gii quyt bi ton pht
hin, theo di v xu chui s kin. Mt s nhm nghin cu tham gia chng
trnh nh sau: nhm CMU ca i hc Carnegie Mellon, nhm BBN t cng ty
BBN Technologies, nhm DRAGON ca cng ty Dragon, nhm UPENN ca
trng i hc Pennsylvania (UPENN). Cc bi ton quan trng ca TDT gm:
Story Segmentation, Topic Tracking, Topic Detection, First Story Detection, v
Link Detection.
Chng trnh Trch chn ni dung t ng (Automatic Content Extraction,
ACE) ca i hc Pennsylvania cng thu ht c nhiu quan tm t cc cng
ng nghin cu v trch chn thng tin cng nh trch chn s kin. Chng
trnh ny tp trung vo cc ngn ng nh ting Anh, Trung Quc v rp. Cc
thng tin c trch chn gm cc thc th, quan h gia cc thc th, v cc s
kin chng tham gia vo.
3
Nh vy, c th thy rng trch chn thng tin ni chung v trch chn s
kin ni ring l mt vn quan trng v thi i, nhn c rt nhiu quan
tm t cng ng khoa hc. Trong phn tip theo lun vn s lm sng t nh
ngha s kin [1.2.1] v trch chn s kin [1.2.2].
1.2.1. nh ngha s kin
Trch chn s kin ln u tin c gii thiu nh mt ch quan trng
trong Message Understanding Conference (MUC) nm 1987 [21]. Trong MUC,
mt s kin c nh ngha nh sau: mt s kin c tc nhn (actor), thi
gian (time), a im (place) v tc ng ti mi trng xung quanh.
Trong chng trnh ACE, Dodington Deorge R v cng s a ra nh
ngha s kin nh sau: mt s kin l mt hnh ng c to bi nhng
ngi tham gia[22]. ACE chia s kin thnh 8 loi khc nhau: LIFE (s sng -
cht), MOVEMENT (s di chuyn), TRANSACTION (giao dch), BUSINESS
(kinh t), CONFLICT (xung t), CONTACT (giao thip), PERSONNEL
(nhn - i vic), JUSTICE (php l). Mi dng s kin li phn bit tng dng
con. V d, LIFE c cc dng con nh BE-BORN (cho i), INJURE (b
thng), DIE (cht), hay PERSONAL c START-POSITION (v tr khi nhn
vic), END-POSITION (v tr khi thi vic), NOMINATE (b nhim), ELECT
(bu chn),...
C th thy rng cc nghin cu lit k trn u ng rng s kin c
th coi nh mt mu (template) gm nhiu cc thuc tnh (elements). Qu trnh
trch chn s kin quan tm ti vic lm th no c th in cc thng tin ph
hp t cc vn bn gc tng ng tng thuc tnh.
1.2.2. Trch chn s kin
Trch chn s kin v trch chn thng tin c im g chung? C th ni
rng trch chn s kin l mt lnh vc con ca trch chn thng tin. Nu nh
trch chn thng tin ch quan tm cc d liu ri rc (tn ngi, a im, cc
con s,) th trch chn s kin quan tm nhiu hn ti tnh cu trc v mc
4
lin quan ca thng tin trong mt s kin. T , ngi c c th d rng suy
lun ra cc thng tin c ngha. V d, ngay sng ngy 30/4, trn ng Xun
Thu, th H Ni xy ra v tai nn nghim trong lm 2 ngi trn xe my
b thng nng. Nguyn nhn bc u c cho l do ti x tc-xi tng tc
khi nhn im nn x thng vo xe my i cng chiu. Trong v d ny,
trch chn thng tin a ra cc kt qu ri rc nh: 30/4, H Ni, 2 hoc tc xi;
trong khi trch chn s kin th quan tm ti mt b cc thuc tnh biu din
cho s kin gm {30/4, H Ni, 2 ngi b thng, tc-xi}. R rng, vi tp d
liu trn, thng tin l hu ch v y hn cc thng tin ri rc.
Mt cch tng qut, c th coi trch chn s kin trong vn bn nhn u
vo l cc vn bn phi cu trc v u ra l tri thc c biu din di dng
thng tin c cu trc. Nhng thng tin ny rt hu ch cho vic khai thc d liu
nh: thng k, h thng gim st, cc h thng h tr ra quyt nh. Trch chn
s kin c th p dng cho mt min d liu c th nh v tai nn giao thng,
thng tin cc tour du lch, bnh dch, ng thi a ra cc thng tin xung
quanh s kin thng bao gm: Thi gian, a im, s lng,
Theo Grishman v cng s, trch chn s kin l mt bi ton kh do vn
x l ngn ng t nhin (Natural Language Processing - NLP) v c trng
d liu [21]. D rng nhn thy trch chn s kin ph thuc nhiu vo NLP, c
th l bi ton nhn dng thc th (Named Entity Recognition - NER). Bn cnh
, d liu u vo ca trch chn s kin rt a dng nn s nh hng ti tnh
hiu qu ca qu trnh trch chn.
1.3. TRCH CHN S KIN TRONG VN BN TIN TC TING VIT
1.3.1. Bi ton trch chn s kin v tai nn
Trch chn thng tin (Information Extraction - IE), c bit l trch chn s
kin (Event Extraction - EE) l mt lnh vc con trong khai ph d liu (Data
Mining - DM). Nhng nm gn y, trch chn s kin thu ht nhiu s quan
5
tm t cc nh khoa hc. N l bc i tt cho vic khai thc tri thc trn vn
bn.
Trch chn thng tin v s kin v tai nn nh: thi gian(gi trong ngy),
thi gian (dd/mm/yyyy), th/tun, thng/nm, a im xy ra v tai nn, s
thng vong, phng tin tham gia trong v tai nn, phng tin gy tai nn,
tui ca ngi iu khin phng tin gy tai nn, ngnh ngh, a hnh gy tai
nn, nguyn nhn gy tai nn... Kt qu ca qu trnh trch chn c lm u
vo cho h thng khai thc nh thng k v trc quan ho trn bn Vit Nam
nhng a im nng hay xy ra tai nn, thi gian no trong ngy c nguy c
xy ra tai nn nhiu hn, thng no hay ma no trong nm c nguy c tai nn
giao thng nhiu hn, tui c nguy c xy ra tai nn Nhng iu gip
ch cho cc nh qun l c bit php gip khc phc gim thiu s v tai nn,
t bng bin bo hiu ni c nguy c tai nn cao, c bim php gio dc ngi
dn khi tham gia giao thng. Mt khc, gip ngi dn bit cch t phng trnh
khng mnh l mn nhn ng tic trong cc v tai nn.
Bi ton trch chn s kin v tai nn c pht biu nh sau:
u vo: bn tin bt k trn bo in t
u ra: trch chn nhng thng tin ca s kin v tai nn (nu c).
Bi ton trch chn s kin v tai nn c chia thnh hai bi ton. Bi
ton th nht, pht hin s kin v tai nn, u vo l bn tin bt k trn bo
in t, bi ton phi ch ra u l s kin v tai nn. Kt qu ca bi ton pht
hin s kin s l d liu u vo cho bi ton trch chn; thng tin c trch
chn trong s kin v tai nn c th l thi gian, a im xy ra tai nn, s
thng vong, phng tin gy tai nn, gi (gi no trong ngy xy ra tai nn),
tui ca ngi iu khin phng tin xy ra tai nn, gii tnh, a hnh xy
ra tai nn, Trong gii hn ti, tc gi tp trung vo vic trch chn ra b cc
thuc tnh nh: (thi gian, a im xy ra tai nn, s thng vong, phng tin
gy tai nn).
6
1.3.2. Pht hin s kin
Bi ton pht hin s kin tr li cu hi lm th no pht hin c
mt vn bn c cha s kin v tai nn. Tc l, cho trc u vo l vn bn,
lm th no pht hin vn bn c cha s kin v tai nn? theo Grishman
v cng s [13], pht hin s kin l qu trnh hc khng gim st, tc gi s
dng cc t kho quyt nh mt vn bn c cha s kin dch bnh hay
khng. Hai t kho c tc gi s dng l outbreak of v died from.
Theo Doan v cng s [14], bi ton pht hin s kin c th coi nh qu trnh
hc c gim st. Trong nghin cu ca mnh, tc gi s dng phng php
phn lp cc ti liu. B phn lp ny da trn mt tp cc d liu c gn
nhn. Qua qu trnh hun luyn, b phn lp s quyt nh mt vn bn u vo
c cha s kin dch bnh hay khng.
T nghin cu ca Grishman v cng s hoc nghin cu ca Doan v
cng s, c cc cch khc nhau gii quyt bi ton pht hin s kin dch
bnh. Do , c th vn dng phng php ny cho vic pht hin s kin v tai
nn giao thng cng vi vic xy dng b t kho hoc xy dng mt tp cc d
liu c gn nhn ph hp cho s kin v tai nn giao thng.
1.3.3. Trch chn s kin
Nhim v ca bi ton trch chn s kin phi tr li cu hi lm th no
trch chn cc thuc tnh ca mt s kin. C nhiu phng php cho vic
trch chn s kin; trong phi k n phng php s dng lut (hc khng
gim st) c s dng t rt sm gii quyt bi ton ny[13]. Qu trnh trch
chn bng phng php ny thng c s dng cc lut da vo qu trnh
kho st d liu trch ra cc thuc tnh ca mt s kin.
Phng php s dng hc my v cc k thut NLP gii quyt bi ton
trch chn s kin. Qu trnh ny thng s dng Named Entity Recognition
(NER) ly ra cc thuc tnh c bn ca s kin: thi gian, a im, tn
ngi, sau kt hp cc thuc tnh ny thnh mt s kin. [14].
7
Nh vy, bi ton trch chn s kin ni chung hay bi ton trch chn s
kin v tai nn ni ring c th c chia thnh hai bi ton con, l: pht hin
s kin v trch chn s kin. Trong lun vn ny, tc gi s m t chi tit cc k
thut c p dng gii quyt hai bi ton ny chng 3.
1.4. NGHA CA BI TON TRCH CHN S KIN V TAI NN
1.4.1. ngha khoa hc
ngha khoa hc ca bi ton trch chn s kin c rt nhiu cc nh
khoa hc quan tm. Kt qu ca bi ton trch chn s kin v tai nn lm tin
cho vic khai thc d liu nh thng k, d on xu hng, h thng gim st
v h tr ra quyt nh.
1.4.2. ngha thc tin
Kt qu vic trch chn s kin v tai nn l d liu u vo cho vic khai
thc: thng k cc con s lin quan n v tai nn nh cc v tai nn hay xy ra
vo thi gian no trong ngy (vo bui sng, gi n cng s, bui tra, gi tan
tm, hay vo m), nhng thng no trong nm hay xy ra tai nn (vo ma l
hi, ma ngh mt hay ma ma), phng tin no hay xy ra tai nn (xe
but, xe ti, tc-xi, xe khch,), tui ca ngi iu kin phng tin giao
thng (tui 18-20, tui ngoi 60, hay tui no khc), ngh nghip ca ngi
iu kin phng tin giao thng (lm ngh t do, xe m, cng chc,..), a hnh
gy tai nn (ng vng cua, ng giao nhau, ng rc, ng trn, ng
g gh, ng cao tc,) T nhng thng k trn c th trc quan ho trn bn
nhng a im nhy cm hay xy ra tai nn.
Qua , cung cp cho ngi dn c thm kin thc khi tham gia giao thng
nh: trong khong thi gian no, trn qung ng no, hay xy ra ta nn.
iu c th gip ngi dn bit cch phng trnh cc nguy c c th xy
ra tai nn.
Ngoi ra, n cn gip ngi dng mun tm kim thng tin lin quan n
v tai nn giao thng.
8
Hn th na, kt qu ca bi ton c th gip cc nh qun l c ci nhn
khch quan tnh trng tai nn giao thng, c bim php phng nga cc v tai
nn nh: sa cha nng cp c s h tng, c bim php gio dc thc ngi
dn khi tham gia, t bin cnh bo ni no c nguy c cao xy ra tai nn, cn
phi gim tc , thn trng quan st ng trong khi tham gia giao thng
Ngoi ra, nhng con s thng k t vic trch chn s kin v tai nn. Cn
gip cc nh qun l so snh quy m mc nghim trng ca cc v tai nn
trong tng khong thi gian vi nhau, t a ra bn nh gi trung v s pht
trin ca cc v tai nn theo chiu hng no.
1.5. KT LUN
Trong chng ny, lun vn trnh by c bn bi ton trch chn s
kin. Trng tm ca chng ny trnh by nhng khi nim c bn ca bi ton
trch chn s kin ni chung v bi ton trch chn s kin v tai nn ni ring.
Bn cnh , chng ny cng cp ti hai bi ton c bn ca trch chn s
kin v tai nn, l bi ton pht hin s kin v bi ton trch chn s kin;
ng thi nu ngha khoa hc, ngha thc tin, nhng kh khn khi gii
quyt bi ton trch chn s kin v tai nn. Trong chng 2, lun vn s trnh
by cc phng php tip cn gii quyt bi ton pht hin s kin v trch
chn s kin v tai nn.
9
Chng 2. MT S PHNG PHP TIP CN
Theo nghin cu ca Hogenbcom F. v cng s [4] cung cp mt kho
st da trn ba phng php c bn ph hp cho bi ton trch chn s kin
trong vn bn. l cc phng php: phng php da lut hay cn c gi
l phng php da ttrn tri thc (knowledge - driven), phng php hc my
hay cn c gi l phng php da trn d liu (data-driven), phng php
kt hp gia hai phng php trn hay cn c gi l phng php lai
(hybrid).
Phng php th nht da trn tri thc, thng s dng kin thc chuyn
gia min sinh ra tp lut (thng l chuyn gia v ngn ng v chuyn min
d liu); i hi c d liu v hiu d liu sau sinh ra tp lut. Phng php
th hai da trn d liu, phng php ny da trn tri thc t mt tp d liu
ln gii quyt bi ton trch chn thng tin trong mt s kin (thng s
dng phng php thng k v m hnh ton hc). in hnh cho phng php
ny l nhn dng thc th (NER). Tp lut ny thng s dng trch chn
thuc tnh ca s kin. Phng php cui cng, s dng kt gia hai phng
php trn.
Trong chng ny, tc gi s trnh by phng php tip cn bi ton Trch
chn s kin v tai nn giao thng bao gm: phng php tip cn da trn
lut (rule - base), phng php tip cn da trn hc my, phng php tip cn
kt hp lut v hc my. Phn cui tc gi s c nhng nhn xt v a ra
phng php gii quyt bi ton trong chng 3. Chi tit ca tng phng php
s c trnh by cc mc [2.1], [2.2], [2.3].
2.1. PHNG PHP TIP CN DA TRN TP LUT (RULE
BASED)
Phng php da trn tp lut hay cn c gi l phng php da ttrn
tri thc (knowledge - driven). Phng php ny da trn tri thc, thng s
dng kin thc chuyn gia min sinh ra tp lut (thng l chuyn gia v
10
ngn ng v chuyn gia min d liu); i hi c v hiu d liu sau sinh ra
tp lut.
2.1.1. Lut c php (lexico-syntactic patterns)
Lut c php, i khi cn c gi l mu c php (lexico-syntactic
patterns) c th coi l phng php s dng sm trong bi ton trch chn s
kin. Cc mu ny c sinh ra t cc chuyn gia min (expert knowledge) di
dng tp lut (rules) [4]. in hnh cho phng php ny l cc lut c biu
din di dng biu thc chnh quy (regular expression).
Cc lut c php l s kt hp biu din ca cc k t v cc thng tin c
php vi cc biu thc chnh quy. Sau khi cc biu thc chnh quy c xy
dng, cc biu thc ny s c so khp vi d liu trong vn bn u vo
trch chn ra cc thng tin tng ng ca cc thuc tnh. i khi, lut c php
c biu din dng n gin hn, l cc t kho. Tp lut c php c
s dng trong trch chn s kin [7], [5], [6]. Trong nghin cu ca mnh,
Nishihara v cng s s dng ba t kho: a im (place), i tng (object),
v hnh vi (action) biu din mt s kin c trch chn t blogs [10].
Trong lnh vc y sinh, Yakushiji v cng s s dng mt b phn tch kt hp
vi ng php xc nh mi quan h v cc s kin [16]. Cn trong lnh vc
tin v chnh tr Aone v cng s dng lut c php trch chn thng tin
ca s kin [24]. Lut c php xc nh cc tham s bn trong vn bn khng
xc nh ngha vn bn.
Khi s dng lut trch chn s kin, i khi phi trch chn khi nim c
ngha c bit hoc mi quan h gia cc thnh phn c trch chn. Do ,
s dng lut c php khng p ng c iu ny. gii quyt c iu
ny, phng php thng s dng trong (rulebased) l s dng lut ng ngha
(lexico-semantic patterns). Chi tit ca lut ng ngha s c trnh by trong
mc [2.1.2].
11
2.1.2. Lut ng ngha (lexico-semantic patterns)
i khi trch chn s kin phi trch chn cc khi nim c ngha c
bit hoc mi quan h gia cc thnh phn c trch chn. Do , gii quyt
c iu ny, phng php thng s dng trong (rulebased) l s dng lut
ng ngha. Cc lut ng ngha khng n gin l cc t c biu din di
dng biu thc chnh quy m l cc t v mi quan h gia chng.
Lut ng ngha c s dng vi nhiu mc ch v nhiu lnh vc khc
nhau. V d nh, Li Fang v cng s s dng lut ngh ngha trch chn
thng tin t sn chng khon (stock market) [25]; Hay, Cohen v cng s [17]
s dng khi nim b nhn dng (recognizer) trn min d liu y sinh trch
chn thng tin y sinh t tp d liu; Capet v cc cng s s dng mu ng
ngha trch chn s kin cho h thng cnh bo sm [27]; cn Vargas-Vera
v Celjuska xut mt b khung (framework) cho vic nhn din cc s kin
tp trung trn bo Knowledge Media Institute (KMI) [26].
Trch chn s kin trong vn bn phi cu trc c th c ng dng trong
nhiu lnh vc nh: ti chnh, chng khon, y sinh, bn tin php lut C l s
l cha y nu khng cp chi tit hn n hnh dng v biu din ca tp
lut trong trch chn thc th. iu ny s c trnh by ti mc [2.1.3].
2.1.3. Hnh dng v biu din ca tp lut (Form and Representation of
Rules)
Theo ti liu Information Extraction ca Sunita Sarawagi [1], mt lut c
bn c dng: "mu theo ng cnh hnh ng". Mt mu theo ng cnh bo
gm mt hoc nhiu mu nhn ghi li thuc tnh ca mt hoc nhiu thc th v
bi cnh xut hin trong vn bn. Mt mu c gn nhn l so khp mt biu
thc chnh quy c xc nh qua cc tnh nng ca th trong vn bn v mt
nhn tu chn. Cc thuc tnh c th c ch ra l thuc tnh ca th hoc ng
cnh hoc cc vn bn trong cc th xut hin.
12
Hu ht cc h thng da trn lut c lin tng; lut c p dng trong
nhiu giai on m mi giai on lin kt mt d liu u vo vi mt ch thch
nh l tnh nng u vo cho cc giai on tip theo. V d, mt trch chn cho
cc a ch lin lc ca ngi c to ra trong hai giai on ca lut: giai on
th nht nhn th cng vi nhn thc th nh: tn ngi, v tr a l nh tn
ng, tn thnh ph, v a ch th in t. Giai on th hai, xc nh khi a
ch cng vi u ra ca giai on th nht nh l thuc tnh b sung.
1/. Cc thuc tnh ca cc th (Features of Tokens)
Mi mt th trong mt cu thng c kt hp cng vi tp thuc tnh
thu c thng qua mt hoc nhiu cc tiu ch sau:
- Cc chui i din cho th .
- Cc loi chnh t ca th c th c dng t in hoa, t in nh, t hn hp,
s, k hiu c bit, du cch, du chm cu,
- Cc phn pht biu (part of speech) ca th
- Danh sch xut hin cc th ca t in. Thng thng, iu ny c th
c tip tc tinh ch ch ra, nu cc th ph hp vi t bt u, kt
thc, hoc t gia ca t in. V d, mt th nh " New " ph hp vi t
u tin ca t in vi tn thnh ph, tn s c lin kt vi mt thuc
tnh
"Dictionary - Lookup = start of city . "
- Ch thch km theo cc bc x l trc .
Lut xc nh mt thc th n (Rules to Identify a Single Entity):
Lut nhn ra mt thc th n y bao gm ba loi mu.
- Mt mu ty chn ghi li bi cnh trc khi bt u ca mt thc th .
- Mt mu kt hp cc th trong cc thc th.
- Mt mu ty chn ghi li bi cnh sau khi kt thc ca thc th.
13
V d v mt mu xc nh tn ngi c dng "Dr. Yair Weiss" bao gm
mt th tiu c lit k trong tp t in cc chc danh (c cha cc mc
nh : Prof , Dr, Mr ), mt du chm, v hai t vit hoa l
({Dictionary - Lookup = Titles}{String = .}{Orthography type
=capitalized word}{2})Person Names.
Mi iu kin trong du ngoc nhn l mt iu kin ca mt th c
theo sau cng vi s ty chn v ch ra s ln lp li ca th.V d v mt lut
nh du tt c s i sau cc gii t "by" v "in" l thc th nm:
(String=by|String=in})({Orthography type = Number}):yYear=:y. C hai
mu trong lut ny: mu u tin ghi li ng cnh xut hin ca cc thc th
nm v mu th hai ghi li cc tnh cht ca th to thnh " year". Mt v d
khc cho vic tm kim tn cng ty dng The XYZ Corp. or ABC Ltd. c
to bi:
({String=The}? {Orthography type = All capitalized}{Orthography type
= Capitalized word, DictionaryType =Company end})Company name
2/. Cc lut nh du ranh gii thc th (Rules to Mark Entity Boundaries)
i vi mt s loi thc th, trong cc n v di c bit nh tiu cun
sch, n l hiu qu hn xc nh cc lut c bit nh du s bt u v
kt thc mt ranh gii thc th. l loi b mt cch c lp v tt c cc th
trong gi hai th nh du u v cui c gi l thc th. Nhn nhn vn
theo mt cch khc, mi lut c bn dn n s chn ca mt n Th SGML
trong vn bn m cc th ny c th l mt th bt u hoc mt th kt thc.
gii quyt s khng nht qun khi c hai thc th bt u nh du trc v ch
mt thc th nh du kt thc, iu ny cn c mt cch gii quyt c bit. V
d, mt quy tc chn mt th , nh du s bt u ca mt tn
tp ch trong mt bn trch dn:
({String=to} {String=appear} {String=in}):jstart
14
({Orthography type = Capitalized word}{2-5})insert
after:jstart.
Nhiu h thng trch chn da trn lut thnh cng da trn cc lut nh
vy, nh (LP)2 [60], STALKER [156], Rapier [ 43 ], v WEIN [121 , 23].
3/. Cc lut cho a thc th (Rules for Multiple Entities)
Mt s lut c dng biu thc chnh quy vi nhiu slot, mi slot i din
cho mt thc th khc nhau sao cho lut ny dn n s cng nhn ca nhiu i
tng cng mt lc. Nhng lut ny c s dng tt hn cho bn ghi d liu
theo nh hng. V d, h thng da trn lut WHISK [18] c nhm ti
cho vic khai thc t h s c cu trc nh h s y t , cc bn ghi bo tr thit
b, v phn loi qung co. Cc lut ny c vit li t [18], trch chn hai
thc th, s lng phng ng v cho thu, t mt qung co cho thu cn h.
({Orthography type = Digit}):Bedrooms ({String =BR})({}*)
({String =$})({Orthography type = Number}):PriceNumber
of Bedrooms =:Bedroom, Rent =: Price
4/. Chn la hnh dng ca tp lut (Alternative Forms of Rules)
C nhiu h thng da trn lut state-of-the-art cho php cc chng trnh
ty vit bng ngn ng th tc nh Java v C + + thay cho c hai thnh phn
mu v phn hnh vi ca cc lut. V d, GATE[19] h tr cc chng trnh
Java thay cho ngn ng hnh thc cc lut ty chnh ca n c gi l JAPE
trong hot ng ca mt lut. y l mt kh nng mnh m bi v n cho php
phn hnh vi ca cc quy tc truy cp cc thuc tnh khc nhau m c s
dng trong phn mu ca cc quy tc v c s dng chn cc trng mi
cho chui ch thch. V d, phn hot ng c th dn n chn cc dng chun
ca mt chui trong t in. Cc trng mi c th c xem nh cc thuc
tnh b sung cho mt lut trong cc ng ly tin ring. Tng t, trong cc
cng thc Prolog-based t [20] th bt k m th tc no cng c th c thay
th nh l so khp mu cho bt k tp hp con ca cc loi thc th.
15
Nhn chung, trong cc h thng tri thc (knowledge systems), ban u
thng c s dng phng php tip cn da trn lut (rule-based). u im
ca phng php ny, th nht, cn s dng t d liu hun luyn hn phng
php tip cn da trn d liu. Th hai, phng php ny c th xy dng cc
biu thc chnh quy tt cho trch chn thng da trn c php, t vng, v cc
thnh phn ng ngha. Phng php tip cn da trn lut ph hp vi bi ton
trch chn cc thng tin v thi gian (rng sng hm qua, gia tra hm
nay). Phng php ny cho chnh xc rt cao (do c xy dng ly ra
cc thng tin c bit), hi tng thp. Do phng php ny rt thch hp
cho cc bi ton ch quan tm n chnh xc.
Bn cnh nhng u im, phng php tip cn da trn lut cn c nhng
nhc im. Khi s dng phng php ny i hi ngi xy dng ng vai tr
nh chuyn gia min d liu, cn phi rt am hiu d liu, ngi xy dng phi
c kin thc v ngn ng, t vng, v c php. Hn na, tp lut thng c
xy dng ly ra cc thng tin c bit, d khi thay i sang min d liu
khc th li phi xy dng tp lut cho ph hp. Vic xy dng tp lut i khi
rt tn thi gian v chi ph.
2.2. PHNG PHP TIP CN DA TRN HC MY
Phng php ny i khi cn c gi vi tn l tip cn da trn d liu
(data-driven). Phng php tip cn da trn hc my thng c s dng cho
cc ng dng x l ngn ng t nhin v tp d liu hun luyn ln hun
luyn cho ph hp vi cc hin tng ngn ng [9]. Phng php ny thng
da trn m hnh xc sut (probabilistic models), l thuyt thng tin
(information theory), v i s tuyn tnh (linear algebra). Mt s cch tip cn
c bn thng c s dng l Term Frequency - Inverse Document Frequency
(TF-IDF), n-grams hay phn cm.
C rt nhiu v d v p dng phng php tip cn da trn d liu
trch chn thng tin trong cc s kin. Nm 2009, Okamoto v cng s [9]
16
dng mt khung (frameword) pht hin cc s kin cc b (loacal events).
Trong nghin cu tc gi s dng cc k thut phn cm phn cp. Trong khi
, phn cm c th sinh ra cc kt qu tt cho trch chn s kin, Liu M v cc
cng s [10] kt hp cc th c trng s v hng chia i (weighted
undirected bipartite graphs) v phn cm trch chn cc thc th chnh v cc
s kin c ngha t cc thng tin hng ngy. Cc k thut phn cm cng c
s dng bi Tanev v cng s [13] trch chn cc s kin bo lc v thm
ho cho h thng gim st.
Cch tip cn da trn d liu (data - driven) khng i hi ngi xy
dng cn n cc kin thc v ngn ng v chuyn gia min. Nhng phng
php ny li i hi mt lng d liu ln lm tp hun luyn. Phng php
tip cn da trn d liu cn xy dng xc sut xp s m hnh hun luyn
vi d liu. Phng php ny c nhng u im. u th nht, cch tip cn ny
khng cn c s tham gia ca cc chuyn gia v ngn ng v chuyn gia min.
u th hai, cc m hnh sau khi hun luyn c th s dng vi cc min d liu
khc nhau.
Tuy th, cch tip cn da trn d liu cng c nhng nhc im. Th
nht, trong cc bi ton trch chn s kin, phng php tip cn da trn d
liu khng gii quyt c cc vn c lin quan n ng ngha (v d,
phng php ny ch pht hin cc quan h trong tp d liu m khng gii
quyt c cc vn ng ngha). Th hai, phng php ny cn mt lng d
liu ln hun luyn m hnh. Trong mt s trng hp, vic gn nhn d liu
tn thi gian v chi ph. Th ba, do phng php tip cn da trn d liu c
xy dng trn cc m hnh xc sut thng k, do , trong mt s trng hp
nu qu trnh lm d liu hun luyn khng tt dn n kt qu ca qu trnh
trch chn khng cao.
17
2.3 PHNG PHP TIP CN KT HP LUT V HC MY
Phng php tip cn kt hp lut v hc my (lai - hybrid) thng c
s dng trong cc bi ton trch chn s kin. Hu ht cc h thng da trn tri
thc (knowledge - driven) c b sung bi cc phng thc da trn d liu
(data - driven), do vy n c th gii quyt c cc khuyt im ca phng
php da trn tri thc. V d, Piskorski v cng s [12] s dng cc k thut
bootstrapping cho h thng trch chn cc s kin lin quan ti bo lc t cc
bn tin trc tuyn vi chnh xc v hi tng cao.
Morik [8] kt hp cc lut ng ngha vi Conditional Random Fields
(CRFs) c biu din nh th v hng trch chn cc s kin t phin
hp ton th ca ngh vin c. y, tc gi gii quyt hn ch ca thut
ton hc c gim st vi cc cm. Lee v cng s [8] s dng ontology m
(ontology-based fuzzy) trch chn s kin t cc bn tin ting Trung Quc.
Tc gi s dng thng k da trn ng php (grammar-based statistical) v
gn nhn t loi (part-of-speech tagging). Chun v cng s [3] trch chn cc s
kin y sinh bng cch s dng cc lut c php kt hp vi ng tham chiu(co-
occurrences). Nh vy phng php ny c th c coi l phng php lai.
Trong lun vn, tc gi s dng phng php kt hp lut v hc my v
cc l do sau: Th nht, phn lp d liu thuc min tai nn giao thng vi
d liu u vo ln, cch thch hp hn c l dng lut c php lc, bc lm
ny gim ng k s lng d liu u vo cho qu pht hin s kin. Th hai,
trong bn thng tin ca s kin v tai nn: thi gian, a im, s thng vong,
v loi phng tin gy tai nn. c bit thng tin v thi gian, s thng vong,
v loi phng tin gy tai nn. i khi nhng thng tin ny c cp khng
r rng thiu chi tit v d vo gia tra, ng lc tan tm hay 2 ngi
thit mng, lm cht 1 ngi hay xe khch m vo xe ti; do tc gi
s dng lut ng ngha trch chn ra cc thng tin ny. L do th 3, tc gi s
dng phng php lai l trong h thng c chc nng phn lp v nhn dng
18
thc th m cc yu cu ny c thc hin tt bi phng php xc sut thng
k da trn d liu.
2.5. TNG KT
Trong chng ny, tc gi trnh by mt s phng php tip cn bi
ton v ch ra mt s u nhc im ca tng phng php. Cui cng, tc
gi nhn ra rng s dng phng tip cn kt hp lut v hc my gii quyt
bi ton trch chn s kin v tai nn l ph hp. Pht biu bi ton, m hnh,
phng php gii quyt bi ton s c trnh by chi tit trong chng 3.
19
Chng 3. XUT M HNH TRCH CHN S KIN V TAI NN
Trong chng ny, tc gi tp trung phn tch lm r bi ton trch chn s
kin v tai nn. Tm hiu cc c tnh ca s kin v tai nn; php biu bi ton,
xut m hnh, cch gii quyt chi tit hai bi ton quan trng trong lun vn
l bi ton pht hin s kin v tai nn v bi ton trch chn s kin v tai nn.
3.1. CC C TNH CA S KIN V TAI NN
Qu trnh kho st trn min d liu l thng tin v tai nn ch ra rng trong
qu trnh pht hin s kin v tai nn cn phi phn bit r u l thng tin v
tai nn giao thng, u l thng tin tai nn giao thng. Thng tin v tai nn giao
thng l ci m lun vn quan tm trong bi ton trch chn s kin v tai nn,
v d nh sng ngy 25/5 mt v tai nn thm khc xy ra trn quc l
1A; cn thng tin tai nn giao thng nh tiu bi bo lm th no gim
thiu s v tai nn giao thng, hay sc v con s thit mng do tai nn trong
na u nm 2014 th y khng phi thng tin v tai nn giao thng m ch l
thng tin tai nn giao thng.
Cng qua kho st trn min d liu thng tin v tai nn ch ra rng mt s
kin v tai nn c th cha thi gian xy ra tai nn, a im, s thng vong,
phng tin gy tai nn, nguyn nhn ca v tai nn, tui ca ngi iu
khin phng tin, v tai nn xy ra vo thi gian no trong ngy Trong s
cc thng tin th thng tin v thi gian, a im, s thng vong, phng
tin gy tai nn c c bit quan tm v cng l cc thng tin s c trch
chn trong s kin v tai nn.
3.2. PHT BIU BI TON
Bi ton trch chn s kin trong vn bn bn tin Ting Vit. Trong lun
vn, tc gi s tp trung vo gii quyt bi ton trch chn s kin trong bn tin
v tai nn giao thng (t nay s c gi l trch chn s kin v tai nn). Tc
gi mun nhn mnh l s kin v tai nn phn bit vi cc thng tin v tai
nn giao thng nhng khng phi bn tin v tai nn giao thng (v d, bn tin
20
v bui tho lun lm th no gim thiu tai nn giao thng). Trong chng
ny, tc gi tp trung vo gii quyt bi ton trch chn thng tin v tai nn giao
thng t vn bn tin tc ting Vit, ly t cc trang bo in t Vit Nam. Trch
ra thng tin v s kin v tai nn nh thi gian xy ra v tai nn, a im xy ra
tai nn, s thng vong (s t vong v s b thng), phng tin gy tai nn,
tui ca ngi gy tai nn, a hnh gy tai nn, nguyn nhn ca v tai
nn t cc vn bn phi cu trc. Bi ton c pht biu nh sau:
u vo: mt bn tin trn bo in t
u ra: bn tin u vo c phi s kin v tai nn giao thng khng, nu
c th trch chn ra thng tin v v tai nn giao thng.
Thng tin trong mt bn tin v tai nn giao thng (t nay gi l bn tin v
tai nn) c nh ngha l mt b E gm bn thnh phn, l: Thi gian, a
im, s thit hi, phng tin gy tai nn. Mt cch hnh thc E c nh
ngha nh sau:
E= (3.1)
Thi gian: l thi gian xy ra v tai nn
a im: l a im xy ra v tai nn
S thng vong: l s nn t vong, s ngi b thng. S thit hi c th
l danh sch gm c 2 trng l s thng vong v s t vong. V d, x hp
do say ru m trc tip vo nh ngi dn, lm cho 2 ngi b thng
nng, ti x cht ngay ti ch. Thng tin s thng vong c trch ra di
dng danh sch:
s t vong s thng vong
2 1
V d khc, xe khch m thng vo xe ti bn ng, lm 3 hnh
khch b thng. Thng tin c trch ra di dng danh sch:
21
s t vong s thng vong
0 2
Phng tin gy tai nn: ch trch ra loi phng tin gy ra tai nn.
V d, thng tin v s kin v tai nn E nh sau: E=. Qua bn thng tin tai nn c bn ny, chng ta
c th d rng suy lun ra rng: vo ngy 12 thng 7 nm 2013 mt v tai nn
xy ra trn Quc l 1A lm 3 ngi i xe my b thng.
Bi ton nh ngha, u vo ca m hnh l cc bn tin trn bo in t.
Tc gi chn d liu u vo l cc trang bo in t v ba l do sau. Th nht,
thng tin trn cc trang rt phong ph; Th hai, thng tin c tin cy cao v
tnh cp nht cao; Th ba, qu trnh thu thp d liu t trn cc trang bo in t
cng kh d rng. Nn d liu lun bo m tnh a dng v tnh cp nht.
M hnh trong phn nh ngha bi ton c chia thnh hai bi ton nh
sau: bi ton th nht c gi l pha 1- pht hin s kin v tai nn, bi ton
gii quyt vn pht hin mt bi bo c cha thng tin v tai nn hay khng,
bi ton th hai c gi l pha 2 - trch chn s kin v tai nn, bi ton ny s
gii quyt vn sau khi pha 1 kim tra d liu l s kin v tai nn, pha 2 s
trch chn thng tin v s kin v tai nn.
3.3. M HNH PHT HIN V TRCH CHN S KIN V TAI NN
3.3.1. Phng php xut
Trong chng 2, lun vn tp trung trnh by cc phng php tip
cn: phng php tip cn da trn lut (rule-based), phng php hc my, v
phng php kt hp lut v hc my (phng php lai). Trong phn ny, lun
vn tip tc pht trin tng ca vic kt hp gia lut v hc my cho bi
ton trch chn s kin v tai nn.
22
Pha 1- Pht hin s kin v tai nn: D liu u vo ca pha ny l cc
bn tin trn cc trang bo in t, s lng cc bi rt nhiu v ca rt nhiu cc
lnh vc khc nhau. Nn ti chia bi ton ny thnh hai bc; bc 1 - dng
lut lc ra d liu trong min tai nn giao thng, bc 2 - dng b lc
nhn din cc bn tin c cha s kin v tai nn. Nh vy, gii quyt bi
ton pht hin s kin v tai nn l kt hp gia lut v hc my.
Pha 2- Trch chn s kin v tai nn: Trong pha ny ta phi trch chn ra
cc thng tin v thi gian xy ra v tai nn, a im u, s thng vong, v
phng tin gy tai nn. Trch chn thng tin v a im xy ra v tai nn dng
nhn dng thc th (NER) v ontology hoc dng t in; thng tin v thi
gian c th dng chun (dd/mm/yyyy) hoc khng chun (gia tra, na
m, gi tan tm ), nn ta dng lut trch chn ra thng tin; Trch chn
thng tin s thng vong (s t vong v s b thng) s dng nhn dng thc
th v lut lc ra thng tin; Trch chn thng tin phng tin gy tai nn, tc
gi xy dng mt b t in cc phng tin giao thng sau dng lut so
khp vi b t in.
Nh vy, gii quyt c cc vn trong hai pha ta kt hp c lut v
hc my ( y l phn lp v nhn dng thc th). M hnh ca c hai pha s
c trnh by chi tit trong phn 3.3.2 v cch gii quyt chi tit hai bi ton
trong phn 3.4.
3.3.2. M hnh pht hin v trch chn s kin v tai nn
gii quyt cc vn c hai pha phn 3.3.1, tc gi xut m hnh
pht hin v trch chn s kin v tai nn gm c bn thnh phn chnh nh sau:
Hnh 3.1: Qu trnh pht hin v trch chn s kin v tai nn
23
Thu thp d liu: phn ny c nhim v thu thp d liu t ng t cc bn
tin t cc trang bo in t trn Internet sau chuyn cho bc tin x l d
liu.
Tin x l d liu: thnh phn ny c nhim v x l d liu sau khi thu
thp c phn trn, ta loi b cc th HTML, ly d liu dng th (text). Sau
chuyn n sang bc pht hin s kin v tai nn.
Pht hin s kin: l pht hin s kin v tai nn, d liu c ly t thnh
phn tin x l d liu, ta dng lut ly cc d liu thuc min thng tin tai
nn giao thng, sau ta dng hc my phn lp d liu, kim tra d liu
c phi bn tin v tai nn giao thng hay khng, nu khng phi th loi, nu
ng th ly v chuyn d liu cho bc trch chn s kin v tai nn.
Trch chn s kin: l bc trch chn s kin v tai nn; bc ny ta
trch chn nhng thng tin c trng ca v tai nn nh: thi gian, a im, s
thng vong, phng tin gy tai nn giao thng.
3.4. GII QUYT BI TON PHT HIN S KIN V BI TON
TRCH CHN S KIN V TAI NN
Nhim v ca bi ton 1, t d liu th (text) bc tin x l dng lut
lc ly d liu trong min thng tin tai nn giao thng, t dng b phn
lp kim tra d liu c phi l bn tin v tai nn hay khng, nu d liu l
bn tin v tai nn th d liu c chuyn sang bi ton 2 - Trch chn s
kin v tai nn. M hnh v cc gii quyt chi tit ca hai bi ton s c trnh
by trong mc 3.4.1 v 3.4.2.
3.4.1. Bi ton 1- Php hin s kin v tai nn (pha 1)
3.4.1.1. Pht biu bi ton
Mc tiu ca bi ton 1- Pht hin s kin v tai nn, d liu cn gii quyt
c ly t bc tin x l d liu (d liu dng th - text), d liu u ra c
24
cha s kin v tai nn hay khng. Mt cc hnh thc, bi ton c pht biu
nh sau:
u vo: mt bn tin trn cc trang bo c dng th.
u ra: bn tin c cha s kin tai nn hay khng?
Trong pha 1, gm hai chc nng: mt b lc d liu v mt b phn lp.
B lc c chc nng lc d liu t bc tin x l (d liu dng th sau khi
c lc th HTML t bn tin c ly trn cc trang bo) cc bn tin trong
min tai nn giao thng; Cn chc nng phn lp kim tra bn tin c cha s
kin v tai nn hay khng? Qu trnh pht hin s kin v tai nn c trnh by
trong hnh 3.2
Hnh 3.2 Thnh phn pht hin s kin
3.4.1.2. Xy dng tp lut
Nh trong phn 3.4.1.1 trnh by, pha pht hin d kin gm hai chc
nng, chc nng lc d liu (l cc bn tin thuc min tai nn giao thng), sau
chc nng phn lp s kim tra d liu c cha s kin v tai nn hay khng.
Trong phn ny tc gi s trnh by chi tit chc nng th nht - lc d liu
thuc min tai nn giao thng.
Qua kho st d liu, ta thy tiu ca bn tin thng ni ln kh y
ni dung ca bn tin. Nn tc gi thay v lc d liu qua ni dung th lc d liu
qua tiu ca bn tin.
Hot ng ca b lc d liu c m t nh sau: (1) xy dng tp lut da
trn kho st min d liu, cc t kho lin quan n min d liu tai nn giao
25
thng. (2) b lc d liu s dng cc lut ny so khp vi tiu bn tin, nu
tiu bn tin cha cc tp lut ny th iu bn tin thuc min tai nn giao
thng, ngc li th khng thuc.
Qua kho st d liu hu ht cc tiu bi bi thuc min tai nn giao
thng thng c cc t lin quan n phng tin giao thng. V d nh,
Tp.HCM: Xe khch ko l xe my trn ng, Xe bus ri xung hm ni, 56
ngi thng vong, t i tri ng, 1 ngi thit mng, TP.HCM:
Nam thanh nin t vong di gm xe ben v mt s t cc bn tin v tai nn
tiu khng cha phng tin giao thng, v d nh: Ngh An: Hai th sinh
khng th thi tt nghip v TNGT th n li cha cc t nh tai nn, tai nn giao
thng, TNGT, tai nn bi thm, V d cc bn tin trong hnh 3.3. v hnh 3.4.
Hnh 3.3 Tiu bn tin c cha t lin quan phng tin giao thng
26
Hnh 3.4 Tiu khng cha cc t lin quan n phng tin giao thng Qua kho st d liu v thc t, tc gi xy dng c mt tp cc
phng tin giao thng gi l t in cc phng tin giao thng. Chi tit tn
cc loi phng tin c lit k di bng 3.1.
Bng3.1 Phng tin giao thng
Stt Tn phng tin Stt Tn phng tin
1 Xe 29 Xe lu
2 t 30 My tut
3 M t 31 Xe cn cu
4 Xe my 32 My sc
5 Xe khch 33 tc-xi
6 Xe but 34 Xe th
27
7 Xe hi 35 Xe hng
8 Xe bn ch 36 Xe
9 x hp 37 Xe b
10 Xe tru 38 Xe nga
11 Xe in 39 Cng-te-n
12 Tu ho 40 cn cu
13 My bay 41 Xe ba gc
14 tu la 42 Xe ua
15 Xe ti 43 Xe phn khi ln
16 Xe m 44 Xe ga
17 Xe p 45 Xch-l
18 Xe p in 46 Trc thng
19 Cng nng 47 Xe bus
20 My ko 48 Xe ben
21 Xe lu 49 Xe 3 bnh
22 t 4 ch 50 Xe ba bnh
23 Xe u ko 51 Xe 3 gc
24 Xe 7 ch 52 Thuyn
25 t 7 ch 53
26 Xe 16 ch 54 Xung my
27 Xe 24 ch 55 Tu
28 Xe 29 ch 56 Ghe
T , tc gi xy dng lut cho hai trng hp, trng hp th nht dng
mu 1, cc tiu bn tin so khp vi t in cc phng tin giao thng nu
28
khp th lc ra; cn khng dng mu 2. Chi tit cc mu c minh ho trong
cng thc (3.1) v (3.2).
Mu1 = phng tin giao thng (3.1)
V d minh ho cho mu 1:
Tm thy t xe trong tiu bn tin Xe ch bia m ct in, 2 ngi
mc kt trong cabin
Tm thy t xe but trong tiu bn tin Tp.HCM: Xe but cn nt
chn ngi b hnh
Mt v d khc, a em i thi i hc ch b tai nn giao thng, trong
tiu bn tin ny khng cha phng tin giao thng nn mu 1 c b qua
m s chuyn sang s dng mu 2.
Mu 2= ng t # danh t (3.2)
Trong :
ng t gm cc t: Tai nn, TNGT,
Danh t gm cc t: giao thng, thng tm,
V d minh ho cho mu 2:
tai nn # thng tm
tai nn # giao thng
3.4.1.3. Xy dng m hnh phn lp
B phn lp c nhim v pht hin mt bi bo c cha s kin hay khng.
B phn lp s phn ra thnh hai lp: lp c cha s kin v tai nn nhn l
EVENT v lp khng cha s kin v tai nn nhn l NOT_EVENT. Qu trnh
kho st cho thy rng phn tiu v tm tt ca bn tin cha y ni
dung chnh ca c bn tin. Nn, tc gi dng thng tin ny xy dng vct
c trng biu din vn bn. Cc c trng c s dng trong qu trnh hun
29
luyn l 2-grams, 3-grams, 4-grams. Tc gi xy dng mt tp hun luyn v
dng tp d liu hun luyn ny xc nh vn bn cha s kin.
Trong phn ny tc gi s dng m hnh Maximum Entropy (ME) v: (1)
d liu trong qu trnh hun luyn l vn bn, do vy, khi biu din di dng
vector c trng th y l d liu tha m ME tt khi d liu c biu din
di dng tha: (2) tc hun luyn ca ME kh tt v thc nghim cho thy
phng php ny cho kt qu tt vi d liu vn bn; (3) c th tu bin m
ngun ca ME do y l m ngun m. M hnh ME da trn xc sut c iu
kin cho php tch hp s a dng ca cc c trng t tp hun luyn cho bi
ton phn lp. tng ca ME l m hnh phi xc nh mt phn phi u tho
mn cc rng buc t tp d liu hun luyn m khng thm bt bt k mt gi
nh no. iu ny c ngha s phn b ca m hnh phi tho mn cc rng
buc ca d liu quan st v cng gn vi phn b cng tt.
Sau qu trnh hun luyn, ton b d liu qua b lc s c a vo
m hnh. Tai y, nhng vn bn c nhn EVENT s l u vo cho qu trnh
trch chn; ngc li vn bn c nhn NOT_EVENT m hnh s b qua.
3.4.2. Bi ton 2- Trch chn s kin v tai nn (pha 2)
3.4.2.1. Pht biu bi ton
B trch chn s kin c th coi l thnh phn trng tm nht ca m hnh.
Ni m cc thng tin ca mt s kin v tai nn giao thng c trch chn.
Mt cch hnh thc, c th pht biu bi ton trch chn s kin nh sau:
u vo: bn tin cha s kin v tai nn
u ra: cc thng tin ca mt v tai nn gm: thi gian, a im, s
thng vong, phng tin gy tai nn. y s thng vong bao gm s nn
nhn t vong v s nn nhn b thng. S thng vong c lit kt di dng
danh sch gm hai trng (s t vong, s b thng), v mt bn ghi tng ng
ghi ra s t vong v s b thng.
30
Bi ton trch chn s kin c th c minh ho trong hnh 3.5.
Hnh 3.5. Thnh phn trch chn s kin
B trch chn gm 4 c trng: trch chn thi gian, trch chn a im,
trch chn s thng vong, v trch chn phng tin gy tai nn. c trng th
nht s dng cc lut ly v thng tin thi gian xy ra v tai nn (thng tin v
thi gian l ngy xy ra tai nn (khng phi l gi trong ny xy ra v tai nn).
c trng th hai dng mt t in cha cc a im ly v a im. c
trng th ba v th t tc gi s dng cc lut ly ra thng tin v s thng
vong v trch ra phng tin gy tai nn.
3.4.2.2. Trch chn thi gian
Qua kho st trn tp d liu, kt qu cho thy thng tin v thi gian
thng c biu din di hai dng: tuyt i, tng i. Thi gian tuyt i
thng c biu din di dng DD/ MM/YYYY hoc dng DD/MM (vi DD
l ch ngy, MM l ch thng, v YYYY l ch nm xy ra tai nn). V d, vo
ngy 8/5 trn quc l 5 xe my va quyt vo t lm hai ngi b thng. V
d khc vo ngy 09/7/2014, v li xe trong tnh trng say ru m x hp
m thng xung h nc, nn nhn cht ti ch. Tuy nhin trong nhiu
trng hp thng tin v thi gian c cp mt m v khng trc tip. V d,
ngay sng sm ngy 5/5, mt v tai nn thm khc xy ra, ti x xe ti m
thng vo xe khch, rt may khng c thit mng nhng ton b hnh khch b
thng c a i cp cu. Trong trng hp ny, thng tin v thi im xy
31
ra v tai nn khng r rng, n ch l chiu ti. Nh vy, cn kt hp cm t
sng sm v ngy chnh xc a ra thng tin v thi gian.
T thc t thi gian c biu din bng hai cch, nn tc gi dng cc
lut c xy dng sn ly ra thng tin v thi gian. Trong trng hp thi
gian c biu din di dng tuyt i, thi gian c th d rng c trch ra
bng cch s dng biu thc chnh quy (Regular Expression - RE). Trng hp
th hai, thi gian di dng tng i, c th nhn thy n cha hai thnh phn:
thnh phn tin t v thi gian. Thnh phn tin t l tp cc t ch thi gian
tng i (rng sng, na m, chiu,) v thnh phn thi gian c biu din
di dng DD/MM/YYYY. Lut trch chn thi gian c minh ho trong cng
thc
Thi gian = + (3.2)
Trong , Tin t gm cc t: vo, ngy, sng, tra, chiu, ti, na m,
tra nay, sng nay, chiu nay, vo gi tan tm, hm qua, hm nay, ti qua, m
qua, rng sng nay, thng.
Ngy thng, c nh dng DD/MM/YYYY hoc DD/MM
Trong trng hp bn tin khng cp n ngy thng, th thi gian mc
nh s c ly l thi gian ng bn tin.
Mt s v v minh ho vic s dng biu thc chnh quy v lut trch
chn thi gian ca s kin.
V d 1: Ngy 23/5, t khch chy t Nam ra Bc n khu vc trc ch
Thi Ph thuc a phn x c Thnh, huyn M c, bt ng lao sang bn
tri ng, hc mnh vo t ti do ti x c Th iu khin lu thng
ngc chiu. Do ang chy tc qu nhanh, xe khch tip tc lao thm
khong 100m, m vo t ti khc ca ti x Nguyn c Ln ang u bn
ng. C va chm mnh khin u ca c 3 t v nt, knh vng tung te trn
ng. Ti x, ph xe v 3 hnh khch trn t khch b thng nng. Ngi
dn xung quanh khu vc ny phi p knh xe cu ngi b nn a n bnh
32
vin cp cu.V tai nn khin nm ngi i trn xe khch b thng, c u xe
t v xe ti b b nt.
V d 2: Chiu ngy 24/8, anh H Vn o ch v v con mt tui bng xe
my t x Tnh Hip, huyn Sn Tnh (Qung Ngi) v huyn min ni Tr Bng
th trt ng, b xe ti cun vo gm. Ti hin trng, chic xe my cng 3
ngi trn xe b cun vo gm t ti, mc kt. Cng an huyn Sn Tinh a a
thi th c nh anh ao ra ngoai ; phong ta hin trng iu tra nguyn nhn.
Nhn nh ban u ca c quan chc nng, do ng trn nn xe my ca anh
o trt ng. Xe ti i cng chiu phanh khng kp nn cun xe my vo
gm, ri ko l hai v chng cng con nn nhn khin c ba cht ti ch.
Trong v d 1, thi gian c trch chn bng cch s dng biu thc chnh
quy, cn trong v d 2 li s dng lut thi gian trch chn. Kt qu ca v d
1 l 23/5, kt qu ca v d 2 l chiu ngy 24/8.
3.4.2.3. Trch chn a im
Trong trch chn a im, s dng NER v t in a im.
Bc 1: p dng NER
Bc 2: ly v cc thc th c gn th
Bc 3: kim tra ngc li vi t in a im tm cc location chnh
xc
3.4.2.4. Trch chn s thng vong
trch chn thng tin s thng vong tc gi s dng lut. Lut trch
chn s thng vong c minh ho trong cng thc (3.3)
S nn nhn = + (3.3)
S: chnh l s nn nhn. C th l s hoc ch
s={"mt", "hai", "ba", "bn", "nm", "su", "by", "tm", "chn",
"mi"}; v cc s [1..9]
33
Hu t: l cc t t vong, b thng, thit mng, cht, nhp vin....
hu t={"b thng", "cht", "t vong", "thit mng", "cht thm",
thng nng, thng nh, cp cu,bnh vin};
Kt qu c ghi li di dng danh sch gm hai trng v mt bn ghi:
trng s t vong v trng s thng vong; tng ng vi mi trng l s
liu c ghi di bn ghi.
V d 3: Vo khong 12h55 tra nay (2/6), trn ng cao tc Thng
Long - Ni Bi (H Ni), on cch siu th M Linh Plaza khong 200m xy
ra v va chm gia 1 xe taxi v 1 xe my. Hu qu, v tai nn khin 2 ngi
trn xe my b thng rt nng, hin vn ang nm trn ng cha c a
i cp cu. V vic cng khin giao thng qua khu vc gp tr ngi, cc
phng tin di chuyn kh khn theo hng vo trung tm thnh ph.
Kt qu ca v d 3, s nn nhn t vong l 0, s b thng l 2
S t vong S b thng
0 2
V d 4: Khong 22h, ngy 27-5, ti km 1045 + 950, quc l 1A, on i
qua thn Th Li, x Tnh Phong, huyn Sn Tnh, tnh Qung Ngi, xy ra v
tai nn giao thng nghim trng lm 1 thanh nin t vong ti ch v 1 ngi
phi nhp vin.
Kt qu ca v d 4: s nhn nhn t vong l 1, s b thng l 1
S t vong S b thng
1 1
3.4.2.5. Trch chn phng tin gy tai nn
trch chn thng tin phng tin gy tai nn, tc gi s dng lut
trch chn. Lut c minh ho trong cng thc sau:
34
Phng tin gy tai nn = + (3.4)
Trong :
Danh t: gm cc t phng tin giao thng trong t in nh: xe khch,
xe ti, xe u ko,... Chi tit ca tp cc phng tin giao thng c lit k
trong bng 3.1.
ng t: gm cc t nh, i u, m xe, gy ti nn, ng xe, m
nhau...
Chi tit ca tp cc ng t nh sau, verbs={"m nt u", "m xe",
"u u", "xe i u", "ng xe", "m nhau","tai nn giao thng", "gy tai
nn", "gp tai nn", "hc nhau", "lao xung gm", "chui vo gm", "b tng",
"tng mnh", "cn cht", "cn qua", "hc", "m", "chui gm", "lt tu", "trt
bnh", "tu trt bnh", "m thuyn", "chm thuyn", "lt thuyn", "lt nga",
"cn cht"};
V d 5: Khong 17h ngy 26/5/2014 ti Km 677 + 700 trn QL1A on i
qua a phn thn Dinh Mi x Duy Ninh huyn Qung Ninh (Qung Bnh)
xy ra mt v tai nn lm mt nam thanh nin t vong. Vo thi im ni trn,
chic xe t ti mang BKS 60C-116.80 ang lu thng theo hng Bc Nam,
khi i n a bn x Duy Ninh, bt ng mt nam thanh nin iu khin xe my
mang BKS 73G1 - 074.03 ang chy ngc chiu m chnh din vo u xe
ti, t vong ti ch.Nn nhn c xc nh l Ng nh Lm (SN 1989) tr ti
thn Ph Lc x Gia Ninh, huyn Qung Ninh (Qung Bnh).
Kt qu ca v d 5: phng tin gy tai nn l xe my
3.5. TNG KT
Trong chng ny, tc gi xut phng php v m hnh gii quyt
bi ton tng quan trch chn s kin v tai nn. ng thi tc gi cng trnh
by chi tit phng php v m hnh gii quyt hai bi ton: bi ton pht hin
s kin v tai nn v bi ton trch chn s kin v tai nn; bi ton th nht tc
35
gia dng phng php kt hp lut v hc my pht hin s kin v tai
nn giao thng v d liu ny c lm u vo cho bi ton th hai. bi ton
th hai ny, cc thng tin c trch chn l: thi gian, a im, s thng
vong, v phng tin gy tai nn. Trong c hai bi ton u dng phng php
kt hp gia lut v hc my . Trong chng 4, tc gi s chng minh tnh hiu
qu ca phng php bng phng php thc nghim.
36
Chng 4. THC NGHIM V NH GI
Chng ny tc gi s trnh by v mi trng, cng c, cng nh cc gi
c tc gi xy dng; bn cnh , tc gi cng chng minh tnh hiu qu ca
phng php thng qua hai bi ton quan trng l pht hin s kin v trch
chn s kin; cui cng, tc gi trnh by mt s bn lun lin quan ti kt qu
thc nghim ca phng php xut cng nh phn tng kt chng.
4.1. MI TRNG V CC CNG C S DNG THC NGHIM
Cu hnh phn cng v cc cng c phn mm s dng s dng trong
thc nghim ca lun vn c trnh by trong bng 4.1, bng 4.2.
Bng 4.1 Cu hnh phn cng
Stt Thnh phn Ch s
1 CPU 2.6GHz Intel Core i5
2 RAM 8GB
3 H iu hnh Win7
4 B nh ngoi 256GB
Bng 4.2. Cng c phn mm s dng
STT Tn phn mm Chc nng Ngun
1 Teleport Pro Ti d liu t cc
website http://teleport-pro.en.softonic.com/
2 Eclipse Stan- dard/Kepler
Release
To mi trng vit
chng trnh http://eclipse.org/eclipse
3 JsoupParser B cng phn tch m
html http://jsoup.org/apidocs/org
4 JvnTextPro v.2.1 Cam-Tu Nguyen http://jvntextpro.sourceforge.net
5 vn.hus.nlp.tokenizer-4.1.1 M ngun m
https://code.google.com/p/vntaggergate-
plugin/source/browse/lib/vn.hus.nlp.token
izer-4.1.1.jar?r=85418c90bafeec89da
9203f9a7f10338d2cff40c
37
4.2. XY DNG TP D LIU
4.2.1. Thu thp d liu
D liu c thu thp trn trang http://vovgiaothong.vn/giao-thong-
trong-nuoc/ (knh VOV Giao thng Quc gia i Ting ni Vit Nam) v
trang http://antoangiaothong.gov.vn/tai-nan-giao-thong/ (ca U ban An ton
giao thng Quc gia). Tc gi chn trang ny v cc trang ny lun cp nht
nhanh v kh y cc v tai nn trn c nc.
Vic thu thp d liu s c thc hin bng phn mm Teleport Pro, phn
mm ny s ly v 500 bn tin t cc website trn, nh vy sau khi thu thp d
liu ta c 3000 bn tin.
4.2.2. Tin x l d liu
D liu c lu di dng JSON, tc gi tin hnh a d liu v dng
HTML, sau tch th HTML thu vn bn dng th (text). Sau qu trnh s
l, tc gi thu c 3000 bn tin. Cc thnh phn trong mt bn tin c minh
ho trong bng 4.3.
Bng 4.3. Cc thnh phn ca mt bn tin
Stt Tn thnh phn M t
1 Tiu Tiu ca bn tin
2 Tm tt Phn tm tt ca bn tin
3 Ngy ng tin Ngy m bn tin c ng
4 Ni dung Ni dung bn tin
4.3. NH GI QU TRNH PHT HIN S KIN
4.3.1. nh gi b lc d liu
M t thc nghim: mc ch ca thc nghim ny nh gi kh nng ca
b lc d liu.
Pht biu thc nghim
38
- u vo: mt tp cc bn tin c thu thp t trang
http://vovgiaothong.vn/giao-thong-trong-nuoc/ v trang
http://antoangiaothong.gov.vn/tai-nan-giao-thong/
- u ra: cc bi bo lin quan ti min d liu tai nn giao thng
D liu thc nghim: l 3.000 bn tin
Sau qu trnh lc d liu thu c tng s 919 bn tin thuc min tai nn
giao thng, trong s bn tin khng lin quan n tai nn giao thng rt t, v
c th tnh t l li theo cng thc 4.1. Chi tit c trnh ny trong bng 4.4.
Bng 4.4. T l li ca qu trnh lc d liu
Tng s bn tin s bn tin khng lin quan T l li
919 19 3.9%
Cng thc tnh t l li ca qu trnh lc d liu:
Trong :
Tng s: l tng s bn tin thu c sau qu trnh lc
S bi khng lin quan: l s bn tin khng thuc min tai nn giao thng.
Kt qu ca qu trnh ny, c trnh by trong bng 4.4, thu c kt qu
chnh xc kh cao.
4.3.2. nh gi qu trnh phn lp
M t thc nghim: mc ch ca phn ny l nh gi qu trnh phn lp
ca thc nghim.
Php biu thc nghim
u vo: mt tp cc bn tin c lc
u ra: cc bn tin c gn nhn EVENT hoc NOT_EVENT
39
D liu thc nghim: d liu ca mi ln nh gi l 100 bn tin c ly
ngu nhin t cc bn tin c lc bi b d liu. Kt qu ca cc qu trnh
nh gi c trnh by trong bng 4.5.
Bng 4.5. nh gi kt qu phn lp
Stt
S bn tin
chnh xc
S bn tin
khng chnh
xc
S bn tin
khng tm thy Precision Recall o F-1
1 85 12 3 88% 97% 92%
2 81 16 3 84% 96% 90%
3 83 15 2 85% 98% 91%
4 85 11 4 89% 96% 92%
5 80 17 3 82% 96% 89%
Trung
binh
82.8
14.2 3 85% 97% 91%
Kt qu thc nghim trong bng 4.5, cho thy qu trnh phn lp cho thy
chnh xc (P-Precision) t 85%, o hi tng (R-Recall) t 97%, o
F-1 t 91%.
4.4. NH GI QU TRNH TRCH CHN S KIN
4.4.1. Thc nghim khng qua b phn lp
M t thc nghim: mc ch ca phn ny l nh gi kh nng trch
chn.
Pht biu thc nghim
u vo: mt bn tin trong min tai nn giao thng
u ra: thng tin v s kin v tai nn gm: thi gian xy ra v tai nn, a
im xy ra v tai nn, s thng vong (s t vong, s b thng), v phng
tin gy tai nn.
40
D liu thc nghim: d liu l 200 bn tin ly ngu nghin t cc bn tin
trong nim tai nn tai nn giao thng cha qua b phn lp.
Mt s kin E c nh ngha l mt b gm thi gian, a im, s
thng vong, v phng tin gy tai nn c trnh by trong cng thc 3.1.
Nh vy mt s kin ng nn cha c bn thnh phn trn. Nu mt s kin
khng bao gm phng tin gy tai nn v thi gian gy tai nn th c xem l
mt s kin sai.
nh gi kh nng trch chn ca s kin, tc gi s dng ba o:
chnh xc (P - Precision), hi tng (R - Recall), v o F1 (F-score). Cc
o ny c biu din trong cng thc (4.2), (4.3), (4.4)
Trong :
- S s kin ng: s s kin c m hnh trch chn chnh xc.
- S s kin sai: l s s kin m m hnh trch chn sai.
Trong :
- S s kin ng: s s kin c m hnh trch chn chnh xc.
- S s kin khng c trch chn: l s s kin m m hnh khng trch
chn ra.
(4.4) 2 x P x R F1 =
(P + R)
(4.2) S s kin ng chnh xc (P) =
S s kin ng + S s kin sai
(4.3) S s kin ng hi tng (R) =
S s kin ng+s s kin khng c trch chn
41
Da vo cng thc (4.2), (4.3), (4.4), tc gi a ra bng nh gi m hnh
trch chn, chi tit c trnh by trong bng4.6.
Bng 4.6. nh gi qu trnh trch chn - d liu khng qua b phn lp
Tn website S s kin
ng
S s
kin sai
S s kin
khng tm thy P R F1
antoangiaothong.gov.vn 160 34 6 82% 96% 89%
vovgiaothong.vn 154 37 9 81% 94% 87%
Trung bnh 314 71 15 82% 95% 88%
4.4.2. Thc nghim qua b phn lp
D liu thc nghim: d liu l 100 bn tin c ly t cc bn tin cha
s kin v tai nn (gn nhn EVENT). Kt qu ca qu trnh trch chn s kin,
tc gi cng s dng cng thc (4.2), (4.3), (4.4) nh gi thc nghim. Kt
qu c m t chi tit trong bng 4.7.
Bng 4.7. nh gi qu trnh trch chn - d liu qua b phn lp.
Tn website S s kin
ng
S s
kin sai
S s kin
khng tm thy P R F1
antoangiaothong.gov.vn 91 5 4 95% 96% 95%
vovgiaothong.vn 93 4 2 96% 98% 97%
Trung bnh 184 9 6 95% 97% 96%
4.4.3. Nhn xt
T thc nghim c chi tit trong bng 4.6 (d liu khng qua b phn
lp) v bng 4.7 (d liu c x l qua b phn lp). Kt qu cho thy d liu
c x l qua b phn lp cho kt qu cao hn. iu chng t tm quan
trng ca b phn lp trong m hnh.
4.5 PHN TCH LI
4.5.1. Phn tch li qu trnh pht hin s kin
Qu kho st v thng k d liu sau thc nghim, pht hin li khi tiu
c t nhc n phng tin giao thng nhng bn tin li khng thuc min
42
tai nn giao thng: V d, hnh 4.1 tiu bn tin kh v mua xe tr gp, c
cha phng tin giao thng l xe nhng thc cht bn tin ny thuc min d
liu thng mai khng phi min tai nn giao thng. Tuy th, b lc vn pht
hin d liu thuc min d liu tai nn giao thng.
Hnh 4.1. Li b lc khi d liu khng thuc min tai nn giao thng
4.5.2. Phn tch li qu trnh trch chn s kin
Trong pha trch chn thng tin th kh nng trch chn thng tin cn thp,
tc gi tm hiu nguyn nhn v thy rng thng xy ra cc li nh: trch
chn a im, i khi trong cc bn tin ch nhc n tn ng khng nhc n
tn a phng (x/huyn/ tnh) trng hp ny khng th xc nh c a
im chnh xc hoc cho gi tr Null. Trong s t cc trng hp cc thng tin
c vit tt l khng trch chn c.Trch chn thng tin phng tin gy tai
nn trong mt s trng hp trch chn ra thng tin sai nh: xe my b m,
nn nhn cht ti ch, thng tin c trch ra xe my l phng tin gy tai
nn kt qu ny l sai. Hay trong trng hp v trch chn s nn nhn nh
Nn nhn c ngi dn a i cp cu, th khng trch chn c ra s
nn nhn v khng c tin t v s lng. Chi tit hn v cc li c trnh by
trong bng 4.8.
43
Bng 4.8 Mt s li - trong qu trnh trch chn
Stt Thng tin ng Thng tin trch chn
1 Phng 4, Qun 1, Phng 9, TP H Ch Minh Qun 5, Phng 7, Qun Bnh Thch
2 Tnh Pray Veng Null
3 Huyn Xun trng, Nam nh Nam nh
4 Quc l 1A Null
5 xe my b m Xe my
6 Nn nhn c ngi dn Null
4.6. MT S KT QU PHN TCH CC S KIN
Kt qu ca qu trnh trch chn c s dng thng k nh thng k s
v tai nn theo Tun, theo Th trn Tun, theo Tnh, v thng k s v tai nn
theo Phng tin tham gia giao thng.
1./ Thng k s v tai nn theo tun trong hai thng (thng 4 v thng 5
nm 2014). D liu c tp trung vo thng 4 v thng 5 nm 2014, thng k
cho thy cc ngy ngh l 30/4 v 1/5 s v tai nn tng ln ng knh ngc, c
nc xy ra 191 v tai nn v lm thit nng 109 ngi. Chi tit c m t
trong biu 4.1.
Biu 4.1. Thng k s v tai nn theo Tun trong thng 4 v thng 5
44
2./ Thng k s v tai nn theo Th trn Tun, kt qu cho thy vo
nhng ny cui tun s v tai nn tng ln ng k. Chi tit v s v tai nn
trong tng Th trn Tun c th hin trong biu 4.2.
Biu 4.2. Thng k s v tai nn theo Th trn Tun
3./ Thng k s v tai nn theo cc tnh (thng k trn 4 tnh in hnh) trn
c nc. Kt qu cho thy Thnh ph H Ch Minh c mc tai nn cao nht.
Chi tit xem biu 4.3.
Biu 4.3. Thng k s v tai nn theo Tnh
45
4./ Thng k cc phng tin c tn sut gy tai nn cao khi tham gia giao
thng (thng k 5 phng tin c mc tai nn cao hn). Chi tit ca tng loi
phng tin c hin th trong biu 4.4
Biu 4.4. Thng k s v tai nn theo loi phng tin giao thng
Qua thng k cc v tai nn giao thng tc gi rt ra nhn xt sau:
i vi ngi dn khi tham gia giao thng vo nhng ngy ngh l, ngy
cui tun, trong cc thnh ph ln, v tham gia giao thng trn cc phng tin
nh xe my, xe but, xe khch, xe cng te- n v c bit l xe ti phi ht sc
cn thn c bit l ngi iu khin phng tin giao thng, trnh nhng tai
nn ng tic cho bn thn v cho ngi i ng.
i vi cc nh qun l cng nn c cc bim php hiu qu ngn nga
tai nn giao thng c bit vo nhng ngy ngh l di.
4.7. TNG KT
Trong chng ny, tc gi tin hnh thc nghim, xem xt v nh gi
kt qu ca m hnh trch chn thng tin trong vn bn du lch c xy dng
trong chng ba. Kt qu thc nghim cho thy tnh kh thi ca m hnh gii
quyt bi ton trch chn s kin v tai nn.
46
KT LUN
1/. Kt qu t c ca lun vn
Trong lun vn ny, tc gi tm hiu cc phng php trch chn s
kin, phng php kt hp lut v hc my c s dng cho bi ton pht hin
s kin v bi ton trch chn s kin. Trn c s , xy dng m hnh v
phng php gii quyt chi tit cho bi ton pht hin s kin v tan nn v bi
ton trch chn s kin v tai nn. Kt qu thc nghim ca qu trnh trch chn
s kin trn min d liu v tai nn vi o P t 95%, o R t 97 %, v
o F1 t 96%, iu chng t tnh kh thi ca m hnh.
2./ Hn ch
- Kt qu ca b phn lp cha cao do nhp nhng gia bn tin c cha s
kin v tai nn v bn tin cha thng tin tai nn giao thng khc.
- Xy dng tp lut bng tay, do kh c th bao ph ton b d liu.
iu ny dn n tp lut c th b st nhng d liu lin quan ti min d liu.
- Trch chn a im da trn t in trong mt s trng hp cn b nhp
nhng khi d liu cung cp khng thng tin v a im.
- Trong mt s trng hp vit tt, khi trch chn thng tin cn cha chnh
xc.
3/. nh hng tng lai
nh hng nghin cu tip theo ca lun vn l tip tc hon thin v pht
trin m hnh trch chn s kin trong vn bn tin tc ting Vit. Pht trin trch
chn thm cc thuc tnh quan trng nh: gi/ngy (gi no trong ngy xy ra
v tai nn), tui ca ngi iu kin phng tin gy tai nn, ngnh ngh ca
ngi iu khin phng tin gy tai nn, a hnh gy tai nn, Kt qu ca
qu trnh trch chn c thng k nh: tai nn hay xy ra vo gi/ngy (gi no
trong ngy hay xy ra tai nn vo ban m, gi n cng s, gi tan tm),
th/tun (tai nn thng xy ra vo th no trn tun, nh ngy i lm hay ngy
47
cui tun, ), ma/nm (vo ma l hi, ma thi i hc, ma ma, hay vo
cc k ngh mt ma h,), a hnh gy tai nn (ng dc, ng vng cua,
hay ng c nhiu ng r..), ngnh ngh ca ngi iu khin phng tin giao
thng T nhng thng k c th tm ra nguyn nhn xy ra cc v tai nn,
so snh quy m mc nghim trng ca cc v tai nn trong tng khong thi
gian vi nhau, t a ra bn nh gi trung v s pht trin ca cc v tai nn
theo chiu hng no. Mt khc, kt qu ca qu trnh thng k s c trc
quan ho trn bn Vit Nam cc im hay xy ra tai nn bng cc cnh bo,
bin bo, v cc ghi ch.
48
TI LIU THAM KHO
Ti liu ting Anh
[1] Sunita Sarawagi (2008), Information Extraction, Indian Institute of
Technology, CSE, Mumbai 400076, India,
[2] Douglas E. Appelt. Introduction to information extraction technology. In
Tutorial held at IJCAI-99, Stockholm, Sweden, 1999.
[3] Young-Sook Hwang Chun Hong-Woo and Hae-Chang Rim. Unsupervised
event extraction from biomedical literature using co-occurrence information and
basic patterns. In: 1st International Joint Conference on Natural Language
Processing (IJCNLP 2004). Lecture Notes in Computer Science. Springer-
Verlag Berlin Heidelberg, vol. 3248:777786, 2004.
[4] Uzay Kaymak Frederik Hogenboom, Flavius Frasincar and Franciska de
Jong. An overview of event extraction from text. Workshop on Detection,
Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)
at Tenth International Semantic Web Conference (ISWC 2011), 779:pp. 4857,
2011.10
[5] M.A Hearst. Automatic acquisition of hyponyms from large text corpora. In:
14th Conference on Computational Linguistics (COLING 1992), vol.
2:539545, 1992.
[6] M.A Hearst. Wordnet: An electronic lexical database and some of its
applications. In Automated Discovery of WordNet Relations, pp. 131151. MIT
Press, 1998.
[7] Frederik Hogenboom Jethro Borsje and Flavius Frasincar. Semi-automatic
financial events discovery based on lexico-semantic patterns. International
Journal of Web Engineering and Technology, 6(2):115140, 2010.
49
[8] Yea-Juan Chen Lee Chang-Shing and Zhi-Wei Jian. Ontology-based fuzzy
event extraction agent for chinese e-news summarization. In Expert Systems
with Applications 25(3), 431 447, 2003.
[9] Okamoto Masayuki and Masaaki Kikuchi. Discovering volatile events in
your neighborhood: Local-area topic extraction from blog entries. In: 5th Asia
Information Retrieval Symposium (AIRS 2009). Lecture Notes in Computer
Science. Springer-Verlag Berlin Heidelberg, vol. 5839:181192, 2009.
[10] Liang Xiang Xing Chen Mingrong Liu, Yicen Liu and Qing Yang.
Extracting key entities and significant events from online daily news. In: 9th
International Conference on Intel- ligent Data Engineering and Automated
Learning (IDEAL 2008). Lecture Notes in Computer Science. Springer-Verlag
Berlin Heidelberg, vol. 5326:201209, 2008.
[11] L. Peshkin and A. Pfeffer. Bayesian information extraction network. In
Proc.of the 18th International Joint Conference on Artificial Intelligence
(IJCAI), 2003.
[12] Hristo Tanev Piskorski Jakub and Pinar Oezden Wennerberg. Extracting
violent events from on-line news for ontology population. In: 10th International
Conference on Business Information Systems (BIS 2007). Lecture Notes in
Computer Science. Springer-Verlag Berlin Heidelberg, vol. 4439:287300,
2007.
[13] Silja Huttunen Ralph Grishman and Roman Yangaber. Information
extraction for enhenced access to disease outbreak reports. Journal of
Biomedical Informastic, 35(4):pp. 236246, 2002.
[14] Ai Kawazoe Son Doan and Nigel Collier. Global health monitor - a web-
based system for detecting and mapping infectious diseases. Proc. International
Joint Conference on Natural Language Processing (IJCNLP), Companion
Volume,Hyderabad, India:pp. 951956, 2008.
50
[15] William H. Hsu Svitlana Volkova, Doina Caragea and Swathi Bujuru.
Animal disease event recognition and classification. 2010
[16] Yusuke Miyao Akane Yakushiji, Yuka Tateisi and Jun ichi Tsujii. Event
extraction from biomedical papers using a full parser. In In: 6th Pacific
Symposium on Biocomputing (PSB 2001):pp. 408419, 2001.
[17] Helen L. Johnson Chris Roeder Philip V. Ogren-William A. Baumgartner Jr.
Elizabeth White Hannah Tipney K. Bretonnel Cohen, Karin Verspoor and Lawrence
Hunter. High-precision biological event extraction with a concept recognizer. In In:
Workshop on BioNLP: Shared Task collocated with the NAACL-HLT 2009 Meeting.
pp. 5058. Association for Computational Linguistics, 2009.
[18] S. Soderland, Learning information extraction rules for semi-structured and free
text, Machine Learning, vol. 34, 1999.
[19] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, Gate: A framework
and graphical development environment for robust nlp tools and applications, in
Proceedings of the 40th Anniversary Meeting of the Association for Computational
Linguistics, 2002
[20] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan, Declarative
information extraction using datalog with embedded extraction predicates, in VLDB,
pp. 10331044, 2007.
[21] Ralph Grishman and Beth Sundheim. Message understanding conference-6: a
brief history. Proceedings of the 16th conference on Computational linguistics,
COLING, Stroudsburg, PA, USA, Volume 1:pp. 466471, 1996.
[22] Doddington George R. The automatic content extraction (ace) program tasks,
data, and evaluation. In LREC, 2004
[23] Keita Sato Nishihara, Yoko and Wataru Sunayama. Event extraction and
visualization for obtaining personal experiences from blogs. In: Symposiumon Human
Interface 2009 on Human Interface and the Management of Information. Information
and Interaction. Part II. Lecture Notes in Computer Science, Springer-Verlag Berlin
Heidelberg, vol. 5618:315324, 2009.
51
[24] Chinatsu Aone and Mila Ramos-Santacruz. Rees: A large-scale relation and event
extraction system. In In: 6th Applied Natural Language Processing Conference (ANLP
2000):pp. 7683. Association for Computational Linguistics, 2000.
[25] Huanye Sheng Li Fang and Dongmo Zhang. Event pattern discovery from the
stock market bulletin. In: 5th International Conference on Discovery Science (DS
2002). Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg, vol.
2534:3549, 2002.
[26] Vargas-Vera Maria and David Celjuska. Event recognition on news stories and
semi-automatic population of an ontology. In In: 3rd IEEE/WIC/ACM International
Conference on Web Intelligence (WI 2004). pp. 615618 , 2004.
[27] Takuya Nakamura Agnes Sandor Cedric Tarsitano Philippe Capet, Thomas
Delavallade and Stavroula Voyatzi. A risk assessment system with automatic
extraction of event types. Intelligent Information Processing IV, IFIP International
Federation for Information Processing. Springer Boston, vol. 288:220229, 2008.