Upload
karleigh-carpenter
View
22
Download
2
Embed Size (px)
DESCRIPTION
A Robust Shallow Parser for Swedish. Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden. Introduction. What is robustness? Robust against noisy, ill-formed and partial natural language data. Shallow parsing. Many NLP-applications do not need full parsing - PowerPoint PPT Presentation
Citation preview
A Robust Shallow Parser for Swedish
Ola Knutsson, Johnny Bigert, Viggo Kann
Royal Institute of Technology, Sweden
Introduction
What is robustness?
Robust against noisy, ill-formed and partial natural language data
Shallow parsing
Many NLP-applications do not need full parsing
Shallow parsing:
• A parsing approach
• Pre-processing for full parsing
A collection of techniques
Abney - finite state cascades (1991)
Currently, a lot of attention on ML
Well suitable for modularization
Chunking and phrase identification
Common modules in a shallow parser:
• Tokenizer
• PoS-tagger
• Chunker
• Phrase identifier
• Grammatical function identifier
Chunking[NP Den mycket gamla mannen][VC gillade][NP mat]
Phrase identification[NP Den [AP mycket gamla] mannen][VC gillade][NP
mat]
Parsers for Swedish
Full parser: UCP (Sågvall Hein) and SLE (Gambäck)
Shallow parsers (phrase structure): Cass-Swe (Kokkinakis) and Megyesi using
machine learning
Dependency: CG (Birn) and FDG (Voutilainen)
Granska Text Analyzer (GTA)
Hand-crafted rules
Context-free backbone
Partly object-oriented notation
Major Phrase Categories
NP: Han såg den lilla mannen på bänken
VC: Han har spelat kort hela natten
PP: Han såg spår i sanden
AP: Han ogillade små vita lögner
ADVP: Han vill inte gå på bio.
INFP: Han tycker om att spela
Clause Boundary Identification
Based on Ejerhed’s algorithm
Context-sensitive rules
Using only PoS information
Different kinds of rules
GTA contains 260 rules
200 identify phrase structure
20 clause boundary identification
40 selection rules (disambiguation)
Example rule, [NP den lilla bilen]
NPmin@{X(wordcl=dt| wordcl=hd | wordcl=rg),X2(wordcl=ab | wordcl=rg)?,Y(wordcl=jj | wordcl=ro | wordcl=pc)*, Z(wordcl=nn) -->action(help, wordcl:=Z.wordcl, pnf:= undef,
gender:=Z.gender, num:=Z.num, spec:=Z.spec, case:=Z.case)
Clause boundary rule
cl@ {
V(sed!=sen & text!="som" & wordcl!=sn),
X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) |
(wordcl=nn & case=nom & V.case!=gen) | wordcl=ab),
---endleftcontext---,
Y(wordcl=kn),
---beginrightcontext---,
Y2(((wordcl=pn & pnf=sub) | (wordcl=pm & case=nom) |
(wordcl=nn & case=nom) | wordcl=ab) & wordcl=X.wordcl),
Z(wordcl=vb &
(vbf=prs | vbf=prt | vbf=imp))
-->
action(help, wordcl:=Y.wordcl) }
The Tetris Algorithm
NPboken
NPFänrik Ax
PPtill general Claes
VCgav
PPtill general Claes Olsson
NPgeneral Claes Olsson
PPtill general
The IOB format
Marcus and Ramshaw 1995
A phrase/clause tag contains two parts:
1. Phrase/Clause type, e.g. NP, PP
2. One of two tags:
I = Inside a phrase/clause
B = Beginning a phrase/clause
When a word does not belong to a phrase
3. O = Outside
Disagreement error
De dt.utr/neu.plu.def NPB CLB
gamla jj.pos.utr/neu.plu.ind/def.nom APB|NPI CLI
äppelträdet nn.neu.sin.def.nom NPI CLI
kan vb.prs.akt.mod VCB CLI
bli vb.inf.akt.kop VCI CLI
som kn O CLI
nya jj.pos.utr/neu.plu.ind/def.nom APB CLI
. mad O CLI
Partial input
Arrangör nn.utr.sin.ind.nom NPB CLB
var vb.prt.akt.kop VCB CLI
Järfälla pm.gen NPB|NPB CLI
naturskyddsförening nn.utr.sin.ind.nom NPB|NPI CLI
där ab ADVPB CLI
är vb.prs.akt.kop VCB CLI
medlem nn.utr.sin.ind.nom NPB CLI
. mad O CLI
Noisy data
Inte ab APB CLB
så ab ADVPB|APB|API CLI
tjck jj.pos.utr.sin.ind.nom APB|API|API CLI
som ha O CLB
det pn.neu.sin.def.sub/obj NPB CLI
ofta ab.pos ADVPB CLI
står vb.prs.akt VCB CLI
i pp PPB CLI
lärobökerna nn.utr.plu.def.nom NPB|PPI CLI
; mid 0 CLI
Word order violation
Ympkvisten nn.utr.sin.def.nom NPB CLB
inte ab ADVPB CLI
ska vb.prs.akt.mod VCB CLI
vara vb.inf.akt.kop VCI CLI
sådär ab ADVPB|APB CLI
lång jj.pos.utr.sin.ind.nom APB CLI
, mid O CLI
Evaluation
Manually corrected output from GTA
Untuned GTA in the evaluation
15 000 words from SUC
5 genres
F-scores for individual phrase types
Type Accuracy Count
ADVP 81.9 1008
AP 91.3 1332
INFP 81.9 512
NP 91.4 6895
O 94.4 2449
PP 95.3 3886
VC 92.9 2562
Total 88.7
F-score for clause boundary identification
Tagger F-score
UNIGRAM 84.2
BRILL 87.3
TNT 88.3
F-score for a baseline identifier was 69.0%
Aplications with GTA
We are using GTA in:
Grammar checking, statistical and rule based
Clustering of medical texts
CALL-systems
What do you want to do with GTA?