A Robust Shallow Parser for Swedish

A Robust Shallow Parser for Swedish

Ola Knutsson, Johnny Bigert, Viggo Kann

Royal Institute of Technology, Sweden

Introduction

What is robustness?

Robust against noisy, ill-formed and partial natural language data

Shallow parsing

Many NLP-applications do not need full parsing

Shallow parsing:

• A parsing approach

• Pre-processing for full parsing

A collection of techniques

Abney - finite state cascades (1991)

Currently, a lot of attention on ML

Well suitable for modularization

Chunking and phrase identification

Common modules in a shallow parser:

• Tokenizer

• PoS-tagger

• Chunker

• Phrase identifier

• Grammatical function identifier

Chunking[NP Den mycket gamla mannen][VC gillade][NP mat]

Phrase identification[NP Den [AP mycket gamla] mannen][VC gillade][NP

mat]

Parsers for Swedish

Full parser: UCP (Sågvall Hein) and SLE (Gambäck)

Shallow parsers (phrase structure): Cass-Swe (Kokkinakis) and Megyesi using

machine learning

Dependency: CG (Birn) and FDG (Voutilainen)

Granska Text Analyzer (GTA)

Hand-crafted rules

Context-free backbone

Partly object-oriented notation

Major Phrase Categories

NP: Han såg den lilla mannen på bänken

VC: Han har spelat kort hela natten

PP: Han såg spår i sanden

AP: Han ogillade små vita lögner

ADVP: Han vill inte gå på bio.

INFP: Han tycker om att spela

Clause Boundary Identification

Based on Ejerhed’s algorithm

Context-sensitive rules

Using only PoS information

Different kinds of rules

GTA contains 260 rules

200 identify phrase structure

20 clause boundary identification

40 selection rules (disambiguation)

Example rule, [NP den lilla bilen]

NPmin@{X(wordcl=dt| wordcl=hd | wordcl=rg),X2(wordcl=ab | wordcl=rg)?,Y(wordcl=jj | wordcl=ro | wordcl=pc)*, Z(wordcl=nn) -->action(help, wordcl:=Z.wordcl, pnf:= undef,

gender:=Z.gender, num:=Z.num, spec:=Z.spec, case:=Z.case)

Clause boundary rule

cl@ {

V(sed!=sen & text!="som" & wordcl!=sn),

X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) |

(wordcl=nn & case=nom & V.case!=gen) | wordcl=ab),

---endleftcontext---,

Y(wordcl=kn),

---beginrightcontext---,

Y2(((wordcl=pn & pnf=sub) | (wordcl=pm & case=nom) |

(wordcl=nn & case=nom) | wordcl=ab) & wordcl=X.wordcl),

Z(wordcl=vb &

(vbf=prs | vbf=prt | vbf=imp))

-->

action(help, wordcl:=Y.wordcl) }

The Tetris Algorithm

NPboken

NPFänrik Ax

PPtill general Claes

VCgav

PPtill general Claes Olsson

NPgeneral Claes Olsson

PPtill general

The IOB format

Marcus and Ramshaw 1995

A phrase/clause tag contains two parts:

1. Phrase/Clause type, e.g. NP, PP

2. One of two tags:

I = Inside a phrase/clause

B = Beginning a phrase/clause

When a word does not belong to a phrase

3. O = Outside

Disagreement error

De dt.utr/neu.plu.def NPB CLB

gamla jj.pos.utr/neu.plu.ind/def.nom APB|NPI CLI

äppelträdet nn.neu.sin.def.nom NPI CLI

kan vb.prs.akt.mod VCB CLI

bli vb.inf.akt.kop VCI CLI

som kn O CLI

nya jj.pos.utr/neu.plu.ind/def.nom APB CLI

. mad O CLI

Partial input

Arrangör nn.utr.sin.ind.nom NPB CLB

var vb.prt.akt.kop VCB CLI

Järfälla pm.gen NPB|NPB CLI

naturskyddsförening nn.utr.sin.ind.nom NPB|NPI CLI

där ab ADVPB CLI

är vb.prs.akt.kop VCB CLI

medlem nn.utr.sin.ind.nom NPB CLI

. mad O CLI

Noisy data

Inte ab APB CLB

så ab ADVPB|APB|API CLI

tjck jj.pos.utr.sin.ind.nom APB|API|API CLI

som ha O CLB

det pn.neu.sin.def.sub/obj NPB CLI

ofta ab.pos ADVPB CLI

står vb.prs.akt VCB CLI

i pp PPB CLI

lärobökerna nn.utr.plu.def.nom NPB|PPI CLI

; mid 0 CLI

Word order violation

Ympkvisten nn.utr.sin.def.nom NPB CLB

inte ab ADVPB CLI

ska vb.prs.akt.mod VCB CLI

vara vb.inf.akt.kop VCI CLI

sådär ab ADVPB|APB CLI

lång jj.pos.utr.sin.ind.nom APB CLI

, mid O CLI

Evaluation

Manually corrected output from GTA

Untuned GTA in the evaluation

15 000 words from SUC

5 genres

F-scores for individual phrase types

Type Accuracy Count

ADVP 81.9 1008

AP 91.3 1332

INFP 81.9 512

NP 91.4 6895

O 94.4 2449

PP 95.3 3886

VC 92.9 2562

Total 88.7

F-score for clause boundary identification

Tagger F-score

UNIGRAM 84.2

BRILL 87.3

TNT 88.3

F-score for a baseline identifier was 69.0%

Aplications with GTA

We are using GTA in:

Grammar checking, statistical and rule based

Clustering of medical texts

CALL-systems

What do you want to do with GTA?

More information

www.nada.kth.se/theory/projects/xcheck

Contact: Ola Knutsson

[email protected]

Documents

A Robust Shallow Parser for Swedish