285
Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest [email protected] NATO-ASI, 15-27 October, 2007, Batumi - Georgia

Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest [email protected]

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Exploiting Multilinguality in Developing Training

Data for Statistics-Based NLP Dan Tufiş

RACAI - Institute for Artificial Intelligence, Bucharest

[email protected]

NATO-ASI, 15-27 October, 2007, Batumi - Georgia

Page 2: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Overview of the talks (I: approx. 8h)

• The BLARK and ELARK concepts– Monolingual and multilingual BLARK/ELARK

• Morpho-Syntactic Tagging– Tagsets design; case study: tiered tagging– Creating training data; validation and

correction of the training data– Gold Standards, tagsets mapping & cross-

tagging– Combining language models for further

improvements

Page 3: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Overview of the talks (II: approx. 8h)

• Briefs on lemmatization, chunking and dependency linking

• Alignment– Sentence alignment– Word alignment; case study: reified

alignment– Multilingual lexical repositories alignment;

case study: wordnets (EuroWordNet, BalkaNet)

Page 4: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Overview of the talks (III: approx. 18h)

• Applications– Exploiting annotations in one part of a parallel

text and inducing similar annotations in the other part of the bitext; frame structures and grammar induction

– Semantic validation of the parallel semantic lexicons

– Checking term translation consistency– Automatic extraction of bilingual dictionaries

(word and MWE )– Cross and monolingual Question Answering– Statistical Machine Translation systems

Page 5: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

BLARK: Tokenisation•Identifying the lexical units in arbitrary texts;– copes with multiword units recognition (look_for,

with_respect_to, used_to, back_and_forth etc), splitting multi-token concatenations (e.g. cliticization: damelo=da+me+lo; dându-mi-se=dându+mi+se), abbreviations, non-alphabetic token interpretations ($100, 5/4/2005, etc)

•Difficulty of this task is language dependent•Various tools for doing the proper job– Language dependent tools (not very attractive, but

possibly more efficient)– Multilanguage tools: needs language specific

resources; an example MtSeg (http://www.lpl.univ-Aix.fr/projects/multext/MtSeg)

•Named Entity Recognition (NER): “White House”, “Standard & Poor’s”, etc.

Page 6: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

An example: MTSeg

• There are three input formats: plain, normalized sgml, tabular. We will use the plain format.

• Consider “infile” containing the plain text (Ro)Într-un cuvânt, acesta este un

exemplu. • The segmenter can be invoked three ways,

depending on the input format:– plain text:

mtseg -lang ro -input plain <infile >ofile

Page 7: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Output format(black&red=full format; red=filtered

format)

[CHUNK <DIV FROM="1">; (PAR <P FROM="1">; (SENT <S>; 1\1 TOK Într 1\6 PROC -un 1\9 TOK cuvânt 1\15 PUNCT , 1\17 TOK acesta 1\24 TOK este 1\29 TOK un 1\32 TOK exemplu 1\39 PTERM_P . )SENT </S>; )PAR </P>

]CHUNK </DIV>;

Page 8: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

BLARK: MSD/POS Tagging

•Given an input tokenized string:

token1token2… tokenk

token1 description1 token2 description2 … tokenk descriptionk

This process, which we would like to be as fast and correct as

possible, is

generically called tagging. If the description is done in terms of

morpho-syntactic properties, it is called morpho-syntactic

tagging (MS-tagging) or (less accurate) POS-tagging.

Page 9: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Morpho-syntactic descriptors or MSD (or simply MS-tags) and the set of all tags needed to describe the words in a lexicon=a tagset.

The tagsets may have various granularity: the finer the granularity, the larger the number of tags. Tagset design is a language specific activity, but …to promote multilinguality, we need standardized ways of describing tagsets.

There are various initiatives towards standardization of the MSD, one of the most influential: EAGLES (“Synopsis and comparison of morpho-syntactic phenomena encoded in lexicons and corpora.

A common proposal and applications to European languages” see http://www.ilc.pi.cnr.it/EAGLES/home.html)

Page 10: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

___________________________________________Part-of-Speech Code Attributes ___________________________________________

Noun N 10 Verb V 15 Adjective A 12 Pronoun P 17 Determiner D 10 Article T 6 Adverb R 6 Adposition S 4 Conjunction C 7 Numeral M 12 Interjection I 2 Residual X 0 Abbreviation Y 5 Particle Q 3

Page 11: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

P Attribute Value Code- ------------- --------------- ------- 1 Type main m

auxiliary a modal o

copula c l.s. base b - ------------- --------------- ------- 2 Vform i s

m c

n p g

u l.s. t l.s. q - ------------- --------------- ------- 3 Tense present p

if

s l.s. l l.s. a- ------------- --------------- ------- 4 Person first 1

second 2third 3

- ------------- --------------- ------- 5 Number singular s plural p l.s. dual d- ------------- --------------- ------- 6 Gender masculine

m feminine f neuter n

P Attribute Value Code- ------------- --------------- ------- 7 Voice active a

passive p- ------------- --------------- ------- 8 Negative no no

yes y- ------------- --------------- ------- 9 Definite no n

yes y l.s. short_art s l.s. full_art f l.s. 1s2s

2- ------------- --------------- ------- 10 Clitic no n

yes y- ------------- --------------- ------- 11Case n

gdalix2e45

- ------------- --------------- ------- 12 Animate no

nyes y

- ------------- --------------- ------- 13Clitic_s no n

yes y

Verb description in MULTEXT-EAST

Vmmp2s--y

don’t sing!

Page 12: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

In practice, the learning program builds a data structure called Language Model (LM) which is fed into an interpretation program (the proper tagger) which produces the tagged text.

Training corpus

Learning Program

Lexicon and guesser (unigrams)

HMM (n-gram database)

Language Model

THE GORY DETAILS

3-gram HMM = (S, P, A, B)

S = finite set of states (a state corresponds to a sequence of 2 tags)P = initial states probabilitiesA = transition matrix (probabilities to move from sij to sjk) B = emitting matrix (lexical probabilities - lexicon)

Aijk = probability of moving from state sij to sjk

Bij (w) = probability of emitting w in state si

P(w|sij)

The estimation of the parameters of the LM is usually done by a method called EM (expectation-maximization) or Baum-Welch algorithm. It chooses the maximum likehood model parameters. Self-organising techniques.

Training

Page 13: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

TokeniserNew text

LM

Tagger Tagged text

tokeniser database

Classifier

…“Senatul este în vacanţă extraordinară", spunea un parlamentar şi uimit, şi amuzat de ceea ce se întâmplă....

..."Senatulesteîn vacanţăextraordinară",spuneaunparlamentarşiuimit, şiamuzatde Sceea_ceseîntâmplă.

...” DBLQSenatul NSRYeste V3în Svacanţă NSRNextraordinară ASN" DBLQ, COMMAspunea V3un TSRparlamentar NSNşi CRuimit ASN, COMMAşi CRamuzat ASNde Sceea_ce RELRse PXAîntâmplă V3. PERIOD

…”/DBLQ Senatul/NSRY este/V3 în/S vacanţă/NSRN extraordinară/ASN”/DBLQ ,/COMMA spunea/V3un/TSR parlamentar/NSNşi/CR uimit/ASN ,/COMMA şi/CR amuzat/ASN de/S ceea_ce/RELR se/PXA întâmplă/V3 ./PERIOD

plain text

tokenized text

tagged text

Tagging

Page 14: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Morpho-Syntactic Tagging: the Hidden Side

of the StoryA. Tagset design issues and training data cleaning-

up:– Cross-lingual standardization– Richness versus distributional adequacy and training

data availability– Lexical tagsets vs. Corpus tagsets– Reducing lexical tagsets cardinality

• Information loss-less reduction• Controlled information loss reduction; recovering left out

information – Validation and correction of training data

B. Tagsets mapping– Biased tagging– Direct tagging vs. Cross tagging– Double Cross tagging & improvement of the original

annotations– Case study: Semcor

Page 15: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

A) Designing Tagsets: Lexical Tagsets

• Lexical tagsets: encodings of morpho-lexical properties of the lexical stock;

• Word-form lexicons (usually very large) contain paradigmatic families of the lemmas they cover; each member of a paradigmatic family is attached the representative lemma and a comprehensive description of the morpho-lexical properties of the word-form in case;

• The need for standardization in the multilingual electronic world generated various encoding proposals;

Page 16: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lexical tagsets (II)

• The benefits of the standardized morpho-lexical descriptions are many folded; one of the advantages is using some useful tools for corpus tagsets generation;

• One of the most influential proposal is EAGLES/ISLE further extended by Multext and Multext/East: (http://www.ilc.cnr.it/EAGLES96/ morphsyn/morphsyn.html; http://nl.ijs.si/ME/V3/msd/html/;)

• Highly inflectional languages have large lexical tagsets (2000, 3000 or even 4000 tags are not unusual)

Page 17: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lexical tagsets (III)• Fortunately not every feature values combination is possible,

but still, the lexical tagsets, although maximally informative, are hardly adequate for a statistical-based approach to automatic POS-tagging (in spite of some attempts to do so). Supervised learning methods (the most accurate) would be hampered by training data sparseness

• Some fundamental observations:– Feature values are not independent on each other in a legal

combination (lexical tag)– Some feature (and all their values) are insensitive to morpho-syntactic

distribution and as such they cannot be reliably distinguished by distributional analysis methods

• A proper design process could reduce the lexical tagsets to manageable corpus tagsets, much more appropriate for fast and accurate POS-tagging of large documents;

Page 18: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The quality of the tagging process depends on many factors:The quantity and quality of the training data:

insufficient discriminative examples for each considered class is very harmful for the reliability of the classification system; a training corpus containing errors induces wrong generalizations in the LM. If the learner is given wrong examples it will learn bad classifications.

The adequacy of the tagset: a poorly designed tagset will destroy the performance of any tagger; on the contrary a good tagset will allow even a simple-minded tagger to get reasonable results.

A too small tagset (say only parts of speech) will be too coarse and won’t allow the learner to abstract on various collocational restrictions. A too large tagset will, most likely, generate a data sparseness problem; there is a non-linear relation between the tagset size and the size of the needed training corpora.

Corpus tagsets

Page 19: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Good practices:

• leaving out attributes/features that are irrelevant for the words distribution (animate/inanimate), requires more context or higher level knowledge (transitive/intransitive) or are recoverable from the form of the word; particularly relevant for inflectional languages. PennTB tagset omits a number of the distinctions that are made in the LOB and Brown tagsets (on which it is based); M. Marcus et all.1994 CL(19/2)

• morpho-lexical syncretism is also source of noise (case system)• clustering tags without information loss eliminate attributes without

reducing the cardinality of any lexical item’s ambiguity class; Thorsten Brants, 1995 (ACL’95)

• “port- manteau” tags = tags assigned for one or few more words with idiosyncratic ambiguity class (e.g. English LOB and Brown tagsets use different tags for the auxiliaries “be”, “do” and “have”; Roger Garside et. all 1987)

Page 20: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• observing frequent tagging errors: this is a signal that limited distributional analysis is insufficient for discriminating among the frequently confused tags; in such a case, merge tags. With LOB and Brown tagsets one of the most frequent error is mistagging a word as a subordinated conjunction (CS) rather than a preposition and vice-versa: PennTB uses a single tag for both cases, leaving the resolution - if required - to other processes; E. Macklovitch, 1992 (4th ANLP).

• using multiple layers of tagging: tiered tagging; D. Tufiş, 1999 (TSD), 2000 (LREC), 2004 (LREC).

Building training corpora is expensive (hand-made), error-prone, boring and time consuming. Fortunately, there are some effective ways to detect most of the unsystematic human errors!

Page 21: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Biased evaluations of the taggers is the simplest way to catch such errors:a) train the tagger on the training corpusb) tag the same corpus based on the learnt language modelc) compare the hand-tagged and the machine-tagged versions: the non-systematic errors in the hand-tagged corpus are very likely to show up.

There are several other ways to spot the human-made errors in training corpora (“gold standards”): Geoffrey Leech, David Elworthy (1994), Simone Teufel (1995), Jean-Pierre Chanod, Tapanainen (1995), Hans von Halteren (2000), Karel Oliva (2001), Yuji Matsumoto (2002), Dickinson, M., Meurers, W. D. (2003) etc.

We will discuss a recent approach (“Cross Tagging”, Pârvan & Tufiş (2006)) and the result of it on a well-known corpus: Semcor (Brown)

Page 22: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Tiered Tagging (TT)Given:

– a large lexical tagset (MSD), a properly reduced corpus tagset (TTAG) and a mapping between the two tagsets (MAP)

– a training corpus (TC) annotated in terms of MSDTT:

– Training:• transforms TC in TC’ where each tag from MSD is replaced by the

corresponding tag from TTAG (using MAP)• builds a LM from TC’

– Tagging:• any new text T is taged as TTTAG by means of LM constructed from

TC’• transforms TTTAG into TMSD, recovering (RECOVER) information

contained in MSD but absent in TTAG

Critical elements:– TTAG, MAP, RECOVER

Page 23: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

1. MSDiMSD, MAP(MSD i)= Ti TTAG,

2. Ti TTAG, IMAP(Ti) = {MSDi1 MSDi2 … MSDik}MSD,

3. Wkdictionary & AMBMSD(Wk)={MSDk1 MSDk2 …MSDkn}

&AMBTTAG(Wk)={MAP(MSDk1) MAP(MSDk2)…MAP(MSDkn)}={Tk1 Tk2 …Tkm} (mn)

Wk, Ti AMBTTAG(Wk)

10%)(say cases theofrest for the 1

90%)(say casesmost in 1)(WAMB )IMAP(T:)1( k

MSDi EQ

recoverability property of the TTAG

Properties of the TTAG and of the MAP mapping

TTAG Design

If the intersection above is always equal to 1, then recoverability is 100% and deterministic; TTAG is called in this case a Baseline Tiered Tagging tagset (BTTAG) and MAP is an information lossless transformation of the MSD into TTAG; MAP takes care about feature-values redundancy elimination.

Page 24: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

TTAG design (cntd.)• A baseline TTAG may significantly diminish the data sparseness

threat induced by a large MSD; It can be constructed automatically!but

for a given MSD one can find many different BTTAGs. So, one needs additional criteria to select one BTTAG (e.g. “minimal cardinality”, “best performing induced LM”, etc.)

• In the general case of TT, the RECOVER procedure is non-deterministic (unless additional knowledge sources are used). But the reduction of the MSD cardinality is more significant than for the deterministic case (BTTAG)

RECOVER may be either a rule-based disambiguation procedure for the limited number of situations where intersection in EQ1 is not 1 (needs human expertise, language dependent, relies on a word-form lexicon; LREC2000/2002) or a ME-based procedure (language independent, needs training data for high accuracy, can work w/wo a word-form lexicon ; ESSLLI2006).

Page 25: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Building an “optimal” BTTAG

• If the MSD contains lot of redundant information, an “optimal” BTTAG could suffice for TT

• Deriving an “optimal” BTTAG is a highly intensive computation (could take several running days on an average PC) but is done only once; it pays off!

Page 26: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

An algorithm for BTTAG (LREC 2000)1. extract all ambiguity classes from the MSD-lexicon

;e.g. MSD-ACj=(Ncfp-n Ncfson Vmis3s Vmm-2s Vmnp)

2. for each ambiguity class ACipreserve only intra-categorical ambiguities ICAi;e.g. ICAj1=(Ncfp-n Ncfson), ICAj2=(Vmis3s Vmm-2s Vmnp)

endfor3. for each ICAi repeat

for each MSDij repeat for each attribute Ak in MSDij repeat

if eliminating Ak does not reduce the cardinality of any of ICAs then marked Ak as removable endif

endfor endfor

endfor4. for all Ak marked as removable

compute the sets of attributes the collective removal of which would still preserve the cardinality of all ICAs (not unique solution) and generate the BTTAGsendfor

5. choose the “optimal” BTTAG

Page 27: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Step 5: the “optimal” BTTAG (LREC2004):

• for each BTTAG do – set Best-Accuracy=prec(MSD) ; MinSize=size(MSD) ;

MinSize4MSDprec=size(MSD) – turn the MSD tags of the corpus in BTTAG tags (determ)– use randomly 90% of data for training and 10% for evaluation– compute the Average-Accuracy of a ten-fold validation run– if Average-Accuracy > Best-Accuracy then

Best-Accuracy <- Average-Accuracy endif

if size(BTTAG)<MinSize then MinSize <- size (BTTAG)

endif if (Average-Accuracy prec(MSD) and size(BTTAG) <MinSize4MSDprec)

then MinSize4MSDprec <- size(BTTAG)

endif• endfor

Page 28: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The “optimal” BTTAG (cntd.)• Determining the “optimal” BTTAG:

– Is language independent and does not require human intervention (thus no language skills are needed);

– Requires a corpus annotated in terms of MSD and word-forms lexica as delivered by the MULTEXT-EAST project for various languages;

• The multilingual corpus we used is “1984”;

• We made experiments for Czech, English, Estonian, Hungarian, Romanian and Slovene;

Page 29: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Language MSD Smallest BTTAG for (~)

MSD Prec

Best Prec BTTAG

BTTAG with closest Prec to

MSD

#msds Prec #tags Prec #tags Prec #tags Prec

RO 615 95.8 56 95.1 174 96.1 81 95.8

SI 2083 90.3 385 89.7 691 90.9 585 90.4HU 618 94.4 44 94.7 84 95.0 44 94.7EN 133 95.5 45 95.5 95 95.8 52 95.6CZ 1428 89.0 299 89.0 735 90.2 319 89.2ET 639 93.0 208 92.8 335 93.5 246 93.1

Results and evaluation for various languages and BTTAGs(no external lexicon, the tagger’s lexicon learnt from training corpus)

BTTAGs for the 6 languages are available at: http://www.racai.ro/Resources

Page 30: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

BTTAG is not the Hidden Tagset of Tiered Tagging

• The full recoverability property of a BTTAG does not ensure necessarily a spectacular improvement of the language model accuracy; it significantly improves its robustness over new data because it may drastically reduce data sparseness;

• To take full advantage of the Tiered Tagging methodology, one should go from BTTAG to Hidden Tagging Tagset (HTTAG).

Page 31: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

HTTAG & RECOVER• Converting MSD into HTTAG is an information loss

transformation! The MAP mapping is not deterministic; it will replace sometimes one HTTAG tag with two or (rarely) more MSD tags, in which case a RECOVER procedure is necessary.

• The initial RECOVER procedure in TT was a rule-based module; writing appropriate rules requires language skills; the local grammars are focused on ambiguities remaining in the |IMAP(Ti) AMBMSD(Wk)| computed for each tagged word; it works only for words in the tagger’s lexicon (AMBMSD(Wk) is read from the lexicon).

Page 32: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Ex: the rule to distinguish among determiners and pronouns (Romanian):

Ps|Ds {Ds. : (-1 Ncy) || (-1 Af. y) || (-1 Mo. y) ||

(-2 Af. n and –1 Ts)||(-2 Ncn and –1 Ts) ||

(-2 Np and –1 Ts) || (-2 D.. and –1 Ts)

Ps. : true}

The reading is as follows:

Choose the determiner interpretation when:

The previous word is tagged as a definite Noun, or a definite Adjective, or a definite Numeral (ordinal), or when the previous two words are tagged as indefinite noun followed by a possessive article, or proper noun followed by a possessive article, or determiner followed by a possessive article.

Chose the pronoun interpretation if none of above holds

RULE-based RECOVER

Page 33: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• The main disadvantage of the rule-based RECOVER approach is that it works only for words present in the MSD lexicon (the one from which the hidden tagset was derived). Writing the disambiguation rules might be also invoked (but is not really a big deal)

• The ME-based RECOVER algorithm uses contextual predicates relating feature values of the current wordform and feature values of the tags in context to decide on the current feature values. The tagset converter uses more contextual information than an usual tagger (it runs on the already tagged corpus).

• If the word w is not in the MSD lexicon, the MSDk tag that the model predicts may not be among the IMAP(Ti) ={MSD1…MSDk-1}. In this case, the MSDk tag that the model predicted is taken into account in the K-breadth first search. This way, the converter can correct tagging errors on unknown words.

ME-based RECOVER (I)

Page 34: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

ME-based RECOVER

• Builds on SharpEntropy (Richard Nothhedge, 2005 http://www.codeproject.com/csharp/sharpentropy.asp)

• It uses an a-priori mapping list containing the complete correspondences between HTTAG and MSD tagset of the form:

Based on this additional resource, the tagset converter generates, with high accuracy, MSD tags even for unknown or partially known words (i.e. either missing from the learnt lexicon or learnt with an incomplete ambiguity class).

TAG)}1..card(HTi ¦ )( { ii TIMAPTIMAP

Page 35: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Contextual predicates: fi(tag, history)

Wordform related features:character lengthprefix (1-2)suffix (1-4)

upper case (all, initial)is abbreviationhas underscorehas number

hyphen position (start, middle, end, none)

Context related features:previous MSD featuresprevious MSD 1gram, 2gram, 3gramprevious Ctag 1gram and 2gram

next Ctag 1gram and 2gramend of sentence punctuation mark

b (b)

b

,aa, ,

)(

),()b(

1

),(

Z(b)

1b)|a(

i

yith historags seen wa set of t

history

TtagsetT

bi

a

k

j

bi

aj

f

jZ

k

j

baj

f

jp

ME framework

The ME-based tagging platformensures all the Tiered Tagging steps and complements our previous HMM-based platform

Page 36: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Accuracy of the ME-based RECOVER

Unknown word accuracy

Total word accuracy without word-form lexicon

Total word accuracy with word-form lexicon

95.20%

98.66%

99.04%

Page 37: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation of the HMM tagging accuracy

Tagging method (HMM)MSD-tagger

Tiered tagging

Ctag-tagger

Unknown word accuracy without word-form lexicon 77.96% 78.57% 81.93%

Total word accuracy without word-form lexicon 95.79% 96.08% 96.78%

Total word accuracy with word-form lexicon 98.01% 98.42% 98.59%

The test data: from the same register with the test data and few unknown words.All MSDs in the test data were seen in the train data; These facts explain the small differences between MSD-tagger and Tiered tagging;Tiered tagging: more robust to register diversity and unknown words.

Page 38: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation of the ME tagging accuracy

Tagging method (ME)MSD-tagger

Tiered tagging

Ctag-tagger

Unknown word accuracy without word-form lexicon 78.65% 78.76% 82.24%

Total word accuracy without word-form lexicon 96.56% 96.22% 96.81%

Total word accuracy with word-form lexicon 98.35% 98.58% 98.62%

All figures are slightly better for the ME tagger(0.1-0.3%), but the validity of the tiered tagging is confirmed. We expect the differences between accuracy of the tiered tagging and of the direct MSD tagging increase on other register texts.

Page 39: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

TT methodology is useful not only when dealing with large tagsets, but also when the interest is only in simple grammatical distinctions (say only part of speech). We made an experiment on using a tagset with only 14 categories (the Multext-East grammar categories) and the average precision (ten-fold evaluation procedure) was very poor: 92.3% !! Using the TT (with combined classifiers) the accuracy was always over 99%.

Text/LM POS Accuracy # errors 1994_20/CLAM 99.50 102 barnes_20/CLAM 99.61 79 ZiarNou_20/CLAM 99.66 69

Evaluation for POS-only TT&CLAM

Hidden tagset =92 tags; Final Tagset=14 tags

Page 40: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Remarks on TT

The models based on HTTAG ensure not only the accuracy increase of the tagging process but also it robustness with respect to text types variation.Experiments on various languages (e.g. Hungarian, German) and by various researchers (Oravecz, Dienes, Hinrichs, Truschina, etc) proved the validity of the approach. For Romanian, the average accuracy (10-fold validation on three different register corpora) is around 98.6%.

Page 41: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Validation & CorrectionRequires a coherent methodology for automatically identifying as many POS annotation and lemmatization errors as possible; we relied on three main techniques for automatic detection of potential errors:

1. When lemmatizing the corpus, we extracted all the triples <word-form, POS tag, lemma> that were unknown to the tagger (the tagger’s lexicon included a large word-form lexicon);

2. We checked the correctness of POS annotation for closed class lexical categories, technique described by (Dickinson & Meurers, 2003);

3. We exploited the Biased Evaluation Conjecture hypothesis

Page 42: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lemmatization and re-tokenization of unknown tokens

(I)a) If the current token is marked by the tagger as unknown, it is checked whether its POS annotation is NP (proper noun), in which case the lemma is considered being identical to the occurrence form of the token; sequences of NPs are joined together (separated by an underscore) with the new lemma similarly constructed.

b) Otherwise, the current token is processed by the probabilistic lemmatizer. The result, a triple <word-form tag lemma> (together with a backward reference), is saved in the file NotInTheLexicon.

The content of the NotInTheLexicon file was classified and analyzed in the decreasing order of the triples frequencies. It revealed multiple errors patterns, and for some of these errors the correction was easy;

Page 43: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lemmatization and re-tokenization of unknown tokens

(II)1. more than 20,000 errors due to the wrong conversion of some diacritical characters into SGML entities or misspelling 2. tokenization errors (misinterpretation of the period character, incomplete or incorrect specification of several frequent compounds)3. incomplete specification for the cases where two or more consecutive numerals should have been taken together;

4. web and e-mail addresses (we added NNWEB and NNMAIL to our tag set, concatenating and retagging accordingly)

Page 44: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Using biased evaluation for better error identification

(I)

Biased evaluation conjecture (Tufiş,1999): an accurately and consistently tagged corpus, re-tagged with the language model learnt from it (biased tagging), should reproduce the initial tagging in the vast majority of cases (usually more than 98%).

Reference Corpus: the version re-tokenized using the preceding techniques

Page 45: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

1. trained the tagger on the newly acquired corpus, building a new language model;

2. retagged the new corpus with this new language model and compared the new tagging against the reference annotation;

3. found 96.8% identically tagged tokens and extracted the differences;

4. sorted the differences by their frequency;

5. the first 100 difference types (accounting on average for 8-10,000 difference occurrences) were examined in context, one by one, and the validation expert decided which of the tags was correct (if any);Obs.: 1. time consuming procedure, given the dimension of the corpus2. differences explained by: inconsistent or partial corrections in the previous phases; the modification of the context for tokens neighboring the corrected ones;

6. Correcting all the errors discovered in the analysis of the first 100 difference types ends the procedure.

Procedure description

Page 46: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Using biased evaluation for better error identification (III)

• We repeated this procedure several times with a continuous decrease of the number of differences.

• After months of analysis/correction cycles, the number of differences stabilized around 1.2% of the entire corpus (identity approx. 98.8%).

• At present, there are approximately 85,000 differences (22,500 distinct) between the reference corpus and its biased tagging version.

Page 47: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Using biased evaluation for better error identification (IV)

• The analysis of the first 200 differences types (containing 168 different word-forms and accounting for 18,929 differences), outlined 96 distinct confusion pairs.

• For each confusion pair we constructed the set of different word affected by the respective confusion.

[Obs.: The more words affected by a confusion pair, the less worrying it is - inherent statistical noise.When a confusion pair is specific to a reduced number of words and if these words are frequent, it might be useful to have a closer look on the respective confusion pair. We discovered a few words that were responsible for significantly more differences than the rest.]

Page 48: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Using biased evaluation for better error identification (V)

• Not surprising, the first four frequent words responsible for almost 3500 tag confusions are closed-class words.

• The weak forms of the personal pronouns (le, ne, vă, îi) show the highest error rate: out of 11,345 occurrences, 3,368 had wrong case label (accusative instead of dative and vice-versa).[Obs.: The correct case assignment for these pronouns is very hard when only distributional properties are taken into account.]

• Another confusion, very difficult to avoid and relatively frequent in the RoCo-News corpus (210 occurrences), is vor tagged as Vm instead of Vaux, followed by words that could be interpreted either as infinitives (as they should have been interpreted) or as nouns (as they were actually interpreted in the respective contexts). Ex.: vízaN / vizáVinf, cífraN / cifráVinf, plácaN / placáVinf,

recóltaN / recoltáVinf,sărbătóriN /sărbătoríVinf, táxaN / taxáVinf, adrésaN / adresáVinf, etc.

Page 49: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Conclusions

• We described a semi-automatic procedure by means of which we constructed a highly accurate annotated journalistic corpus for Romanian.

• Although it is language, tagger and tagset dependent, this approach is easy to apply for a different setting, for other languages or other linguistic registers.

• The type of analysis we described gives strong indications about which words might be unreliably tagged and does not require word-by-word inspection of the corpus. It does not ensure elimination of all existing errors, but the accuracy gain is substantial.

Page 50: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

PART B:Tagset Mapping

• In TT, the two considered tagsets (the hidden and the lexical ones) are related by a subsumption relation. The mapping (MAP) between the tagsets is a function.

• A more interesting case is represented by two unrelated tagsets, such as Penn tagset and Multext-East tagset.

• Why should we bother?

Page 51: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Because, for instance,

given two corpora (gold standards), each tagged with its own tagset, we might want to:

• merge the two corpora and have the resulted corpus confidently tagged with either of the tagsets

• improve the gold standards’ tagging

Page 52: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Definitions

• Biased tagging (BT): tagging a corpus with a language model learnt from the same corpus

• Direct tagging (DT): tagging a corpus with a language model learnt from another corpus

• Cross-tagging (CT): tagging two corpora, each already tagged with its own tagset, with the other one’s tagset, using a mapping data structure (automatically built) for the two tagsets;

Page 53: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-Tagging System Architecture

MappingSystem

AGS(X)

LM(Y)○A=>

ADT(Y)

BCT(X)ACT(Y)

LM(X)○B=>BDT(X)

BGS(Y)

Legend:

Tagset mappingImproving the direct tagging

MAPS LM(Y)LM(X)

Page 54: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

AGS(X) is replaced by BCT(X) & BGS(Y) is replaced by ACT(Y) and repeat the CT procedure: Double Cross-Tagging (DCT)Compare: AGS(X):ADCT(X) and BGS(Y):BDCT(Y)

MappingSystem

BCT(X)

LM’(Y)○B=>

B’DT(Y)

BDCT(Y)ADCT(X)

LM’(X)○A=>

A’DT(X)

ACT(Y)

MAPS’

AGS(X) BGS(Y)

AGS(X) BGS(Y)LM’(X) LM’(Y)

Page 55: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Major Claims:

• ACT(X) and BCT(Y) are more accurately tagged

than ADT(X) and BDT(Y) respectively.

• Comparing AGS(X) against ADCT(X) and

BGS(Y) against BDCT(Y) one could accurately

identify and fix errors in the Gold Standards,

significantly improving their quality.

Page 56: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Partial Maps, Global Map (non-lexicalized)

y1 y2 … ym

x1 N11 N12 … N1m Nx1

x2 N21 N22 … N2m Nx2

… … … … … …

xn Nn1 Nn2 … Nnm Nxn

Ny1 Ny2 … Nym N

Yx = {yY | p(x|y)MSC(PSet(x))}

PSet(xi) = {p(xi|yj) | yjY}

PMX = {(x, y)XY | yYx}

MA(X,Y) = PMAX PMAY

PMY = {(x, y)XY | xXY}

M(X, Y) = MA(X, Y) MB(X, Y)

MB(X,Y) = PMBX PMBY

XY = {yY | p(y|x)MSC(PSet(y))}

Page 57: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Token Maps (lexicalized maps)• several possibly correct mappings are left out from the global

map either because of insufficient data, or because of idiosyncratic behaviour of some lexical items

• built only for the token types common to both corpora (except for happax legomena)

• the global map will be used only for those tokens without token map (happax legomena, tokens occurring in one corpus but not both)

• initially the token maps are built the same way the global map was built; therefore some tags associated to a token in one corpus or the other might remain unmapped in the token’s map; these tags are subject to further processing in order to decide whether they are likely to be tagging errors or not (in the latter case a mapping will be eventually constructed).

Page 58: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Token Maps: Unmapped Tags

x1

x2

x3

x4

y1

y2

y3

Mw= Token Map of w

An unmapped tag yi for w may mean:

a) the contexts in GS(Y) where w has been tagged yi are dissimilar to all the contexts of w in GS(X) (the ambiguity class observed in GS(X) for w is incomplete: the tag will be mapped using the global map)

b) the tag yi is wrongly assigned to w in GS(Y) (the tag will be left unmapped)

If w occurs frequently in GS(X), the explanation a) is unlikely. Therefore in such a case decision is that yi is not the correct tag for w.

Otherwise, the decision needs more evidence: tag sympathies!

Page 59: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Token Maps: Tag Sympathies

• The sympathy between two tags from the same tagset is defined as the number of token types they commonly tag in a certain corpus. A generalization of the notion of ambiguity class

• Mw may be extended by the global mappings of the unmapped tag ym iff at least one tag xk mapped to ym in the global map is sympathetic enough to all xi in Mw: xk extends the ambiguity class of w.

x1

x2

x3

x4

y1

y2

y3

x5

x6

y3

Global mapfragment

Mw= Token Map of w

Page 60: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Improving the Direct Tagging: Error Identification

• let Mk be the map (either the token map Mwk

or the global map M) used for the token k

• if xk is mapped, and <xk,yk> Mk, replace yk with the tags mapped to xk A*(Y);similarly B*(X)

A AGS(X) ADT(Y) A*(Y)

w1

...

wk xk yk yk1,...,ykn

...

wN

Page 61: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Improving the Direct Tagging: Retagging

• build trigram HMM language models from AGS(X) and BGS(Y) with provision for unseen ambiguity classes.

• Viterbi-like algorithm that is coerced in the following way:– if A* says that token wk should be tagged by one of {yk1,...,ykn}

then the only legal transition state from <yk-2 yk-1> are:

< yk-1 yk1>, …, < yk-1 ykn>

Page 62: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dealing with Unseen Events

Lexical probabilities:• smooth the p(wk,xi)

probabilities using the Simple Good-Turing estimation

• distribution of the probability mass reserved for unseen events: the probability of a token w to be tagged with a tag x, not previously observed for w, was considered to be directly proportional with the number of token types tagged x.

Contextual probabilities:• linear interpolation of

unigram, bigram, and trigram probabilities

• lambda estimation: the greater the observed frequency of an n-1_gram and the fewer n-grams beginning with that n-1_gram, the more reliable such an n-gram is.

Page 63: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Experiments: Resources

• 1984 corpus, about 120,000 tokens, automatically tagged, then human validated, MTE tagset

• SemCor corpus, about 780,000 tokens, tagged with the Brill tagger, Penn tagset

• the TnT tagger developed by T. Brants, as tagger of reference

Page 64: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Experiment 1: Cross-tagging two corpora

• Starting with 1984GS(MTE) and a comparable size fragment of SemCorGS(Penn), by direct tagging we got:

1984DT(Penn) & PSemCorDT(MTE)

• by cross-tagging the same texts we got:

1984CT(Penn) & PSemCorCT(MTE)

• randomly selected 100 differences between the direct and cross- tagged versions for each corpus

• cross-tagging was correct in 69 out of 100 cases for the 1984 corpus, and in 59 out of 100 cases for the PSemCor corpus: => the LM built from SemcorGS(Penn) is less

accurate than the LM built from 1984GS(MTE)

Page 65: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Experiment 2: Improving the tagged SemCor corpus

• used biased tagging as a means of evaluation:

BTScore(SemCor, SemCorBT) = 93.81%• double cross-tagged the entire SemCor corpus, getting a new

version SemCor+ and then repeated the biased evaluation:

BTScore(SemCor+, SemCor+BT) = 96.4%

• by analyzing the differences between SemCor+ and SemCor+BT

we noticed several tokenization inconsistencies (e.g. double quotes, formulae symbols) and 6 tokens (am, are, is, was, were, and that) and 2 tag pairs (NN/NNS, NN/NNP) that caused the most differences; we identified tagging patterns and adjusted the tagging to match those patterns, thus getting SemCor++

BTScore(SemCor++, SemCor++BT) = 98.08%

Page 66: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

SemCor tag

token SemCor++ tag

Freq.

VB to TO 1910 VB be VBN 674 RB in IN 655 VB in IN 646 RB of IN 478 VB on IN 381 VB for IN 334 VB with IN 324 RB more RBR 314 RB the DT 306

Experiment 2: Improvement Assessment

• 57,905 differences between SemCor and SemCor++, of 10,216 types

• most frequent 200 types count for 25,136 differences

• the SemCor++ version is evaluated to be right in 84.44% of those 25,136 differences

The 10 most frequentdifference types

Page 67: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Comments on Cross-tagging• the direct tagging of a corpus can be improved• two tagsets can be compared from a distributional point of

view• errors in the training data may be spotted and corrected• successively applying the method for different pairs of

corpora tagged with different tagsets would permit the building of a much larger corpus, tagged in parallel with all those tagsets in a reliable manner

• the mapping system applies not only to POS tags, but to other types of tags as well

http://www.racai.ro/resources/SEMCOR++_Penn.zip

Page 68: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Is any way to improve the accuracy of a tagging process, once we have decided on a tagset, we have one or more training corpora and one or more tagging systems (learner+tagger)?

YES! By using combined classifiers. Classifier = a trained tagger (in this context).

T: (Ci,Wn) (W)n w1 w2 …wn w1/ti

1 w2/ti2 …wn/ti

n

CC : ({ C1…Ck},Wn) (Wk)n

w1 w2 …wn w1/{t11 …tk

1} w2/{t12 …tk

2} …wn/{t1n …

tkn}

w1/ta1 w2/tb

2 …wn/tzn

Improving the accuracy of a tagging process, beyond what a single

classifier can do.

Page 69: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Combining classifiers = defining an decision method for selecting one of the results provided by the various classifiers.

The basic assumption: different classifiers, even of comparable accuracy (e.g. McNemar’s test)do not make similar errors (e.g. Brill and Wu’s test)

Page 70: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

n= n00 + n01+ n10+ n11.

Under the null hypothesis, the two classifiers should have the same error rate, which means that n01= n10 (with the estimated error rate as (n01+n10)/2).

McNamer’s test considers the statistics:

which is (approximately) distributed as 2 with 1 degree of freedom (the term –1 in the numerator is a “continuity” correction term and accounts for the fact that the considered statistic is discrete while 2 –is continuous).

If the null hypothesis is correct, then the probability that is greater than 3.84146

is less than 0.05

1000

21000 )1|(|

nn

nn

n00 = no. of words mistagged by A and B n01 = no. of words mistaggeed by A but not by Bn10 = no. of words mistagged by B but not by A n11 = no. of words correctly tagged by A and B

1000

21000 )1|(|

nn

nn

295.0,1

a) McNemar’s test of classifiers comparisson

Page 71: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

b) local error complementarity (Brill&Wu, 1998):

A, B = classifiersCOMP(A,B) = 100*(1-Ncom/NA) , COMP(A,B) COMP(B,A)

Ncom = number of common errorsNA = number of errors erori made by the A classifier

If A would make the same errors as B, then COMP(A,B) = 0, therefore combining A with B is useless;

COMP(A,B) percentually shows how frequently A is right when B is wrong.COMP(A,B) COMP(B,A) B is better than A with respect to the current text; the linguistic register of the analysed text is closer to the one of the texts in the corpus which used for building

COMP(A,B)COMP(B,A) , then the two classifiers are neutral to the analyzed text (similar accuracy)

Testing the statistical hypotheses:

Page 72: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The usual approach in combining classifiers: combine different taggers trained on the same data• H.v. Halteren, J. Zavrel, W. Daelemans, 1998 (COLING-ACL) corpus: LOB taggers: HMM 3-gram tagger (Steetskaamp- TOSCA tagger) Brills tagger (ftp://ftp.cs.jhu.edu/pub/brill /Programs

/RULE_BASED_TAGGER_V.1.14.tar.Z) MaxEnt tagger (ftp://ftp.cis.upenn.edu/pub/~adwait/jmx) MBL (licensed based from Walter Daelemans)

Tagger AccuracyHMM 96.08%Brill 96.46%ME 97.43%

MBL 96.95%Combined classifier (Prec-Rec) 97.84%

Page 73: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

•E. Brill & J. Wu, 1998 (COLING-ACL) training corpus:WSJ taggers: HMM 3-gram tagger (undocumented)

Brill’s tagger (ftp://ftp.cs.jhu.edu/pub/brill /Programs/ MaxEnt tagger (ftp://ftp.cis.upenn.edu/~adwait/jmx)

Tagger AccuracyHMM 96.36%Brill 96.61%ME 96.83%

CC (pick-up tagger) 97.2%

•CLAMContrasting our approach with previous approaches

a)Taggeri+Given_Training_Corpus = Classifieri

differences in the tagging accuracy motivated just by the software (tells you which program is better

b) Given_Tagger +Training_Corpusi = Classifieri (CLAM)

differences in the tagging accuracy motivated just by the data

Page 74: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

TAG RECALL PRECISION CONFUSION SETA 96.02 92.51 R:7.48AN 99.73 99.73 NN:0.26APN 98.31 99.28 ASN:0.04 ASON:0.08 NPN:0.43 PI:0.01 V2:0.03

V3:0.09APOY 100 94.23 NPOY:0.0576APRY 100 97.97 NPN:0.5 NPRY:1.51ASN 96.76 95.71 AN:0.01 M:0.05 NN:0.01 NSN:0.6 NSRN:0.13

PPPD:0.03 R:3.26 S:0.03 V3:0.03 VG:0.01 VP:0.1ASON 98.7 92.37 APN:7.45 V3:0.17ASOY 100 96.12 NSOY:3.87ASRY 99.42 97.01 NSRY:2.99ASVY 100 90.91 NSVY:9.09

Table 1: Adjectival Entries in a Credibility Profile

Page 75: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

CLAM uses the CREDIBILITY combiner (the most satisfactory):

j

)T(* )T |(TP - (Ti) Pr )(T Cargmax jijci kkk where:

Ck (Ti) is the credibility that the k-th classifier is right

Prk (Ti) is the probability of correct assignment of the tag Ti the by the k-th classifier as given by its credibility profile;

Pck(Tj|Ti) is the probability for the k-th classifier to wrongly assign

tag Ti when it should assign Tj as given by its credibility profile

(Tj)=

otherwise 0

classifier competing aby assigned is T if 1 j

.

The selected tag is the one proposed by the classifier with the highest credibility figure

CREDIBILITY classifier better than of any individual classifier.

Page 76: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

4 training corpora ==> 4 basic classifiers

Evaluation

Unseen test data:Text (Amb2) #Sentences Text “classification” Occurrences1994 (2,68) 3708 fiction (1984 follow-up) 20110Barnes (2,73) 1250 scientific essay (not seen before) 20120ziarNou(2,79) 4318 journalism (another journal) 20035

Additional Lexicon= all words in the test data not in the basic lexicon.

Types of experiments:

l1) partial lexicon (Basic Lexicon) l2) full lexicon (Basic Lex+Additional Lex => no unknown words)

A basic experiment: 24 runs: 3 text_chunks*4 classifiers*2 exp_typesResampling the test texts and averaging (bagging):

Accunk=98.6%, Accno-unk=99.0%

Page 77: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 78: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

References on Morpho-Syntactic Tagging

Felix Pîrvan, Dan Tufiş: Tagsets Mapping and Statistical Training Data Cleaning-up. In proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp. 385-390

Dan Tufiş, Elena Irimia: RoCo_News - A Hand Validated Journalistic Corpus of Romanian. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp. 869-872

Dan Tufiş, Liviu Dragomirescu. Tiered Tagging Revisited. In Proceedings of the 4th LREC Conference, Lisabona, 2004, pp. 39-42

Dan Tufiş “High Accuracy Tagging with Large Tagsets”. In Proceedings of the ACCIDA’2000, Tunisia, 2000

Dan Tufiş “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging”, Second International Conference on Language Resources and Evaluation, Athens May, 2000, pp. 1105-1112

Dan Tufiş, Dienes, P., Oravecz, C., Váradi T. “Principled Hidden Tagset Design for Tiered Tagging of Hungarian” Second International Conference on Language Resources and Evaluation, Athens May, 2000, pp. 1421:1426

Dan Tufiş “Tiered Tagging and Combined Classifiers”. In F. Jelinek, E. Nöth (eds) Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999, pp. 28-33

Tomaz Erjavec, Nancy Ide, Dan Tufiş: Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages, ALLC-ACH '98 Conference, Debrecen, Hungary, July 5-10 1998, pp. 69-76.

Dan Tufiş, Nancy Ide, Tomaz Erjavec.: Standardised Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages, First International Conference on Language Resources and Evaluation, Granada, 28-30 May, 1998, pp. 233-240

Dan Tufiş, Oliver Mason: Tagging Romanian texts: A Case Study for QTAG, a Language Independent probabilistic tagger, First International Conference on Language Resources and Evaluation, Granada, 28-30 May, 1998

Ludmila Dimitrova, Tomaz Erjavec, Nancy Ide, Heiki Kaalep, Vladimir Petkevic, Dan Tufiş: Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages, COLING, Montreal 1998

Tomaz Erjavec, Nancy Ide, Dan Tufiş, “Encoding and Parallel Alignment of Linguistic Corpora in Six Central and Eastern European Languages”. In Michael Levison (ed) Proceedings of the Joint ACH/ALL Conference Queen's University, Kingston, Ontario, June 1997

Page 79: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

BLARK: Lemmatization

• Lemmatization after tagging is very easy, provided a wordform lexicon (á-la Multext) is available (other solutions are possible as well).

• A wordform lexicon entry:<wordform> <lemma> <tag>For most languages, in the vast majority of cases: <wordform> + <tag> <lemma>When this is not the case, one could use statistical evidence:<wordform> + <tag> <lemma1>P1, <lemma2>P2…

Page 80: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lemmatisation (cntd)• The previous approach works for the words in

the wordform lexicon, otherwise for inflectional languages a retrograde ending analysis could be one way to solve the problem;

*A

*E

*L

*I *U

R

*O

.

.

.

* nodes { (endingi tagi)+}

wordform+tagi=>root+endingi

ET can be learnt from a core wordform lexicon; root lemma

Ending Tree

Page 81: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lemmatisation (cntd)

root lemma

If the root is common for all inflectional variants, then root+standard+affix for the category= lemma

Otherwise: use knowledge about the regular root alternations rules:

{sub-category='common-noun' & gender='feminine' & <X> = >e|a|< & number='singular' & case='nom-acc'}

{ [number='singular' & case='nom-acc']

>e<}

Page 82: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

BLARK: Chunking• chunking – dividing a text in syntactically correlated sequences of words; an

intermediate step towards full parsing, easier to achieve. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September . can be divided as follows:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] .

• Text chunking is an intermediate step towards full parsing. It is usually implemented by means of regular expressions defined over the tags of the tagged input text (language dependent).

ex: NP->Indef-art[gender=; number=] Det[gender=; number=]* Noun[gender=; number=]

VP->Aux1 {Aux2}{Adv} Vpart|Vmain

• The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision). The precision and recall numbers will be computed over all types of chunks.

• It was the shared task for CoNLL-2000. Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens): http://www.cnts.ua.ac.be/conll2000/chunking/

Page 83: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Linker

• A processing module that links words which are likely to depend on each other (syntactically or semantically)

• A slightly modified IBM-1 (EM algorithm), fed with the same source and target text is able to detect interesting dependencies among the words of a sentence (not necessarily adjacent).

• The collocation pairs which comply with a set of restrictions (Constraint Lexical Attraction Model – e.g. graph planarity and syntactic rules defined over the POS tagset) identifies also interesting dependencies.

Page 84: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Linker• There are two basic approaches to computing

the statistical dependency of the tokens in one text: – Compute the collocation scores among all the

adjacent words and retain those pairs with a score beyond a given threshold

– Use IBM model 1 (EM algorithm) to detect dependencies between the words of each sentence, ignoring the identity relation (a word being related to itself). Since this approach does not impose word adjacency, it is able to detect noncontiguous related words

– We used both approaches and combined the results

Page 85: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

CLAM Linker• Lexical Attraction Model (LAM)

– Yuret, D. (1998). Discovery of linguistic relations using lexical attraction. PhD thesis, Dept of Computer Science and Electrical Engineering, MIT.

• Constrained LAM: a link is rejected if it does not pass any of the linking rules of a language: for instance the number agreement – Radu Ion, Verginica Barbu Mititelu: Constrained

Lexical Attraction Models, in Proc. Trends in Natural Language Processing, Special Track at 19th FLAIRS 2006, Melbourne Beach, Florida, May 11-13, 2006

– Verginica Barbu Mititelu and Radu Ion, Cross-language Transfer of Syntactic Relations Using Parallel Corpora, in Proceedings of the Workshop on Cross-Language Knowledge Induction, Cluj-Napoca, July 2005, pp. 46-51

Page 86: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

IBM-1 and CLAM linkers

Page 87: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

References on tokenisation lemmatization and dependency

linking• Dan Tufiş, Radu Ion, Elena Irimia, Verginica Barbu Mititelu, Alexandru Ceauşu, Dan Ştefănescu,

Luigi Bozianu, Cătălin Mihăilă: Resources, Tools and Algorithms for the Semantic Web. In Proceedings of the “IST – Multidisciplinary approaches” Workshop, Romanian Academy, 2006, pp. 13-22, ISBN-10 973-0-04483-X, ISBN-13 978-973-0-04483-6

• Radu Ion “Methods for Semantic Disambiguation; Applications for English and Romanian“, PhD Thesis, Romanian Academy, May 2007

• Radu Ion and Verginica Barbu Mititelu. Constrained lexical attraction models. In Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, pages 297–302, Menlo Park, Calif., USA, 2006. AAAI Press.

• Dan Tufiş, Radu Ion Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastrcture. In C. Burileanu, H.N.Teodorescu (eds): Proceedings of the 4th International Conference on Speech and Dialogue Systems (SPED2007), 10-12 May, 2007, Romanian Academy Publishing House, pp. 183-195

• Radu Ion, Alexandru Ceauşu, Dan Tufiş. Dependency-Based Phrase Alignment. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp. 1290-1293

• Dan Tufiş, Barbu Ana-Maria, "Extracting multilingual lexicons from parallel corpora". In Proceedings of the ACH-ALLC conference, New York, 12-17 June, 2001.

• Tomaz Erjavec, Dan Tufiş, Tamas Varadi “Developing TEI-Conformant Lexical Databases for CEE Languages”. In Proceedings of the 4th International Workshop on Computational Lexicography COMPLEX, Pecs, Ungaria, 1999, pp.205-210.

• Vladimir Petkevic, Dan Tufiş: The Multext-East lexicon, ALLC-ACH '98 Conference, Debrecen, Hungary, July 5-10 1998.

• Dan Tufiş, Octav Popescu. “A Unified Management And Processing of Word-Forms, Idioms And Analytical Compounds”, in Jurgen Kunze and Dorothy Reinman (eds.), Proceedings of the 5th European Conference of the Association for Computational Linguistics, Berlin, 1991

• Dan Tufiş. “It Would Be Much Easier If WENT Were GOED”, in Harry Somers, Mary McGee Wood (eds.), Proceedings of the 4th European Conference of the Association for Computational Linguistics, Manchester, 1989

Page 88: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

ELARK: AlignmentReification

An alignment: a set of mapping links between the entities of two sets of information representations (N to M). To reify is “To regard or treat (an abstraction) as if it has a concrete or material existence” (http://www.dictionary.com)

alig

nm

ent

Page 89: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Types of Aligned Entities• Multilingual parallel texts (bitexts or multitexts)-

implicit representation structures of the same underlying meaning– paragraph, – sentence, – phrase, – word level;

• Multilingual thesauri (concept names, relation names, hierarchical structures);

• Ontologies (concept names and properties, relation names and properties, hierarchical structures);

• Text to semantic or conceptual structures (semantic dictionaries, thesauri, ontologies); WSD, document classification/clustering, IR, QA

• Multimedia data (text-speech, text-video, video-speech);

• …etc.

Page 90: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Texts alignment• Regards texts to be aligned as implicitly structured

representations of the underlying meaning;

• Revealing the linguistic structure requires minimal preprocessing such as segmentation (paragraph, sentence, lexical token), POS-tagging, lemmatization (stemming), chunking (parsing), WSD; the more fine grained the preprocessing, the more precise the alignment and better its benefits;

• If two texts encode the same meaning, text alignment enables automatic identification of the matching meaning blocks and computing their common meaning; typical examples of text encoding the same meaning: paraphrase corpora (for the monolingual case) and parallel corpora (for the multilingual case)

Page 91: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Approaches to Alignments• Symbolic, usually rule-based, rely on a priori

knowledge about the entities to be aligned (knowledge intensive)

• Statistical, rely on large data; the ML approaches may require training data (human prepared) based on which an alignment statistical model is computed; based on this model, new unseen data can be aligned.

• Mix, combine statistical and symbolic methods. Few approaches are purely symbolic or statistical.

Page 92: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Why parallel text alignment is so important?

• Multilingual lexicography• Multilingual terminology• Annotation import in parallel corpora and

automatic induction of annotation models• Multilingual information retrieval• Multilingual question-answering• Machine translation

Page 93: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Sentence alignment (I)

• The aligner is designed for alignment of large parallel corpora.

• Initial step: create a (hand-validated) reference sentence alignment of a sample of the corpus (about 1000 sentences per language pair ). We used Moore’s aligner for this step; the result was hand validated. This is a cheap process as it does not require more than 20-30 minutes.

• This set of sentences, supplemented, with 1000 wrong alignments (generated automatically, based on the correct ones), is used to train as SVM classifier. The features used for this initial model are: sentence length-both word or chars based, correlation factor for the number of non-lexical tokens contained in the candidate sentences, rank correlation for the words in the candidate sentences.

Page 94: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Sentence Alignment (II)The proper sentence alignment process of the current

bitext is a two step procedure:

•The first step produces an initial alignment of the bitext and uses the initial SVM classifier to label the paired sentences as GOOD or BAD. The pairs that were classified as GOOD with a probability higher than 90% are used to retrain the classifier, this time with a highly discriminative feature: translation equivalence (enough data is supposed in order to extract reliable translation equivalence tables – aka IBM-1).

•The second phase of the process is an iterative procedure that generates a reduced set of sentence-pair candidates which are evaluated by the new SVM classifier. The GOOD ones are retained in the final alignment. The set of GOOD links gets larger and larger form one iteration to the next one and they act as restrictors and/or supporters of the new candidate links

Page 95: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Sentence Alignment (example)Done at Brussels , 14 August 2003. Adoptat la Bruxelles ,

For the Commission 14 august 2003 .

Franz Fischler Pentru Comisie

Member of the Commission Franz FISCHLER

( 1 ) OJ L 198 , 22.7.1991 , p. 1 . Membru al Comisiei

( 2 ) OJ L 85 , 2.4.2003 , p. 15 . ANEXĂ

( 3 ) OJ L 169 , 10.7.2000 , p. 1 . Comisia studiază în prezent ...

ANNEX 1 JO L 198 , 22.7.1991 , p1 .

The Commission is currently... 2 JO L 85 , 2.4.2003 , p. 15 .

3 JO L 169 , 10.7.2000 , p. 1.State of the art sentence aligner: Moore, R. C. (2002). Fast and Accurate Sentence Alignment of Bilingual Corpora. In Machine Translation: From Research to Real Users (Proceedings, 5th Conference of the Association for Machine Translation in the Americas, Tiburon, California), Springer-Verlag, Heidelberg, Germany, pp. 135-244

Page 96: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation of the Sentence Aligner

Precision Recall F-measure

Moore En-It 100% 97.76% 98.87%

RACAI En-It 98.93% 98.99% 98.96%

Moore En-Fr 100% 98.62% 99.31%

RACAI En-Fr 99.46% 99.60% 99.53%

Moore En-Ro 99.80% 93.93% 96.78%

RACAI En-Ro 99.24% 99.04% 99.14%

~1000 sentences per language pair (4 files in AcqComm corpus)

• Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş,

Daniel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference, Genoa, Italy, 2006

• Alexandru Ceauşu, Dan Ştefănescu, Dan Tufiş: Acquis Communautaire sentence alignment using Support Vector Machines. In Proceedings of the 5th LREC Conference, Genoa, Italy, 2006

Page 97: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word Alignment

• An automatic procedure identifying for each word in one part of a bitext which is the word in the other part of the bitext that translates it;

• In general the correspondence between the words in the two parts of a bitext is not always 1-1.

• 1-1 links are easier to detect; N-M links require special treatment.

Page 98: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Requirements

• Irrespective of the granularity, the alignment should be as accurate as possible;– any alignment error in an early stage of a bi-text

processing, would generate lots of other errors in the next steps (or, in the best case, would generate “silence” in the output)

• When the quantity of textual data is large (the interesting case), the alignment should be as fast as possible;

• With the current computing paradigms and technologies, a trade-off between speed and accuracy must be accepted;

Page 99: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The patrols did not matter , however .

Şi totuşi , patrulele nu contau .

Word Alignment: an example

Page 100: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Our Aligners YAWA, MEBA&COWAL

• We developed various aligners at the sentence, chunk and word levels, and they can work either independently (pipe-lined) or may be coupled in a feedback-based architecture, providing much better alignment results at the price of a longer response time.

• The sentence alignment and monolingual chunking, precedes the word alignment; however: – if the number of aligned words in the current pair of

sentences is below an expected number, it is highly probable that the sentence alignment is wrong; this is an extremely precise and valuable hint for automatic correction of the sentence alignment;

– if the links starting from the words in a chunk of one language point to words in different chunks in the other language it is highly probable that either the chunking or word alignment is locally wrong; this is an extremely precise and valuable hint for automatic correction of chunking, word alignment or both.

• Word Alignment is the most critical processing phase!

Page 101: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lexical alignment of parallel texts

• Method: Reified lexical alignment (very high accuracy)• Hub language – English: From En-X and En-Y alignments we

automatically derive the X-Y alignment• The derived alignments are used to generate translation

models for the X-Y pair of languages and by bootstrapping (if necessary and linguistic expertise is available) to improve the initial X-Y alignment

• Reified Alignments - COWAL– A bitext alignment is a set of lexical token pairs (links),

each of them being characterized by a weighted feature structure the score of which should be higher than the acceptability threshold.

a feat1: val1

b feat2:val2

. . . z featn: valn

n

iii ScoreFeatCoefFeatLinkScore

1

*

Page 102: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Features characterizing a link

• A link <Token1 Token2> is characterized by a set of features, the values of which are real numbers in the [0,1] interval.

• context independent features – CIF, they refer to the tokens of the current link cognate, translation equivalents (TE), POS-affinity,

“obliqueness”, TE entropy• context dependent features – CDF, they refer to the

properties of the current link with respect to the rest of links in a bi-text.

strong and/or weak locality, number of links crossed, collocations

• Based on the values of a link’s features we compute for each possible link a global reliability score which is used to license or not a link in the final result.

ACL2005 Word Alignment Shared Task (En-Ro) 1. COWAL (F=73.90%, AER=26.10%) The current version (July 2007) evaluated against our corrected

ACL2005 GS:COWAL (F=85.22%, AER=14.78%)

Page 103: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Technical details (1)

2

1 j*i*

**ijij

2

1i n*n

n*nlog*n*2

j

TT TT

TS n11 n12n1*

TS

n21 n22n2*

n*1 n2* n**

Threshold < LL(TT, TS) =

The search space is dramatically reduced by sentence alignments, tagging and lemmatization and by imposing a minimal LL threshold. For the “promising” translation pairs, we estimate their translation probabilities: P(TT |TS)&P(TS|TT)

Building a translation model (TM): Estimating the probabilities of each possible translation pair from the training corpora and retaining only the most promising ones. The search space is a subset of {TT

1-

n}{TS1-m}; TLi=<lemma_tag>

Page 104: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Technical details (2)• Multiple-step algorithms, controlled by several parameters

(linearly interpolated):– P1: Translation probabilities:

P(lemmaL1i_tagL1i|lemmaL2k_tagL2k)

– P2: Fertility: P(n |lemmaL2k_tagL2k)/collocation score

– P3: Translation equivalence entropy:

– P4: Distorsions: P(pozL1i | pozL2k)

– P5: POS affinity: P(POSL1i|POSL2k)

– P6: String Similarity SS(lemmaL1i, lemmaL2k)

– P7: Local and/or global localities

)_ |_(log)_ |_()_( 221122112_2 nLnLjLjLnLnLj

jLjLnLnL tagltaglPtagltaglPtaglH

Page 105: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Technical details (3)

Argmax AL(lemmaL1i|lemmaL2k)= i

n

iiP

1

11

n

ii

N-M alignments; the lambdas are different from each step to the other; additionally, there are minimal thresholds (also varying from step to step) for the value of each parameter (if below the threshold, it’s contribution is nil)

Page 106: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Translation equivalents (TE)

• YAWA, TREQ-AL use competitive linking based on ll-scores, plus the Ro-En aligned wordnets

• MEBA uses GIZA++ generated candidates filtered with a log-likelihood threshold (11).

• The TE candidates search space is limited by lemmatization and POS meta-classes (e.g. meta-class 1 includes only N, V, Aj and Adv; meta-class 8 includes only proper names)

• For a pair of languages translation equivalents are computed in both directions. The value of the TE feature of a candidate link <TOKEN1 TOKEN2> is 1/2 (PTR(TOKEN1, TOKEN2) + PTR(TOKEN2, TOKEN1).

Page 107: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Entropy Score (ES)

• The entropy of a word's translation equivalents distribution proved to be an important hint on identifying highly reliable links (anchoring links)

• Skewed distributions are favored against uniform ones

• For a link <A B>, the link feature value is 0.5(ES(A)+ES(B))

N

WTRpWTRpN

iii

WES log

)|()*log|(11)(

Page 108: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cognates (COGN)• The cognates feature assigns a string similarity

to the tokens of an candidate link

• We estimated the probability of a pair of orthographically similar words, appearing in aligned sentences, to be cognates, with different string similarity thresholds. For the threshold 0.6 we didn’t find any exception. Therefore, the value of this feature is either 1 (if the similarity score is above the threshold or 0 otherwise).

• Before computing the string similarity score the words are normalized (duplicate letters are removed, diacritics are removed, some suffixes are discarded).

Page 109: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

COGN

TS=12 . . . k ; TT=12 . . . m

i and are j the matching characters, &(i) is the distance (in chars of TS) from the previous matching , &( i) is the distance (in chars of TT) from the previous matching

2 q if 0

2q if |)()(|1

2

=)T ,COGN(T 1TS

mk

q

i ii

Page 110: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Part-of-speech affinity (PA)

• An important clue in word alignment is the fact that the translated words tend to keep their part-of-speech and when they have different POSes, this is not arbitrary.

• The information was computed based on a gold standard (the revised NAACL2003), in both directions (source-target and target-source). – For a link <A,B> PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|

cat(A))

Page 111: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Collocation

• “A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things” (Manning & Schütze, 1999)

• “Collocations of a given word are statements of habitual or customary places of that word” (Firth 1957)

• "a recurrent combination of words that co-occur more often than expected by chance and that correspond to arbitrary word usages” (Smadja, 1993)

• “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. " (Choueka 1988)

• An n-gram analysis program (such as Ted Pedersen’s) is extremely useful in discovering collocations; only adjacent words are considered. Out of the different available collocation texts we found LL (with a threshold of 9) working the best

Page 112: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Refining Collocation Extraction

• Considers not only adjacent words;

11

2

2

n

dn

ii

where n is the total number of distances, di are the distances, is the mean and2 is the variance.

If two words appear together always at the same distance, the variance is equal to 0. If the distribution of the distances is random (the case of those words which appear together by chance), then, the variance has high values. Smadja (1990) shows that one can find collocations by looking for pairs of words for which the standard deviations () of distances are small.

Page 113: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Refining collocation extraction

• We were interested in finding V-N (N-V), N-N and N-A (A-N) collocations. We should note that while V-N collocations usually characterize verb sub-categorization structures, the N-N and N-A ones are ordinarily terminological compounds. The method was applied on a tagged and lemmatized version of the Acquis Communautiare (AC) corpus. The size of the corpus is around 350 Mb.

• We computed the standard deviation for all V-N, N-N and N-A pairs (from the AC corpus) within a window of 11 non-functional words length for French, German and Romanian. We considered as good, all the pairs for which standard deviation was smaller than 1.5 (this is a reasonable threshold according to the examples from Manning & Schütze (1999)) and we kept them in a list along with their mean.

• This method allows us to find good candidates for collocations but not good enough. We want to further filter out some of the pairs so that we keep only those composed by words which appear together more often than expected by chance. This can be done using Log-Likelihood. The idea behind the LL score is finding the hypothesis which describes better the data obtained by analyzing a text. The two hypotheses considered are:– H0 : P(w2|w1) = p = P(w2|¬w1) (null hypothesis - independence)– H1 : P(w2|w1) = p1 ≠ p2 = P(w2|¬w1) (non-independence hypothesis)

Page 114: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Refining collocation extraction

• We computed the LL score for all the pairs obtained using Smadja’s method. In the computation for a certain (for example) V-N pair at distance d (the round of the mean of all distances between the words of this certain pair) we used only the V-N pairs of words for which the distance is the same (d). We kept in a final list, the pairs for which the LL score was higher than 9. For this threshold the probability of error is less then 0.004

• If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is the maximum of the collocation probabilities of the link’s tokens. Competing links (starting or finishing in the same token) are licensed only and only if at least one of them have a non-null collocation score

Page 115: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• Collocations analysis in a parallel corpus– Large parallel corpus (Acq-Com)– University Marc Bloch from Strasbourg, IMS Stuttgart

University and RACAI independently extracted the collocation in Fr, Ge, Ro and En (hub).

– We identified the equivalent collocations in the four languages. SURE-COLLOCX= COLLOCX TRX-COLLOCY (EQ1) member states, European Communities, international treaty, etc.

INT-COLLOCZ = COLLOCZ \ SURE-COLLOCZ (EQ2) • adversely affect <-> a aduce atingere[1]; • legal remedy <-> cale de atac[2], • to make good the damage <-> a compensa daunele[3] etc.

[1] A mot-a-mot translation would be to bring a touch[2] A mot-a-mot translation would be way to attack[3] A mot-a-mot translation would be to compensate the

damages

Refining collocation extraction

Page 116: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Localization• This feature is relevant with or without

chunking or dependency parsing modules. It accounts for the degree of the cohesion of links.

• With the chunking module is available, and the chunks are aligned via the linking of their respective heads, the links starting in one chunk should finish in the aligned chunk.

• When chunking information is not available, the link localization is judged against a window, the span of which is dependent on the aligned sentences length.

• Maximum localization (1) is the one with all the tokens in the source window are linked to all tokens in the target window

Page 117: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

•Weak Locality

When chunking/dependency links information is not available, the link localization is judged against a window containing m links. The value of m dependents on the aligned sentences length. The window is centered on thecandidate link.

s1

s2

.

.

.

.s

.

.

.sm

t1

t2

.

.

.t...tm

m

)|)||,max(|

|)||,min(|1

1

m

k kk

kk

ttss

ttss

mLOC

• Combining classifiers

If multiple classifiers are comparable, and if they do not make similar errors,combining their classificationsis always better than the individual classifications.

Page 118: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Distorsion/Relative position• Each token in both sides of a bi-text is

characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes, gives the link’s “obliqueness”

• The distorsion feature of a link is its obliqueness D(link)=OBL(SWi, TWj)

)()(),(

TSji Sentlength

j

Sentlength

iTWSWOBL

Page 119: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Crossed links

• The crossed links feature computes (for a window size depending on the categories of the candidates and the sentences lengths) the links that were crossed.

• The normalization factor (maximum number of crossable links) is empirically set, based on categories of the link’s tokens

links) crossed of No.CCmber(MAX(max_nu

links crossed of No.reCrossFeatu

TS ),,1

Page 120: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• The words unaligned in the previous step may get links via their aligned dependents (HLP: Head Linking Projection heuristics): if b is aligned to c and b is linked to a, link a to c, unless there exist d in the same chunk with c, linked or not to it, and the POS category of d has a significant affinity with the category of a.

a c a cb b d

– Alignment of sequences of words surrounded by the aligned chunks

– Filtering out improbable links (e.g.links that cross many other links)

Heuristics for improving the alignment (1)

Page 121: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Heuristics for improving the alignment (2)

• Unaligned chunks surrounded by aligned chunks get probable phrase alignment:

SL TL

Wsi ↔ Wtj

Wsk ↔ Wtm

Wsp Wsp+1…↔ Wtq Wtq+1…

Page 122: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Links Alignment (I)

• If instead of taking lexical tokens as alignment units, one considers dependency links, COWAL produces dependency links alignment;

• from the links alignment => word alignment

Radu Ion, Alexandru Ceauşu and Dan Tufiş: Dependency-Based Phrase Alignment, in Proceedings of the LREC 2006, Genoa, Italy

Page 123: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Links Alignment (II)

Page 124: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency chunks &Translation Model

• Regular expressions defined over the POS tags and dependency links

• Non-recursive chunks

• Chunk alignment based on their aligned constituents (one or more).

Page 125: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Final Word Alignment

Page 126: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

YAWA

YAWA starts we all plausible links (licensed by the translation model). Then, using a competitive linking strategy, retains the links that maximizes sentence translation equivalence score, and minimizing the number of crossing links. This way, it generates only 1-1 alignments. N-M alignments are possible only with chunking and/or dependency linking available. In this case, the unaligned words may get links via their heads’ links. Very good recall!

Page 127: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

MEBA• Unlike YAWA, MEBA iterates several times

over each pair of aligned sentences, at each iteration adding only the highest score link. The links already established in previous iterations give support or create restrictions for the links to be added in a subsequent iteration. Generates N-M alignments (no competitive linking filtering). MEBA uses different weights and different significance thresholds on each feature and iteration step. They were set manually. Very good precision!

n

iii ScoreFeatCoefFeatLinkScore

1

*

Page 128: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Combining the Alignments•The simplest method: just compute the union of YAWA and

MEBA, remove the duplicates and eliminate impossible multiple links (i-j and i-0). The winner in the last case was heuristically determined by the properties of the language pair (En-Ro)

•The above method has a better recall but the precision is significantly deteriorated unless the language specific filtering is used; solution? Try to remove as many bad links as possible from the union. We used an SVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) binary classifier (Good/Bad) trained on our version of the GS2003 (for positive examples) and the differences among the basic alignments (YAWA, MEBA) and the GS2003 (for negative examples).

•The SVM classifier (LIBSVM (Fan et al., 2005) uses the default parameters: C-SVC classification (soft margin classifier) and RBF kernel (Radial Basis Function )

•Features used for the training (10-fold validation; about 7000 good examples and 7000 bad examples) :

TE(S,T), TE(T,S), OBL(S,T), LOC(S,T), PA(S,T), PA(T,S)

The links labeled as incorrect links were removed from the merged alignments.

2

),( yxeyxK

Page 129: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

COWAL• An integrated platform that takes two parallel raw texts

and produces their alignment– basic preprocessing modules: sentence aligner,

tokenizers, lemmatizers, POS-taggers, dependency linkers, chunkers,

– translation models builder – two or more comparable word-aligners (YAWA-

superseded TREQ-AL, MEBA), – alignment combiner; – an XML generator (XCES schema compliant)– an alignment viewer & editor

– optional modules : bilingual lexical ontologies (Ro-En aligned wordnets)

Page 130: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

En-Ro Word Alignment

Page 131: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 132: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word Alignment Competitions• There is a general interest in increasing the WA accuracy

(mainly in the Statistical Machine Translation community). International competitions (very similar to TREC, CLEF etc) for evaluating the WA state of the art:

2003 NAACL in Edmonton, Canada2005 ACL in Ann Arbor Michigan, USA

• Although WA is far from being perfect, its accuracy is rapidly improving!

our word aligners, rated the best in the two competitions progressed almost 10% in two years:

• TREQ-AL, 2003: F-measure: 73.39% (highest F-measure for the Ro-En track)

• COWAL, 2005: F-measure: 82.52% (highest F-measure for the Ro-En track)

• The volume of the training data was approximately the same, the texts were more difficult in 2005 => real technological improvement

• In the next few years is very likely to see results superior to 90%!

Page 133: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

ACL 2005 Word Alignement Competition

J. Martin, R. Mihalcea, T,Pedersen, Word Alignment for Languages with Scarce Resources, in Proceedings of the ACL Workshop on Building and Using Parallel Texts, June,2005, Ann Arbor, Michigan, Association for Computational Linguistics, pp. 65—74, http://www.aclweb.org/anthology/W/W05/W05-08

• 3 pairs of languages, with different quantities of training data – Inuktitut-English (1.6 Mw-3.4 Mw)– Romanian-English (0.85 Mw-0.9 Mw)– Hindi-English (0.07 Mw-0.06Mw)

• Major differences in preparing the Gold Standard

Page 134: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation measures• Let us consider an alignment A, each link is labeled either as S(ure)

or P(ossible). If we have a Gold Standard (G) with S and P links then AS / GS represents the subset of A/G containing only Sure links

• As any Sure link is a Possible link as well, AP=A, GP=G

0

1

*2;;

*2;;

AERGAG If

F-1AER used,are links S only If

2000)Ney, and (Och GA

GAGAAER

RP

RPF

G

GAR

A

GAP

RP

RPF

G

GAR

A

GAP

S

S

S

S

PP

PPSPP

SS

SSS

S

SSS

S

SSS

Page 135: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Comparing different pairs of languages against

differently constructed GS • The way AER is defined makes difficult to compare

alignment performance for pairs of languages with different strategies in building their GS. If one pair of languages has only S links and the other has both S and P links, the results (in terms of AER) will be always better for the second pair. This is because adding P links to a GS always produces a better AER for the alignments judged against such a GS. Demonstration is very simple if one observes that in the definition of AER in (Mihalcea&Pedersen, 2003) Ap means both P links and S links (any probable link is a sure link). Adding P links in the GS inexorably decrease AER…!

Page 136: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Official Ranking 1. RACAI.COWAL (F=73.90%,

AER=26.10%) 2. ISI.Run5.vocab.grow (F=73.45%,

AER=26.55%)

but correcting the obvious errors in the GS, actually:

COWAL (F=79.79%, AER=21.21%)

and considering our GS (a different tokenization):COWAL (F=82.52%, AER=17.48%)

Page 137: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

References on Sentence and Word Alignment

• Dan Tufiş, Radu Ion. Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastrcture. In C. Burileanu, H.N.Teodorescu (eds): Proceedings of the 4th International Conference on Speech and Dialogue Systems (SPED2007), 2007, Romanian Academy Publishing House, 183-195

• Alexandru Ceauşu, Dan Ştefănescu, Dan Tufiş:Acquis Communautaire sentence alignment using Support Vector Machines. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp2134-2137

• Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Ştefănescu: Improved Lexical Alignment by Combining Multiple Reified Alignments. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), Trento, Italy, 3-7 April, 2006, pp. 153-160

• Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Stefănescu: Combined Aligners. In Proceedings of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp. 107-110

• Dan Tufiş, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, May 2004, pp. 163 – 189

• Dan Tufiş, Ana-Maria Barbu, Radu Ion: „TREQ-AL: A word-alignment system with limited language resources”, Proceedings of the NAACL 2003 Workshop on Building and Using Parallel Texts; Romanian-English Shared Task, Edmonton, Canada, 2003, pp. 36-39

• Dan Tufiş ”A cheap and fast way to build useful translation lexicons” in Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25-30 August, 2002, pp. 1030-1036

• Dan Tufiş, Ana-Maria Barbu: „Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing”, in International Journal of Speech Technology. Kluwer Academic Publishers, no.5, pp.199-209, 2002

Page 138: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Lexical ontologies

• Princeton Wordnet (PWN)= semantic lexicon for EnPWN+ SUMO+MILO+Domains

= lexical ontology for En

EuroWordNet = multilingual lexical ontology

with PWN as ILI

BalkaNet = multilingual lexical ontologyPWN as ILI

• Recently, SentiWordNet a subjectivity marked-up version of PWN2.0

Page 139: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The BalkaNet project (2001-2004)

• An EU funded project (IST-2000-29388) for the development of a (core, approx 8000 synsets/language) multilingual semantic lexicon along the principles of EuroWordNet;

• Languages concerned: Bulgarian, Czech, Greek, Romanian, Serbian, Turkish

• EuroWordnet-BalkaNet liaison: Piek Vossen (consultant) and MemoData (industrial partner)

Page 140: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Balkanet

CzWn

GrWn

RoWN

SrWn

TrWn

BgWN

ILI

interlingual relation (effective)translation equivalence relation (virtual)

6 (Li->ILI)

ILI: PWN1.6 PWN1.7.1 PWN2.0+BKN

30 (Li->Lj)

At the end of the projectmore than 8000 ILIs implemented in all 6 languages(more than 336,000 virtual translation pairs of synsets; oralmost 1 million word translation pairs)Most of the wordnets contained more than 18,000 synsets.

Page 141: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Main features of the BalkaNet wordnets

• Compatibility with the EuroWordNet wordnets• Structured ILI = PWN+BKN

– BKN = Balkan specific concepts

• Relations defined within a monolingual wordnet have precedence over the relations in the ILI (redundancy, but more expressive power)

• SUMO/MILO & DOMAINS which are aligned with PWN2.0 are available in the monolingual wordnets.

• Some monolingual wordnets (CZ, RO and BG) contain valency frames for a common number of verbal synsets

• A single interlingual relation (EQ-SYN); the rest may be emulated; less expressive power but extremely efficient management of the multilingual displays of entries (VISDIC)

Page 142: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Design Principlesd1) ensuring as much as possible compatibility with the EuroWordnet approaches (e.g. unstructured ILI based on Princeton WordNet) and maximisation of cross-lingual coveraged2) synset structuring (relations) inside each wordnet (lots of redundancy, but much more powerful) d3) keeping up with Princeton WordNet (PWN) developmentsd4) ensuring conceptually dense wordnetsd5) defining a reusable methodology for data acquisition and validation (open for further development)d6) linguistically motivated (reference language resources, with human experts actively involved in all decision makings and validation)d7) minimizing the development time and costs

Page 143: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Maximisation of the cross-lingual coverage (1)

• ILI= the set of PWN synsets (labeled by their offsets in the database) taken as interlingual concepts: (07766677-n; 02564241-v; 00933364-a;00087007-b)

• The consortium selected a common set of ILI codes to be implemented for all languages; this selection took place in three steps: – BCS1 (essentially the BC set of

EuroWordnet):1218 concepts– BCS2: 3471 concepts– BCS3: 3827 concepts

Page 144: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Maximisation of the cross-lingual coverage (2)

Selection criteria for BCS1,2,3…(8516 ILI-codes)– number of languages in EuroWordNet linked

to an ILI code (imperative)– conceptual density: once a concept was selected, all

its ancestors (nouns and verbs), up to the top level were also selected (imperative); adjectives were selected so that they would typically be related to nominal concepts in the selection (be_in_state)

– language specific criteria: each team proposed a set of concepts of interest and the maximum intersection set among these proposals became imperative

Page 145: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Synsets structuring (1)

• At the level of each individual wordnet

• Common set of relations (the semantic relations) as used in the PWN

• Language specific relations (the lexical relations: such as derivative, usage_domain, region_domain)

Page 146: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Synsets structuring(2)

• Principle of hierarchy preservationM1

L1 H+ M2L1

M1L1 = N1

L2 N1L2 H+ N2

L2

M2L1 = N2

L2

Allows for importing taxonomic relations and checking interlingual alignments.

• When taxonomic relations were imported, they were hand validated.

Page 147: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Keeping up with PWN developments

•When the project started ILI was based on PWN1.5 (as EuroWordNet was).

•BalkaNet ILI was updated following the new releases of PWN:

• PWN1.5 => PWN1.7.1• PWN1.7.1 => PWN2.0

•As the automatic remapping is not always deterministic the partners manually solved the remaining ambiguities in their wordnets.

Page 148: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Defining a reusable methodology for data

acquisition and validation• Each partner developed own specific tools for

acquisition and validation, having a commonly agreed set of functionalities.

• These tools were documented for a lay computer user.

• The language specific tools differ mainly because

of the set of language resources available to each partner; depending on available resources each partner chose the appropriate balance among the d6) and d7) next issue

Page 149: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Trading effort and development time for

language centricity (1)• This issue has been addressed by each partner

differently, basically, depending on: available man power and language resources available.For instance, – if relevant (encoded) electronic dictionaries

(2lang. Dicts + Expl. Dicts + Syn. Dicts + Antonym Dicts + etc.) were available, the development effort concentrated to a large extent on equivalence interlingual mappings. This approach allowed a more language centric development (merge model).

Page 150: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Trading effort and development time for

language centricity (2)

– if reliable dictionaries other than bilingual dictionaries (which every partner had) were not available (e.g. because of the reluctance of the copyright holders to release or to allow the use of their data) a translation approach of the literals in the PWN was generally followed (approximately an expand model); additional efforts were necessary in this case to check out the translated synsets as well as their language adequacy.

Page 151: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Syntactic validations (wordnet well-formedness checking)

Semantic validation (word sense alignment in parallel corpora)

Validation methodologies

Page 152: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Validation of syntactically well-formed wordnets:

-compliance with the dtd for the VISDIC editor.-no duplicate literals in the same synset-no sense duplications (literal&sense number) -valid set of semantic relations-no dangling nodes (conceptual density)-no loops-valid synsets identifiers… and many others

Syntactic validations

Page 153: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Sense conflicts (a literal&sense-label in two or more synsets):• easy to solve (obvious human errors in sense assignment)• hard to solve (provide evidence for the Wordnet sense distinctions hard to make in other languages; hints for ILI soft clustering)

Consistency checking

Page 154: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-lingual validation of the ILI mapping

• A bilingual lexicon might say TR (wL1)=w1L2, w2

L2, …

(not enough)

• A lexical alignment process might give you contextual translation information:

The mth word in language L1 (wmL1) is translated by

the nth word in language L2 (wnL2)

(step1)

TR-EQ (wmL1)= wn

L2 (not enough, but

better)

Page 155: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-lingual validation of the ILI mapping

• A sense clustering procedure might give you info on similar senses of different occurrences of the same word:

Sense (Occ(wiL1, p), Occ(wi

L1, q) …) =

(step2)

Sense (Occ(wjL2, m), Occ(wj

L2, n) …) = β

, β= ? (sense labeling)

• synset(wiL1)

TR-EQV synset(wjL2) (step3)

, β are ILI-codes (ideally = β)

Page 156: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-lingual validation of the ILI mapping (idealistic

view)Translation(Wi

L1)=WjL2 => Syn1

L1, Syn2L2 so that

WiL1 Syn1

L1 and WjL2 Syn2

L2 and

=> EQ-SYN (Syn1L1)=EQ-SYN(Syn2

L2) = ILIk

WN1WN2

ILI ILI

EQ-SYN EQ-SYN

WiL1

WjL2

ILIk

TR-EQ

Page 157: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-lingual validation of the ILI mapping (more realistic

view)

ILI ILI

EQ-SYNEQ-SYN

WiL1

WjLk

TR-EQ

WN1

WN2

Page 158: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Monolingual wordnets construction

• Expand model (essentially, based on translating the PWN synsets and importing the relations)

• Merge model (mapping independently built synsets already related onto the best matching PWN synsets and relations)

• Combined model (*Expand+*Merge)

• Interlingual relation: EQ-SYN; the other types of interlingual relations emulated via the non-lexicalized synsets.

Page 159: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Interlingual relations emulation

Hyper

NL1

Romanian

EQ-SYN

jubilee:1

ILI

jubileu:1 diamond jubilee:1

silver jubilee:1

Hyper

Hyper

NL2 NL3NL4

Hyper

EQ-SYN

EQ-SYNEQ-SYN

EQ-HAS-HYPO

EQ-NEAR-SYN

EQ-NEAR-SYNHyper

Hyper

EQ-HAS-HYPO(jubilee:1, jubileu:1); EQ-HAS-HYPER(jubileu:1, jubilee:1)EQ-NEAR-SYN(jubileu:1, silver jubilee:1); EQ-NEAR-SYN(jubileu:1, diamond jubilee:1)

50years 25years 60years

50years 50years yearsyears

Page 160: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

RoWn

• Development strategy:– Merge model for the synsets and lexical relations and Expand

model for the semantic relations

• Facilitated by the availability of resources and tools: – EXPD – XML encoding of the reference explanatory dictionary

for Romanian (includes among others, sense definitions, grammatical information, expressions, examples, synonyms, etymology, derivation (for derivatives)

– SYND – XML encoding of the reference dictionary of synonyms in Romanian

– English-Romanian bilingual dictionary (automatic extracted, using WA technology, from very large parallel corpora

– Various statistical tools for corpus processing and information extraction

– Lexicographer’s tools (language independent, but dependent on the annotation schema (XCES, CONCEDE) for the language resources),

Page 161: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Example of an EXPD-entry (CONCEDE schema)

<entry id="TUFIS_"> <hw>TUFIŞ</hw> <stress>TUF`IŞ</stress> <alt> <brack> <gram> nom_neut_sing_indef</gram><orth>tufiş</orth>

</brack> <brack><gram>

nom_neut_pl_indef</gram><orth>tufişuri</orth></brack> </alt> <pos>substantiv</pos> <gen>neutru</gen> <struc> <alt> <def>Desiş de tufe sau de arbuşti</def> <def>mulţime de copaci tineri, stufoşi</def> </alt> <syn>tufăriş, tufărie </syn> </struc> <etym> <m>tufă</m>+suf.<m>-iş</m> </etym></entry>

Page 162: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

XCES annotation in the parallel corpus

<tu id="Ozz20"> <seg lang="en"> <s id="Oen.1.1.4.9"> <w lemma="the" ana="Dd">The</w> <w lemma="patrol" ana="Ncnp" sn=“3" oc=“Group" dom="military">

patrols</w> <w lemma="do" ana="Vais">did</w> <w lemma="not" ana="Rmp" sn="1" oc="not“ dom="factotum">

not</w> <w lemma="matter" ana="Vmn" sn="1" oc="SubjAssesAttr" dom="factotum">

matter</w><c>,</c> <w lemma="however" ana="Rmp" sn="1" oc="SubjAssesAttr|PastFn” dom="factotum">

however</w><c>.</c></s></seg> <seg lang="ro"> <s id="Oro.1.2.5.9"> <w lemma="şi" ana=Crssp>Şi</w> <w lemma="totuşi" ana="Rgp" sn="1“ oc="SubjAssesAttr |PastFn" dom="factotum"> totuşi</w><c>,</c> <w lemma="patrulă" ana="Ncfpry" sn="1.1.x" oc=“Group" dom="military">

patrulele</w> <w lemma="nu" ana="Qz" sn="1.x" oc="not" dom="factotum"> nu</w> <w lemma="conta" ana="Vmii3p" sn="2.x" oc="SubjAssesAttr" dom="factotum">

contau</w><c>.</c></s></seg>… </tu>

Page 163: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Selection criteria for the BalkaNet lexical stock

• All the partners implemented the synsets in BCS1, BCS2 and BCS3 (BalkaNet Common Synsets): 8516 synsets– they were selected based on the same

criteria as in EuroWordNet (base concepts, number of hyponyms, position in the PWN hierarchies, etc)

• Partner specific criteria for other synsets BUT conceptually dense!

Page 164: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Hierarchy Preservation Principle

Notation• H - the hypernymy relation• + - the Kleene operator (H+ meaning

at least one H relation)• ML1 = NL2 – synset M in L1 and

synset N in L2 are mapped onto the same ILI concept

Page 165: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Hierarchy preservation principle

M1L1 H+ M2

L1

M1L1 = N1

L2 N1L2 H+ N2

L2

M2L1 = N2

L2

Ma

Mb

M2

M1

Na

N2

N1

language L1 language L2

Page 166: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Examples of inconsistencies

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

Mketchup

Mmaioneză Mketchup Mmayonnaise

Maromaizant

Page 167: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Examples of inconsistencies

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

Mketchup

Mmaioneză Mketchup Mmayonnaise

Maromaizant

Page 168: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Examples of inconsistencies

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

Mketchup

Mmaioneză Mketchup Mmayonnaise

Maromaizant

Page 169: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Examples of inconsistencies

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

Mketchup

Mmaioneză Mketchup Mmayonnaise

Maromaizant

Page 170: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Examples of inconsistencies

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

Mketchup

Mmaioneză Mketchup Mmayonnaise

Maromaizant

Page 171: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The first inconsistency

McondimentRO = name given to some (spicy) substances of mineral, vegetal, animal or with synthesis origin, which added to some alimentary products, gives them a specific taste or flavor, enjoyable (DEX)

McondimentEN = a powder or liquid, such as salt or ketchup that you use to give special taste to food (Longman)

Mcondiment Msos

Mmirodenie

Mflavorer

Mcondiment

Msauce

Mspice

Maromaizant

Page 172: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The first inconsistency - solved

McondimentRO = name given to some (spicy) substances of mineral, vegetal, animal or with synthesis origin, which added to some alimentary products, gives them a specific taste or flavor, enjoyable (DEX)

McondimentEN = a powder or liquid, such as salt or ketchup that you use to give special taste to food (Longman)

Mcondiment Msos

Mmirodenie

Mflavorer

Mcondiment

Msauce

Mspice

Maromaizant

Page 173: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The second inconsistency

RO EN

MsosMcondiment

Msauce MketchupMketchup

1 sense of ketchup Sense 1catsup, ketchup, cetchup, tomato ketchup -- (thick spicy sauce made from tomatoes) => condiment -- (a preparation (a sauce or relish or spice) to enhance flavor or enjoyment: "mustard and ketchup are condiments") => flavorer, flavourer, flavoring, flavouring, seasoner, seasoning -- (something added to food primarily for the savor it imparts) => ingredient -- (food that is a component of a mixture in cooking)

Page 174: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The second inconsistency -solved

RO EN

MsosMcondiment

Msauce

Mketchup

Mketchup

1 sense of ketchup Sense 1catsup, ketchup, cetchup, tomato ketchup -- (thick spicy sauce made from tomatoes) => condiment -- (a preparation (a sauce or relish or spice) to enhance flavor or enjoyment: "mustard and ketchup are condiments") => flavorer, flavourer, flavoring, flavouring, seasoner, seasoning -- (something added to food primarily for the savor it imparts) => ingredient -- (food that is a component of a mixture in cooking)

Page 175: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Successful overlapping

Mingredient

RO EN

Mingredient

Mcondiment Msos

Mmirodenie Mmuştar

Mdafin

Mflavorer

Mcondiment

Msauce Mmustard

Mspice

MketchupMmaioneză Mketchup Mmayonnaise

Maromaizant

Page 176: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Valency frames (I)An interesting experiment: the Cz partner gave us access to about 600 Cz verbal synsets associated with valency frames extracted from the Czech National Corpus.

Via translation equivalence relations among the Ro-Cz wordnets we imported the original valency frames and manually checked for applicability in Romanian. About 84% of the imported valency frames were valid (sometimes with minor modifications)! Only 98 valency frames needed significant modifications. Very promissing.

The semantic restrictions for the frame elements are wordnet-endogenous.

Page 177: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Valency frames (II)The synset 20-02609765-v:

(a_se_afla:3.1, a_se_găsi:9.1, a_fi:3.1) with the gloss:be located or situated somewhere; occupy a certain

position (nom*AG(fiinţă:1.1) | nom*PAT(obiect_fizic:1))

= prep-acc*LOC(loc:1). fiinţă:1.1 =: a living thing that has (or can develop) the ability

to act or function independently

obiect_fizic:1 =: a tangible and visible entity; an entity that can cast a shadow; loc:1 =: a point or extent in space.

The reading is the following: Any verb in the synset 20-02609765-v subcategorizes for two arguments:

• the first one, which usually precedes the verb, is an NP in nominative case with the semantic role either AGent or PATient depending on the category of the filler

• The second one, which usually follows the verb, is a PP in accusative case with the semantic role of LOCation (it’s filler must be a loc:1 or a hyponym of it)

Page 178: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Development tools (I)

• WnBuilder (Java)– Graphical user interface putting

together all the lexical resources (EXPD, SYND, PWN, RO-EN dictionary) based on which the lexicographer select the best matching synsets and assign sense numbers to the literals in the RO synset;

– Allows for distributed and independent work of different lexicographers

Page 179: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Development tools (II)

• Distributed work makes room for inherent mapping errors; they are dealt with in a centralized way by means of WnCorrect.

• WnCorrect (Java)– Graphical user interface putting together all

the independently developed portions of the wordnet.

– Allows for immediate spotting of most mapping errors (two or more synsets in the target wordnet mapped onto the same PWN synset; same literal with the same sense number occurring in two or more synsets, etc.)

Page 180: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 181: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 182: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 183: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Development tools (III)

• Relations-Import (Perl)– Allows for automatic import from PWN of

the semantic relations and assists the user in defining lexical relations among the target synsets

• Alternative solution VISDIC (Brno University); we use it as the standard multilingual viewer

Page 184: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 185: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Development tools (IV) WSDTool

• WSDTool is a Java application which by its GUI allows the user to edit the semantic annotation of a XCES compliant corpus.

• In editing mode, the resources involved (translation dictionaries and wordnets) can be validated and corrected in accordance with the corpus findings.

Page 186: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 187: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Some Quantitative Data about RoWordNet (October, 2007)

• Synsets: 43,302• Relations: 57,178• Literals: 67,270• SUMO/MILO labels: 39,538 (1821

concepts) • DOMAINS labels: 49,563 (165 domains)• Sentiment labeled synsets: 43,302• Incorporation into Alexandria (Memodata)

multilingual reading-support system http://www.memodata.com/2004/fr/Alexandria/

• Incorporation into MultiWordNet http://multiwordnet.itc.it/online

Page 188: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Some Direct Applications of These Technologies and Resources

• Word Alignment (translation models building)– Best results in the shared task on aligning En-Ro parallel

corpora (NAACL 2003 Edmonton, Canada; ACL 2005 Ann Arbor, USA)

• Word Sense Disambiguation • Multilingual Thesauri Alignment• Semantic annotation import (valency frames, frame

semantics)• Terminology consistency over translated texts• Opinion mining• Cross-lingual QA in open domains

Page 189: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word Alignment

• D. Tufis, A.M.Barbu, R. Ion: „TREQ-AL: A word-alignment system with limited language resources”, Proc. of the NAACL 2003 Workshop on Building and Using Parallel Texts; Romanian-English Shared Task, Edmonton, Canada, 2003, pp. 36-39

• D. Tufis, R. Ion, A. Ceauşu, D. Ştefănescu: Combined Aligners. In Proc. of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp. 107-110

• D. Tufis, R. Ion, A. Ceauşu, D. Ştefănescu: Improved Lexical Alignment by Combining Multiple Reified Alignments. In Proc. of the EACL, Trento, 3-7 April, 2006.

The alignment tool has been incorporated into a complex processing platform which put together most of the tools presented so far:

- sentence segmentation, tokenisation, POS tagging (monolingual)

- sentence alignment, word and phrase alignment, word sense disambiguation with multiple sense inventories: PWN, SUMO, Domains (bitexts)

Page 190: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

En-Ro Word Alignment

Page 191: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word Sense Disambiguation

• Setting: parallel texts• Uses:

– Word Alignment (major part of the WSD task in our setting)

– WordNet (as ILI) and theXx-Wn (for En-Xx bitext)

– SUMO, Domains

• Model:– If <WL1 WL2> then at least one sense of WL1 and

one sense of WL2 must be closely conceptually related

Page 192: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Parallel Corpora• We made several experiments on three parallel

corpora, all of them represented in the same format (Multext-East XCES-ANA). They are tokenized, POS tagged, lemmatized and sentence aligned.– “1984”: corpus based on Orwell’s novel; contains 9

parallel translations (BG, CZ, ET, GR, HU, RO, SL, SR, TR) plus the EN original; the validation procedure is currently applied on all the bitexts pairing BalkaNet languages to English (BG-EN, CZ-EN, GR-EN, RO_EN, SR-EN and TR-EN) for the validation of the respective wordnets;

– “VAT”: corpus based on the Sixth VAT Directive -77/388 EEC VAT; contains 3 languages sentence-aligned variants (EN, NL, FR)

– NAACL parallel corpus (EN, RO)• Similar work is ongoing with a much larger

corpus (JRC-Acquis): 22 languages, more than 50,000,000 words per language; ever growing

Page 193: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Parallel Corpus Representation

<tu id="Ozz.113"> <seg lang="en"> <s id="Oen.1.1.24.2">

<w lemma="Winston" ana="Np">Winston</w> <w lemma="be" ana="Vais3s">was</w> ...

</s> </seg> <seg lang="ro"> <s id="Oro.1.2.23.2">

<w lemma="Winston" ana="Np">Winston</w> <w lemma="fi" ana="Vmii3s">era</w> ...

</s> </seg> <seg lang="cs"> <s id="Ocs.1.1.24.2">

<w lemma="Winston" ana="Np">Winston</w> <w lemma="se" ana="Px---d--ypn--n">si</w> ... </s> </seg>

...</tu>

Page 194: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Traditional WSD: monolingual data

• Various conceptual problems (continuum/discrete, how many senses one word has, independence on the intended application-the right level of granularity, etc)

• A typical classification of the WSD solutions:– Unsupervised (with raw or pre-processed texts)

• The cheapest way; sense-inventories are established ad-hoc, depending on applications;

– Supervized (hand annotated WSD training data)• Expensive, requires lots of hand-annotated data; sense

inventory is dictated by the one used by human annotators in the training data

– Knowledge-Based (based on MRD dictionaries, ontologies, domain taxonomies) ; they can be built on either supervised or unsupervised methods

• Sense inventory biased to the one used in the supporting KS

Page 195: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD based on parallel texts and aligned wordnets (I)

• A very different approach than the one used for monolingual data, with less conceptual problems;

• First approaches to using parallel corpora for WSD (1992-1995) were shadowed by a lack of interest by virtue of the lack) in those days) of sufficient parallel data (this cannot be an argument nowadays)

• Relatively cheap (provided aligned wordnets do exist): a mixture of unsupervised and KB approaches

• Sense inventory biased by the interlingual index

• Given that these approaches can use any of the methods in monolingual WSD systems, but additionally have access to an invaluable additional KS (the translators’ linguistic decisions) any decent implementation of a WSD system based on parallel texts and aligned wordnets is “doomed” to be more accurate than any competing monolingual system.

Page 196: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD based on parallel texts and aligned wordnets (II)

The advantage of using an interlingual index (PWN, the backbone of aligned wordnets) is many-folded:• PWN has been aligned with different other conceptual

structures (SUMO, MILO, DOMAINS, various domain specific ontologies, topic signatures, synset clusters, etc) which are at no cost available in the other aligned ontologies;

• Multiple sense inventories become thus available for any application of the WSD in the languages of a multilingual wordnet

• Multilingual R&D can benefit in a much controlled and principled way of any advancement achieved in languages interlingually connected.

• …

Page 197: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD based on parallel texts and aligned wordnets(III)

Our approach:

A mixture of unsupervised and KB approaches with multiple processing steps:

– word alignment in parallel corpora (COWAL) and translation equivalents extraction (translation model);

– sense labeling using (BalkaNet):• Princeton word sense inventory; • SUMO/MILO ontology• IRST Domains classes• EXPD labelsBased on aligned wordnets (covers ~75% of the word occurrences in a corpus)

– sense clustering based on translation equivalents extracted from parallel corpora (takes care of the uncovered cases by the previous step;

– generation of the WSD annotation in the parallel corpus

Page 198: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD MAIN STEPS 1.Word Alignment & Filtering of the Translation Equivalents

• The word alignment system (COWAL) is producing N-M cross-POS alignment links with an average F-measure of more than 80% (F=82.52%, AER=17.48%).

• Only the preserving major POS (V, N, A, R) translation links retained. In such a case F is better than 92% (F=92.04%, AER=7.96%)

Page 199: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word Sense Disambiguation

Page 200: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD MAIN STEPS2. Sense Labeling

• Aligned wordnets (lexical ontologies)• Conceptual knowledge structuring (upper and mid

level ontologies): SUMO/MILO• Domain taxonomies (UDC-librarian’s taxonomy):

IRST-DOMAINS• Explanatory Dictionary of Romanian (3 sense levels)• Coverage heuristics: if one of the words in a

translation pair is not member of any synset in the respective language wordnet, but the other word is present in the aligned wordnet, and moreover it is monosemous, the first word gets the sense as given by the monosemous word. If one of the languages is English, any other language can benefit from this heuristics (approx. 80% of the PWN literals are monosemous):– Ex: hilarious <-> hilar => ENG20-01221243-a burp <-> râgâi => ENG20-00003374-v prospicience <->clarviziune => ENG20-05469664-n

Page 201: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Example (I): <lamp lampă>PWN2.0 (lamp) = {03500372-n, 03500773-n}

RoWN (lampă) = {03500773-n, 03500872-n}

ILI= 03500773-n => <lamp(2) lampă(1)>

SUMO (03500773-n) = +Device

DOMAINS (03500773-n) = furniture

EQ-SYN

WN1WNk

ILI ILI

EQ-SYN

W(i)L1

w(j)Lk

ILIk

TR-EQV

SUMODomains

Page 202: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Example (II): <lamp felinar>

ILI= 03500773-n => <lamp(1) felinar(1.1)>

SUMO (03500773-n) = IlluminationDevice

DOMAINS (03500773-n) = factotum

WN1

WNk

ILI ILI

TR-EQV

EQ-SYNEQ-SYN

EQ-SYN

SUMODomains

PWN2.0 (lamp) = {03500372-n, 03500773-n}RoWN (felinar) = {003505057-n}δ (03500372-n, 003505057-n)=0.5 δ (03500373-n, 003505057-n)=0.125

Page 203: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Example (III): <contain conţine>PWN2.0 (contains) = {02435410-v, 02619614-v, 02551275-

v, 02619957-v, 01096284-v, 02666612-v}

RoWN (conţine) = {02619614-v, 02554437-v, 02551275-v,

02619957-v, 02554853-v}

contain:2 (02619614-v) EQ-SYN conţine:1.1

verb_group contain:5 (02619957-v) EQ-SYN conţine:1.1.1

SUMO (02619614-v, 02619957-v) = contains

SUMO (02551275-v) = part

(disjointRelation contains part)

Page 204: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Example (IV): <contain conţine> SUMO definition for contains (The relation of spatial containment for two separable objects):

(subrelation contains partlyLocated)(instance contains SpatialRelation)(instance contains AsymmetricRelation)(domain contains 1 SelfConnectedObject)(domain contains 2 Object)(subclass SelfConnectedObject Object)(<=> (contains ?OBJ1 ?OBJ2) (exists (?HOLE) (and (hole ?HOLE ?OBJ1) (properlyFills ?OBJ2 ?HOLE))))

ILI=02619614-v => <contain(2) conţine(1.1)>

Page 205: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

WSD MAIN STEPS 3.Word Sense Clustering

_____|-> (1) |-----| |-> (1) | |_____|---> (1) | |___|-> (1) | |-> (1) | |---> (1) |----| | _|-> (1) | | | |-| |-> (1) | | |---| |-| |-> (1) | | | | |-| |-> (1) | |-----| | |-| |-> (1) |--| | |---| |-> (1) | | | |-> (1) | | |___|---> (6) | | |___|-----> (1) | | |-----> (1) | | _____|-----> (6) -| |----| |-----> (6) | |-----> (4) | | | |---> (2) | |---| _|-> (2) | | | |-| |-> (2) | | |---| |-> (2) |--| |-> (2) | |-----> (2) | | ___|-----> (2) |---| |----| |-----> (2) | | | _|-> (2) | | |---| |-> (2) |-----| |-> (2) | ____|-> (3) |----| |-> (2) | _|-> (2) |----| |-> (2) |-> (2)

An agglomerative, hierarchical algorithm using a vector space model, Euclidean distance and cardinality weighted computation of the centroid (the “center of weight” of a new class).The undecidable problem of how many classes, gets hints from thework already done in step 1 and 2

Page 206: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation (I)• “lexical sample” and 1-tag annotation evaluation (with

k-tag, the performance would be essentially the one of the filtered word alignment, i.e. 92.04% )

• 216 English ambiguous words (at least two senses/POS) with 2081 occurrences in “1984” were semantically disambiguated by three experts in terms of PWN2.0 sense inventory. The experts negotiated all the disagreement cases, thus resulting the Gold Standard annotation (GS)- this is “lexical sample/lexical choice” evaluation type of SENSEVAL (much harder than “all words” which includes monosems and homographs as well)

• For each PWN2.0 sense number, the GS was deterministically enriched with the SUMO category and the DOMAINS label.

• Thus, we had three sense inventories in the GS and could evaluate system’s WSD accuracy in terms of each of them.

Page 207: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation (II)

• Automatic WSD was performed three ways:• using only RO-EN aligned BalkaNet wordnets (AWN)combining AWN with clustering (AWN+C) combining AWN+C with the simple heuristics (AWN+C+MFS)

• Out of 2081 total occurrences 61 (34 words) could not receive a sense tag because the target literal was wrongly aligned by the translation equivalents extractor module of the WSDtool, or because it was not translated or wrongly translated by the human translator. In this case we used MFS, a simple heuristics assigning the most frequent sense label (42 occurrences were correctly tagged).

Page 208: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation (III)

WSD annotation Precision Recall F-measure

AWN 84.86% 62.22% 71.80%

AWN+C 80.29% 77.94% 79.10%

AWN+C+MFS 79.96% 79.96% 79.96%

WSD based on WN2.0+RoWN (PWN2.0 id)

Page 209: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation depends on the sense inventories

SENSE INVENTORY PRECISION RECALL

PWN meanings

(115424 categories)

79.96% 79.96%

SUMO/MILO (2066 categories)

86.54% 86.54%

IRST DOMAINS

(163 categories)

93.46% 93.46%

PWN2.0+RoWN ( AWN+C+MFS)

Page 210: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

<tu id="Ozz.1"><seg lang="en">...

</seg> <seg lang="ro">... </seg>...</tu>

4. WSD annotation in the parallel corpus

Page 211: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Thesauri Alignment• Eurovoc (the multilingual thesaurus used for indexing the Acquis

Communautaire corpus)– En version: the reference– Ro version: partial, unmapped (about 600 terms missing, some problematic

translations)

• Alignment: (translation equivalents (lemma-based) have been extracted from Acq-Com)

a) Full topological matching => translation equivalence checking (editable)b) Partial topological matchings => select the identity & translation equivalence

suggestions (semi-automatic: human selects from system’s suggestions and edit the translations)

Dan Ştefănescu, Dan Tufiş: “Aligning multilingual thesauri”, in Proceedings of LREC 2006, Genoa, Italy

Page 212: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Topological Alignment (full)

Page 213: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Topological Alignment (partial)

RO5

RO1

RO2RO3

Page 214: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Term Translations in Parallel Discovery and Consistency

Check:Background• FF-POIROT (IST-2001-38248)

– http://www.starlab.vub.be/research/projects/poirot

• Consistent multilingual lexicalization of ontological concepts and relations, ensuring common understanding of the legal stipulations.

• The domain area is VAT: VAT 6th Directive of EEC (77/388/EEC of 17th May 1977)

• Different cross-countries interpretations of the VAT directive favours fiscal frauds

Page 215: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

<tu id="Ozz.21"><seg lang="en"><s id="Oen.21"><w lemma="title" ana="Vm">TITLE</w> <w lemma="I" ana="Pp">I</w> <c>-</c> <w lemma="introductory" ana="Adj">INTRODUCTORY</w> <w lemma="provision" ana="Nc">PROVISIONS</w> </s></seg>

<seg lang="nl"><s id="Onl.21"><w lemma="hoofd#stuk" ana="Nc">Hoofdstuk</w> <w lemma="i" ana="M">I</w> <c>:</c> <w lemma="inleiden" ana="A">Inleidende</w> <w lemma="bepaling" ana="Nc">bepalingen</w> </s></seg>

<seg lang="fr"><s id="Ofr.21"><w lemma="titre" ana="Nc_sg">Titre</w> <w lemma="i" ana="M">I</w> <w lemma="er" ana="Nc_sg">er</w> <c>:</c> <w lemma="disposition" ana="Nc_pl">Dispositions</w> <w lemma="introductif" ana="Adj_pl">introductives</w> </s></seg></tu>

A sample of XCES-Ana Aligned Encoding

TITLE I – INTRODUCTORY PROVISIONS

Hoofdstuk I : Inleidende bepalingen

Titre I er : Dispositions introductives

Page 216: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

VAT Corpus Overview

LANGUAGE EN FR NL

No. of occurrences 41722 45458 40594

No. of word forms 3473 3961 3976

No. of lemmas 2641 2755 3165

Additional resource: a list of EN terms, manually extracted by an expert in VAT legislation from the English variant of the VAT directive; 1043 (inflected) forms; after lemmatization and duplicates removal remained only 900 terms.

Page 217: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

the POS of French word membres (common noun plural)and it’s English translation equivalent (member)

The English sentence of the Ozz.2 translation unit

The French sentence of the Ozz.2 translation unit

English words and their French translations

Main functions

Page 218: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Finding the Multiword Terms in a Parallel Corpus

A) Extract the 1-1 translation equivalents;

B) Using a “witness” monolingual term collection, we identify term translations in the other parts of the parallel corpus; this is based on exploiting the distribution of the indexes of the aligned words and defining a target span of the text where the candidates are looked for. The identified spans are checked for the longest common sequence of translations. The ranking of the different possible translation equivalents is based on the DICE score and takes into account the number of words in the source term, the number and adjacency score of the translated words from the source term and the length of the target candidate.

Page 219: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

EX: transit procedure = procédure de transit communautaire

Translation equivalents:Community communautaire (cross-part of speech equivalents)transit transitprocedure procédure

Main motivation for this process: checking consistency in cross-lingual term usage, ensuring correct projection of a multilingual terminological data-base over the concepts of a language-independent ontology.

Community transit procedure

procédure de transit communautaire

Page 220: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 221: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 222: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

A (rough) evaluation of the French and Dutch Terms

• Number of English terms: 900• Number of French terms: 871

– Recall 96,73%– Precision 99.08%

• Number of missed French multiword terms because of data preprocessing errors: 18 (the 11 were single word terms, occurring only once)

• Number of Dutch terms: 861 – Recall 95,67%– Precision ?? (no idea!)

• Number of missed Dutch multiword terms because of data preprocessing errors: 32

Page 223: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Hypotheses for text mining (term discovery):a) a multiword term has a simple constituent structureb) if a “witness” term is translated in different ways in other languages, its usage in those languages is not terminologically “clean”;c) a “witness” term which is not translated systematically in the same way in the other languages is probably not a proper term;d) a significant term should re-occur in a representative document .Considering these hypotheses being reasonable, we developed a multilingual term discovery tool, thus removing the requirement for a witness monolingual term glossary.

Page 224: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Term discovery with translation equivalents

1. for each language in a parallel corpus we extract (by means of NSP, using loglikelihood scoring and DICE-score ranking) the statistically meaningful collocations;

2. for loosening the effect of data sparseness, lemmatisation is needed);

3. lists of stop-words and 18 regular expressions describing the term constituent structure (e.g. a term cannot begin with a number, a term cannot contain a conjunction, etc.) filter out parasitic high scored n-grams.

Page 225: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

4. With each such list of monolingual collocations, used as a source “witness term collection”, the translations in the other target languages are extracted as we did before.

5. If the translations in the target languages are also found in the language specific collocation lists, these are taken to represent terms; with N languages in the parallel corpus, we have N lists of monolingual collocations and N*(N-1) bilingual translation equivalence extraction exercises (L1 source and L2 target is not the same as L2 source and L1 target; they reinforce each other).

Page 226: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Evaluation of the term miner

• Based on collocational analysis and grammar well-formedness rules (we used 18 rules for English).

• Several terms in the human extracted VAT list of terms were not compliant with our grammar well-formedness rules (out of the 755 multiword terms, only 357 passed the well-formedness filters).

• 144 terms were discovered, 79 were discovered partially and 144 were not found at all;

• about 1500 terms found by the term extractor which could be terms (in our view)

Page 227: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Knowledge Induction/Transfer

•dependency relations, •word senses, •valency frames,•semantic paradigmatic (wordnet) relations•syntagmatic (framenet) relations

Page 228: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Annotation import

• Romance Framenet initiative– We started translating the SemCor corpus

(about 1000 sentences by now), word-aligned the Ro-En bitext and WSDed the Romanian part; this complements the Italian initiative (MultiSemCor). Based on the alignment the annotations were imported from English into Romanian (no evaluation done yet).

– The results (outdated now) can be seen at: http://multisemcor.itc.it/

Page 229: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Relation Transfer• Setting: a En-Ro bitext in which the English part

is parsed with the FDG parser• The Romanian part is linked with CLAM linker• The two parts of the bitext are word and

dependency aligned• Import the orientation and labelling of a English

dependency relation if the English relation is properly aligned with the corresponding Romanian link. In other words, if the paired words of the English dependency relation are aligned with the paired words of the Romanian link

• The filter component controls the import and labeling (e.g. translations of active voice as passive voice).

Page 230: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Tagged, Lemmatized and Parsed English side of the “1984” multilingual corpus

01 The the det:>2 @DN+ 2+,Dd2 hallway hallway subj:>3 @SUBJ 1+,Ncns3 smelt smell main:>0 @+FMAINV 1+,Vmis4 of of phr:>3 @ADVL 5+,Sp5 boiled boil attr:>6 @A+ 1+,Afp6 cabbage cabbage pcomp:>4 @-P 1+,Ncns7 and and cc:>6 @CC 31+,Cc-n8 old old attr:>9 @A+ 1+,Afp9 rag rag attr:>10 @A+ 1+,Ncns10 mats mat pccomp:>6 @-P 1+,Ncnp

$.

In order to provide maximally accurate source data for knowledge induction and transfer, the EN annotated data has been hand validated

Page 231: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependeny &Chunking Annotation

Page 232: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Dependency Relations Transfer Cases

1. Perfect transfer

2. Transfer with amendments

3. Language specific phenomena

4. Impossibility of transfer

Page 233: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

1. Perfect Transfer

Page 234: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

2.Transfer with amendments Dummy anticipatory ‘It’ is subject and book is complement. Yet in Romanian, ‘carte’ is subject.

Page 235: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

3. Language Specific Phenomena

• Pro-drop phenomenon

Page 236: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

4. Impossibility of transfer (I)

• Equivalent verbs with different syntactic behaviour: ‘like’ – ‘plăcea’

Page 237: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Valency frames (I)A preliminary experiment: the Cz partner gave us access to about 600 Cz verbal synsets associated with valency frames extracted from the Czech National Corpus.

Via translation equivalence relations among the Ro-Cz wordnets we imported the original valency frames and manually checked for applicability in Romanian. About 84% of the imported valency frames were valid (sometimes with minor modifications)! Only 98 valency frames needed significant modifications. Very promissing.

The semantic restrictions for the frame elements are wordnet-endogenous.

Page 238: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Valency frames (II)The synset 20-02609765-v:

(a_se_afla:3.1, a_se_găsi:9.1, a_fi:3.1) with the gloss:

be located or situated somewhere; occupy a certain position (nom*AG(fiinţă:1.1) | nom*PAT(obiect_fizic:1))

= prep-acc*LOC(loc:1). fiinţă:1.1 =: a living thing that has (or can develop) the ability to act or function independently

obiect_fizic:1 =: a tangible and visible entity; an entity that can cast a shadow; loc:1 =: a point or extent in space.

The reading is the following: Any verb in the synset 20-02609765-v subcategorizes for two arguments:

• the first one, which usually precedes the verb, is an NP in nominative case with the semantic role either AGent or PATient depending on the category of the filler

• The second one, which usually follows the verb, is a PP in accusative case with the semantic role of LOCation (it’s filler must be a loc:1 or a hyponym of it)

Page 239: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Opinion Mining

• Goal: to assess the overall sentiment of an opinion holder with respect to a subject matter

• Different Granularities (Document, Sentence)– Identify the opinionated sentences (OpinionFinder) and

the opinion holder– Select those referring to the subject matter of interest– Classify the opinionated sentences on the subject

matter, according to their polarity (positive, negative, undecided) and force.

Page 240: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

SentiWordNet Andrea Esuli, Fabrizio Sebastiani. SentiWordNet: A publicly Available

Lexical Resourced for Opinion Mining, LREC2006

The basic assumptions: 1. words have graded

polarities along the orthogonal axes: Subjective-Objective (SO) & Positive-Negative (PN)

1. The SO and PN polarities depend on the various senses of a given word (context)

Page 241: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 242: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

© Andrea Esuli 2005 – [email protected]

Page 243: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

State of the art• Monolingual research: more and more numerous

and in more and more languages• Multilingual comparative studies (different

comparable text-data, different languages): not very many, but their number is increasing

• We are not aware of cross-lingual studies (parallel texts)– why? Possible answers:

– The original opinions are those expressed in the source language; the target language contains (presumably faithful) translations of the holders’ opinions;

– Expressing opinions is a cultural matter: most translations are concerned with the factual content preservation

Page 244: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Questions (Case 1)• Assume a collection of original documents in Japanese

SJP and two translations of it in English TEN1 and TEN2

with the first translation done by an Japanese with a perfect command of English and the second translation done by an American with a perfect command of Japanese.

– Would opinions in SJP and TEN1 be “the same”?

– Would opinions in SJP and TEN2 be “the same”?

– Would opinions in TEN1 and TEN2 be “the same”?

“the same”=#opinionated sentences, polarity

Answers: No idea! Possible guesses: YES?, YES?, YES?

Page 245: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Questions (Case 2)• Assume a collection of original documents in Japanese SJP

containing reports (newspaper articles, news agencies briefs, official statements) on specific international events and a collection of documents in English SEN containing reports of similar lengths and from corresponding sources on the same international events– Would opinions in SJP and SEN be “the same”?

“the same”=#opinionated sentences, polarity

Answer: Probably, no!

Why? Due to cultural differences. For instance, (cf Kim & Myaeng, NTCIR 2007) “a sentence in Japanese, reporting on a merge of two companies should be judged to have negative sentiment whereas the same kind of activities in the US would be a positive event”

Page 246: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Opinion analysis across languages NTCIR6• David Kirk Evans, Lun-Wei Ku, Yohei Seki, Hsin-His Chen, Noriko

Kando (2007)

– Case 1.5 experiments (comparable texts) in Japanese, English and Chinese (English translations probably done by Japanese and Chinese employees of the local news agencies)

– Japanese data(1998-99): Mainichi Daily News, Yomiuri

– English data(1998-99): Mainichi Daily News, Korea Times, Xinhua

– Chinese data(1998-99): United Daily News, China Times, China Time Express, China Daily News, etc

Language Topics Documents Sentences Opinionated lenient/strict

Relevant lenient/strict

Chinese 32 843 8546 62% / 25% 39% / 16%

English 28 439 8528 30% / 7% 69% / 37%

Japanese 30 490 12525 29% / 22% 64% / 49%

Page 247: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• Big differences across languages in the Gold Standards• Despite using similar approaches, big differences in the

performances of the competing systems with respect to the processed language (best in Chinese, worst in English)

• Are these differences explained by the existing differences in the annotation? – Partly!

– Annotators training could be a better explanation (big differences between the annotators in the three languages)

– Language and cultural differences probably significantly mattered

What does this experiment show?

Page 248: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Why so much interest in subjectivity analysis ?

• Social community-oriented websites and user generated content are becoming an extremely valuable source of information for everyday information consumer. But also for various kinds of decision makers;

• Two main types of textual information on the web:

facts (objective) and opinions (subjective)• Current search engines search for facts not opinions

(current search ranking strategy is not appropriate for opinion search/retrieval)

Page 249: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• Word-of mouth on the web is sometimes perceived as being more trustful than the regular mass-media sources! – In user generated content (review sites, forums,

discussion groups, blogs etc) one can find descriptions of personal experiences and opinions on almost anything;

– valuable for common people for practical daily decisions (buying products or services, going to a movie/show, traveling somewhere, finding opinions on political topics or on various events…

– valuable for professional decision makers in many areas, so they support this trend.

Page 250: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Feature-based opinion and sentiment analysis

• The building blocks: sentiment words• The words become sentiment words in context• The bag-of-words approach works pretty bad (but

works!) and there are various ways (maybe expensive) to improve the opinion and sentiment analysis

• Syntax and punctuation (usually discarded) play also an important role in judging the subjectivity of a piece of text.

Page 251: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Resources

Polarity is a matter of context. However, lexical resources give you only prior polarities. Sentence/phrase polarity is compositional, based on prior polarities which could be altered by valence shifters (such as negation). Most researchers came to the conclusion that prior polarity is a matter of word senses!Having good resources, creates premises for building accurate opinion miners.

Princeton WordNet 2.0SUMO&MILO

SENTIWORDNET

Domains

Page 252: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Annotating WordNet for prior polarities

• Starting with a set of words, hand annotated for their prior polarities, most sentiment resources are built by applying some ML techniques and inducing prior polarities for lexical items stored in lexico-semantic repositories. As WordNet is a highly praised such a repository, not surprisingly its structure and content are the backbone of such enterprises.

Page 253: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Word-Senses and Subjectivity

• SentiWordNet associates subjectivity scores (P, R, O) to WordNet synsets, i.e.to the word-senses.

• Lexical semantics is very important here.

• WSD would be highly instrumental (e.g. JW&RM)

• Dependency Linking (which is less than parsing, but easier to obtain) is much more appropriate than B-o-W

Page 254: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Cross-lingual opinion and sentiment analysis

• A parallel text (EN-XX) e.g,Orwell’s “1984”, MultiSemCor, Euro-Parl, JRC-Acquis etc.

• Word Align and WSD the EN-XX bitext• Use a scoring method for the senti-words and

valency-shifter words in each part of the bitext (based on SentiWordNet scores) to classify the opinionated sentences

• Try to answer Question 1– Evaluate monolingualy (both in En and XX) whether the

mark-ups hold true; for En you might use OpinionFinder (Wiebe, Riloff et al.) and compare its classification with the SentiWordNet-based classification

– Write immediately a breakthrough paper (whatever the results of the evaluation)

Page 255: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

What you would need to do it?

• Quality multilingual lexical and sentiment marked-up resources (multilingual lexical and sentiment ontologies are probably the best)

• List of valency shifters and rules to define their scope and results on the sentiment words (Polanyi& Zaenen, 2006)

• Preprocesing tools (sentence alignment, tokenizer, POS taggers, lemmatizers, dependency linkers)

• Alignment tools (e.g. COWAL)• WSD tools (WSD Tool, SynWSD)• Sentence opinion scorer and classifier • Annotation transfer tools

Page 256: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• Princeton WordNet2.0 (Fellbaum)

• SUMO/MILO (Niles, Pease)

• DOMAINS (Magnini, Cavaglià • SentiWordNet (Esuli,Senbastiani)

English Lexical and Sentiment Ontology (ELSO)

English Lexical & Sentiment

Ontology (ELSO)

Page 257: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

• English Lexical & Sentiment Ontology

• BalkaNet wordnets (see D. Tufis (ed) ROMJIST Special Issue on BWN)

• EuroWordNet wordnets see P. Vossen (ed) CHUM Special Issue on EWN)

Multilingual Lexical and Sentiment Ontologies (MLSO)

The EuroWordNet and BalkaNet multilingual wordnets use the Princeton WordNet as the InterLingual Index (ILI) => any sentiment and ontological mark-up in PWN is available in the aligned monolingual wordnets; altogether they make a MLSO.

Multilingual Lexical & Sentiment Ontology (MLSO)

Page 258: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

The encoding of a (sentiment) synset in RoWordNet

<SYNSET> <ID>ENG20-04854135-n</ID> <BCS>3</BCS> <DOMAIN>factotum</DOMAIN> <SUMO>SubjectiveAssessmentAttribute<TYPE>+</TYPE></SUMO> <POS>n</POS> <SYNONYM><LITERAL>bine<SENSE>16</SENSE></LITERAL>

<LITERAL>bun<SENSE>51</SENSE></LITERAL> <LITERAL>virtute<SENSE>2</SENSE></LITERAL> </SYNONYM> <DEF> Înclinaţie statornică specială către un anumit fel de îndeletniciri sau acţiuni

frumoase. < /DEF> <ILR>ENG20-04521520-n<TYPE>hypernym</TYPE></ILR> <ILR>ENG20-04855887-n<TYPE>near_antonym</TYPE></ILR> <SENTIWN>

<P>0.75</P><N>0</N><O>0.25</O> </SENTIWN></SYNSET>

Page 259: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 260: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

How much trust one can have in such an approach?

• Pretty high, but if there are different viewpoints, one should better harmonize them.

• Domains: psychological_features, psychology, quality, military etc.

• SUMO/MILO: EmotionalState, Happiness, Psychological Process, SubjectiveAssesmentAttribute, StateOfMind, TraitAttribute, Unhappiness, War etc.

• SENTIWN, DOMAINS, SUMO&MILO, General Inquirer annotations should, intuitively, match! Do they? Not really!

Page 261: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Some statistics

• 2637 synsets labeled by the SUMO concept SubjectiveAssessmentAttribute or EmotionalState have the Sentiwn annotation P:0, N:0, O:1– Eg.(SAA): prosperously, impertinently, best,

victimization, oppression, honeymoon, prettify, beautify, curse, threaten, justify, waste, cry, experience…

– Eg. (ES): unsatisfactorily, lonely, sordid, kindness, disappointment, frustration…

Page 262: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Some statstics (ctnd.)• 28434 synsets are marked for subjectivity. Many of them are

questionably marked so: – e.g. Abstract, BodyPart, Building, Device, EngineeringComponent (most of

the time with negative polarity), FieldOfStudy (nonparametric statistics is bad: <P>0.0</P><N>0.625</N><O>0.375</O>, while gastronomy is much better: <P>0.5</P><N>0.0</N><O>0.5</O>)

– Happyness: Happy, pleased which is similar to glad (1) is not good (<P>0.0</P><N>0.75</N><O>0.25</O>

– Human instances (both real persons or literary characters– LinguisticExpression (extralinguistic is very bad:

<P>0.0</P><N>0.75</N><O>0.25</O>)– PrimeNumber (<P>0.0</P><N>0.375</N><O>0.625</O>)– Proposition (conservation of momentum:

<P>0.0</P><N>0.25</N><O>0.75</O>– DiseaseOrSyndrome (influenza, flu, grippe:

<P>0.75</P><N>0.0</N><O>0.25</O>)– Prison (Jail is not bad at all, it’s a little

fun:<P>0.25</P><N>0.0</N><O>0.75</O>)– etc

Page 263: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Some other statistics

• We took Wiebe’s hand-crafted list of PositivePolarity and MinusPolarity words (based on General Inquirer):– PolPman file contains 657 words– PolMman file contains 679 words

• We extracted all the synsets in PWN2.0 containing the literals in Wiebe’s files– PwNPolPman file contains 2305 synsets

• 817 synsets are marked as entirely objective O:1• 239 synsets have non-positive subjectivity (P:0)• 486 synsets have P 0,5 (corresponding to 293 literals)

– PwPolMman file contains 1803 synsets• 461 synsets are marked as entirely objective O:1• 213 synsets have non-negative subjectivity (N:0)• 656 synsets have N 0,5 (corresponding to 356 literals)

Page 264: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Why these happen?Assuming WordNet structuring is perfect &

Assuming the SUMO&MILO classification is perfect:Assuming Wiebe’s polarity lists are perfect

• Taxonomic generalization is not always working – Nightmare is bad!

• Nightmare is a dream• But dream is not bad (per se)

– An emotion is something good (P:0,5) and so is love, but hate or envy are not!

• Glosses are full of valence shifters (BOW is not sufficient):– honest, honorable: not disposed to CHEAT- or DEFRAUD- , not

DECEPTIVE- or FRAUDULENT- – intrepid: invulnerable to FEAR- or INTIMIDATION- – superfluous: serving no USEFUL+ purpose; having no EXCUSE+

for being• Majority voting is democratic but not the best solution

0,5

Page 265: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Why these happen (cntd)?

But Wiebe’s polarity lists are not perfect

& SUMO&MILO classification is not perfect

& WordNet structuring is not perfect

Page 266: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Sentence Subjectivity Scorer • A very naïve implementation:

– For each sentence in each language add the P, N and O figures of each word

He has(1) no(1) merits(1). P:0.0;N:0.0;O:1 P:0.25;N:0.25:O:0.5 P:0.625;N:0.0:O:0.375

Sentence_1 score: P:0.292;N:0.083;O:0.625

He has(1) all the merits(1). P:0.0;N:0.0;O:1

P:0.0;N:0.0;O:1 P:0.625;N:0.0:O:0.375

Sentence_2 score: P:0.208;N:0.0;O:0.792

Page 267: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

What’s wrong with this naïve scorer?

• It doesn’t consider the valency shifters.

The stuff(1) was_like

nitric_acid(1)… P:0;N:0:O:1=>P:0.5;N:0.5;O:0

… had _sensation of being hit (4) on the

back_of_the_head (1) with a rubber(1) club(3).

With valency shifters considered, either the SO or the PN or both polarities are switched.Sentence_1 score: P:0,063;N:0.563;O:0.375

Now this is in line with OpinionFinder!

Page 268: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Exploiting SentiWordNet

Connotation analyzer (CONAN):

Given a sentence, check if it may have a subjective/objective unwanted connotation (and to what extent, if any)

The experimental data: SEMCOR (English and its Romanian translation); also tagged, lemmatized, chunked, word

aligned and WSDed.

Page 269: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

A Translation Unit from the En-Ro SEMCOR parallel corpus

(tagged, lemmatized, chunked, word aligned and WSDed)

Page 270: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Three runtime modes:a) Get the most objective reading

For each word, the sense with the highest O score is selectedb) Get the most positive subjective reading

For each word, the sense with the highest P score is selectedc) Get the most negative subjective reading

For each word, the sense with the highest N score is selected

If one of the scores is significantly higher than the others, there is no risk of inducing an unwanted connotation; Otherwise, one could spot the word(s) the senses of which determined the subjectivity polarity variation and maybe make other lexical choices.

The system works on texts tagged and lemmatized, irrespective if it is WSDed or not.If the words are WSDed, CONAN returns the Subjectivity-Objectivity scores calculated according to the already assigned senses (the same scores, irrespective of the runtime mode).

Page 271: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

What else could be CONAN good for?

Jan Wiebe & Rada Mihalcea (ACL2006) state the following

Hypothesis: instances of subjective senses are more likely to be in subjective sentences, so sentence subjectivity is an informative feature for WSD of words with both subjective and objective senses.

S-O sentence polarity is a cheap process with pretty high accuracy. If sentence S-O polarity sentence is known, this might be a very strong clue for sense disambiguation:

Subjective sentence: He was boiling with anger boil(5): ENG20-01716662-v

anger(3): ENG20-00714423-n

Objective sentence:The water boiled boil(1): ENG20-00363608-v

Page 272: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 273: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Personal opinions

• SentiWordNet is one of the best resource for Opinion Mining; • We can bring evidence that it works cross-linguistically –via aligned

wordnets- (almost) as well as for English (see my talk on Thursday)• It should be supported with a concerted validation work;• Different synset labellings pertaining to subjectivity should be

conciliated• Multilingual experiments could bring strong evidence for prior polarity

assignment to lexical items;• The verb (and deverbal noun)’s argument structure is essential in finding

out who or what is good or bad• The polarity of several adjectives and adverbs (modifiers) are head-

dependent:– long- response time vs. long+ life battery– high- pollution vs. high+ standard

Prior polarity assignment only to head-independent modifiers; for others, very useful WN relations might be typically-modifies, is-characteristic-of etc. which could have attached prior polarities

Page 274: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro

Language Web Services

• This is just started; it was fostered by the need to closer cooperate with our partners at UAIC in various projects (ROTEL, CLEF, LT4L, etc).

• Currently we added basic text processing for Romanian and English: tokenisation, tiered tagging, lemmatization, RoWordNet (SOAP/WSDL/UDDI). Some others, for parallel corpora (sentence aligner, word aligner, dependency linker, etc.) will be soon there.

Page 275: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 276: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 277: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 278: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 279: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 280: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 281: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 282: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 283: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 284: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro
Page 285: Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest tufis@racai.ro