10
Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés. INTEX 4.0 for Bulgarian (Errol' checking as an INTEX application) Svetla KOEVA, Stoyan MIHOV 1. Introduction lNTEX under NextStep has been available since 1992. A module for Bulgarian under NextStep was implemented between 1995-1997 at LML (Linguistic Modelling LaboratOlY, http://www.hnl.bas.bg). For several years lNTEX has pOl"ted to Windows95-NT. Here we present the Bulgarian lexical and grammatical knowledge integrated in INTEX for Windows. The investigation is being developed as a part of the Joint research project "Computer Representation of granmlatical Knowledge of Bulgarian" between LADL (Laboratory for Information Retrieval Systems and Linguistics at the University of Paris-7, http://www.ladl.jussieu.fr) and CMD (Department for Computer Mo- deling of Bulgarian at the Institute for Bulgarian language, Bulgarian Academy of Sciences, http://www.ibl.bas.bg). 2. Preprocessing the text The prepl'Ocessing in INTEX includes the segmentation of the text in sentences, the identification of unambiguous compound words and the distinguishing of special tokens (contractions, elisions etc.). We sup- ply INTEX with Bulgarian lexical and grammatical knowledge as [01- [8] Svctla KOEVA, Department for Computer Modeling of Bulgarian Institutc for Bulgarian language, Bulgarian Academy of Sciences e-mail: [email protected] Stoyan Mmov, Seclor for Linguistic Modeling, Center for Paralle! Processing Bulgarian Academy of Sciences e-mail: [email protected]

(Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

INTEX 4.0 for Bulgarian

(Errol' checking as an INTEX application)

Svetla KOEVA, Stoyan MIHOV

1. Introduction

lNTEX under NextStep has been available since 1992. A module forBulgarian under NextStep was implemented between 1995-1997 atLML (Linguistic Modelling LaboratOlY, http://www.hnl.bas.bg).

For several years lNTEX has pOl"ted to Windows95-NT. Here wepresent the Bulgarian lexical and grammatical knowledge integrated inINTEX for Windows. The investigation is being developed as a part ofthe Joint research project "Computer Representation of granmlaticalKnowledge of Bulgarian" between LADL (Laboratory for InformationRetrieval Systems and Linguistics at the University of Paris-7,http://www.ladl.jussieu.fr) and CMD (Department for Computer Mo­deling of Bulgarian at the Institute for Bulgarian language, BulgarianAcademy of Sciences, http://www.ibl.bas.bg).

2. Preprocessing the text

The prepl'Ocessing in INTEX includes the segmentation of the text insentences, the identification of unambiguous compound words and thedistinguishing of special tokens (contractions, elisions etc.). We sup­ply INTEX with Bulgarian lexical and grammatical knowledge as [01-

[8] Svctla KOEVA, Department for Computer Modeling of BulgarianInstitutc for Bulgarian language, Bulgarian Academy of Sciences

e-mail: [email protected] Mmov, Seclor for Linguistic Modeling, Center for Paralle! ProcessingBulgarian Academy of Sciences

e-mail: [email protected]

Page 2: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

232 Svetla KOEVA, Sloyan Mmov

lows:

- FST for sentence recognition in Bulgarian;- Dictionary for Bulgarian unambiguous compounds;- FSTs for identifying some specifie Bulgarian tokens.

2.1. Identifying sentencesThe FST named Sentence.fst is applied ta the text in MERGE mode.The output of the FST - {S} is inserted in the text at the sentenceborders. Here bellow is presented Sentence.fst for Bulgarian (Fig. 1),which describes sorne specifie features of Bulgarian punctuation (forexample lexical abbreviations, Bulgarian compound words that endwith a capital letter, graphical abbreviations, etc.) together with thetrivial cases.

W'JiW#j@'·4@["":ffl'+ifUij!tw i+;g."u'lIl.iflIHj

{S

Pt' rj~")t, r~i-f,'~~"

""

~UI.l'-""'~pf

FriJ.:rn0913'20.:<82ifJO

Figure 1

Page 3: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

INTEX 4.0 FOR BULGARIAN 233

2.2. Idelltifyillg ullambiguous compoundsThe preprocessing includes the identifying and tagging of the unambi­guous compound words in the tex!. The dictionary Nonn.dic in whichBulgarian unambiguous compound words are listed is used in this op­eration.

Most of the compound words are unambiguous and il is good torecognize them during the preprocessing in order to avoid wrongreadings, i.e.:

Cherni v/J'h (Black peak)Cherni {cherni, cher.A:p} l'IJ1h (v/J'h, vryh.N+M:s)

{chemi, cheren.A:p}{dl/ni, chernja.V+IMP+T:E2s:E3s:12s}

The carrect reading is: Cherni v/J1h(Cherni v/J1h, Chenu vryh.N+M:s)

In general we classified Bulgarian unambiguous compound wordsas personal names (Baba Jaga), geographical names (Tihi okean-ThePacific), names of institutions (Uluversity of Sofia), tenns (stone age),complex prepositions (blagodarenie na-thanks to) and complexconjunctions (za da - in arder to).

2.3. Identifying special tokensThe last stage of the preprocessing consists of identifyillg and taggillgof the special tokens, such as abbreviated and contracted words. Ap­plying the FST Norm.fst in REPLACE mode performs the identifica­tion of special tokens. The FST Norm.fst consists of several embeddedFSTs. For example the FST Comparison.fst interprets Bulgm'ian com­parative and superlative pmticles that are separated by the corre­sponding adjectives and adverbs by a hyphen. The FST Elision.fst de­scribes the abbreviations in different positions in the sentence, the ab­breviations after numbers, the units of measure, and the elisions withapostrophe and hyphen.

Page 4: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

234 Svetla KOEVA, Stoyan MIHOV

3. Applying dictiollaries and lexical F8Ts

3.1. Bulgarian INTEX dictional"iesDictionaries applied by L\lTEX must be in a specific format, DELAFdictionaries for simple words and DELAFC dictionaries far compoundwards. Each lexical enlty in the DELAS dictionary is associated withone FST that represents ail corresponding inflected farms. DELAFdictionaries are generated l'rom DELAS dictionaries by derivation ofail paths of inflectional FSTs. For more detailed explanation ofDELAS, DELAF and DELACF dictionaries in INTEX see GROSS ANDPERRIN (1989), SILBERZTEIN (1986, 1987 and 1999, p. 16-24).

After the preprocessing procedure the simple words and the com­pound words in the text can be identified by the respective (DELAFand DELAFC) dictionaries and FSTs. The Bulgarian DELAF diction­ary is generated l'rom Bulgarian dictionmy Gramatic2000 (KOEVA,1999). The dictionm'y consists of about 80,000 lemmas and over1,000,000 corresponding forms, basically ail inflected simple wards.

3.2 Lexical FSTsThe lexical FSTs are used to describe infinite sets of lexical entries(e.g. analyticalnumerals). The following FST Dnum.fst (Fig. 2) iden­tifies and tags Bulgarian numerais l'rom 2 to 9,999.

4. Bulgariall gl'all1ll1ar ndes4.1. Disambiguatioll rulesMost of the illteresting problems in computationallinguistics, as weil asthe importaut applications in Naturai language processing require atagger-an automatic system that correctly associates the words with thegrammatical categories and theu' values. The mOlphological analyzerproduces ail legitimate tags for the wards that appem' in the text. Forinstance the Bulgarian word kosi will receive in Intex the following tags:

Noun, Feminine, Plural, Indefinite "hau'"Verb, Present, 3 Person, SUlgular "he mows"Verb, Petfect, 2 Person, Singular "you mew"Verb, Perfect, 3 Person, Singular "he mew"Verb, Imperative, 2 Person, Singular "mow!(now)"

Page 5: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

lNTEx 4.0 fOR BULGARIAN

IW!iltMSi@f'f"m§h@,"Œ'MUM!

98~gecem

rnpugecem'1elllUPliJ ECil!l

1-\~ A1i.ne",gl'ce'"

N •~ . A ~,~'::'i:::'oceugeC"I!!g<o&rngecem

omo

:i~~,~ill<cerncmo!!.lKmJ Ol':1l'l1

C~Œ"'I!ilIH

5JC"lJCl!l()l!llMgç&>mCIlliC'I!I1fr\

,B<mpu'<EIllUpU

"fi!!.lecm II-----j;;;.~,,~i"l""'-------=::::::::::,4"1c~eu li~~"..

Figure 2

,,~

g'.mpu""mupti"fi"-'Km

235

Il is weil known that there are Iwo different approaches 10 the dis­ambiguation-statistical (dala-driven) and linguistic (conslrainl­based). The INTEX offers good language independent tool for the lin­guistically oriented disambiguation. We can formulate a few types oflinguistic rules as FSTs. The contextual restrictions for the most fre­quently used ambiguous words can be defined to resolve ambiguity.For example a rule can say that:

- Fonns like hl/bava are adverbs, if there is not a nmm (neuter, singu­lar) in the sentence.

- Fonns like IllIbava are adjectives, if there is a noun (neuter, singu­lar, indefinite) that illnl1ediately follows them.

The rules described above are certainly not sufficient to providefull disambiguation. An interesting observation (which is also true forBulgmian) is that the most frequently used ambiguous words are usu­ally words that me corpus independenl (prepositions, pronouns, con-

Page 6: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

236 Svetla KOEVA. Sloyan MIHOV

junctions). We also need some heuristic rules coding the followinginformation:

Mi is a short fOl1n of a possessive pranoun, if it follows a deter­Inined noun, adjective or numeral. Otherwise it is a short dative formof a personal pranoun.

Bellow the grammar Iule for disambiguation of the Bulgarian per­sonal pranoun clitics is given.

--J>'0H

Mlf

HYr.HU

BU

111,1

<PRû+PER>

Figure 3

<PRO+PER>o

4.2. Grammal' checkillg as an INTEX aPllIicationThe typology of elTOr types in Bulgarian was explored under the pra­ject PEC~ 2824 "Language Teclmologies for Slavic Languages". Thegoal of the praject was the creation of an automatic system for detec­tian and correction of errars in Bulgarian texts (grannnar checker).We successfully applied the INTEX system for the errar detection inthe texts. For example the following FST named def-AN.fst (Fig. 4)describes several errars in the adjective, nmlll agreement and nounphrase definiteness.

Page 7: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

INTEX 4.0 FOR BULGARIAN

i%1ib$!l;;'H!iJ!'ilitjmfffl.h~ItM!i1tl.UJM"Y.li

&f.All<;1fV/dlhy)I13(16:2320oo

Figure 4

237

intelpretacijaintel]Jretation(indef)

intelpretacijataintelpretation(def)intelpretacijaintel]Jretation(indef)

The definiteness in the noun phrase is expressed only once-the defi­nite article is incorporated behind the first ward in the phrase, respec­tively the "indefinite article" immediately precedes the first ward inthe phrase.

S.a. *novata obshtatanew(def) cOlI/lI/on(def)

S.b. novata obshtanew(def) cOlI/lI/on(indef)the new cOlI/lI/on intelpretation

S.c. nova obshtanew(inde.f) cOlI/lI/on(indef)a new cOlI/lI/on intelpretation

The ill-formation of ward and definite article sequence is noticedvery frequent with masculine common nouns-these nouns can takeeither full or short fonn of definite article. The subject noun phrase(masculine, singular, definite) in Bulgarian literalY language receives

Page 8: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

238 Svetla KOEVA, Stoyan MIHOV

the so called full fon11 of definite article, if any. A very frequent erroris just the opposite usage.

6. *Studentkata chete ot dokladat.student (fm.def) read fi'om report (/IIs.def:ful/)

In so called grammatical gender languages like Bulgarian com­mon nouns are arbitrarily specified for gender, which is reflected intheit· morphological fon11. Bulgm'ian adjectives agree with theit· headnOllll with respect to gender and number.

7. *M1adata xubav /IIomiche.beautifid (fin.sg.def) young (/IIs.sg.) girl (fin.sg.)

The FST comma,fst is an example for the detection of punctuationerrors of different type-omission or wrong insertion of a punctuationmark.

<li>

<IPue;>

ERR

<PRO+INTtFPO>

6e3,!lanJl.aIbH KaiÛ

ra.,,!! 'I.eBl-npeYIf'l.e

IH1Œp'le

3a1I:(OTO

cünilIla.grfWed May 31 llû6:22 2000

Figure 5

Page 9: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

lNTEX 4.0 FOR BULGARIAN 239

On the next figure (Fig. 6) the result of the application of FSTs forgrall1ll1ar checking is presented. In the concordance window the de­tected grall1ll1ar errors are given.

3apa3eHUJI pm10H 6~~ npe'laT Ha TbprOBl1J'lTapa....l0HI1.{S} Te311 MepKll 6uxa Morllll,o,a 3aMeOl:

Cbll.leOTBYBautOTO ynoBaHl1e B 6e3uellH\1lB npoqBal(CllHal.l.\l.A Il nO-OTpOFIUl rpaHl14eH KOH1POI\.{S:

Ha. lMl18anpOrpaWl Sa pro.10HIlpaHB GlllP.RGBM06Bbp*e c npou.eAyplne no PBfllClpal.\IUI Ha:N1\1,QeHtuq:.llU11paHe Ha 60AeCHnB, eKtJo'HneI\HO 1\1

TecTBali.tO 060py,o,BaHB Il 06Y4eH nepCOHM.

{S} KrJo'lDBU MapKS1

.e..CCNJS H,a "" lIpl!Hl3He " 1I0II1'"""k" Ha HHB"',,-.

,a..IlITJ Oll.l'""" "01 ""na; g"f""b""" "4/432/1':00 "a6c,:-JU:..THO,a6co.nl.fleH.A,aD kUI Dp"ç(:;;;"E"na lia SCOlTIl"P""""" c,pa,U!6conbrHo,a6conbrllo.Wl (.",,, c"",,,,kOc,,,na>lcka lI""II""k-.., lIe"'P""1l0

,.01" HaITHOHa'HIII,a

Ha npr""'!lI;a Il "PiT"""

-,§'If: Il.51

Fig. 6

6. ConclusionsTo summarize, we supply INTEX with Bulgm'ian lexical and gram­matical kllowledge, as follows:

- FST for sentence recognition;- Dictionary for unal11biguous cOl11pounds;- FSTs for special tokens;- DELAF dictionm'y with over 1 000 000 entries;- DELACF dictionary;- FSTs for some lexical entries;- FSTs - local grall111lars for disambiguation;- FSTs -local gra111ll1ars for gra1llll1ar checking.

Page 10: (Errol' checking as an INTEX application)lml.bas.bg/~stoyan/SKoevaetc.pdfExtrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. -

Extrait de la Revue Informatique et Statistique dans les Sciences humaines XXXVI, 1 à 4, 2000. C.I.P.L. - Université de Liège - Tous droits réservés.

240 Svetla KOEVA, Stoyan MillOV

INTEX is a research tool with a lot of applications in computationallinguistics, corpus-based linguislics, information retrieval, etc. and webelieve that the description of Bulgarian lexical and grammaticalknowledge will conlribute to the multilinguallexicography, as weIl asto the unification of the language resources.

The team at the Bulgarian Academy of Sciences has successfullyapplied INTEX for various research tasks. Il has been proved that thesystem is very useful for prototyping, testing and verifying of new lin­guistic tools like grammar checker.

References

GROSS (Maurice), PERRIN (Dominique) (eds.): 1989, "Electronic Dic­tionaries and Automata in Computational Linguislics", LectureNotes in Computer Science (Berlin-New York: Springer Ver­lag).

KOEVA (Svetla), 1999: "Bulgarian Gran1l11atical Dictionary", Bulgar­ian Language, 2, p. 49-56.

KOEVA (Svetla), 1998: "Bulgarian Tagset Specification for Part-of­Speech Disambiguation", Papers fi'om Second Conference onFormaI Approaches ta South Slavic Languages, Working papersin Linguistics 1998 (Dragvoll: University of Trondheim), p. 71­77.

PEC~ 2824,1995: "Ill-formed input in Bulgarian", Review meeting ofthe Joint research project PECQ 2824 Language Technologiesfol' Slavic Languages, Saarbruecken, 28 October 1995.

SILBERZTEIN (Max), 1986: "Classification des mots des dictionnairesDELAS et DELAF". Rapport Technique du LADL na 14a-14b(Paris: Université Pm"is 7).

SILBERZTEIN (Max), 1987: "The lexical analysis of French", LecturesNotes in Computer Science (Berlin: Springer Verlag).

SILBERZTEIN (Max), 1999: 1NTEX (01' Windows,http://www.ladl.jussieu.frIINTExi