27
Seminários de Linguística 2 (1998): pp. 21-37. Universidade do Algarve – UCEH. Faro A local grammar of proper nouns Jorge Baptista Universidade do Algarve – UCEH CAUTL - LabEL ** Abstract Usualy, in natural language processing (NLP), proper nouns (Npr) are dealt with in a minimalist way: either by small testing-lists or by ignoring words in capital letters. This paper presents some of the linguistic properties of proper nouns pertinent for their processing in an automatic analysis of texts, namely, their formal variation (inflection and diminutives) and combinatorial constraints. This linguistic information is represented by finite state automata (FSA) and a machine-readable dictionary of proper nouns. Resumo Normalmente, os sistemas de processamento de linguagem natural (NLP) tratam os nomes próprios de forma minimalista: ou através de pequenas listas experimentais ou ignorando as palavras com maiúscula inicial. Neste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu processamento em análise automática de textos, nomeadamente a sua variação formal (flexão e diminutivos) e restrições combinatórias. Esta informação linguística é representada sob a forma de autómatos de estados finitos (FSA) e um dicionário electrónico de nomes próprios. Research for this paper was partially financed by JNICT (Program PRAXIS XXI/2/2.1/CSH/ 775/95). ** Centro de Automática da Universidade Técnica de Lisboa – Laboratório de Engenharia da Linguagem.

w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Embed Size (px)

Citation preview

Page 1: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 (1998): pp. 21-37. Universidade do Algarve – UCEH. Faro

A local grammar of proper nouns

Jorge BaptistaUniversidade do Algarve – UCEH

CAUTL - LabEL**

AbstractUsualy, in natural language processing (NLP), proper nouns (Npr) are dealt with in a

minimalist way: either by small testing-lists or by ignoring words in capital letters. This paper presents some of the linguistic properties of proper nouns pertinent for their processing in an automatic analysis of texts, namely, their formal variation (inflection and diminutives) and combinatorial constraints. This linguistic information is represented by finite state automata (FSA) and a machine-readable dictionary of proper nouns.

ResumoNormalmente, os sistemas de processamento de linguagem natural (NLP) tratam os nomes

próprios de forma minimalista: ou através de pequenas listas experimentais ou ignorando as palavras com maiúscula inicial. Neste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu processamento em análise automática de textos, nomeadamente a sua variação formal (flexão e diminutivos) e restrições combinatórias. Esta informação linguística é representada sob a forma de autómatos de estados finitos (FSA) e um dicionário electrónico de nomes próprios.

1. Introduction.

1.1. Machine-readable dictionaries.

Identifying every word of a text is an indispensable preliminary step towards any sort of automatic processing. In most current systems of NLP, this task is usually done by consulting machine-readable dictionaries.

Machine-readable dictionaries are quite different from human dictionaries. In fact, many available dictionaries are already available on CD-Rom, and can be consulted using computers, but as they are built for human readers, they are totally ineffective for automatic language processing: lexical lacunae, incoherent information are quite obvious.

Research for this paper was partially financed by JNICT (Program PRAXIS XXI/2/2.1/CSH/ 775/95).** Centro de Automática da Universidade Técnica de Lisboa – Laboratório de Engenharia da Linguagem.

Page 2: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Lexicographers count on the readers’ own knowledge of the language to fill these lacunae.

However, computers can not do this. Electronic (or machine-readable) dictionaries must, therefore, be as complete as possible, to the point of near exhaustivity, for any failure during its consultation makes further analysis impossible. Linguistic information there coded must be totally explicit , for computer programs do not have the humans adaptability nor any other linguistic knowledge besides the one made available by the linguist.

Finally, this information must be completely coherent and precise. As a result of the precision required, semantic representation will be left out, for now, from machine-readable dictionary here proposed, since no coherent system exists, so far, for representing meaning1.

Proper nouns are an exceptionally precise set of lexicon in respect to this. They can all be classified as “human” nouns, in the syntactic-semantic sense defined by M. Gross 1975:47. An incipient subclassification can also be given in many cases, namely, the distinction between given names and surnames, which is of syntactic interest (see §3).

This paper deals with the problem of processing a specific set of proper nouns, that is, words designating people’s names (João ‘John’, Ana ‘Anne’). It will disregard other sets of proper names: brands (Coca Cola), corporations (Microsoft), institutions (Gulbenkian), an so on. Toponyms (Portugal, Lisboa), which may be considered a subset of proper nouns, were not taken in account too.

1.2. Finite state automata (FSA).

As increasing evidence demonstrates (M. Gross 1988, 1989b, 1996 and 1997; Roche e Shabes 1997 (eds.), for an overview of the domain), certain short range restrictions on the combinations of words are better described by finite state automata than by extensive listing of those variations. The importance of such short range constraints in sentences is remarkable.

A clear case of this is the formal variation of proverbs, usually considered as a particular case of frozen sentences2. For the well-known proverb:

1 On building machine-readable dictionaries, see M. Gross 1988, 1989a; for French machine-readable dictionaries, see Courtois 1990, Silberztein 1989, Courtois et al. 1997; for Portuguese, see Marques Ranchhod and Eleutério 1992, 1994, Eleutério et al. 1995.2 On formal variation of Portuguese proverbs, see L. Chacoto 1994.

22

Page 3: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

A galinha da vizinha é mais gorda do que à minhaThe chiken of my neighbour is more fat than mine

one can find several variants:

initial definite article a ‘the’ can be zeroed; possessive determinant minha ‘my’ can be inserted before vizinha

‘neighbour’; before the possessive minha, the definite article a ‘the’ contracted with

preposition de ‘of’ can be zeroed, therefore, da minha ‘of the my’ is equivalent to de minha ‘of my’;

adverb and adjective mais gorda ‘more fat’ can be replaced by melhor ‘better’;

adverb sempre ‘always’ can be inserted to modify these two adjectives; comparative conjunction do que ‘than’ has an equivalent reduced shape

que ; the so-called mirror transformation around the verb é ‘is’ can be

observed, but some of the former variations do not apply.

These, and many other types of short-range restrictions can be described both by listing all variants in a dictionary of frozen adverbs, or by a finite state automaton. In this case, all these local constraints produce twelve variants that can be efficiently described by FSA ProvGalinhaDaVizinha, shown in Fig. 1, below:

Fig. 1. ProvGalinhaDaVizinha.graph

As M. Gross (1997:330) puts it, no matter how important local constraints seem to be in a language description, “they have been totally neglected, and the [linguists’] interest has shifted to the essential problems of long range constraints between words and phrases.[...] The model we advocate, and which we call finite state for short, is of a strictly local nature. In this perspective, the global nature of language results from the interaction of a multiplicity of local finite state schemes which we call finite-state local automata”.

23

Page 4: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Following M. Gross proposition, in this paper “we give elementary examples of where the finite constraints can be exhaustively described in a local way, that is, without interferences from the rest of the grammar” (idem: 331).

1.3. Methods.

For this paper, I used a small sample of text, consisting of several arbitrarily chosen issues of the on-line edition of Portuguese daily newspaper PÚBLICO3. This is a small file of 460 KB, with about 90.534 (13.417 different) tokens , containing 74.783 (13.058 different) words.

This small corpus was analysed using the machine-readable dictionary of Portuguese simple words of the DIGRAMA system (Eleutério et al. 1995).

In its present state, the dictionary (DIGRAS) contains about 100.000 common lexical entries and can be considered to be a real-size lexicon of Portuguese simple words. To each entry a conventional code is given that allows an adequate program to generate/recognize all inflected forms of Portuguese simple words – about 1.250.000 (approximately 700.000 diferent) inflected forms – , and to associated them every pertinent linguistic information (grammatical category, gender-number, degree, person, tense, and so on). There are no proper names in this dictionary.

The dictionary of the text’s words after lexical analysis contains 14.498 lexical entries (many forms are homographs). To this paper, I focused on the list of the remaining 1.532 unknown tokens, that is, words unrecognized by the dictionary (see extract of this list in Annex 1).

Finite state automata (FSA) were built to be used with a small machine-readable dictionary of about 3.000 proper nouns in the analysis of the corpus. Both the FSA, such as the one shown above, and the dictionary were built using INTEX  4.0, a linguistic development environment created by M. Silberztein (1993, 1997) for finite state descriptions of natural languages, and to be used on large corpora of texts4.

3 I acknowledge PÚBLICO for the permission to use this material.4 I thank M. Silberztein for his many helpful observations on using INTEX

24

Page 5: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

2. Proper nouns.

2.1. Raw data.

Proper nouns may constitute one of the larger sets of unrecognized Portuguese words after applying a real-size machine-readable dictionary of Portuguese simple words to a corpus of texts.

Of the 13.417 different tokens of the small corpus used in this study, the dictionary of simple words did not recognize 1.532 (see Annex 1, for an extract from this list). This means that unrecognized tokens constitute about 11,4 per cent of the text units, which is a high rate of word-recognition failure, and therefore constitute a strong impeachment to further linguistic analysis.

Looking through this list of unknown forms, one finds about 200 Portuguese proper nouns (13%), both given and surnamens, that were all included in the dictionary of proper names. The rest of unknown tokens consists mainly of (here presented in decreasing order of importance):

foreign words, mostly proper nouns: Adams, Airbus, Alatas, Albert, Allen, Annie, Antonioni, Assayas, Aznar ;

toponyms, both in Portuguese and in other languages: Alagoinhas, Alemanha, Alentejo, Algarve; [Los] Angeles, Auschwitz;

siglae: AAC, ABS, AEVP, AFP, AIP, AM, ANMP;

names of brands, corporations and other institutions: Abacus, Agrolongo, Antral, Autocoop, Autoeuropa;

orthographic and typing errors: acccionar, acordão, anfitriãos, Amnista, anosdepois, assuas, arcordão, Arias, avancadas;

other proper nouns: Alcorão.

Each one of these situations may be dealt with independently and probably in a somehow different manner. Here I focus only on proper nouns designating people’s names. Like the other types of proper names, they also must be recognized by NLP systems an processed in some way in order to allow adequate analysis of the sentences where they occur.

2.2. Capital letters.

As all proper names begin with capital letters, a minimalist approach to this problem would be considering all capitalized words (unrecognized by

25

Page 6: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

the common lexicon dictionary) as proper names.

However, this would not be adequate, for it presupposes that all capitalized words are proper nouns and that they are error-free, which is not always the case.

Still, using lists of capitalized words from a corpus proved an efficient method for constructing the dictionary of proper nouns.

2.3. Gender-number inflection of proper nouns.

Another difficulty arises if one considers lexical analysis as the starting point for further text processing, namely, syntactic analysis. Many proper names (at least given names) present inflection and they induce gender-number agreement (ms=masculine, singular; fs=feminine, singular):

O António está cansado The-ms Anthony-ms is tired-ms

A Antónia está cansada The-fs Anthony-fs is tired-fs

Simple morpho-syntactic agreement rules can not apply without information regarding gender-number values of proper nouns.

The same can be said about number inflection, for some given names accept plural (mp = masculine, plural):

O António é espertoThe-ms António-ms is smart-ms

Os Antónios são espertos The-mp Antónios-mp are smart-mp.

Usually, surnames do not present gender inflection:

(A D. Silva + O Sr. Silva) veio cá ontem(The-fs Mrs.-fs Silva-s + The-mp Mr.-ms Silva-s) came here yesterday

Number can be expressed both by plural morphemes or by syntactic processes. In a plural form preceded by masculine-plural article they designate the whole family:

Os Silvas vieram cá ontemThe-mp Silva-p came here yesterday

In this last construction, the surname can also appear in its basic, singular form; number information is then carried only by the definite article os:

26

Page 7: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

Os Silva vieram cá ontemThe-mp Silva-s came here yesterday

Other surnames have the same shape for singular and plural:

O (Guedes + Lopes + Vaz) veio cá ontemThe-ms (Guedes + Lopes + Vaz) came-s here yesterday

Os (Guedes + Lopes + Vaz) vieram cá ontemThe-mp (Guedes + Lopes + Vaz) came-p here yesterday

Nouns that can be both given and surnames were often divided in two morphological entries, because each one presents different inflected forms. For instance, Rosa ‘Rose’ is a feminin given name, accepting diminutives:

A (Rosa + Rosinha + Rosita) veio cá ontemThe-fs (Rosa-fs + Rosa-inha-fs + Rosa-ita-fs) came here yesterday

but the surname Rosa does not accept diminutives:

(A D. Rosa + O Sr. Rosa) veio cá ontem(The-fs Mrs.-fs Rosa-s + The-ms Mr.-s Rosa-s) came here yesterday

*O Sr. Rosinha veio cá ontem

In the sentence:

A D. Rosinha veio cá ontem

diminutive Rosinha is normaly interpreted as a given name (diminutive).

Morphologic description of given and surnames is shown below (§ 2.7).

2.4. Diminutives and nicknames.

Recognizing proper nouns requires adequate formalization of their morphological variation. Gender and number are the most common formal variation. But some nouns also admit diminutive suffixes, which have affective meaning:

António/Antoninho/Antoniozinho, Ana/Aninha/Anita/Aninhas,Maria/MariazinhaManuel/Manelinho/ManelitoJoão/Joãozinho/Joãozito/Janeca(s)Joana/Joaninha/Joanita

27

Page 8: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Marginally, some nouns have equivalent nicknames:

António/Tó,Helena/Lena,José/Zé/Zézé/Zeca

and some of these even admit diminutives:

Lena/Leninha/LenitaTó/Toninho/TonitoZé/Zezinho/ZezitoZeca/Zequinha

Nicknames can be treated as variants of the basic nouns, and this equivalence should be stated in the lexicon.

Formal variation of proper nouns is not substancially different from that of simple words. Therefore, in order to allow adequate processing of this lingustic information, construction of machine-readable dictionaries of proper nouns is required.

2.5. Difficulties on buiding a machine-readable dictionary of proper names.

One might consider it impossible to build an exhaustive list of proper names, due to the immense size of such expected database. Even if one considers only Portuguese proper nouns, as I do here, the size of this list will be considerable. But this does not have to be necessarilly so. Such lists can be completed progressively. Equivalent lists can also be made for other languages.

Building this dictinaries should be seen as a cumulative program for gathering linguistic data of a particular kind, similar to the construction of machine-readable dictionaries for the common lexicon.

Nevertheless, simple listing of proper nouns and adequate formalization of their morphologic variation do not suffice for its automatic processing. These nouns combine with themselves in texts, and this can be subject to a formal description of another type, namely a finite state automaton.

2.6. Combinatory constraints on sequences of proper nouns.

People are often designated in texts by two or more of their names, frequently, using the pair formed by their given and surname (João Silva).

28

Page 9: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

One would need that the two (or larger sequences of) proper nouns could be analysed as a single name.

Combination of proper nouns follows rather strict rules. First, names are given observing specific legislation5 and, secondly, the use of one’s name is frequently conditioned by socio-professional conventions. For instance, many peolple use only one given name and a surname in their profession (João Silva). Others also use the inicial of one of their midle names (João M. Silva).

These rules and conventions can be stated explicitly and are generally observed in Portuguese:

Given names precede surnames: José Silva.

There can be several of each kind6: José Manuel Lopes Silva.

Some given names are connected by preposition de ‘of’ or its contraction with the definite article: Maria de Lurdes, Maria do Carmo, Maria da Luz, Maria dos Prazeres, Maria das Dores. These combinations are rather limited, so it is possible to consider them as compound given names7.

There are a few compound given names connected by a hyphen8: José-Maria.

There are also some compound surnames connected by a hyphen: Corte-Real.

Surnames can be simply juxtaposed: Lopes Silva, or connected by preposition de ‘of’ or its contraction with the definite article: Lopes de Almeida, Lopes do Carmo, Lopes da Silva, Lopes dos Santos, or

5 These legal dispositions are stated in Código Civil (Decº-Lei nº. 47.344/66 November 11 with modifications introduced by Decº-Lei nº. 496/77 November 11, Art. §1677-A,B and C, §1875, §1876, §1988 and §1995) and in Código Registo Civil (Decº-Lei 36/97 January 31, Art. §103).6 Today’s legislation imposes a maximum of two surnames from each parent, but older people

still carry more then four surnames and married people can, under certain conditions imposed by law, add to their own up to two of his/her wife/husband surname(s). Therefore, the number of proper names in the full identification of a person can not be predetermined a priori.7 Lists of compound proper names are being made, but due to insufficient data, I will not present them here. For an operative formal definition of compound nouns (of the common lexicon) see G. Gross 1988 and Baptista 1994. Some of the formal (syntactic) properties presented in these studies may also apply in determinig compound proper nouns. For the purposes of building the FSA local grammar of proper names, these compound given names were considered ordinary N de N sequences of simple given names. Still, including compound proper names seems to require little modification of the FSA.8 In this paper the hiphen connector was not taken into account.

29

Page 10: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

connected by the conjunction e ‘and’ 9: Lopes e Silva, or a combination of all these processes: Lopes Silva de Almeida e Cunha.

Quite rarely, preposition de ‘of’ can be abbreviated to d’ before a surname beginning with a vowel: d’Almeida.

Given and middle names can be abbreviated to their initials: J. Silva, José M. L. Silva

Many of these restrictions can be described by finite state automata. Among other things, this requires that a distinction should be made between given names and surnames. It can be done as a sort of semantic feature in the machine-readable dictionary of proper names.

Of course, there are nouns that can be used both as given and surnames (Afonso, Alexandre, Bernando, Caetano, David, Diogo, Duarte, Miguel, Rosa) and to these both features can be applied, although it may creates some level of ambiguity.

2.7. Formalization of proper nouns in machine-readabe dictionaries.

As it has been said before, formalization of proper nouns in a machine-readable dictionary is not substantially different from common nouns.

Like these, proper nouns are formed of an invariable and a variable sequence of characters. To each noun, a conventional code is then added, designating the variable sequences of characters necessary to generate all the inflect forms of a given entry. Semantic features like Hum (=human), gn (=given name) and sn (=surname) came afterwards. Equivalence of nicknames and basic nouns is appended to the entry. The following are examples of these entries’ formalization:

Afons<o> .N201A1 +Hum+gnAfonso<> .N101 +Hum+snAlexandre<>.N200 +Hum+gnAlexandre<>.N101 +Hum+snAn<a> .N300A1 +Hum+gnAninhas .N316 +Hum+gn/Dim:AnaAntóni<o> .N200A3 +Hum+gnBeatriz<> .N305 +Hum+gnBé<> .N300 +Hum+gn/Dim:IsabelCarl<os> .N270A1 +Hum+gnDavid<> .N200 +Hum+gnDavid<> .N101 +Hum+snDiog<o> .N201 +Hum+gn

Diogo<> .N101 +Hum+snElisa<> .N300 +Hum+gnFernand<o>.N001A1 +Hum+gnGustavo<> .N200 +Hum+gnHug<o> .N204 +Hum+gnJo<ão> .N208 +Hum+gnJosé<> .N200A1 +Hum+gnLopes<> .N116 +Hum+snMári<o> .N201A5 +Hum+gnMaria<> .N301A3 +Hum+gnManu<el> .N214 +Hum+gnMendes<> .N101 +Hum+snNun<o> .N201A1 +Hum+gn

9 I do not know the use of conjonction e ‘and’ for connecting given names.

30

Page 11: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

Octávi<o> .N001 +Hum+gnPedr<o> .N201A1 +Hum+gnRos<a> .N301A1 +Hum+gnRosa<> .N101 +Hum+snSilva<> .N101 +Hum+snSousa<> .N101 +Hum+sn

Tó<> .N200 +Hum+gn/Dim:AntónioToneca<> .N200 +Hum+gn/Dim:AntónioToninho<> .N200 +Hum+gn/Dim:TóTonito<> .N200 +Hum+gn/Dim:TóZé<> .N200A1 +Hum+gn/Dim:JoséZe<ca> .N214A1 +Hum+gn/Dim:José

The inflectional paradigms are the same as those used for simple words of the common lexicon. Surnames were given N101 (e.g., Silva) or N116 (e.g., Lopes) codes for uniform nouns, for they can appear in both genders.

2.8. Local grammar of proper names.

Most of these combinatory constraints can be represented, in a first approach, by the finite state automaton NprNpr.graph, shown in Fig. 2.

Abbreviated nouns are represented by auxiliary FSA Maiusc.graph, which lists all majuscules: A, B, C, ... Z, and can be applyed recurrently.

Names are frequently preceded by (professional or honorific) titles, often abbreviated, such as Prof. António Silva, Drª. Manuela Ribeiro, Eng.º Lopes, D. Rosa, D. José and so on. A local grammar of abbreviated titles – AbbrevTitles.graph – has ben made for this purpose.

Some names can be followed by a Roman numeral: João Paulo II, Carlos V, Afonso III. Usually, these names do not include abbreviations (*João P. II, *J. Paulo II), but some abbreviated titles may be incorrectly analysed as abbreviated noun: D. Afonso III, D. Maria II. In these cases, ‘D.’ must be obligatorily analysed as the abbreviation for the honorific title ‘Dom/Dona’. This restriction was not taken into account here10.

A specific local grammar (NumRom.graph) has been built (M. Silberztein 1993:126-127) to identify all Roman numerals.

I will not present these FSA here. In the automaton below, they are invoked by their graphs’ names in the respective boxes.

10 In fact, introducing this restriction implies building another FSA, almoust similar to NprNpr.graph, but without the Maiusc.graph auxiliary nodes and their neighbouring transitions and where “D.” and probably other abbreviated titles could be expanded and tagged as such.

31

Page 12: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Fig. 2. NprNpr.graph

2.9. Improving local grammar adequacy.

Automaton NprNpr.graph can be used to analyse most sequences of proper nouns in texts. Annex 2 shows an extract from the concordance of NprNpr.graph. Still, it does not adequately analyse all of them nor represent such subtle restrictions as those occurring between abbreviated names and Roman numerals.

First, it does not distinguish abbreviated given names from surnames without an unambiguous surname before that last abbreviation. Therefore,

32

Page 13: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

“L .” is ambiguous as to its status of given/surname in the first example, but not in the second:

João M. S. L. CastroJoão M. Sousa L. Castro

Second, sequences of two surnames connected by the conjunction e ‘and’ may be incorrectly analysed together as a single name. This happens with any sequence presenting the following configuration:

<N:Hum+sn> e <N:Hum+sn>

which corresponds to the uncorrectly analysed sequences:

[Sampaio e Guterres]Nhum estão de acordo (‘to agree’) [Leonor Beleza e Rebelo de Sousa]Nhum estão de acordo

Third, if there is an abbreviated surname before e, the following analysis was imposed – the surname after e could not be analysed as part of the same single name:

António M. e Sousa: [António M.]Nhum e [Sousa]Nhum

In other words, e <N:Hum+sn> must always follows a (optionally an abbreviated) surname:

António Cunha e SousaAntónio Cunha M. e Sousa

However, if there is a given name or an abbreviated one after e the two names will be distinguished:

[Leonor Beleza]Nhum e [Marcelo Rebelo de Sousa]Nhum estão zangados[Guterres]Nhum e [Mário Soares]Nhum estão zangados[António M.]Nhum e [R. Sousa]Nhum estão zangados

In spite of inadequate analysis resulting from conjunction e ‘and’ ambiguous syntactic status, to eliminate it from the graph would make it impossible to recognize the compound surnames with connector e.

Fourth, introducing compound given names (mostly, N de (E+ o + a + os + as) N sequences, and more rarelly, N-N sequences) and surnames (just N-N sequences) would greatly simplify NprNpr.graph.

Fifth, the unambiguous abbreviated title “D.”, and probably others, must be described independentelly, so that, in certain sequences (D. Npr RomanNumeral) it could be adequatly tagged.

33

Page 14: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Finally, this graph can be used as a finite state transducer (FST), due to labels “{” ... “,.N+Hum}” in the first and last nodes. This means the FST may recognize sequences of Npr as minimal text units and tag them with the N+Hum labels.

However, as it is now, the NprNpr.graph FST does not distinguish (or indicate) the name’s gender-number value, wich are important for applying agreement rule. This would require one of the following solutions:

a) building four equivalent FST for each gender-number combination; in this case, transitions marked (1) would not be included, so that proper nouns beginning with surnames or abbreviations would not be recognised here but by an independant graph (this last one would remain ambiguous as for gender-number information).

b) after tagging the sequence with FST NprNpr.graph, further information on gender-number should be looked for, namely, using the information available for the first noun of that sequence, if it is a given name (not abbreviated or ambiguous with a surname).

The first solution is exemplified in NprNprMascSing.graph for masculine singular proper names, shown in Fig. 3. above. At this stage, we can not see

yet how the second solution could be implemented.

34

Page 15: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

Fig.3. NprNprMascSing.graph

3. Conclusion.

The FSA and the small companion dictionary proposed in this paper should be viewed as a tentative description for a local grammar of proper nouns. They allow some elementary automatic processing of sequences of proper nouns. While the elaboration of extensive lists of proper names and the determination of combinatorial constraints between it elements is a matter of attentive linguistic survey, combining all these restrictions in a coherent and adequate local grammar may prove to be more difficult than it seems. Therefore, much is still left to be done.

References

Baptista, J. 1994. Estabelecimento e formalização de classes de nomes compostos. (M.Th.). Lisboa : FLUL.

Chacoto, L. 1994. Estudo e formalização das propriedades léxico-sintácticas das expressões fixas proverbiais. (M.Th.). Lisboa : FLUL.

Courtois, B. 1990. Un système de dictionnaires électroniques pour les mots simples du français. Langue Française 87: 11-22. Paris: Larousse.

Courtois, B., M.Garrigues, G. Gross, M. Gross, R. Jung, M. Mathieu-Colas, A. Monceaux, A. Poncet-Montange, M. Silberztein and R. Vivès. 1997. Dictionnaire électronique Delac: les noms composés binaires. Rapport technique du LADL 56. Paris: LADL.

Eleutério, S., E. Marques Ranchhod, H. Freire and J. Baptista 1995. A system of electronic dictionaries of Portuguese. Linguisticae Investigationes XIX:1. 57-82. Amesterdam: John Benjamins B.V.

Gross, G. 1988. Degré de figement des noms composés. Langages 90: 57-72. Paris: Larousse.Gross, M. 1988. Linguistic representations and text analysis. Linguistic Unity and Linguistic

Diversity in Europe. pp. 31-61. London: Academia Europae.Gross, M. 1989a. La construction de dictionnaires électroniques. Annales des télécomunica-

tions 44. Paris: CNET.Gross, M. 1989b. The Use of Finite Automata in the Lexical Representation of Natural

Language. Lecture Notes in Computer Science 377. Electronic Dictionaries and Automata in Computational Linguistics. LITP Spring School on Theoretical Computer Science (Saint-Pierre d’Oléron, France, May 1987). M. Gross and D. Perrin (Eds.) pp. 34-50. Berlin: Springer-Verlag.

Gross, M. 1996. Lexicon-Grammar. Concise Encyclopedia of Syntactic Theories. K.Brown and J. Miller (Eds.). pp. 224-259. Cambridge, Elsevier/Cambridge University Press.

Gross, M. 1997. The Construction of Local Grammars. Finite-State Language Processing. E. Roche and Y. Schabes (Eds.). pp. 329-354. Cambridges MA, London: MIT Press, Bradford.

Marques Ranchhod, E. and S. Eleutério 1992. As novas tecnologias e o ensino do Português. Actas do III Encontro da Associação das Universidades de Língua Portuguesa (Estoril, 1991). Lisboa: AULP.

35

Page 16: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 Jorge Baptista

Marques Ranchhod, E. and S. Eleutério 1994. Contrução de dicionários electrónicos do Português: Problemas teóricos e metodológicos. Actas do Congresso Internacional sobre o Português (Lisboa, 1993). I. Duarte and I. Leiria (Orgs.). Vol. I: 265-281. Lisboa: Ed. Colibri/APL.

Silberztein, M. 1989. The lexical analysis of French. Lecture Notes in Computer Science 377. Electronic Dictionaries and Automata in Computational Linguistics. LITP Spring School on Theoretical Computer Science (Saint-Pierre d’Oléron, France, May 1987). M. Gross and D. Perrin (Eds.) pp. 93-110. Berlin: Springer-Verlag.

Silberztein, M. 1993. Dictionnaires électroniques et analyse automatique de textes. Le système INTEX. Paris: Masson.

Silberztein, M. 1997. Intex 4.0 Tutorial Notes. Paris: LADL.

Annex 1Extract of list of unknown tokens

This list was obtained from a small corpus of text, taken from PUBLICO newspaper on-line edition, after lexical analysis by INTEX using Portuguese simple word dictionary DIGRAS.

AAC, Aaland, Abacus, Abecasis, Abel, Abílio, ABS, Academy, Acácio, acccionar, acordão, Adams, Adolosi, Adriano, Adrião, Advani, AEVP, AfDB, Afonso, AFP, Agrolongo, Aguilar, aidagarashugi, AIP, Airbus, Alagoinhas, Alanis, Alatas, Albert, Alberti, Albertina, Alberto, Alcorão, Alema, Alemanha, Alentejo, Alexandre, Alfredo, alfredo, Algarve, Allen, Almada, Altis, Alves, AM, América, Américo, Amnista, Amore, Ana, Anabela, And, and, Andaluzia, Andrade, André, anfitriãos, angel, Angeles, Angers, Anguita, Anita, Aníbal, ANMP, Ann, Annentregou, ANNIE, Annie, anosdepois, ANPES, antipropinas, Antonioni, Antónia, António, Antral, Antunes, APEC, APIMC, Aranda, Araújo, arbres, arcordão, Arias, Arieka, Arriaga, ARS, Arts, Artur, asdfghjklç, ASEAN, Asia, Assayas, asservie, assuas, Atami, Atlanta, August, Auschwitz, Australia, Austrália, Autocoop, Autoeuropa, Asutosil, avancada, Avdo, Aveiro, aviolinadas, AX, Azevedo, Aznar, Azpilueta, África, Álvares, Álvaro, Ásia, Áustria [...]

36

Page 17: w3.ualg.ptw3.ualg.pt/~jbaptis/download/Baptista 1998b.doc · Web viewNeste artigo, apresentam-se algumas das propriedades linguísticas dos nomes próprios pertinentes para o seu

Seminários de Linguística 2 A local grammar of proper names

Annex 2Extract of concordance of NprNpr.graph.Recognized sequences in bold characters.

asculinos: Paulo Guerra, António Pinto, Alfredo Brás, Eduardo Henriques e Carlosg) e Conceição Ferreira (SC Braga), com Ana Correia (Maratona da Maia) a suplent

), John Lee Hooker (blues tradicional), António Carlos Jobim (jazz latino), Anitinho do Porto (AEVP), Mesquita Montes e António Filipe, respectivamente, estabel

aulo Paraty (Porto);{S} Chaves-Farense, António Marçal (Lisboa);{S} Leça-CampomaS} Seniores - Masculinos: Paulo Guerra, António Pinto, Alfredo Brás, Eduardo Hen

viam antes sido reiteradas pelo próprio António Santos, num programa da RTP, masntanto, esclarecidas em breve.{S} É que António Sousa Franco quer ter na sua mão

tre João Soares e o ministro da Defesa, António Vitorino. {S}Fernanda Ribeirorsão final da OPA, foram oferecidos por Artur Santos Silva.{S} Isto significa qussária tal documentação.{S} Finalmente, Barbosa contesta a posição de Santana Lo

atação {S}As atletas Ana Alegria, do Braga, e Joana Soutinho, do FC Porto, esaquela ser uma atleta portuguesa e por Campos ser treinador nacional.{S} Ou tud-Flandres, a sobriedade da escultura de Carlos Nogueira, no pavilhão português,

projecto de urbanismo, de Graça Dias ou Carrilho da Graça e Souto Moura com projcentenário {S}O presente do ministro Carrilho {S}O ministro da Cultura, Ma

em Libreville.{S} Ao colocar o nome de Chipenda numa lista em que também englobdeclarações dos responsáveis pelo INE, Correia Gago e Daniel Santos, "a taxa de

s responsáveis pelo INE, Correia Gago e Daniel Santos, "a taxa de inflação em Pougiados, como assegurou um porta-voz da Cruz Vermelha, e em algumas cidades cont

(Leiria);{S} E.{S} Amadora-Salgueiros, Cunha Antunes (Braga);{S} Belenenses-Martão bispo do Porto, D. António Ferreira Gomes, começou a tornar-se evidente.{S}

S} Ainda ontem, a propósito da morte de Daniel Júlio Chipenda, antigo embaixadorbre a banca portuguesa, encomendado por Eduardo Catroga, na altura ministro das

lo Guerra, António Pinto, Alfredo Brás, Eduardo Henriques e Carlos Monteiro (todpercurso" (ver PÚBLICO de 8/9/1995).{S} Eduardo Lourenço licenciou-se em Históri

uto Moura, a energia de Graça Dias (com Egas Vieira) e João Santa Rita, as insul: Rui Silva (Ouriquense).{S} Femininos: Elisabete Fonseca (USG Paredes). {S}L

os 50 por cento do capital. {S} Pedro Fernandes Cristalaria {S} Crise na crREP} {essa,esse.PRON+Dém:fs} carta, Rui Fernandes, presidente do Farense, apelaCrónica de D. João I, a mais célebre de Fernão Lopes, é lançada hoje numa edição

ÚBLICO tentou esclarecer a situação com Fonseca e Costa e com a Federação Portug- É por isso que na cena em que Lucas ( Francisco Nascimento) é repudiado pelo a

Bolito, Marco Delgado, Orlando Sérgio e Francisco Nascimento LISBOA - Alfa 2: 14s, espera que a História faça justiça a Franco, desejo que embaraçou a direcção

porque a empresária de Fernanda, Julia Garcia Fernandez, tinha afirmado que umae de 6 de Março, no Teatro Académico de Gil Vicente, e conta com a participação

antes árbitros e encontros da ronda: V. Guimarães-Tirsense, Olegário Benquerençatado em Braga sob a direcção do árbitro Isidoro Rodrigues, de Viseu.{S} O jogo Fco pela Junta de Freguesia de Alvalade, Joaquim Cunha adianta que enquanto não t

a, vindo mais tarde a ser assistente de Joaquim de Carvalho.{S} Christo {S} Cs Patrício e Vítor Almeida (Sporting) e Joaquim Silva (Benfica), com Joaquim Pin

riam provir mesmo de Portugal, da dupla João Campos-Fernanda Ribeiro, até porqueuperação de edifícios ou de Barreiros e João Paulo Conceição, com instalações fa

até ao final do ano.{S} João Ribeiro da Fonseca, presidente da Portugália, afirmergia de Graça Dias (com Egas Vieira) e João Santa Rita, as insularidades de Pau

Socialista com aquele objectivo. {S} João Seabra Portugália compra aviõesde apresentação (da responsabilidade de João Vieira Caldas) é de grande coerênci

glio ainda maior do que o actual. {S} Jorge Heitor Cuba - Avionetas - Acordo Crs+O:1s} penso em vídeo há 20 anos".{S} Jorge Silva Melo continua: "'Corte de Cabra tutelar de "Os Verdes Anos".{S} Com Jorge Silva Melo e Paulo Rocha - cineast

as com os Mundiais de crosse", adiantou Jorge Vieira.{S} Quanto à inclusão de AnC Porto) para os 3000m, no feminino.{S} José Azevedo conquistou o direito a estaansferência de posições dentro do Grupo José de Mello - já que o vendedor é a "s

{S}O presidente da República de Angola, José Eduardo dos Santos, e o chefe da opSaura - e de escritores - Luís Landero, José Luís Sampedro, José Goytisolo, Rosa

abilidades de vencer", disse ao PÚBLICO José Manuel da Graça Bau, presidente daPorto e para a região do Douro". {S} José Manuel Santos, vice-presidente da C

Elisabete Fonseca (USG Paredes). {S} Luís Lopes Atletismo {S}Carla Sacrame{S}O cineasta e os seus fantasmas {S} Luís Miguel Oliveira {S}"Corte de Cab

u.PN+Pess+Nom:3mp}. {S}Fotografia de Luís Vasconcelos Investido no estatuto dlegislação para táxis {S}O vereador Machado Rodrigues, com o pelouro dos Tra

"A Palavra" do Dreyer e "Francisca" do Manoel de Oliveira.{S} Ao falar sobre is

37