Upload
hortense-payne
View
240
Download
1
Tags:
Embed Size (px)
Citation preview
Learning a token classification from a large corpus(A case study in abbreviations)
Petya Osenova & Kiril Simov
BulTreeBank Project(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
[email protected], [email protected]
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002
Plan of the talk
• BulTreeBank Project
• Text Archive
• Token Processing Problem
• Global Token Classification
• Application to Abbreviations
BulTreeBank project
• It is a joint project between the Linguistic Modeling Laboratory (LML), Bulgarian Academy of Sciences and Seminar fuer Sprachwissenschaft (SfS), Tuebingen. It is funded by Volkswagen Foundation, Germany.
• Its main goal is the creation of a high quality syntactic treebank of Bulgarian which will be HPSG oriented.
• It also aims at producing a parser and a partial grammar of Bulgarian.
• Within the project an XML-based system for corpora development is being created.
BulTreeBank teamPrinciple researcher:
Kiril SimovResearchers:
Petya Osenova, Milena Slavcheva, Sia Kolkovska
PhD student:
Elisaveta BalabanovaStudents:
Alexander Simov, Milen Kouylekov,
Krasimira Ivanova, Dimitar Dojkov
BulTreeBank text archive
• A collection of linguistically interpreted texts from different genres (target size 100 million words)
• Linguistically interpreted text is a text in which all meaningful tokens (including numbers, special signs and others) are marked-up with linguistic descriptions
The current state of the text archive
• Nearly 90 000 000 running words: 15% fiction,
78% newspapers and 7% legal texts, government bulletins and other genres
• About 70 million running words are converted into XML format with respect to TEI guidelines
• 10 million running words are morphologically tagged
• 500 000 running words are manually disambiguated
Pre-processing steps (1)
• Morphosyntactic taggerAssigning all appropriate morpho-syntactic features to each potential word
• Part-of-speech disambiguatorChoosing the right morpho-syntactic features for each potential word in the context
• Partial Grammar for non-word tokens
Pre-processing steps (2)
Partial grammars• Sentence boundaries grammar• Named Entity Recognition
– Names of people, places, organizations etc.
– Dates, currencies, numerical expressions
– Abbreviations
– Foreign tokens
• Chunk grammar (Abney 1991, 1996)– Non-recursive constituents
Token processing problem
A token in a text receives its linguistic interpretation on the basis of two sources of information: (1) the language and (2) the context of use
Two problems:
• For less studied languages there is no enough language resources (low level of linguistic interpretation)
• Erroneous use in the context (wrong prediction)
Token classification
• Symbol-based classificationThe tokens are defined by their immanent graphical characteristics
• General token classificationThe tokens fall into several categories: common word, proper name, abbreviation, symbols, punctuation, error
• Grammatical and semantic classificationThe tokens are presented in several lexicons, in which their grammatical and semantic features are listed
General token classification
Our goal is to learn a corpus-based classification of the tokens with respect to the general token classification
We use this classification in two ways:– For an initial classification of the tokens in the texts
before consulting the dictionary, and – For processing linguistically the tokens from the
different classes
Learning general token categories (1)
Token classes:• Common words
typical - lowercased and first capital letter in sentence-initial position; non-typical - all caps
• Proper namestypical - first capital letter; non-typical - all caps; wrong - lowercased
• Abbreviationstypical - all caps, mixed, lowercased (with period, hyphen or a single letter)
Learning general token categories (2)
Some problems:
• Some tokens can belong to more than one class according to their graphical properties.
• Spelling errors in a large set of texts could cause misclassification.
Learning general token categories (3)
Our classification is not boolean but gradual-ranking of tokens with respect to each of the above categories.
Our initial procedure included the following steps:– We used some graphical criteria for assigning
potential categories to the unknown tokens.– We used statistical methods to make a distinction
within each category between the most frequent tokens of this category and tokens not in the category or rare tokens.
Learning general token categories (4)
Graphical criterion
It takes into account the graphical specificity of the tokens.
For each category a list of tokens potentially belonging to it was constructed
Well known problems such as:– Common words written in capital letters
– Abbreviations written in a wrong way
The graphical criterion is not sufficient
Learning general token categories (5)
Statistical criterion
For each category, in order to get the maximal number of right predictions for candidate tokens, every candidate-token is ranked
In fact we classify normalized tokens
A normalized token is an abstraction over tokens that share the same sequence of letters from a given alphabet
Learning general token categories (6)
Ranking with a category (1)The ranking formula is
Rank = TokPar*DocPar
where the two parameters areTokPar = True/All
The number of true appearances of the token divided by the number of all appearances of the token
DocPar
The number of the documents in which the correctly written token was found if this number is less that 50, otherwise this value is 50
Learning general token categories (7)
Ranking with a category (2)
The first parameter does not make difference between one or hundred occurrences. Thus, the real scope of distribution is lost
The impact that the token has over the text archive is represented by the second parameter. The upper bound of 50 is used as a normalization parameter.
Thus the tokens that are rare or do not belong to the category receive a very small rank.
Learning general token categories (8)
Usefulness
• The method tolerates the tokens with greater impact over the whole corpus
• The tokens appearing in a small number of documents are processed by local-based approaches (document-centered)
General token categories and local-based approaches
• The local-based approaches can filter general classification with respect to ambiguous or unusual usage of tokens
• When the local-based approach is unapplicable, the information is taken from the general token classification
• The result of such a ranking is very useful for the other task mentioned above - the linguistic treatment of the unknown tokens
Abbreviations in the pre-processing
• Abbreviations are special tokens in the text
• They contribute to a robust:
– tagging– disambiguation– shallow parsing
Extraction criteria
Three criteria
• Graphical criterion (as above)
• Statistical criterion (as above)
• Context criterion - we tried to extract some abbreviations with their extensions written usually in brackets. Thus the
ambiguity is reduced.
Dealing with abbreviations
Our approach includes three steps:
• Typological classification - the existing classifications were refined with respect to the electronic treatment of abbreviations
• Extraction - different criteria were proposed for the extraction of the most frequent abbreviations in the archive
• Linguistic treatment - the abbreviations were extended and
the relevant grammatical information was added
Typological classification
Linguistic treatment (1)
• Encoding the linguistic information shared by all abbreviations:– head element presents the abbreviation itself– every abbreviation has a generalized type: acronym
or word– every abbreviation has at least one extension– every extension element consists of a phrase
Linguistic treatment (2)
• Encoding the linguistic information shared by some types of abbreviations:– the non-lexicalized abbreviations were assigned grammatical
information according to its syntactic head. Thus the element 'class' was introduced.
– the partly lexicalized abbreviations were assigned additionally grammatical information according to their inflection. Thus the element 'flex' was introduced.
– the abbreviations of foreign origin usually have an additional head element, called headforeign (headf).
Examples (1)type ACRONYM
<abbr><head>АЧП</head><acronym/><expan><phrase>Агенция за чуждестранна помощ</phrase><class>Сжед</class></expan><abbr>
<abbr><head>ДП</head><acronym/>
<expan><phrase>Държавно предприятие</phrase> <class>Ссред</class> </expan>
<expan><phrase>Демократичесka партия</phrase> <class>Сжед</class></expan></abbr>
<abbr><head>ЗУНК</head><acronym/><expan><phrase>Закон за уреждане на необслужваните кредити</phrase><class>Смед</class>
<flex>ЗУНК-а,ЗУНК-ът,ЗУНК-ове</flex></expan></abbr>
<abbr><head>ФБР</head><headf>FBI</headf><acronym/>
<expan><phrase>Федерално бюро за разследване</phrase>
<class>Ссред</class></expan><abbr>
Examples (2)
type WORD
<abbr><head>г-ца</head> <word/>
<expan><phrase>госпожица</phrase></expan></abbr>
<abbr><head>гр.<head><word/>
<expan><phrase>град</phrase></expan></abbr>
<abbr><head>в.</head><head>в-к</head><word/>
<expan><phrase>вестник</phrase></expan></abbr>
<abbr><head>ез.</head><word/>
<expan>езеро</expan>
<expan>език</expan></abbr>
Evaluation
The method is hard for absolute evaluation with respect to only one class of tokens
We apply only relative evaluation with respect to a given rank
Only precision measure is really applicable
The recall is practically equal to 100%
Precision = 98.7% for the first 557 candidates (Rank >= 25)
Other applications
• Classification and linguistic treatment of other classes of tokens: names, sentence boundary markers
(similar to abbreviation)
• Determination of the vocabulary of dictionary for human use
The lexeme with great impact over the nowadays texts will be chosen
Similar treatment of the new words
Future work
• Dealing with different ambiguities• Combination with other methods as document-
centered, morphological guessers• Using other stochastic methods