Automatic construction of a TMF Terminological Database using a transducer … · 2016. 12. 23. · them use machine learning tools to extract header metadata using support vector

Automatic construction of a TMF Terminological

Database using a transducer cascade

Chihebeddine Ammar, Kais Haddar, Laurent Romary

To cite this version:

Chihebeddine Ammar, Kais Haddar, Laurent Romary. Automatic construction of a TMF Ter-minological Database using a transducer cascade. RANLP-2015, Sep 2015, Hissar, Bulgaria.Proceedings of the International Conference ”Recent Advances in Natural Language Process-ing”. <http://lml.bas.bg/ranlp2015>. <hal-01276816>

HAL Id: hal-01276816

https://hal.inria.fr/hal-01276816

Submitted on 22 Feb 2016

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

Distributed under a Creative Commons Attribution 4.0 International License

https://hal.archives-ouvertes.fr

https://hal.inria.fr/hal-01276816

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/licenses/by/4.0/

Automatic construction of a TMF Terminological Database using atransducer cascade

Chihebeddine AmmarMIRACL & Sfax University

Sfax, [email protected]

Kais HaddarMIRACL & Sfax University

Sfax, [email protected]

Laurent RomaryINRIA & Humboldt University

Berlin, [email protected]

Abstract

The automatic development of termino-logical databases, especially in a standard-ized format, has a crucial aspect for mul-tiple applications related to technical andscientific knowledge that requires seman-tic and terminological descriptions cover-ing multiple domains. In this context, wehave two challenges: the first is the auto-matic extraction of terms in order to builda terminological database, and the sec-ond challenge is their normalization intoa standardized format. To deal with thesechallenges, we propose an approach basedon a cascade of transducers performed us-ing CasSys tool of Unitex platform thatbenefits from both: the success of the rule-based approach for the extraction of terms,and the performance of the TMF standardfor the representation of terms. We havetested and evaluated our approach on anArabic scientific and technical documentsfor the Elevator domain and the results arevery encouraging.

1 Introduction

The automation of terminology will reduce thetime and cost that usually takes terminologicaldatabase construction. It will also help us to con-struct terminological databases with broad cover-age, especially for recent concepts and poor lan-guage coverage (Arabic for example). On theother side, the representation of terminologicaldata in a standard format allows the integrationand merging of terminological data from multi-ple source systems, while improving terminolog-ical data quality and maintaining maximum inter-operability between different applications.

One of the very rich in terminology workingarea are the scientific and technical documents.

They cover several scientific and technical fields.That is why we will need several terminologicaldatabases, one for each field. For this reason, wedecided to work on a specific domain: the eleva-tors.

To automate any process, we need a framework.The choice of this framework is not an easy task.In fact, many frameworks exist, based on: formalgrammars, logical formalism, discrete mathemat-ics, etc. The rule-based approach requires: a thor-ough study of the characteristics of terms and con-struction of necessary resources such as dictionar-ies, trigger words and extraction rules.

Finite automata and in particularly transducersare often used in the Natural Language Process-ing (NLP). The general idea is to replace the rulesof formal grammars with representation forms.Transducers offer a particularly nice and simpleformulation, and prove their capability of repre-senting complex grammars due to their graphicrepresentation. They have a success for the extrac-tion of named entities (NE) and terms. In fact, pre-cision is more important for rule-based systems.

Another issue is to decide which standard willwe choose to model our terminological databases,which standard will best represent scientific andtechnical terms and which model to use, onomasi-ological or semasiological?

Our main objective is to create a standardizedterminological resource form a corpus of Arabicscientific and technical documents (patents, man-uals, scientific papers) able to support automatictext processing applications. Our approach isbased on a cascade of transducers performed usingCasSys tool of Unitex. It aims to extract and an-notate under standardized TMF - TerminologicalMarkup Framework form technical terms of a spe-cific field: the case of lifts field. The first step is apre-treatment consisting on resolving some prob-lems of the Arabic language (e.g. agglutination).The second step is to extract and annotate terms.

And the final one is a post-treatment consisting ofcleaning documents.

This paper is organized as follows. Section 2 isdevoted to the presentation of the previous work.In section 3, the characteristics of Arabic scientificand technical terms. We present, in section 4, weargue the choice of terminology model. In section5, we present our approach. Section 6 is devoted toexperimentation and evaluation and we concludeand enunciate some perspectives in section 7.

2 Previous work

Three methods for building a terminologicalknowledge base exist: manual, semi-automaticand automatic. In the literature, there are some ter-minological databases for scientific and technicalfields, most of them were constructed manually orsemi-automatically.

For instance, the multilingual terminology ofthe European Union, IATE1, contains 8,4 millionterms in 23 languages covering EU specific ter-minology as well as multiple fields such as agri-culture or information technology. The multilin-gual terminology portal of the World IntellectualProperty Office, WIPO Pearl2, gives access to sci-entific and technical terms in ten languages, in-cluding Arabic, derived from patent documents. Itcontains 15,000 concepts and 90,000 terms. SinceWIPO has not a collection of Arabic patents, Ara-bic terms are often translations of the WIPO trans-lation service. In (Lopez and Romary, 2010b),the authors developed a multilingual terminologi-cal database called GRISP covering multiple tech-nical and scientific fields from various open re-sources.

Three main approaches are generally followedfor extraction: rule-based (or linguistic) approach,training based (or statistic) approach and hybridapproach. What distinguishes the approachesmentioned, is not the type of information consid-ered, but their acquisition and handling. The lin-guistic approach is based on human intuition, withthe manual construction of analysis models, usu-ally in the form of contextual rules. It requires athorough study of the types of terms, but it has asuccess for the extraction of NE and terms. In fact,precision is more important for symbolic systems.

In previous work on non scientific and techni-cal documents, there are those who used linguis-

1http://iate.europa.eu2http://www.wipo.int/wipopearl/search/home.html

tically methods based on syntactic analysis (seefor instance (Bourigault, 1992) and (Bourigault,1994)). But the most used approach is a hybridapproach combining statistical and linguistic tech-niques (Dagan and Church, 1994).

The most recent work on scientific and technicaldocuments were mainly based on purely statisticalapproaches. They used standard techniques of in-formation retrieval and data extraction. Some ofthem use machine learning tools to extract headermetadata using support vector machines (SVM)(Do et al., 2013), hidden markov models (HMM)(Binge, 2009), or conditional random fields (CRF)(Lopez, 2009). Others use machine learning toolsto extract metadata of citations (Hetzner, 2008),tables (Liu et al., 2007), figures (Choudhury et al.,2013) or to identify concepts (Rao et al., 2013).All these approaches rely on previous training andnatural language processing.

The need to allow exchanges between referenceformats (Geneter, DXLT, etc.) has brought to thebirth of the standard ISO 16642, TMF, specifyingthe minimum structural requirements to be met byevery TML (terminological Markup Language).

3 Characteristics of Arabic scientific andtechnical terms

Our study corpus contains 60 Arabic documents:50 patents, 5 scientific papers and 5 manuals andinstallation documents of elevators collected frommultiple resources: manuals from the websitesof elevator manufacturers, patents from multipleArabic intellectual property offices and scientificpapers from some Arabic journals. All of thesedocuments are text files and contain a total num-ber of 619k tokens.

This corpus will allow us to construct the neces-sary resources such as dictionaries, trigger wordsand extraction rules and to study the characteris-tics of Arabic terms. Indeed, we noted the exis-tence of some semantic relationships among termsof our collection, such as synonymy.

In fact, some terms have the same signified anddifferent signifiers. For example,

àðY K. Y ª � Ó

úæ�º«

à Pð signifies ÈXAªÓ

à Pð

àðYK. Yª�Ó

”elevator without counterweight”. Here, the twoterms have the same part (

à PðàðY K. Y ª � Ó

”elevator without weight”) and two synonymouswords (ÈXAªÓ ”equivalent” and ú

æ�º« ”reverse”).

Another type of semantic relationships is the hi-

erarchical relationship in two ways. Firstly, fromthe generic term to the specific term(s) (from hy-peronym to hyponym). For example, hyperonym:�é J. »Q Ó ”vehicle”, hyponym: Y ª � Ó ”elevator”,�éK. Q« ”car”. Secondly, from the all to the differentparts (from holonym to meronyms). For example,holonym: Y ª � Ó ”elevator”, meronyms: �

é K. Q «

”car”, H. AK. ”door”, P P ”button”, etc.

In Arabic texts, some factors make there auto-matic analysis a painful task, such as: the agglu-tination of Arabic terms. In fact, the Arabic lan-guage is a highly agglutinative language from thefact that clitics stick to nouns, verbs, adjectiveswhich they relate. Therefore, we find particles thatstick to the radicals, preventing their detection. In-deed, textual forms are made up of the agglutina-tion of prefixes (articles: definite article È@ ”the”,

prepositions: È ”for”, conjunctions: ð ”and”), andsuffixes (linked pronouns) to the stems (inflectedforms: éK. @ñK.

@ ”its doors”, H. @ñK.

@ ”doors” + è ”its”).

Another problem is the ambiguity which maybe caused by several factors. For example, Arabiclanguage is one of the Semitic languages that isdefined as a diacritized language. Unfortunately,diacritics are rarely used in current Arabic writingconventions. So two or more words in Arabic canbe homographic. such as Y

�ªK ”return”,

�Yª�

�K ”pre-

pare”,�Y

�ª

�K ”count”.

Despite documents of our corpus are in Arabiclanguage, some of them have a literal translationof key terms and technical words. These transla-tions can be in English or French and are usuallyof a very high quality because they are made byprofessional human translators. They facilitate thetask of our terminological database implementa-tion (Language Section and Term Section of theTMF model) and make it multilingual.

4 TMF Terminological Model

The terminology is interested in what the termmeans: notions, concepts, and words or phrasesthat they nominate. This is the notional or con-

ceptual approach. Motivated from the industrialpractice terminology, the Terminological MarkupFramework (TMF3) (Romary, 2001) was devel-oped as a standard for onomasiological (sense toterm) resources. In this paper, we need a genericmodel able to cover a variety of terminological re-sources. That is why we consider that the stan-dard TMF is the most appropriate for our termi-nological database. The meta-model of the stan-dard TMF is defined by logical hierarchical lev-els. It thus represents a structural hierarchy of therelevant nodes in linguistic description. The meta-model describes the main structural elements andtheir internal connections.

It is combined with data categories (ISO126204) from a data category selection (DCS). Us-ing the data model based on ISO 16642 allowsus to fulfill the requirements of standardizationand to exploit Data Category Registry (DCR) fol-lowing the ISO 12620 standard for facilitating theimplementation of filters and converters betweendifferent terminology instances and to produce aGeneric Mapping Tool (GMT) representation, i.e.a canonical XML representation. The main roleof our terminological extractor is to automaticallygenerate terms in GMT format and create a nor-malized terminological database of scientific andtechnical terms.

Figure 1 shows an example of scientific termi-nological entry (Multi-car elevator) in the form ofan XML document conforming GMT in two lan-guages (Arabic and English).

5 Proposed approach

The extraction method of Arabic terms that weadvocate is rule-based. In fact, the rules that aremanually built, express the structure of the infor-mation to extract and take the form of transducers.These transducers generally operate morphosyn-tactic information, as well as those contained inthe resources (lexicons or dictionaries). Moreover,they allow the description of possible sequences ofconstituents of Arabic terms belonging to the fieldof elevators. The approach that we propose to ex-tract terms for the field of elevators is composedof two steps (Figure. 2): (i) identifying the neces-

3ISO 16642:2003. Computer Applications in Terminol-ogy: Terminological Markup Framework

4ISO 12620:2009. Terminology and Other Language andContent Resources – Specification of Data Categories andManagement of a Data Category Registry for Language Re-sources

Figure 1: Terminological entry conforming GMT

sary resources to identify terms to extract, (ii) thecreation of a cascade of transducer each of whichhas its own role.

In the following, we detail the different re-sources and steps of our approach.

5.1 Necessary linguistic resources

For our approach, we construct linguistic resoucesfrom our study corpus like dictionaries, triggerwords and extraction rules (syntactic patterns). Inthe following, we present these resources.

5.1.1 Dictionaries

For the domain of elevator, subject of our study,we identified the following dictionaries: a dictio-nary of inflected nouns and their canonical forms,a dictionary of inflected verbs, a dictionary for ad-jectives, a dictionary for trigger words of the do-main and dictionaries of particles, possessive pro-nouns, demonstrative pronouns and relative pro-nouns. The structure of the various dictionary en-tries is not the same. It can vary from one dic-tionary to another. It must contain the grammati-cal category of the entry (noun, adjective), but, ac-cording to the dictionary, it may contain also: gen-der (masculine, feminine or neutral) and number(singular, dual, plural or broken plural), definition

Figure 2: Proposed approach

(defined or undefined), case (accusative, nomina-tive or genitive) or mode (indicative, subjunctiveor jussive), person (1st person, 2nd person or 3rdperson) and voice (active or passive).

5.1.2 Trigger words

The extraction rules generally use morphosyntac-tic information such as trigger words for the de-tection of the beginning of a term. We opted forthe increased number of rules and triggers in orderto have as efficient as possible extraction system.We identified 162 trigger words, some of them cantrigger the recognition of up to 5 terms. For thisreason we classified them in classes.

5.1.3 Extraction rules

To facilitate identification of the necessary trans-ducers for the extraction of terms, we have builta set of extraction rules. Indeed, they give the ar-rangement of the various constituents of the termsof a readily transferable linearly as graphs. Weidentified 12 extraction rules. Table 1 shows someof them. Four grammatical features are attribu-ated here: gender (masculine (m) or feminine (f)),number (singular (s), dual (d) or plural (p)), defi-nition (defined (r) or undefined (n)) and case (ac-cusative (a), nominative (u) or genitive (i)).

Examples of trigger words are: ½KQ m��' ”mobi-

lization” for the rules R1 and R5, �é

K AJ� ”mainte-

RuleExtraction rules

number

R1<Pattern 1>:=<Trigger word>

<N:nums><PREP>(<N:nums>)+

R2<Pattern 2>:=<N:nums>

<PREP><N:nufs>[<Adj:nufs>]

R3<Pattern 3>:=<Trigger word>

<N:nums><Adj:nums>

R4<Pattern 4>:= <Trigger word>

<N:nufs><N:rums>

R5<Pattern 5>:= <Trigger word>

<N:nufp><N:rums>

Table 1: Some extraction rules of Arabic patentterms

nance” for the rule R3 and ©P ”lifting” for the rule

R4. The Table 2 shows some extracted terms dueto the precedent extraction rules (here identified bytheir number in Table 1).

RuleExtracted terms

number

R1úæ�º«

à Pð

àðYK. Yª�Ó

”elevator without counterweight”

R2�è Pð Qm×

�èQºJ. K. Yª�Ó

”elevator with splined roller”

R3úÍ�@ Yª�Ó

”automatic elevator”

R4

Yª�ÖÏ @�éK. Q«

”elevator car”Õºj

�JË @

�ékñË

”contor panel”

R5©

QË @ ÈAJ.k

@

”hoisting ropes”

Table 2: Terms extracted due to extraction rules

5.2 Implementation of extraction rules

We created three types of transducers. The firstone is the transducer of pre-treatment solving Ara-bic prefixes and suffixes agglutination. To rec-ognize the agglutinative character, we should en-ter inside the token. As Unitex works on a to-kenized version of the text, it is not possible tomake queries entering within the tokens, except

with morphological filters or the morphologicalmode that is more appropriate in our case. To dothis, we must define the whole portion of grammarusing the symbols < and > as presented in the Fig-ure. 3). The transducer annotate every part of theagglutinated token with appropriate grammaticalcategory.

Figure 3: Transducer of resolution of agglutina-tion

The second transducer, as shown in Fig-ure. 4, includes all subgraphs of term extractionand annotation under the format GMT (”extrac-tion trasducers” box). In order to improve termsextraction, trigger words are regrouped into the”trigger words” box.

Figure 4: The main extraction transducer

Figure. 5 shows one of the transducers that ex-tract and annotate terms. It also recognizes theFrench or English translation of the term if avail-able thanks to the ”French Translation” and ”En-glish Translation” subgraph and annotate them ina new Language Section (LS) in the GMT formatas shown in Figure. 1.

The final transducer is a post-treatment trans-ducer consisting on document cleaning: its role isto delete all text remains (which is not XML). Fig-ure. 6 is an overview of this transducer. The sub-graph ”XML” recognize all the XML element thatcould be contained by the <struct type=”TE”>GMT element.

Figure 5: Example of extraction subgraph

6 Exprimentation and evaluation

Our test corpus contains 160 Arabic documentsfrom multiple resources: 100 patents, 50 scientificpapers and 10 manuals and installation documentsof elevators, with a total number of 1665635 to-kens. Our transducers are called in a specific or-der in a transducer cascade that is directly imple-mented in the linguistic platform Unitex5 using theCasSys tool (Friburger and Maurel, 2004). Eachgraph adds its own annotations due to the mode”Replace”. This mode provides, as output, a rec-ognized term surrounded by a GMT annotation de-fined in the transducers.

In order to conduct an evaluation, we appliedthe cascade implemented on the test corpus. Wemanually evaluated the quality of our work on thetest corpus. The total number of terms is 852. Ta-ble 3 gives an overview of the obtained results.

Terms Extracted terms Erroneous terms852 827 59

Table 3: Overview of the obtained results

The obtained results are satisfactory, the trans-ducers were able to cover the majority of the termscontained therein with a precision of 0.95 and a re-call of 0.97 with a F-score of 0.95. We thereforefind that the proposed method is effective.

5Unitex: http://www-igm.univ-mlv.fr/ unitex/

Figure 6: Post-treatment transducer

The noise can be caused by the absence of dia-critics in our corpus and dictionaries, which couldcreate ambiguity problem. It may also be causedby the absance of high granularity features of ourdictionary entries. For this reason, we will try toadd other semantic and grammatical features toour dictionary entries to improve our results. De-spite the good results, we were forced to spendour terminological database to a terminologist tocorrect erroneous terms and their definitions. Webelieve that the automatic integration and mergingof our database with other existing databases canhelp us to automatically correct errors.

7 Conclusion

In this paper, we built a set of transducers. Thenwe generated a cascade allowing extraction of sci-entific and technical terms. The extracted termswere represented in a standardized format (GMT).The generation of this cascade is performed us-ing the CasSys tool, built-in Unitex linguistic plat-form. The operation of the transducer cascade re-quired the construction of resources such as dic-tionaries.

In the immediate future, we will create a trans-ducer cascade to extract bibliographic data andmetadata of citations, tables, formulas and fig-ures from scientific and technical documents andpatents. We will also extract terms using a satatis-tic approach. Finally, we will try to combine thetwo approaches in a hybrid one.

ReferencesCui Binge. 2009. Scientific literature metadata ex-

traction based on HMM. CDVE 2009, pages 64-68.Luxembourg, Luxembourg.

Didier Bourigault. 1992. Surface Grammatical Anal-ysis for the Extraction of Terminological NounPhrases. In Proceedings of the 14th InternationalConference on Computational Linguistics COL-ING’92, volume 3, pages 977-981. Nantes, France.

Didier Bourigault. 1994. LEXTER, Un Logi-ciel d’Extraction de TERminologie. Application al’Acquisition de Connaissances a Partir de Textes.Doctoral thesis. Ecole des hautes Etudes en Sci-ences Sociales.

Erik Hetzner. 2008. A simple method for citationmetadata extraction using hidden markov models.JCDL 2008, pages 280-284. New York, USA.

Huy H.N. Do, Muthu K. Chandrasekaran, Philip S. Choand Min Y. Kan. 2013. Extracting and MatchingAuthors and Affiliations in Scholarly Documents.JCDL 2013. Indianapolis, Indiana, USA.

Ido Dagan and Ken Church. 1994. Termight:Identifying and Translating Technical Terminology.In Proceedings of the 4th Applied Natural Lan-guage Processing Conference ANLP’94, pages 34-40. Stuttgart, Germany.

Laurent Romary. 2001. An Abstract Model forthe Representation of Multilingual TerminologicalData: TMF Terminological Markup Framework.TAMA 2001. Antwerp, Belgium.

Nathalie Friburger, Denis Maurel. 2004. Finite-state transducer cascades to extract named entities intexts. In Theoretical Computer Science,volume 313,pages 93-104.

Patrice Lopez. 2009. GROBID: Combining Auto-matic Bibliographic Data Recognition and Term Ex-traction for Scholarship Publications. ECDL 2009.Corfu, Greece.

Patrice Lopez and Laurent Romary. 2010b. GRISP: AMassive Multilingual Terminological Database forScientific and Technical Domains. LREC 2010. LaValette, Malta.

Pattabhi R.K. Rao, Sobha L. Devi, Paolo Rosso. 2013.Automatic Identification of Concepts and Concep-tual relations from Patents Using Machine LearningMethods. ICON 2013, pages 18-20. Noida, India.

Sagnik R. Choudhury, Prasenjit Mitra, Andi Kirk, Sil-via Szep, Donald Pellegrino, Sue Jones and C LeeGiles. 2013. Figure Metadata Extraction from Dig-ital Documents. ICDAR 2013, pages 135 - 139.Washington, USA.

Ying Liu, Kun Bai, Prasenjit Mitra and C. Lee Giles.2007. Tableseer: automatic table metadata extrac-tion and searching in digital libraries. JCDL 2007,pages 91-100. Vancouver, Canada.