Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
NLP resources:
construc.on, standardiza.on, exploita.on & API
Karim Bouzoubaa
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
Exploita.on
Exploitation
LRs are used in various NLP so7ware tools: • morphological, syntac@c and seman@c analysis • automa@c transla@on • automa@c genera@on of texts • spell-‐checking • automa@c summariza@on • handwri@ng recogni@on • reformula@on and paraphrasing • informa@on search and text mining
4
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
NLP Resources
Resources
Introduction – Definition Types Examples Evaluation criteria
Introduc.on -‐ Defini.on
q The key to NLT development is the Language Resource q Resource produc@on takes a lot of effort and is very expensive
Example: The Arabic standard LC-‐STAR phone@c lexicon of the European Linguis@c Resource Associa@on (ELRA) with 110,271 entries costs 21250.00 EUR (for use in academic research)
8
Language resources are language-related data,
accessible in an electronic format, and used for
the development of NLP systems
1. Corpus • wriTen: monolingual texts, mul@lingual texts, annoted texts,
treebanks
• speech: reading texts aloud, speeches, dialogues, radio and television broadcasts
• Mul@media: images, sounds and videos
2. Lexicon • monolingual and mul@lingual Dic@onaries
• Gaze@ers (geographical dic@onary) • Terminologies
• ontologies
Types – 2 categories
An entry in the lexicon may contain :
• morphological, syntac@c, seman@c and pragma@c
informa@on
• the gramma@cal category (noun, verb, etc.),
o subcategory proper@es (transi@ve verb or not, masculine
or feminine)
• seman@c informa@on (animated name, verb requiring a
human subject
Content of a lexicon
12
Examples
Oxford dic.onary
verbNet
q Formal (regardless of content) § Size § Maintenance (durability, scalability) § Compa@bility
q Func.onal (language criteria) § Lexicographic annota@on (existence and
relevance) § Intrinsic rules
Evalua@on criteria
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
Construc.on
Construc@on
Produc.on cycle Crea@ng resources Example (Contempory Arabic) Reusing ressources Example of free resources
Good prac.ces Documenta@on Interoperability Viability
two approaches for developing LRs:
q creating new resources
q tuning existing resources
19
crea.ng resources
Collect "authen@c" data, of a general
nature or belonging to a par@cular sector
of ac@vity, directly in digital form or, in
some cases, by digi@zing them.
20
crea.ng resources
Contemporary Arabic
Example of creating resources
q The opera@on of making changes to a resource for the purpose of performing certain func@ons and improving it in a different usage environment from the original one
q Example: ....
22
Resources’ Reuse
Corpus q Corpus of Contemporary Arabic q Khoja POS tagged corpus q Quranic Arabic q Collec@on of free arabic texts and books:
- Almeshkat - Al-‐Eman
Lexicon q Buckwalter’s list of Arabic roots q Al-‐Baheth Al-‐Arabi
23
Example of free resources
In order to contribute to the crea@on of a set of
sustainable RLs, some principles must be
respected:
• Resource documenta@on
• Interoperability of resources 24
Good prac@ces
LRs are o7en poorly documented or undocumented at all.
Documenta@on should be as comprehensive as possible,
and include informa@on on:
• the format of the data
• the content of the data
• the produc@on context
• the possible uses 25
Documenta.on of resources
q The interoperability of LRs is the ability to operate in different systems
q The formats of the LRs must be standard
26
Resources interoperability
Many difficul@es are encountered when reusing available LRs
Interoperability – documentation - reuse
• Contribute to the development of LRs respec@ng interoperability rules
– Availability
– Portability
– Reusability
– normaliza@on
Interoperability – documentation - reuse
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
Standardiza.on
q How to integrate exis@ng resources into one's own
contexts?
q How to separate the resources from the tools that
manage them?
why?
standardisation agencies: CNIS: China National Institute of Standardization FNOR: Agence Française de Normalisation DIN: Deutsches Institut für Normung ANSI: American National Standards Institute W3C: World Wide Web Consortium TEI: Text Encoding Initiative ISO: the International Organization for Standardization
projects:
LIRICS :Linguistic Infrastructure for Interoperable Resources and Systems EAGLES: Expert Advisory Group on Language Engineering Standards Multext : Multilingual Text Tools and Corpora
research structures:
CLARIN: Common Language Resources and Technology Infrastructure FLaReNet : Fostering Language Resources Network Alpage : Analyse Linguistique Profonde A Grande Echelle.
Panorama
Organization
Préparatoire new project of the WG
Préliminaire Preliminary Work Item (PWI)
Proposition New Work Item Proposal (NP)
Commission Committee Draft (CD)
Approbation Final Draft International Standard (FDIS)
Enquête Draft International Standard (DIS)
Publication International Standard (IS)
standards proposition
LMF
• Modeling Arabic inflec@on paradigms according to the LMF standard – Aïda Khemakhem et al. 2007
• Automa@c conversion of editorial dic@onaries to LMF – Feten Baccar et al. 2008, Aïda Khemakhem et al. 2009
• Domain ontology genera@on from LMF dic@onaries – Feten Baccar et al. 2010
• Proposed standardized representa@on of standard Arabic lexicons – Susanne Salmon-‐Alt et al 2013
• Detec@on of anomalies and evalua@on of the content of LMF dic@onaries – Wafa WALI et al. 2014
• Realiza@on of a system of produc@on of Arabic dic@onaries respec@ng the LMF standard – Mohammed Reqqass et al. 2014
LMF Example
LMF Example
TEI
<TEI> <teiHeader> <name> NAFIS Arabic Stemming Gold Standard</name> ... </teiHeader> <text> <phr> <val> أأسسااسس ففإإننهه ببااللججدد ععللييككمم <val/>االلننججااحح <w rend="ععللييككمم"> <choice n="14"> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">ععَللَيي</m>
</form> <m type="suffix">ككمم</m> </seg> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">َععللِيي</m> </form> <m type="suffix">ككمم</m></seg> ... </choice> </w> </phr> ... </text> </TEI>