38
NLP resources: construc.on, standardiza.on, exploita.on & API Karim Bouzoubaa

NLP$resources: construcon,$ standardizaon , exploita.on

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NLP$resources: construcon,$ standardizaon , exploita.on

 NLP  resources:    

construc.on,  standardiza.on,  exploita.on  &  API      

 Karim  Bouzoubaa  

Page 2: NLP$resources: construcon,$ standardizaon , exploita.on

outline  

•  Exploita.on    •  NLP  resources  •  Construc.on  •  Standardiza.on  •  API    

Page 3: NLP$resources: construcon,$ standardizaon , exploita.on

Exploita.on  

Page 4: NLP$resources: construcon,$ standardizaon , exploita.on

Exploitation  

LRs  are  used  in  various  NLP  so7ware  tools:    •  morphological,  syntac@c  and  seman@c  analysis  •  automa@c  transla@on  •  automa@c  genera@on  of  texts  •  spell-­‐checking  •  automa@c  summariza@on  •  handwri@ng  recogni@on  •  reformula@on  and  paraphrasing  •  informa@on  search  and  text  mining  

4  

Page 5: NLP$resources: construcon,$ standardizaon , exploita.on

outline  

•  Exploita.on    •  NLP  resources  •  Construc.on  •  Standardiza.on  •  API    

Page 6: NLP$resources: construcon,$ standardizaon , exploita.on

NLP  Resources    

Page 7: NLP$resources: construcon,$ standardizaon , exploita.on

Resources  

Introduction – Definition Types Examples Evaluation criteria

Page 8: NLP$resources: construcon,$ standardizaon , exploita.on

Introduc.on  -­‐  Defini.on  

q  The  key  to  NLT  development  is  the  Language  Resource  q  Resource   produc@on   takes   a   lot   of   effort   and   is   very  expensive  

 Example:   The   Arabic   standard   LC-­‐STAR   phone@c   lexicon   of   the   European   Linguis@c  Resource   Associa@on   (ELRA)   with   110,271   entries   costs   21250.00   EUR   (for   use   in  academic  research)  

8  

Language resources are language-related data,

accessible in an electronic format, and used for

the development of NLP systems

Page 9: NLP$resources: construcon,$ standardizaon , exploita.on
Page 10: NLP$resources: construcon,$ standardizaon , exploita.on

1.  Corpus  •  wriTen:  monolingual  texts,  mul@lingual  texts,  annoted  texts,  

treebanks  

•  speech:   reading   texts   aloud,   speeches,   dialogues,   radio   and  television  broadcasts  

•  Mul@media:  images,  sounds  and  videos  

2.  Lexicon  •  monolingual  and  mul@lingual  Dic@onaries  

•  Gaze@ers  (geographical  dic@onary)  •  Terminologies  

•  ontologies  

Types  –  2  categories  

Page 11: NLP$resources: construcon,$ standardizaon , exploita.on

An  entry  in  the  lexicon  may  contain  :  

 

•  morphological,   syntac@c,   seman@c   and   pragma@c  

informa@on  

•  the  gramma@cal  category  (noun,  verb,  etc.),    

o  subcategory  proper@es  (transi@ve  verb  or  not,  masculine  

or  feminine)  

•  seman@c   informa@on   (animated   name,   verb   requiring   a  

human  subject  

Content  of  a  lexicon  

Page 12: NLP$resources: construcon,$ standardizaon , exploita.on

12  

Examples

Page 13: NLP$resources: construcon,$ standardizaon , exploita.on

Oxford  dic.onary  

Page 14: NLP$resources: construcon,$ standardizaon , exploita.on

verbNet  

Page 15: NLP$resources: construcon,$ standardizaon , exploita.on

q Formal  (regardless  of  content)  §  Size  §  Maintenance  (durability,  scalability)  §  Compa@bility  

q Func.onal  (language  criteria)  §  Lexicographic  annota@on  (existence  and  

relevance)  §  Intrinsic  rules

Evalua@on  criteria

Page 16: NLP$resources: construcon,$ standardizaon , exploita.on

outline  

•  Exploita.on    •  NLP  resources  •  Construc.on  •  Standardiza.on  •  API    

Page 17: NLP$resources: construcon,$ standardizaon , exploita.on

Construc.on  

Page 18: NLP$resources: construcon,$ standardizaon , exploita.on

Construc@on  

Produc.on  cycle  Crea@ng  resources  Example  (Contempory  Arabic)  Reusing  ressources  Example  of  free  resources  

Good  prac.ces    Documenta@on  Interoperability  Viability  

Page 19: NLP$resources: construcon,$ standardizaon , exploita.on

two approaches for developing LRs:  

q creating new resources

q  tuning existing resources  

19  

crea.ng  resources  

Page 20: NLP$resources: construcon,$ standardizaon , exploita.on

Collect   "authen@c"   data,   of   a   general  

nature   or   belonging   to   a   par@cular   sector  

of   ac@vity,   directly   in   digital   form   or,   in  

some  cases,  by  digi@zing  them.  

20  

crea.ng  resources  

Page 21: NLP$resources: construcon,$ standardizaon , exploita.on

Contemporary Arabic  

Example of creating resources  

Page 22: NLP$resources: construcon,$ standardizaon , exploita.on

q The  opera@on  of  making  changes  to  a  resource  for  the  purpose  of  performing  certain  func@ons  and   improving   it   in   a   different   usage  environment  from  the  original  one  

q Example: ....  

22  

Resources’  Reuse  

Page 23: NLP$resources: construcon,$ standardizaon , exploita.on

Corpus  q  Corpus  of  Contemporary  Arabic  q  Khoja  POS  tagged  corpus  q  Quranic  Arabic  q  Collec@on  of  free  arabic  texts  and  books:  

- Almeshkat    - Al-­‐Eman  

Lexicon  q  Buckwalter’s  list  of  Arabic  roots  q  Al-­‐Baheth  Al-­‐Arabi  

 23  

Example  of  free  resources  

Page 24: NLP$resources: construcon,$ standardizaon , exploita.on

In   order   to   contribute   to   the   crea@on   of   a   set   of  

sustainable   RLs,   some   principles   must   be  

respected:  

 

•  Resource  documenta@on  

•  Interoperability  of  resources 24  

Good  prac@ces  

Page 25: NLP$resources: construcon,$ standardizaon , exploita.on

LRs  are  o7en  poorly  documented  or  undocumented  at  all.  

Documenta@on   should   be   as   comprehensive   as   possible,  

and  include  informa@on  on:  

•  the  format  of  the  data  

•  the  content  of  the  data  

•  the  produc@on  context  

•  the  possible  uses       25  

Documenta.on  of  resources  

Page 26: NLP$resources: construcon,$ standardizaon , exploita.on

q The  interoperability  of  LRs  is  the  ability  to  operate  in  different  systems  

q The  formats  of  the  LRs  must  be  standard

26  

Resources interoperability  

Page 27: NLP$resources: construcon,$ standardizaon , exploita.on

Many  difficul@es  are  encountered  when  reusing  available  LRs  

Interoperability – documentation - reuse  

Page 28: NLP$resources: construcon,$ standardizaon , exploita.on

•  Contribute   to   the   development   of   LRs   respec@ng  interoperability  rules  

–  Availability  

–  Portability  

–  Reusability  

–  normaliza@on  

Interoperability – documentation - reuse  

Page 29: NLP$resources: construcon,$ standardizaon , exploita.on

outline  

•  Exploita.on    •  NLP  resources  •  Construc.on  •  Standardiza.on  •  API    

Page 30: NLP$resources: construcon,$ standardizaon , exploita.on

Standardiza.on  

Page 31: NLP$resources: construcon,$ standardizaon , exploita.on

q How  to  integrate  exis@ng  resources  into  one's  own  

contexts?  

q How  to  separate  the  resources  from  the  tools  that  

manage  them?

why?  

Page 32: NLP$resources: construcon,$ standardizaon , exploita.on

standardisation agencies:  CNIS: China National Institute of Standardization  FNOR: Agence Française de Normalisation  DIN: Deutsches Institut für Normung  ANSI: American National Standards Institute  W3C: World Wide Web Consortium  TEI: Text Encoding Initiative  ISO: the International Organization for Standardization  

projects:  

LIRICS :Linguistic Infrastructure for Interoperable Resources and Systems  EAGLES: Expert Advisory Group on Language Engineering Standards  Multext : Multilingual Text Tools and Corpora  

research structures:  

CLARIN: Common Language Resources and Technology Infrastructure  FLaReNet : Fostering Language Resources Network  Alpage : Analyse Linguistique Profonde A Grande Echelle.  

 

Panorama  

Page 33: NLP$resources: construcon,$ standardizaon , exploita.on

Organization  

Page 34: NLP$resources: construcon,$ standardizaon , exploita.on

Préparatoire  new project of the WG  

Préliminaire  Preliminary Work Item (PWI)  

Proposition  New Work Item Proposal (NP)  

Commission  Committee Draft (CD)  

Approbation  Final Draft International Standard (FDIS)  

Enquête  Draft International Standard (DIS)  

Publication  International Standard (IS)  

standards proposition  

Page 35: NLP$resources: construcon,$ standardizaon , exploita.on

LMF  

•  Modeling  Arabic  inflec@on  paradigms  according  to  the  LMF  standard  –  Aïda  Khemakhem  et  al.    2007  

•  Automa@c  conversion  of  editorial  dic@onaries  to  LMF  –  Feten  Baccar  et  al.  2008,  Aïda  Khemakhem  et  al.  2009  

•  Domain  ontology  genera@on  from  LMF  dic@onaries  –  Feten  Baccar  et  al.  2010  

•  Proposed  standardized  representa@on  of  standard  Arabic  lexicons  –  Susanne  Salmon-­‐Alt  et    al  2013  

•  Detec@on  of  anomalies  and  evalua@on  of  the  content  of  LMF  dic@onaries  –  Wafa  WALI  et  al.  2014  

•  Realiza@on  of  a  system  of  produc@on  of  Arabic  dic@onaries  respec@ng  the  LMF  standard  –  Mohammed  Reqqass  et  al.  2014  

Page 36: NLP$resources: construcon,$ standardizaon , exploita.on

LMF Example  

Page 37: NLP$resources: construcon,$ standardizaon , exploita.on

LMF Example  

Page 38: NLP$resources: construcon,$ standardizaon , exploita.on

TEI  

<TEI> <teiHeader> <name> NAFIS Arabic Stemming Gold Standard</name> ...  </teiHeader> <text> <phr> <val> أأسسااسس ففإإننهه ببااللججدد ععللييككمم <val/>االلننججااحح <w rend="ععللييككمم"> <choice n="14"> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">ععَللَيي</m>

</form> <m type="suffix">ككمم</m> </seg> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">َععللِيي</m> </form> <m type="suffix">ككمم</m></seg> ... </choice> </w> </phr> ... </text> </TEI>