View
217
Download
1
Category
Tags:
Preview:
Citation preview
CITALA2009 - Morroco
Rule-based approach in Arabic NLP: Tools, Systems
and ResourcesDr Khaled Shaalan
Professor, Faculty of Computers & Information, Cairo UniversityOn Secondment to BUiD, UAEKhaled.shaalan@{buid.ac.ae, gmail.com}
Agenda Objective Language Tasks NLP Approaches Rule-based Arabic Analysis and
generation tools Rule-based Arabic NLP applications Some Arabic NLP Free Resources Major and Arabic mailing lists Conclusion
Objective
To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.
Separating Language Tasks
English vs. French vs. Arabic vs . . . spoken language (dialogue) vs written test
vs hand written script Genuine Script vs transliterated
(Romanized) script Vocalized (vowelized) vs non-vocalized Understanding vs. generation First language learner vs second language
learner Classical or Qur’anical Arabic vs Modern
Standard Arabic vs colloquial (dialects) Stem-based vs root-based
Rules
Situation/Action If match(stem.prefix, def_article)
then romve(stem.prefix,Stem_FS)
If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)
Common Mistake
Rule-based approach is not a rule-based expert systems !!!!!!!
Both consist of rules. Rule-based expert systems solves
the problem by Recognize-Act Cycle Loop Conflict resolution strategy
7
Recognize-Act Cycle
RuleBase
FactBase
Match ConflictResolution
nExecute
1
NewFact
NewRule
Working Memory
Domain Knowledge
loop1. Match: Rules are
compared to working memory to determine matches. if no rule matches then stop
2. Conflict Resolution: Select or enable a single rule for execution
3. Execute: Fire the selected rule• Add new fact, or• Learn a new rule
end loop
NLP Approaches (1)
Relies on hand-constructed rules that are to be acquired from language specialists
requires only small amount of training data
development could be very time consuming
developers do not need language specialists expertise
requires large amount of annotated training data (very large corpora)
automated
NLP Approaches (2)
some changes may be hard to accommodate
not easy to obtain high coverage of the linguistic knowledge
useful for limited domain Can be used with both
well-formed and ill-formed input
High quality based on solid linguistic
some changes may require re-annotation of the entire training corpus
Coverage depends on the training data
Not easy to work with ill-formed input as both well-formed and ill-formed are still probable
Less quality - does not explicitly deal with syntax
Rule-based Arabic NLP tools
Morphological Analyzers Morphological Generators Syntactic Analyzers Syntactic Generators
Morphological Analysis Breakdown the inflected Arabic word into a
root/stem, affixes, features. Example: sa- ‘uEty- kumA (سأعطیكما) - ‘will I give
you…’
-sa :س -uEty‘- :أعطی kumA- :كماTYPE: ParticleINFLECTION: ‘Future’
TYPE: VERBASPECT: IMPERFMOOD: INDPERS: 1GENDER: M/FNUMBER: SGSUBJ: I
TYPE: AFFPRGENDER: M/FNUMBER: DUALGF: OBJ
Rules - Augmented Transition Network (ATN) technique Rules associated with arcs represent the
context-sensitive knowledge about the relation between a root and inflections.
More than one rule may be associated with one arc.
Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.
Analysis of the verb "شاهدتك" (I saw you): Remove suffixes
S1 S2 S3
شاهد”ك“ = last1ت last2 = “ت” شا
S10S0هد
شاهدتك
•stem: "شاهد" (saw)• perfect•1st person sg pronoun: "ت"•2nd person sg pronoun "ك"
Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix
S1 S2 S3
last2 = “ون” لعب
S10S0
لعبون
•stem: “لعب" (played)• imperfect•Plural subject
Begin2 = “ي” لعبون
Issues in the morphological analysis
Overgeneration (too many output) Ambiguity Reconstruction of vowels MultiWord/compound Expressions Out-of-Vocabulary (OOV) Handling ill-formed input
Detection (spell checking) Correction- relaxation “ه” instead of “ة”
Prevent ill-formed output Check the compatibility (the prefix “ف” cannot
come after the prefix “ب” (or “ك”)).
Morphological generation Synthesis of an inflected Arabic word from
a given root/stem according to a combination of morphological properties that include: definiteness (definite article “ال”), gender (masculine, feminine), number (singular, dual, plural), case (nominative, genitive, accusative,…), person (first, second, third) …
Synthesis of inflected Nouns definite noun feminine noun pluralize noun dual noun attach a prefix preposition attach a suffix pronoun end case ….
Synthesis of feminine noun
If noun.gender = masculineThen attach suffix feminine letter
Example: (wife) ”زوجة“ husband) ( “زوج”
Synthesis of suffix pronoun
If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun
Example: (my wife) ”زوجتي“ (wife) ”زوجة“
Synthesis of inflected Verbs(very complex-rich in form and meaning)
conjugate a verb with tense conjugate a verb with number conjugate a verb with prefix
pronoun conjugate a verb with suffix
pronoun ….
Rule: synthesize first person plural of assimilated verbs
Input: first person singular past verbOutput: inflected verbExample: - -وصل سنصل نانصلIf verb.tense = futurethen remove first weak & attach_prefix(""سن)else if verb.tense = present then remove first weak & attach_prefix(""ن) else attach_suffix(verb.stem,"نا")
Issues in the morphological generation
MultiWord/compound Expressions Out-of-Vocabulary (OOV) Some forms need special handling:
Substitution: This man – الرجل هذا literal numbers (complex nouns) Arabic script
‘ ال’ + ‘ل ’ ‘للـ’ ” ي“ + ” ’زمالئي‘ ’زمالءي‘ “زمالء ”غرفتان“ “غرفة”
Types of Rules
Grammatical rules: Describe sentence and phrase
structures, and ensure the agreement relations between various elements in the sentence.
Parsing Accepts the input and generates the
sentence structure (parse tree)
مجتهدة الطالبة
noun (definite, fem, sg)
noun (indefinite, fem, sg)
definite(definite, fem, sg)
enunciative (indefinite, femfem, sgsg) Inchoative (defined, femfem, sgsg)
nominal sentence
Agreement:•Number•Gender
Parsing of the sentence “ الطالبة ”مجتهدةThe student (sg,f) is diligent (sg,f)
Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)
Issues in the syntactic analysis
Ambiguity (more than parse tree) Disambiguation techniques
Handling ill-formed input Detection (grammar checking) Recovering (Partial parsing - parses =
chunks to be related)
Types of Rules
Determine phrase structures Determine syntactic structure Ensure the agreement relations
between various elements in the sentence.
Rule: verb-subject agreement
Input: verb and inflected subject (a pre-verbal NP )
Output: inflected verb agreed with its inflected subject
synthesize_verb(Subject.number,verb.stem)
synthesize_verb(Subject.gender,verb.stem)
An agreement example:
زاروا قديمة األوالد متاحف خمسthe-boys visited-they five museum oldThe boys visited five old museums
قديمةمتاحفخمسزاروااألوالد
Adj-noun counted-Num verb-Subject(G) (G) (N,G)
Issues in the syntactic generation
Word order (VSO,SVO, etc.) Agreement (full/partial) dropping the subject pronoun (called Pro-
drop), i.e., to have a null subject, when the inflected verb includes subject affixes.
Syntax that captures the source/intended meaning My son is 8 = سنوات ثماني عمره أبني I did not understand the last sentence = لم أنا
األخيرة الجملة أفهم
A Rule-based Arabic NLP applications
Named Entity Recognition Machine translation Transferring Egyptian Colloquial
Dialect into Modern Standard Arabic
What is entity recognition?
Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies.
Makes unstructured data more structured
Entity Extractor
Politics of UkraineIn July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies.
PersonLocation
Date
Person Entity Recognition (1)
Example: ‘ الثاني عبد األردني الملك الله ’ The Jordanian king Abdullah II
We want to have a rule that recognizes a person name composed of a first name followed by optional last names, based on a preceding person indicator pattern.
Person Entity Recognition (2)
The Rule component of this example: Name Entity: الله عبد [Abdullah] indicator pattern:
an honorific such as "الملك" [The king] Nasab: (optional) inflected from a location name
.[Jordanian] "األردني" The rule also matches an optional ordinal
number appearing at the end of some names such as "الثاني" [II].
Person Entity Recognition (3)
((honorfic+(location( ي|ية ))?)+first_Name(last_Name)?+(number)?)
This (Regular Expression) rule can recognize: الله عبد الملك الله عبد األردني الملك الثاني عبد األردني الملك الله رانيا األردنية ةالملك …
Issues in the Arabic NER Complex Morphological System
(inflections) Non-casing language (No initial
capital for proper nouns) Non-standardization and
inconsistency in Arabic written text (typos, and spelling variants)
Ambiguity
MT ApproachesMT Pyramid
Source word
Source syntax Target syntax
Target word
Analysis Generation
Direct
Transfer
Interlingua
English-to-Arabic Transfer based Approachsource sentence
(English)
Sentence AnalysisSentence AnalysisMorphological & syntactic Analysis Rules of English
English Dic.
TransferTransferEnglish-to-ArabicTransformation RulesBi-ling Dic.
Sentence SynthesisSentence SynthesisMorphological Gen. &Synthesis Rules ofArabic
Arabic Dic.
Target sentence(Arabic)
English Parse Tree
Arabic Parse Tree
Transfer approach
Involves analysis, transfer, and generation components
If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component
np
noun
networkspl
np
npnoun
performancesg
noun
evaluationsg
transfertransfer
np
noun
تقييمsg
np
npnoun
أداءsg noun
شبكةpl
Networks performance evaluation تقييمشبكة أداء
Issues in the Transfer-based MT approach Synonyms of a word
Acquisition “اكتساب” or “استخالص”. Agreement
intelligent tutoring systems “ نظمالذكية “ or ”التعليم الذكي التعليم ”نظم
Problems with prepositions did you do fungal analysis? “ قمت الفطر ـبهل ؟تحليل ”
…
Interlingua MT – Multilingual translation Interlingua = Semantic Representation Deep analysis –
no need for transfer component) Only analysis and generation components
Add Arabic analyzer to translate to other languages
Add Arabic generator to translate from other languages
Analysis of Arabic to Interlingua حجز: في أرغب أنا العميل
الفندق في غرفة
Interlingua(IF)c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Parse Tree
Preprocessor
Sentence Analyzer
Morphological Analyzer
Arabic Grammar Rules
Arabic Morphology Rules
ArabicLexicon
MapperMapLexicon
Ontology
Generating Arabic from Interlingua
Interlingua(IF)c:introduce-topic+reservation+disposition+room (room-spec=(room,
specifier=hote,identifiability=yes),disposition=(desire,who=i))
Sentence Generator
Morphological Generator
Arabic Morphology Rules
ArabicLexicon
Arabic Grammar Rules
Mapper
Feature StructureMap Rules
MapLexicon
Ontology
حجز: في أرغب أنا العميلالفندق في غرفة
Issues in the interlingua approach
Interlingua: language-neutral representation captures the intended meaning of the
source sentence Requires a fully-disambiguating
parser
Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
Be able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words.
Facilitate the communication with colloquial Arabic speakers
Restore the Arabic dialect to the standard language in use nowadays.
A complete sentence example
امتي؟ جيت You-came when?
Mapping
؟متيجئت
reordering
جئت؟ متي When did-you-come ?
•Step (1) جئت جيت•متي امتي•
•Step (2)• the New Segment Position for the word “امتى” is start of sentence (SoS)
Arabic Morphological Analyzers
Tim Buckwalter Morphological http://www.qamus.org/ http://www.ldc.upenn.edu/Catalog/
CatalogEntry.jsp?catalogId=LDC2002L49
Xerox http://www.cis.upenn.edu/~cis639/arabic/input/keyboard_input.html
Tokenization & POS tagging
ArabicSVMTools: The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text http://www1.cs.columbia.edu/~mdiab/ http://www1.cs.columbia.edu/~mdiab/
software/AMIRA-1.0.tar.gz
Tokenization & POS tagging
MADA: a full morphological tagger for Modern Standard Arabic. http://www1.cs.columbia.edu/
~rambow/software-downloads/MADA_Distribution.html
POS tagging
Stanford Log-linear Part-Of-Speech Tagger http://nlp.stanford.edu/software/
tagger.shtml http://nlp.stanford.edu/software/
stanford-arabic-tagger-2008-09-28.tar.gz
Tokenization & POS tagging
Attia's Finite State Tools for Modern Standard Arabic http://www.attiaspace.com/getrec.asp?
rec=htmFiles/fsttools
Arabic Parsers
Dan Bikel’s Parser http://www.cis.upenn.edu/~dbikel/ http://www.cis.upenn.edu/~dbikel/
software.html Attia Arabic Parser
http://www.attiaspace.com/ http://decentius.aksis.uib.no/logon/
xle.xml
Arabic wordnet
Arabic WordNet http://www.globalwordnet.org/AWN/
http://personalpages.manchester.ac.uk/staff/paul.thompson/AWNBrowser.zip
Translation resources
Tools: GIZA++, MOSES, Pharaoh, Rewrite and BLEU
http://www.statmt.org/ APIs:
http://code.google.com/apis/ajax/playground/#translate
http://code.google.com/apis/ajax/playground/#batch_translate
Mailing Lists – just to be connected to the NLP community
corpora@uib.no http://mailman.uib.no/listinfo/corpora
linguist@LINGUISTLIST.ORG http://www.linguistlist.org/
semitic@cs.haifa.ac.il http://www.semitic.tk/
caasl-list@arabicscript.org http://www.arabicscript.org/CAASL3/
index.html
Conclusion (1)
Arabic requires the treatment of the language constituents at all levels: morphology, syntax, and semantics.
Most of the researches in Arabic NLP are mainly concentrated on the analysis part aiming at automated understanding of Arabic language.
Conclusion (2)
Arabic NLP in general is significantly under developed.
In order to bridge this gab and help Arabic NLP research to catch up with the many recent advances of Latin languages, we need collaborative efforts from the Arabic research community.
Conclusion (3)
We need Public Domain (in Electronic Form) for: Linguistic resources such as large Arabic
(bilingual) Corpora and treebanks. Machine readable (bilingual) dictionaries Morphological Analyzers Parsers …
Conclusion (4)
We need to secure fund for: Exchanging visits (experience Expert
Network) Buy software Secure dedicated RA’s and/or PhD
students for the NLP task.
References (1) - Journals Khaled Shaalan, Hafsa Raza, NERA: Named Entity
Recognition for Arabic, the Journal of the American Society for Information Science and Technology (JASIST), John Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.
Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological Generation from Interlingua: A Rule-based Approach, in IFIP International Federation for Information Processing, Vol. 228, Intelligent Information Processing III, eds. Z. Shi, Shimohara K., Feng D., (Boston:Springer), PP. 441-451, 2006.
Shaalan, K., Talhami H., and Kamel I., Morphological Generation for Indexing Arabic Speech Recordings, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 20(1)1:14, 2007.
References (2) - Journals Shaalan K. An Intelligent Computer Assisted Language
Learning System for Arabic Learners, Computer Assisted Language Learning: An International Journal, Taylor & Francis Group Ltd., 18(1 & 2): 81-108, February 2005.
Shaalan K. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.
Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine Translation of English Noun Phrases into Arabic, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 17(2):121-134, 2004.
Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network, Software Practice and Experience, John Wiley & sons Ltd., UK,23(6):567-588, June 1993.
References (3) – workshops & conferences
Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological Rule Induction for Arabic, In the Proceedings of The LREC'08 workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, 31st May, PP. 97-101, 2008.
Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian Colloquial into Modern Standard Arabic, International Conference on Recent Advances in Natural Language Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525-529, September 27-29, 2007.
Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H., Generating Arabic Text from Interlingua, In the Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages, CAASL-2, Linguistic Institute, Stanford, California, USA, PP. 137-144, July 21-22, 2007.
References (4) – workshops & conferences Othman E., Shaalan K., and Rafea A., Towards
Resolving Ambiguity in Understanding Arabic Sentence, In the Proceedings of the International Conference on Arabic Language Resources and Tools, NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, , 2004.
Othman E., Shaalan K., and Rafea A. A Chart Parser for Analyzing Modern Standard Arabic Sentence, In proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA., September, 2003.
Recommended