191
Advanced Search/Indexing in Holy-Quran Assem Chelli 2011 / 2012

Proposal of an Advanced Retrieval System for Noble Qur’an

Embed Size (px)

Citation preview

  • Advanced Search/Indexing in Holy-QuranAssem Chelli

    2011 / 2012

  • Ministry of Higher Education and Scientific ResearchNational Higher School of Computer Science

    Thesis Of Magister

    Option : Mobile Destributed Computing (IRM)

    Proposal of an Advanced Retrieval System for Noble Quran

    Written By : Supervised By :

    Assem CHELLI Pr. Amar BALLA

    Mr. Taha ZERROUKI

    2011/ 2012

  • ... and say : O my Lord ! have compassion on them, as they brought me up (when I was)little. Al-isra 24

    iii

  • Acknowledgment

    First at all, I am thanking Allah, the Almighty for giving me strength and patience towrite this modest thesis.

    We gratefully acknowledge Pr. Amar Balla and Mr. Taha Zerrouki for giving me thehonor of their supervising during that year and guiding me with advices, and meaningful

    criticism.

    I also thank the jury members for agreeing to evaluate our modest work.

    A big thanks to the faculty and the administration of the National Higher School ofComputer Science (ESI) who took care of my training and monitoring throughout the

    study program.

    Our deepest thanks go to the Arab open source community that has offered me a greatsupport, especially Alfanous Team/Community that makes a valuable contribution tocarry out this great work, and I hope to be worthy of the confidence they have placed on

    me.

    Finally I express my appreciation to all who contributed by their advice and theirencouragement to the completion of this work, my family and my friends for their

    assistance and support.

    iv

  • AbstractNoble Quran is different of all documents that we have known. Its the sacred bookof Muslims. It contains knowledge of all aspects of life. With this huge quantity ofinformation, we can extract only a small part manually and this is considered insuffi-cient compared to the size of knowledge contained by Quran. That raises the need fora method to extract those information because currently there is no efficient methodexcept many printed lexicons and many tools of simple sequential search with regularexpression. Due to this limitation, the Quran requires us to find new ways to interact.

    The goal through this work is to propose a system for advanced research in all ofthe information contained in the Quran by considering the morphology of the Arabiclanguage and the properties of the Quranic text. It should be based on modern meth-ods of information retrieval for good stability and high speed search. It would be veryuseful for researchers and could be generalized to cover all the content in Arabic.

    Keywords : Indexing/Search, Arabic, Holy Quran, Information retrieval, Searchengines.

    v

  • RsumLe Coran est diffrent de tous les documents que nous connaissons . Cest le livre

    sacr des musulmans. Il comporte des connaissances sur tous les aspects de la vie. Avecun tel volume dinformations, on ne peut y extraire quune infime partie manuellement.Ceci savre tre insuffisant vue la quantit de connaissances que contient le Coran. Dola ncessit de trouver une mthode pour extraire ces informations. Or il nexiste aucunoutil utiliser sauf quelques lexiques imprims et quelques outils de recherche simpleet squentielle par les expressions rgulires. En raison de cette limitation, le Corannous oblige trouver de nouvelles faons dinteraction.

    Le but recherch travers ce travail est de proposer un systme avanc de recherchedans lensemble des informations contenues dans le Coran en prenant en considrationla morphologie de la langue Arabe et les proprits du texte coranique. Elle doit trefonde sur les mthodes modernes de recherche dinformations pour obtenir une bonnestabilit et une recherche de grande vitesse. Elle serait trs utile pour les chercheurs etpourrait tre gnralise pour couvrir lensemble du contenu en arabe.

    Mots cls : Indexation/Recherche, Arabe, Coran, Recherche dinformation, Mo-teurs de recherche.

    vi

  • . . . .

    . . .

    .

    . / :

    vii

  • Contents

    Dedication iii

    Acknowledgment iv

    Table of Contents xv

    List of Figures xvii

    List of Tables xix

    List of Abbreviations xx

    Glossary xxiv

    General Introduction 1

    I State Art 4

    1 Search engines 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.1 Keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.2 Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    viii

  • 1.2.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.5 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.3 Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Full-text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.5.1 Crawler Features . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1.1 Features a crawler must provide . . . . . . . . . . . . 91.5.1.2 Features a crawler should provide . . . . . . . . . . . 9

    1.6 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6.2 Indexing modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.6.2.1 Manual indexing . . . . . . . . . . . . . . . . . . . . . 111.6.2.2 Automatic indexing . . . . . . . . . . . . . . . . . . . 121.6.2.3 Semi-automatic indexing . . . . . . . . . . . . . . . . 13

    1.6.3 Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6.3.1 Document Index . . . . . . . . . . . . . . . . . . . . 131.6.3.2 Forward Index . . . . . . . . . . . . . . . . . . . . . . 141.6.3.3 Inverted index . . . . . . . . . . . . . . . . . . . . . . 141.6.3.4 N-gram index . . . . . . . . . . . . . . . . . . . . . . . 15

    1.6.4 Index storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6.5 Index update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    1.6.5.1 Incremental update . . . . . . . . . . . . . . . . . . . 161.6.5.2 Global update . . . . . . . . . . . . . . . . . . . . . . 16

    1.6.6 Indexing phases . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.6.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . 161.6.6.2 Normalization . . . . . . . . . . . . . . . . . . . . . . 171.6.6.3 Elimination of stop-words . . . . . . . . . . . . . . . . 171.6.6.4 Weighting . . . . . . . . . . . . . . . . . . . . . . . . . 17

    1.7 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.7.1 Relevance concept . . . . . . . . . . . . . . . . . . . . . . . . . 191.7.2 Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . . 191.7.3 Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    1.8 Semantic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2 Arabic Language 252.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    ix

  • 2.3 Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1 Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3.1.1 Verbs with a simple root ) :( . . . . . . . . 262.3.1.2 Verbs with augmented root ) :( . . . . . . . 27

    2.3.2 Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2.1 Primitive nouns ) :( . . . . . . . . . . . 282.3.2.2 Nouns derived from verbals ) ( : . . . . 282.3.2.3 Numbers : . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2.4 Demonstrative pronouns ) :( . . . . . . . . 282.3.2.5 Relative pronouns ) ): . . . . . . . . . . . 292.3.2.6 Personal pronouns ( ): . . . . . . . . 29

    2.3.3 Function words . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.4.1 Flexional Morphology . . . . . . . . . . . . . . . . . . . . . . . 302.4.1.1 Flexion of verbs . . . . . . . . . . . . . . . . . . . . . 312.4.1.2 Flexion of nouns . . . . . . . . . . . . . . . . . . . . . 322.4.1.3 Flexion of function words . . . . . . . . . . . . . . . . 34

    2.4.2 Derivational morphology . . . . . . . . . . . . . . . . . . . . . . 342.4.2.1 Deverbal noun :() . . . . . . . . . . . . . . . . 352.4.2.2 Active participle ) :( . . . . . . . . . . . . . . 352.4.2.3 Passive participle ) :( . . . . . . . . . . . . . 352.4.2.4 Nouns of time and place ) :( . . . . 352.4.2.5 Noun of instrument ) :( . . . . . . . . . . . . . 352.4.2.6 The Nomen Vicis ) :( . . . . . . . . . . . . . . 362.4.2.7 The Nomen Speciei ) :( . . . . . . . . . . . . . 36

    2.5 Ambiguity issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.5.1 The absence of vocalization . . . . . . . . . . . . . . . . . . . . 362.5.2 Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.3 Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.6 The computerization of Arabic language . . . . . . . . . . . . . . . . . 392.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3 The Quran 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 Quran Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.3.1 Fragmentation into surahs . . . . . . . . . . . . . . . . . . . . . 433.3.2 Fragmentation into Hizbs . . . . . . . . . . . . . . . . . . . . . 43

    x

  • 3.3.3 Fragmentation into Stops (Waqfs) . . . . . . . . . . . . . . . . . 443.4 Quranic Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.4.1 Knowledge of Ayahs revelation places . . . . . . . . . . . . . . 453.4.2 Knowledge of Ayahs revelation causes: . . . . . . . . . . . . . . 463.4.3 Knowledge of Morphology: . . . . . . . . . . . . . . . . . . . . 463.4.4 Knowledge of Orthography ) ( : . . . . . . . . . . 473.4.5 Grammatical analysis of the Quran ) ( : . . . 483.4.6 Science of allegorical ayahs ) :( . . . . . . . . . . . . 483.4.7 The beginnings of surahs ) :( . . . . . . . . . . . . . . 483.4.8 Knowledge of Quranic Parables ) :( . . . . . . . . . 493.4.9 Tafssr :() . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    3.5 Computerization of the Quran . . . . . . . . . . . . . . . . . . . . . . 493.5.1 Advantages of Computerization . . . . . . . . . . . . . . . . . . 50

    3.6 Quran Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.6.1.1 Indexing words of the Quran . . . . . . . . . . . . . . 513.6.1.2 Majim of Quranic words . . . . . . . . . . . . . . . 513.6.1.3 Specialized Majim of Quranic words . . . . . . . . 523.6.1.4 Quranic Indexes and Computer . . . . . . . . . . . . 52

    3.6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.2.1 By unit: . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.2.2 By purpose . . . . . . . . . . . . . . . . . . . . . . . . 56

    3.6.3 Projects of building indexes . . . . . . . . . . . . . . . . . . . . 583.6.3.1 Midd lbayn . . . . . . . . . . . . . . . . . . . . . . 583.6.3.2 Indexes by Taha Zerrouki . . . . . . . . . . . . . 593.6.3.3 Quranic Arabic Corpus . . . . . . . . . . . . . . . . 603.6.3.4 Tanzil Project . . . . . . . . . . . . . . . . . . . . . . 623.6.3.5 Boundary-Annotated Quran Corpus . . . . . . . . . . 643.6.3.6 Qurany Concepts Tool . . . . . . . . . . . . . . . . . . 65

    3.7 Quran Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.7.1 Quranic Concepts Ontology . . . . . . . . . . . . . . . . . . . 673.7.2 The Ontology made by Hadj Henni: . . . . . . . . . . . . . . 68

    3.8 Quranic Search Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.8.1 Alawfa () . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.8.2 Al-Monaqeb-Alqurany ) ( . . . . . . . . . . . . . . 703.8.3 Quran complex search service . . . . . . . . . . . . . . . . . . . 713.8.4 Quranic Researcher ) ( . . . . . . . . . . . . . . . . 72

    xi

  • 3.8.5 Quranologie ) ( . . . . . . . . . . . . . . . . . . . . . . 733.8.6 Quranic Corpus Word-by-Word Search . . . . . . . . . . . . . . 743.8.7 Tanzil () . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.8.8 Zekr () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    II Analysis & Conception 78

    4 Classification & Proposition of Quranic Search Features 794.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2 Difficulties of Search in Quran . . . . . . . . . . . . . . . . . . . . . . 794.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.4 Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4.4.1 Advanced Query . . . . . . . . . . . . . . . . . . . . . . . . . . 834.4.2 Output Improvements . . . . . . . . . . . . . . . . . . . . . . . 844.4.3 Suggestion Systems . . . . . . . . . . . . . . . . . . . . . . . . . 854.4.4 Linguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 874.4.5 Quranic Options . . . . . . . . . . . . . . . . . . . . . . . . . . 914.4.6 Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 934.4.7 Statistical System . . . . . . . . . . . . . . . . . . . . . . . . . 96

    4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.5.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    4.5.1.1 Survey Participants Details . . . . . . . . . . . . . . . 974.5.1.2 Results of survey . . . . . . . . . . . . . . . . . . . . . 99

    4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    5 Conception 1025.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    5.2.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.2.3 Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2.4 Results Processing . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2.5 Indexes Importing . . . . . . . . . . . . . . . . . . . . . . . . . 107

    5.3 Full vocalized search engine . . . . . . . . . . . . . . . . . . . . . . . . 1075.4 Othmani script and text processing . . . . . . . . . . . . . . . . . . . . 111

    5.4.1 Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.4.1.1 Romanizations . . . . . . . . . . . . . . . . . . . . . . 113

    xii

  • 5.4.1.2 Numbers into words: . . . . . . . . . . . . . . . . . . 1145.4.2 Tokenization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.4.3 Normalization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.4.4 Filtering stop-words: . . . . . . . . . . . . . . . . . . . . . . . . 1205.4.5 Lemmatization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    5.5 Quranic Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.5.1 Word properties search . . . . . . . . . . . . . . . . . . . . . . . 1245.5.2 Semantically Related Words . . . . . . . . . . . . . . . . . . . . 1255.5.3 Multi-level Derivations . . . . . . . . . . . . . . . . . . . . . . . 1265.5.4 Specific Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 1275.5.5 Fuzzy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

    5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    III Implementation 130

    6 Implementation 1316.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2 Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    6.2.1 License : AGPL . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.2.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.2.3 Whoosh Search API . . . . . . . . . . . . . . . . . . . . . . . . 134

    6.3 Previous Code Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4 Our improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    6.4.1 A New Centralized JSON Output System: . . . . . . . . . . . . 1386.4.2 Many new features . . . . . . . . . . . . . . . . . . . . . . . . . 1406.4.3 Resource Importing Manager . . . . . . . . . . . . . . . . . . . 1426.4.4 Automating the API building . . . . . . . . . . . . . . . . . . . 1436.4.5 A new console interface . . . . . . . . . . . . . . . . . . . . . . 1436.4.6 Enhancing the web interface . . . . . . . . . . . . . . . . . . . . 1436.4.7 Packaging system: . . . . . . . . . . . . . . . . . . . . . . . . . 1446.4.8 Multiple search units . . . . . . . . . . . . . . . . . . . . . . . . 1446.4.9 Coding Standardization . . . . . . . . . . . . . . . . . . . . . . 1456.4.10 Documentation covering . . . . . . . . . . . . . . . . . . . . . . 1466.4.11 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    6.5 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496.5.1 Application Programming Interface . . . . . . . . . . . . . . . . 150

    6.5.1.1 JSON web service . . . . . . . . . . . . . . . . . . . . 152

    xiii

  • 6.5.1.2 Console interface . . . . . . . . . . . . . . . . . . . . 1536.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

    General Conclusion 156

    Bibliography 158

    Appendices A1

    Annex A: Paper Abstracts A2

    xiv

  • List of Figures

    1.1 The various components of a web search engine . . . . . . . . . . . . . 91.2 Indexing Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3 Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4 Search results returned as a web page: one of the possible ways to expose

    results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.1 Ligature of lm and lif. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.1 Explanatory diagram of fragmentation into Surahs . . . . . . . . . . . 433.2 Explanatory diagram of fragmentation into Hizbs . . . . . . . . . . . . 443.3 Index using the word as a unit [Arabic Quranic Corpus] . . . . . . . . 533.4 Index that takes the ayah as a unit [Arabeyes Quran Model] . . . . . . 543.5 Classification indexes by purpose . . . . . . . . . . . . . . . . . . . . . 563.6 The various structures of Quran . . . . . . . . . . . . . . . . . . . . . 573.7 Preview of Qurany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.8 A closer look at Quran Concepts Ontology . . . . . . . . . . . . . . . 683.9 Diagram of domain ontology of Quranic documents made by Hadj

    Henni[Hadjhenni2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.10 Preview of Alawfa website: www.alawfa.com . . . . . . . . . . . . . . . 703.11 Preview of Al Monaqeb Alqurany . . . . . . . . . . . . . . . . . . . . . 713.12 Preview of Quran Complex Search page . . . . . . . . . . . . . . . . . 723.13 Preview of Quranic Researcher (www.quranicresearcher.com) . . . . . . 733.14 Preview of Quranologie (quranologie.com) . . . . . . . . . . . . . . . . 743.15 Preview of Quranic Arabic Corpus Word By Word search . . . . . . . 753.16 Preview of Tanzil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.17 Preview of Zekr Application . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.1 Pages view in Google.com . . . . . . . . . . . . . . . . . . . . . . . . . 84

    xv

  • 4.2 Highlight the keyword (alone) in Ayah 11 of al-modather al-fanous.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    4.3 Ayah in full diacritical marks - quran.com . . . . . . . . . . . . . . . . 854.4 Query spell correction in Google.com . . . . . . . . . . . . . . . . . . . 854.5 Focusing on Yaqub in Ontology of concepts of corpus.quran.com . . 864.6 Related searches suggestion in Google.com . . . . . . . . . . . . . . . . 874.7 Keyboard mapping Arabic to English in Google.com . . . . . . . . . . 874.8 Different romanizations for the word . . . . . . . . . . . . . . . . 884.9 Different transliterations used in ElixirFM Resolve Online . . . . . . . 884.10 Syntactic Coloration of Basmalah bayt-al-hikma.com . . . . . . . . . 884.11 Google Voice Search on Android . . . . . . . . . . . . . . . . . . . . . . 894.12 Annotations shown by Quranic Arabic Corpus website . . . . . . . . . 904.13 Divine names Highlight in Quran Reader iPhone application . . . . . . 914.14 Faceted Thematic Browsing - Qurany Project . . . . . . . . . . . . . . 934.15 Audience Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.16 Audience experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.17 Clarity, Usefulness, and Need percentage of each feature . . . . . . . . 100

    5.1 Basic Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2 The behavior of Searcher . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3 Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.4 Results processing phases . . . . . . . . . . . . . . . . . . . . . . . . . 1065.5 Different possible declensions of the word . . . . . . . . . . . . . 1095.6 Different types of the possible vocalizations of the word . . . . . . . 1105.7 Different writing forms of the word using Othmani script . . . . . 1115.8 Merging Words in Uthmani Script . . . . . . . . . . . . . . . . . . . . . 1125.9 General Schema of Uthmani and Standard text processing. . . . . . . . 1135.10 Example of Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.11 Tokenization of the word . . . . . . . . . . . . . . . . . . . . 1165.12 Sub-tokens separation schema . . . . . . . . . . . . . . . . . . . . . . . 1185.13 Example of tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.14 Arabic case markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.15 Example of normalization . . . . . . . . . . . . . . . . . . . . . . . . . 1205.16 Example of stop-word filtering . . . . . . . . . . . . . . . . . . . . . . 1215.17 Examples of lemmatization . . . . . . . . . . . . . . . . . . . . . . . . 1225.18 Two-Steps search behavior . . . . . . . . . . . . . . . . . . . . . . . . . 1235.19 Semantically related words : Idols in Quran . . . . . . . . . . . . . . . 1245.20 Word properties search example : First person, Plural, Masculine . . . 125

    xvi

  • 5.21 Searching through an ontology . . . . . . . . . . . . . . . . . . . . . . . 1255.22 Semantically Related Words, Hyponymy of the word (prophet) . . . 1265.23 Multi-level Derivation Search example . . . . . . . . . . . . . . . . . . 1275.24 Special derivations example, Imperative of (to say) . . . . . . . . . 128

    6.1 Screenshot of the Qt desktop interface . . . . . . . . . . . . . . . . . . 1366.2 Json results of . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.3 Fuzzy search example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.4 Showing adjacent ayahs . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.5 Showing ayahs in different scripts . . . . . . . . . . . . . . . . . . . . . 1416.6 Suggestion example of Vocalizations , Derivations ,and Synonyms of 1416.7 Annotations of the keyword . . . . . . . . . . . . . . . . . . . . . . 1416.8 Buckwalter translation example . . . . . . . . . . . . . . . . . . . . . . 1426.9 Fields table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.10 Translation-as-unit search , Query: seven . . . . . . . . . . . . . . . . 1456.11 Word-as-unit json outout, Query: . . . . . . . . . . . . . . . . . . 1456.12 Interfaces dependency hierarchy . . . . . . . . . . . . . . . . . . . . . 1506.13 API usage sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . 1526.14 Preview of the JSON web service . . . . . . . . . . . . . . . . . . . . . 1536.15 Preview of the Console interface . . . . . . . . . . . . . . . . . . . . . . 153

    xvii

  • List of Tables

    1.1 Document Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Forward Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 N-gram index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.1 3 types of Arabic letters: 1 form, 2 forms or 4 forms . . . . . . . . . . . 262.2 The change of meaning by changing the diacritical marks . . . . . . . . 362.3 The change of function by changing diacritical marks . . . . . . . . . . 372.4 Ambiguities of prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Ambiguities due to suffixes . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.1 Waqfs types [Web-Islamweb] . . . . . . . . . . . . . . . . . . . . . . . . 443.2 Some numerical miracles of Quran[Nawfal1975] . . . . . . . . . . . . . 503.3 Index based on word parts . . . . . . . . . . . . . . . . . . . . . . . . . 553.4 Index based on sentences surah: al-ftiha . . . . . . . . . . . . . . . . 563.5 Overview of main index Midd lbayn . . . . . . . . . . . . . . . . . . 583.6 Overview of words index M.Taha Zerrouki . . . . . . . . . . . . . . . 593.7 Overview of the topics index Taha Zerrouki . . . . . . . . . . . . . . . 603.8 Overview of the index of synonyms Taha Zerrouki . . . . . . . . . . . 603.9 Overview of morphology index Quranic Arabic corpus . . . . . . . . . 623.10 Example of simple proper Quranic text Tanzil.info . . . . . . . . . . 633.11 Example of Surah index Tanzil.info . . . . . . . . . . . . . . . . . . . 633.12 Sajdah index Tanzil.info . . . . . . . . . . . . . . . . . . . . . . . . . 643.13 Example of rub index Tanzil.info . . . . . . . . . . . . . . . . . . . . 643.14 Sample of Boundary Annotated Quran Corpus . . . . . . . . . . . . . 65

    5.1 Partial vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.2 Some numbers as they mentioned in Quran . . . . . . . . . . . . . . . 115

    xviii

  • 6.1 Search request flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.2 Pylint Analysis stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1466.3 Implementation State of search features . . . . . . . . . . . . . . . . . 148

    xix

  • List of Abbreviations

    AGPL Affero General Public License.

    API Application Programming Interface.

    GPL GNU Public License.

    GUI Graphical User Interface.

    IDF Inverse Document Frequency.

    OWL Web Ontology Language.

    PC Personal Computer.

    POS Part Of Speech.

    POS Part Of Speech.

    RSV Retrieval Status Value.

    TF Term Frequency.

    TF*TDF term frequency - inverse document frequency.

    UI User Interface.

    xx

  • Glossary

    Abrogated ayahs _Abrogating ayahs _Accusative _Active Participle _Active voice __Allegorical ayah _Assimilated verb _Attaching pronoun _Ayah

    Basmalah Book Science _Broken plural _

    Conjugation _

    Declension Declinable Defect Demonstrative pronoun _Deverbal noun Diacritical marks _Diphthong Diptote __Dual form

    Expansion

    xxi

  • External feminine plural __External masculine plural __External plural _

    Fiqh First ayahs of surah _First person Fusional language _

    Geminated verb _General and Particular _Genitive _

    Hamzah Hamzated verb _Healthy verb _Hizb Hollow verb _

    Imperative Imperfective Instrument noun _Internal plural _Inversion

    Jussive _Juz

    Lam-ALef _Last ayahs of surah _Laws Lemma _Lexicology _

    Makkan Medinan

    xxii

  • Migration of Prophet _Morphology Mushaf

    Narration of Hadiths _Nisf Nomen Speciei _Nomen Vicis _Nominative _Noun of place _Noun of time _

    Object Orthography __Othmani script _

    Passive Participle _Passive voice __People of the book _Perfective Personal pronoun _Plural form Primitive noun _Prophets Sunnah _Prosody _Prostration of recitation _

    Qiblah Quranic comma _Quranic Parable _

    Recitation Relative pronoun _Revelation Science _Rewayate Rewayate of Kaloun _

    xxiii

  • Rhetoric Root _Rubu

    Sajdah Second person Singular form Standard script _Subject Superlative noun _Surah Surah keys _

    Tafssir __The Five Nouns _Third person Thumn Translation of Quran __Triptote

    Verb with a simple root _Verb with augmented root _Virtues of surah _

    Waqf Weakened verb _

    xxiv

  • General Introduction

    Work Context

    Quran, in Arabic, means the read or the recitation. Muslim scholars define it as: thewords of Allah revealed to His Prophet Muhammad, written in Mushaf and transmittedby successive generations .[Mahssin1973]() The Quran is also known by othernames such as: Al-Furkn , Al-kitb , Al-dhikr , Al-wahy and Al-ruh . It is thesacred book of all Muslims and the first reference to Islamic law. Its more then 14centuries passed since its revelation, and the Muslims are still studying it, teaching it,writing books about it and recently developing applications for it.

    Quran is an important source of information that contains various informationabout all aspects of life: Scientific, Social, Historic, Politic...etc.

    ProblematicDue to the large amount of information held in the Quran, it has become extremelydifficult for regular search engines to successfully extract key information. For example,When searching for a book related to English grammar, youll simply Google it, selecta PDF file and download it. Thats all! Search engines (like Google) are used generallyon Latin letters and for searching general information of document like content, title,authoretc. However, searching through Quranic text is a much more complicated;Its procedure thats requiring a much more in depth solution as there is a lot ofinformation that needs to be extracted to fulfill Quran scholars needs. Before thecreation of computer, Quran scholars were using printed lexicons made manually. Theprinted lexicons cant help much since many search process waste the time and theforce of the searcher. Each lexicon is written to reply to a specific query which isgenerally simple. Nowadays, there are many applications that are specific for searchneeds; most of applications that were developed for Quran had the search feature but

    1

  • General Introduction

    in a simply way: sequential search with regular expressions.The simple search using exact query does not offer better options and still inefficient

    to move toward Thematic search by example. Full text search is the new approachof search that replaced the sequential search and which is used in search engines.Unfortunately, this approach is not applied yet on Quran. The question is why weneed this approach? Why search engines? Do applications of search in Quran reallyneed to be implemented as search engines?

    ObjectivesOur proposal is about design a retrieval system that fit the Quran search needs. Butto realize this objective, we must first list and classify all the search features that arepossible and helpful. Then we need to study how to implement each feature and whatis its requirements.

    Report organizationWe organized the report as follows:

    First Part : Art State

    This part contains 3 chapters:

    Chapter 1 : Search EnginesTo design a powerful search engine, it is essential to understand how search engines

    work, in this chapter we discuss the different parts of a search engine, namely: thecrawling, indexing and querying . And the definition of basic concepts in the field ofinformation retrieval systems. This chapter contains an introduction to the semanticapproach.

    Chapter 2 : Arabic LanguageThe objective of this chapter is to present the properties of the Arabic language,

    its spells, its morphology and to introduce some ambiguity issues that raise due to theArabic nature ... etc.

    Chapter 3 : The QuranThis chapter presents an overview of the Quran and its sciences, it has a historical

    background on the evolution of the Quran, the structure of the Mushaf, and themain problems of computerization of the Quran, including the script Uthmani andauthentication Quranic texts.

    2

  • General Introduction

    Second Part : Analysis & Conception

    This part contains two chapters :

    Chapter 4 : Quranic search featuresThe objective of this chapter is to present the possible search features in Quran.

    It has a big importance in our work since it defines our objectives and our path onthe work. Weve make a survey about Usefulness, Need, and Clarity of each feature inorder to validate our points of view in choosing those features.

    Chapter 5 : ConceptionIn this chapter, we start by a preview on our previous work then well propose many

    improvements to carry out all the feasible search features mentioned in the previouschapter.

    Third Part : Implementation

    This part contains the different steps of implementation of our retrieval system. Itincludes one chapter:

    Chapter 6 : ImplementationThis chapter describes the choice of technologies and development tools and also

    presents the prototype with a description of various features.

    Finally, we finish the report with a conclusion that summarizes our work. Weinclude an appendix that describes the papers published about this work. Actuallythere are two papers:

    An Arabic paper in NITS 2011 KSA entitled An Application Programming In-terface for indexing and search in Noble Quran1 [Chelli2011].

    An English paper in a pre-conference workshop in LREC 2012 Turkey which isabout LRE-Rel: Language Resource and Evaluation for Religious Texts. Thepaper was entitled Advanced Search in Quran: Classification and Proposition ofAll Possible Features[Chelli2012].

    1Arabic title:

    3

  • Part I

    State Art

    4

  • Chapter 1

    Search engines

    How could the world beat a path to your door when the pathwas uncharted, uncatalogued, and could be discovered onlyserendipitously?

    Paul Gilster, Digital Literacy

    1.1 IntroductionOur work falls within the field of Information Retrieval, as it aims to design a searchengine, in this chapter we will discuss how search engines work by explaining its maincomponents.

    Exploration is the part that feeds the search engine by documents that it collects,but with the amount of information that becomes larger and larger, it is necessary todevelop methods of search, only indexing able to accelerate search in very large systemssuch as the Web, because it anticipates the search by extracting and arranging themkeywords.

    So that search results be satisfactory, we must properly calculate the relevance ofresults against the query, this is done during the interrogation. The question must alsobe able to express simple questions as well as complex questions.

    The quality of research is directly related to the quality of the crawling, indexingand search, these three operations can be considered as the core of search engine, theobjective of this chapter is to define the main concepts of this area, starting withdefining the crawling, then study indexing, its methods and steps, and then we shallexplain the process of search and the notion of relevance.

    5

  • Chapter 1

    1.2 Definitions

    1.2.1 Keyword

    Word or set of words chosen to represent the contents of a document, and find it indocument search. It can be coming from the document (title, text, abstract, ...) or acontrolled vocabulary..[Hensens1998]

    1.2.2 Descriptor

    Keyword selected from a set of equivalent terms to represent clearly a concept. It is usu-ally part of an organized and hierarchical vocabulary of type thesaurus.[Hensens1998]

    1.2.3 Document

    A document can be text, a piece of text, web page, image, video, etc. We call Documentany unit that can be an answer to a user query. For textual documents, there are manyforms regarding their specification. A document can be a text without any structure(it is also called full-text) and may also be a text with a structured part (documentpartially structured or semi structured) or fully structured. [Amrouche2008]

    1.2.4 Query

    A query expresses the need for information of a user. Various types of query languageshave been proposed to formulate a query. A query can be expressed:

    In natural (or almost) language (eg: find all the manufacturing facilities of carsand their addresses)[Salton1971]

    In a structured format, also called Boolean query language (eg: cars and factoriesand brand)[Bourne1979]

    As graphical language from a GUI [Lelu1992]

    1.2.5 Relevance

    Relevance is a word that simply means returning the information considered the mostuseful at the top of a result list. While the definition is simple, getting a program tocompute relevance is not a trivial task, mainly because the notion of usefulness is hardfor a machine to understand. [Bernard2009]

    6

  • Chapter 1

    1.3 Search enginesA search engine is software that allows to regain resources (web pages, images, video,files, information ... etc.) related to any words. Some websites offer a search engine asthe main feature, called then search engine the website itself (Google, Yahoo, Bing ...are search engines). [Nejjari2007]

    A search engine is also a crawling tool on the web made up of robots that explorethe websites periodically and automatically (without human intervention, that is whatdistinguishes search engines from directories). They follow the links (pages that linkto each other) encountered on each page reached. Each identified page will be indexedin a database, then will be accessible by Internet users using keywords.[Sanan2008]

    Search engines do not apply only to Web: some engines are softwares installed onpersonal computers. These are known as desktop search engines , they aims the searchin the files stored on the PC - include such Exalead Desktop, Google Desktop andCopernic Desktop Search ... etc.

    In December 2004, Tim Berners Lee (the inventor of the World Wide Web) talkedabout a new project: Semantic Web which is based on processing of the web informa-tion automatically according to their significances. The reason for this was that 80% ofWeb contains texts intended to be read and understood by humans. While computerprograms, Web browsers and search engines are unable to understand this content, sothey are unable to speed up the search. In less than two years from the article of Lee,the First foundations of Semantic Web were formed. They seemed to lead the worldtoward a new revolution in Internet and search engines .[Abulhajjaj2009]

    At first glance, nothing distinguishes a classic search engine from a semantic one.The same sparse interface, with a text box in the center of the page where the user canenter his search query. In fact, the difference lies in the search mode. A classic searchengine , as Google, works as follows: its robots index browse the pages and index thewords. Then store these words in a gigantic database. Users can do search by sendtheir queries and a search algorithm retrieve the results and sort them in a certainorder based on their relevance.[Mentre2008]

    1.4 Full-text searchFull-text search is a technology focused on finding documents matching a set of words.

    While sounding like a mouthful, full-text search is more common than you mightthink. You probably have been using full-text search today. Most of the web search

    7

  • Chapter 1

    engines such as Google and Yahoo! use full-text search engines at the heart of theirservice. The differences between each of them are recipe secrets (and sometimes notso secret), such as the Google PageRankTM algorithm. PageRankTM will modify theimportance of a given web page (result) depending on how many web pages are pointingto it and how important each page is .

    Be careful, though; these so-called web search engines are way more than the coreof full-text search: They have a web UI , they crawl the web to find new pages orexisting ones, and so on. They provide business-specific wrapping around the core ofa full- text search engine.

    Given a set of words (the query), the main goal of full-text search is to provideaccess to all the documents matching those words. Because sequentially scanning allthe documents to find the matching words is very inefficient, a full-text search engine(its core) is split into two main operations: indexing the information into an efficientformat and searching the relevant information from this precomputed index. From thedefinition, you can clearly see that the notion of word is at the heart of full-text search;this is the atomic piece of information that the engine will manipulate. [Bernard2009]

    1.5 CrawlingCrawling is the process by which we gather pages from the Web, in order to indexthem and support a search engine. The objective of crawling is to quickly and ef-ficiently gather as many useful web pages as possible, together with the link struc-ture that interconnects them. the web crawler is sometimes referred to as a spider .[Manning2009]

    This process is in the phase preceding the indexing phase, see Figure:

    8

  • Chapter 1

    Figure 1.1: The various components of a web search engine

    1.5.1 Crawler Features

    We list the desiderata for web crawlers in two categories: features that web crawlersmust provide, followed by features they should provide. [Manning2009]

    1.5.1.1 Features a crawler must provide

    Robustness : The Web contains servers that create spider traps, which are gener-ators of web pages that mislead crawlers into getting stuck fetching an infinite numberof pages in a particular domain. Crawlers must be designed to be resilient to suchtraps. Not all such traps are malicious; some are the inadvertent side-effect of faultywebsite development .

    Politeness : Web servers have both implicit and explicit policies regulating therate at which a crawler can visit them. These politeness policies must be respected .

    1.5.1.2 Features a crawler should provide

    Distributed : The crawler should have the ability to execute in a distributedfashion across multiple machines .

    Scalable : The crawler architecture should permit scaling up the crawl rate byadding extra machines and bandwidth .

    9

  • Chapter 1

    Performance and efficiency : The crawl system should make efficient use ofvarious system resources including processor, storage and network band- width.

    Quality : Given that a significant fraction of all web pages are of poor utility forserving user query needs, the crawler should be biased towards fetching useful pagesfirst.

    Freshness : In many applications, the crawler should operate in continuous mode:it should obtain fresh copies of previously fetched pages. A search engine crawler,for instance, can thus ensure that the search engines index contains a fairly currentrepresentation of each indexed web page. For such continuous crawling, a crawlershould be able to crawl a page with a frequency that approximates the rate of changeof that page.

    Extensible : Crawlers should be designed to be extensible in many ways tocope with new data formats, new fetch protocols, and so on. This demands that thecrawler architecture be modular.

    1.6 IndexingTo make the research cost acceptable, it should pass by an essential phase in thedocument database. This phase consists in analyzing each document in the collec-tion to create a set of keywords: we call it the indexing phase. These keywords willbe more easily used by the system during the subsequent process of search. Index-ing create a representation of documents in the system. Its objective is to find themost important concepts of the document (or query), which form the descriptor ofdocument.[Sauvagnat2005]

    1.6.1 Definition

    Indexing is the act of describing or classifying a document by index terms or othersymbols in order to indicate what the document is about, to summarize its contentor to increase its find-ability. In other words, it is about identifying and describingthe subject of documents. Indexes are constructed, separately, on three distinct levels:terms in a document such as a book; objects in a collection such as a library; anddocuments (such as books and articles) within a field of knowledge.

    The process of indexing begins with any analysis of the subject of the document.The indexer must then identify terms which appropriately identify the subject either

    10

  • Chapter 1

    by extracting words directly from the document or assigning words from a controlledvocabulary. The terms in the index are then presented in a systematic order. Indexersmust decide how many terms to include and how specific the terms should be. Togetherthis gives a depth of indexing.[Lancaster2003]

    Indexing is most often used to information retrieval. But it can also be used in otherareas such as automatic classification of documents, keyword suggestion, co-occurringterms calculating, automatic summarization, etc.[Abar2009]

    Figure 1.2: Indexing Benefits

    1.6.2 Indexing modes

    1.6.2.1 Manual indexing

    Manual indexing is achieved by a human expert (librarian or specialist in the field )that analyzes the content of the text to identify the terms representing the document.

    Manual indexing ensures greater relevance in the answers, because it identifies amore specific keywords describing a document.

    However, it has several drawbacks, there is the problem of used vocabulary andthe dependence on indexers knowledge on the topic, ie the same document can beindexed in several ways (according to vision of the person who makes the indexing),and an indexer at two different times can have two distinct terms to represent the sameconcept.

    The major drawback of this method is the cost in time, this method is not thereforeappropriate when the number of documents to be indexed issubstantial. [Sauvagnat2005,Abar2009, Amrouche2008]

    11

  • Chapter 1

    Manual indexing is based on four key points [Chartron1989] :

    reading the entire document for preparation ;

    consideration of descriptors, objectives (applications) and user needs;

    permanent complementarity between the terms of manual indexing and abstract;

    in the absence of appropriate descriptor, and when the emergence of a new con-cept is not explicit enough to propose a candidate descriptor, the ability to usea close or generic descriptor.

    So we thought fast enough to use the computer[Mustafa].

    1.6.2.2 Automatic indexing

    Automatic indexing is a set of automated processing phases applied on documents. Wedistinguish: Tokenization (automatic extraction of word), Elimination of stop words,Stemming (Lemmatization or radicalization), Scoring of words and finally the creationof the index[Sauvagnat2005].

    The first approach to the automatic indexation KWIC (Key Word In Context-)was introduced by Luhn (1957)[Luhn1957]. There was discussion about to weight theindex. In the early days of information retrieval, statistical methods were based on thefrequency of words in the document. Later, this measure was extended to take intoaccount the specificity of a term for the document. To this end, other methods havebeen exploited, such as 2-Poisson (Nie, 2003)[Gaussier2003][Mustafa].

    The automatic indexing systems use several methods of analysis:

    1.6.2.2.1 Linguistic analysis: Technology issued from text mining, the latteris to implement a simplified model of linguistic theories in computer systems of learning.This is part of the artificial intelligence field . [Allab2008]

    The linguistic method consists of several modules of linguistic analysis: morpho-logical, lexical, syntactic and pragmatic. The fact that some systems use indexingtechniques of natural language processing, demonstrates the relevance of a linguisticapproach. [Elhachani1997]

    1.6.2.2.2 Statistical analysis: The initiator of the methods of the automaticindexation is H.P. Luhn with his influential article The automatic creation of liter-ature abstracts published in 1958 in the Journal of Research and Development ofIBM. He states : (...) instead of sampling at ran- dom, as a reader normally doeswhen scanning, the new mechanical method selects those among all the sentences of

    12

  • Chapter 1

    an article that are the most representative of pertinent information , H. P. Luhnopened the door to work on automatic indexing by proximity also called statisticalmethod[Luhn1958].

    Automatic indexing involves the following steps :

    Extracting words (tokenization): the extraction rules are language-dependent.

    Eliminating stop words (stop words): these are words too frequent but unneces-sary. Example: the, a, of, or ... etc.

    Stemming : for example the stem of the word stemmers is stem.

    Transformation rules: removal of plural endings.

    Truncation: choose an optimal value of truncation of words. It is better totruncate suffixes. There is no absolute rule for this.

    1.6.2.3 Semi-automatic indexing

    The two previous techniques can be combined, a first automatic process to extractthe terms of the document. However the final choice remains the expert in the fieldor librarian to establish semantic relations between keywords and choose the signifi-cant terms using a thesaurus or a terminology database which is an organized list ofdescriptors (keywords) obeying specific terminology rules[Abar2009, Sauvagnat2005,Hadjhenni2008].

    1.6.3 Index types

    The index is the output of the indexing process, there are several types of indexesaccording to the used technique and the desired function:

    1.6.3.1 Document Index

    The document index keeps information about each document. It is an index ISAM(Index sequential access mode) with a fixed width, ordered by the ID of the document.The information stored in each entry includes data, a checksum of documents andvarious statistics. If the document was crawled, it also contains a pointer to a variablewidth file called the document information that contains the URL and title. Thisdesign decision was driven by the desire to have a relatively compact data structure,and the ability to find a record in one disk traversal, when queried.[Brin1998]

    The following table is a simplified illustration of a document index:

    13

  • Chapter 1

    Document ID Text LinkDocument 1 The cow says moo /ex/doc1.txtDocument 2 The cat and the hat /ex/doc2.txtDocument 3 The dish ran away with the spoon /ex/doc3.txt

    Table 1.1: Document Index

    1.6.3.2 Forward Index

    The forward index stores a list of words for each document. The following is a simplifiedform of forward index:

    Document ID WordsDocument 1 the, cow, says, mooDocument 2 the, cat, and, the, hatDocument 3 the, dish, ran, away, with, the, spoon

    Table 1.2: Forward Index

    The rationale behind developing a forward index is that as documents are parsing,it is better to immediately store the words per document. The delineation enablesAsynchronous system processing, which partially circumvents the inverted index up-date bottleneck. The forward index is sorted to transform it to an inverted index. Theforward index is essentially a list of pairs consisting of a document and a word, collatedby the document. Converting the forward index to an inverted index is only a matterof sorting the pairs by the words. In this regard, the inverted index is a word-sortedforward index. [Brin1998]

    1.6.3.3 Inverted index

    Many search engines include an inverted index when evaluating a search query toquickly retrieve documents that contain words in the query and then sort them byrelevance. Since the inverted index stores the list of documents containing each word,the search engine can use direct access to find documents associated with each word ina query to retrieve documents that respond quickly. The following table is a simplifiedillustration of an inverted index:

    14

  • Chapter 1

    Word Documentsthe Document 1, Document 3, Document 4, Document 5cow Document 2, Document 3, Document 4says Document 5moo Document 7

    Table 1.3: Inverted Index

    This index can identify only if a word exists in a particular document because itdoes not store any information regarding the frequency or the position of the word. Itis considered an index of boolean. This index determines which documents that matcha query, but does not classify them. In some models, the index includes additionalinformation such as frequency of each word in each document or positions of a wordin each document. The position information allow the search algorithm to identify theadjacent words to support the search by phrases. The frequency can be used to assistcalculating the relevance of documents to the query. [Grossman2002, Tang2004]

    1.6.3.4 N-gram index

    An n-gram is a sequence of n consecutive characters. For any document, all n-grams(usually n takes the values 2 or 3) we can generate, is the result obtained by shifting awindow of n squares on the body text. This shift occurs in steps, one step correspondsto a character. Then we calculate the frequencies of n-grams found. for example 1

    [Jalam2002]the french sentence La nourrice nourrit le nourrisson is represented by :

    1 2 3 4 5 6 7 8 9 10 11 12 n-grams la_ a_n _no nou our urr rri ric ice _ce e_n rit

    Frequencies 1 1 3 3 3 3 3 1 1 1 2 1

    Table 1.4: N-gram index

    One benefit of n-grams is automatic tracking of the most common stems [Grefenstette1995]:dans in the previous example, using techniques based on n-grams we find the commonroot of : Nourrir, nourri, nourrit, nourrissez, nourriture, etc. Tolerance to spellingmistakes and distortions is also an important property. [Sanan2008]

    1.6.4 Index storage

    Storage of index structures is mainly characterized per the index size and organizationof its elements. Index structures vary widely in their use of size that is closely related

    1the character _ is used instead of spaces, in order to facilitate the reading.

    15

  • Chapter 1

    to the organization of data in the index.This organization has a significant impact on latency of search. More items are

    closely related to each other in the storage space is less latency research, this is calledthe concept of locality. It is also very important that the index can hold in mainmemory, it avoids disk access to the system and reduces the latency of search.

    The ideal index is one that occupies less space and minimize search latency. [Dahak2006]

    1.6.5 Index update

    Updating the index refers to the behavior of applying changes on the index . Changescan be insertions, modifications or deletions. An index can be more or less able toadapt to these changes. This adaptation can occur in two forms:

    1.6.5.1 Incremental update

    In the case of an incremental update, the structure of the index is updated by addingback the indexes of new documents without modifying existing ones. The number ofchanges in this case is, however, often limited.[Dahak2006]

    1.6.5.2 Global update

    The third case, and worst is when the structure of the entire index must be rebuiltfrom scratch.[Dahak2006]

    1.6.6 Indexing phases

    The indexing process consists of the following phases:

    1.6.6.1 Tokenization

    Tokenization is a phase that may seem trivial at first, and yet provide the basis forthe rest of the indexing phases. Therefore this phase must be done with the highestquality.[Meylan2001]

    Some retrieval systems use a list of predefined keywords. This list is designedmanually and, in most cases built for a specific topic. This method allow to controlthe index size. The use of automatic extraction of keywords or the use of a list ofpredefined keywords, determines the type of indexing. Document-oriented in the firstcase and query-oriented in the second.[Berrut1997, Dahak2006]

    16

  • Chapter 1

    1.6.6.2 Normalization

    This processing is to find for a word its normalized form (usually the masculine fornouns, infinitive for verbs, the masculine singular for adjectives, etc.). Thus, in theindex are stored only in their normalized forms, which offers a significant size saving,but more importantly, even if the processing is done on the request, it can be muchquicker and more flexible in research: for example, if a user searches with a verb,documents that contains this verb in all its conjugated forms will be considered, notjust documents containing the word in the form provided by the user. This step is alsocalled morphological processing of keywords[Denoyer2004]

    This phase can also be enriched with syntactic and semantic processing of keywords.The first is to identify and group a set of words whose meaning depends on their union.For example, the words White House does not usually mean youre dealing with ahouse that is white, but instead the seat of the presidency of the United States. It isalso to remove ambiguities such as the problems of homography.

    Semantic processing is intended to make distinctions between different possiblemeanings of a word (polysemy). For example, this phase helps differentiate the wordroom that can match a coin, or a room in a house. This is an arduous task thatis not currently well controlled and its effect on system performance is not alwaysproven.[Dahak2006]

    1.6.6.3 Elimination of stop-words

    This phase is of some importance since it constitutes a factor of great influence in theaccuracy of the search. The failure to remove stop words inevitably cause noise. Theelimination of stop words which are words of everyday language and do not containmuch semantic information must be both in indexing as querying (removing stop-wordsfrom the query). [Dahak2006]

    1.6.6.4 Weighting

    This step is entirely dependent on the model of information retrieval used. It defineshow important a term in a given document. [Dahak2006]

    In general, most term weighting formulas are built by combination of two factors. Alocal weighting factor measuring the local representativity of a term in the document,and an overall weighting factor measuring the global representativity of a term withrespect to the collection of documents[Amrouche2008].

    That leads to two types :

    17

  • Chapter 1

    1.6.6.4.1 Local weighting Local weighting takes into account the local informa-tion of the term that depend only on the document. It is typically a function offrequency of occurrence of the word in the document, denoted tf (Term Frequency).A term that frequently appears in a document is considered relevant to describe itscontents. [Dahak2006]

    1.6.6.4.2 Overall weighting The overall weight measures the importance of aterm within all documents. It aims to represent its discriminatory nature, or in otherwords its ability to distinguish between document. In fact, a term appearing in fewdocuments is considered more discriminatory and should be favored over a term foundin many documents. The calculation of the overall weighting is based on the numberof documents in which a term appears. One of the most used is idf (Inverse DocumentFrequency), represented by the following formula:

    Idf = log(Nni)

    Such as ni is the number of documents containing the word i and N is the totalnumber of documents.

    The value tf *idf gives a good approximation of the importance of a term in thedocument, particularly in the corpus of documents of similar size. [Dahak2006]

    1.7 QueryingQuerying is the phase of interaction between the system and the user. This expressesthe need for information via a query language that the system will take care of inter-preting. This interpretation is done according to the query template and is designed tounderstand user needs and express them in a formalism similar to the one used whenindexing documents. This process provides an inner query. Following this phase ofquery interpreting, a matching pattern calculates the match between the inner queryand each document in the index. This calculation established by the mapping function,has traditionally resulted in an ordered list of documents. It should, at this level, asemantic comparison (not equal) between concepts in of document and those of thequery.

    The comparison between query and document rarely leads to strict equivalences,but rather to partial equivalences: the document is only part of the query. The firstdocument in the list returned by the system is one that is considered by the systemas the most relevant, that is to say the one that best suits the query, again accordingto the system. The final document is one that is considered by the system as the

    18

  • Chapter 1

    least relevant. This notion of relevance is based on the proximity between the needsexpressed by the user and the results provided by the system.[Dahak2006]

    1.7.1 Relevance concept

    Relevance is a central concept of the query because all evaluations are based aroundthis concept. But it is also the most poorly understood concept, despite numerousstudies on this concept as the one in[Denos1997].

    Let us see some definitions of the relevance. Relevance is:

    The correspondence between a document and a query, a measure of informative-ness of the document to the query;

    A degree of relationship (overlap, relativity, ...) between the document and thequery;

    A degree of surprise that comes with a document that is relevant to the needs ofthe user;

    A measure of usefulness of the document to the user.

    Even in these definitions, the used concepts (informativeness, relativity, surprise ...)remains very vague because users have very different needs. They have very differentcriteria for judging whether a document is relevant. So the notion of relevance is usedto cover a very wide range of criteria and relations[Dahak2006].

    1.7.2 Similarity Function

    Comparing between the document and query is equivalent to calculating a score, as-sumed to represent the relevance of the document in respect to the query. This valueis calculated from a function or a probability of similarity denoted rsv(q,d) (retrievalstatus value), such as q is a query and d est a document and whose formula dependsentirely on the used model of information retrieval. This measure takes into accountthe weight of terms in documents determined by statistical analysis and probability.The matching function is very closely related to the operations of indexing and weight-ing of query terms and documents in the corpus. In general, the matching document- query and indexing model used to characterize and identify a model of informationretrieval. The similarity function is then used to order the documents returned tothe user. The quality of this ordering is paramount. In fact, users is generally sat-isfied to examine the first documents (the top 10 or 20). If the documents soughtare not present in this slice, the user will consider sorting as bad in respect to hisquery[Sauvagnat2005, Dahak2006].

    19

  • Chapter 1

    1.7.3 Search process

    Search takes a user query and returns the effective list of matching results sorted byrelevance. Such as indexing, searching is a multiphase process, as shown in Figure.

    Figure 1.3: Search process

    The first operation is about building the query. Depending on the full text search,the way to express query is either: :

    1. String basedA text-based query language. Depending on the focus, such alanguage can be as simple as handling words and as complex as having Booleanoperators, approximation operators, field restriction, and much more!

    2. Programmatic API basedFor advanced and tightly controlled queries a pro-grammatic API is very neat. It gives the developer a flexible way to expresscomplex queries and decide how to expose the query flexibility to users.

    Some tools will focus on the string-based query, some on the programmatic API, andsome on both.

    The second operation, lets call it analyzing, is responsible for taking sentences orlists of words and applying the similar operation performed at indexing time (chunk

    20

  • Chapter 1

    into words, stems, or phonetic description). This is critical because the result of thisoperation is the common language that indexing and searching use to talk to each otherand happens to be the one stored in the index. If the same set of operations is notapplied, the search wont find the indexed wordsnot so useful!

    Based on the common language between indexing and searching, the third operation(finding documents) will read the index and retrieve the index information associatedwith each matching word. Remember, for each word, the index could store the listof matching documents, the frequency, the word positions in a document, and so on.The implicit deal here is that the document itself is not loaded, and thats one of thereasons why full-text search is efficient: The document does not have to be loadedto know whether it matches or not. The next operation (filtering and ordering) willprocess the information retrieved from the index and build the list of documents (ormore precisely, handlers to docu- ments). From the information available (matchingdocuments per word, word fre- quency, and word position), the search engine is ableto exclude documents from the matching list. More important, it is able to computea score for each document. The higher its score, the higher a document will be in theresult list. lets have a look at some factors influencing its value :

    In a query involving multiple words, the closer they are in a document, the higherthe rank.

    In a query involving multiple words, the more are found in a single document,the higher the rank.

    The higher the frequency of a matching word in a document, the higher the rank.

    The less approximate a word, the higher the rank.

    Depending on how the query is expressed and how the product computes score, theserules may or may not apply. This list is here to give you a feeling of what may affectthe score, therefore the relevance of a document. Once the ordered list of documentsis ready, the full-text search engine exposes the results to the user. It can be througha programmatic API or through a web page. the following figure shows a result pagefrom the Google search engine.[Bernard2009]

    21

  • Chapter 1

    Figure 1.4: Search results returned as a web page: one of the possible ways to expose results

    1.8 Semantic ApproachSemantic search seeks to improve search accuracy by understanding searcher intent andthe contextual meaning of terms as they appear in the searchable dataspace, whetheron the Web or within a closed system, to generate more relevant results. Semanticsearch systems consider various points including context of search, location, intent,variation of words, synonyms, generalized and specialized queries, concept matchingand natural language queries to provide relevant search results[Web-Techulator]. Majorweb search engines like Google and Bing incorporate some elements of semantic search.

    Rather than using ranking algorithms such as Googles PageRank to predict rel-evancy, semantic search uses semantics, or the science of meaning in language, toproduce highly relevant search results. In most cases, the goal is to deliver the in-formation queried by a user rather than have a user sort through a list of looselyrelated keyword results. However, Google itself has subsequently also announced itsown Semantic Search project[Web-WSJ].

    Other authors primarily regard semantic search as a set of techniques for retrievingknowledge from richly structured data sources like ontologies. Such technologies enablethe formal articulation of domain knowledge at a high level of expressiveness and couldenable the user to specify his intent in more detail at query time[Web-ESWC2012].

    Semantic search does not just mean contextual search or search based on the in-tend of the question. It include several other factors as well. A smart search enginewould consider several factors to provide the most relevant and useful search queries,

    22

  • Chapter 1

    including[Web-Techulator]:

    Current trend: If the president election was just finished in the country andsomeone is searching for Who is the new president, the semantic search systemshould be able to understand the query and give relevant results based on thecurrent trend and news.

    Location of search: If a person is searching for what is the temperature, thesemantic search engine should be able to provide results based on the currentlocation of the search. If the person is searching from California, search resultsshould include the current temperature in California.

    Intend of the search: Semantic search engines should be able to give appropri-ate search results based on the intent of the search and not based on the specificwords used in the search query.

    Variations of words in Semantic Search: Semantic search should considertenses, plural, singular etc and provide relevant search results for all semanticvariations of the words. For example, words like dog, dogs, dogs etc.

    Synonyms and Semantic Search: A semantic search engine should be ableto understand the synonyms and give more or less the same search results on anysynonyms of the word users search for. For example, try searching for biggestmountain and highest mountain. You would get pretty much the same resultssince both of them means the same in this particular query, even though thebiggest and highest could mean different things in different cases.

    Generalized and Specialized queries: Semantic Searching engine should beable to set relation between generalized and specialized queries and provide ap-propriate and relevant results. For example, consider an article on general healthtopics and another article specifically on Diabetes. If someone search for healthinformation, both articles could match even though the article on Diabetes doesnot talk specifically about health.

    Concept matching: This is a sub-set of context matching in semantic search.Semantic search should understand the broad concept of the query and returnrelevant results. For example, a query on Traffic problems in New Jersey couldreturn relevant results including the topics narrow roads, non functioningtraffic lights, lack of roadside assistance etc because in a broad conceptualpoint of view, all of these lead to traffic problems.

    23

  • Chapter 1

    Natural language queries: Not everyone are tech savvy and not many peopleknow what to search to get the relevant search results. Most users simply typein queries in natural language. For example, if some one want to find what isthe current time in Arizona, USA, they would search for What time is it inArizona. Most search engines would simply show results from the websites andarticles that talk about Time and Arizona. However, smart search engines thatuse Semantic Search would actually show you the current time in Arizona, USA.Try it yourself at Google search.

    Change of meaning based on the group of words: By combining differentwords, the true meaning of search term could change. Consider the followingsearch terms:

    New egg health products

    New egg health benefits

    If you search for both the above terms in Google, you would get completelydifferent meaning. Instead of just picking the results based on the words, GoogleSearch looks at it as a term and then combines with common user search pattern.The first term returns search results primarily on the popular online shoppingwebsite NewEgg.com and shows results of health products from that site andsimilar sites. The second term shows search results for the health benefits ofEgg.

    Semantic Search is a big challenge for search engines and none of them are perfect.Most search engines have improved significantly in last few years. Search engines likeBing and Google provide significantly relevant search results incorporating some degreeof semantic search. There are many other specialized search engines (like Hakia) whichoffer purely semantic search results, but they lack many other qualities of normal searchengines.

    1.9 ConclusionIn this chapter, the study focused on the working mechanism of search engines andinformation retrieval systems, based on indexing due to its importance. Indeed, it isthe most important step in the search process as it allows the extraction and processingof keywords.

    The search phase does not offer only the interaction between users and the system,but also calculates the match percentage between the query and the documents toprovide the most relevant results.

    24

  • Chapter 2

    Arabic Language

    ***

    2.1 IntroductionArabic ( ) is a name applied to the descendants of the classical Arabic languageof the sixth century AD, the most widely used in the Quran, the Islamic holy book.Arabic is a Central Semitic language, closely related to modern Hebrew and Aramailanguages1.

    In this chapter, we will talk about orthography and morphology of the Arabiclanguage that are unique. we will also talk about some ambiguities that may appearin Arabic due to the absence of vocalization.

    2.2 OrthographyThe Arabic script is one of the most used scripts all over the world. It dominates inthe Arab countries, of course, but holds a special place for all Muslims because it isthe script used to write the Quran.[Jabri]

    It is written from right to left like other Semitic languages. Its alphabet has twenty-nine2 consonant letters, three of them .. are considered as vowels. Optionally,

    1Modern Aramaic, languages are varieties of Aramaic that are spoken vernaculars in the medievalto modern era, evolving out of Middle Aramaic dialects around AD 1200.

    2The scientists of Arabic language considered the Hamzah as a letter

    25

  • Chapter 2

    one of the three Diacritical marks .. can be placed after certain characters toresolve the ambiguity in pronunciation and/or direction when it arises. In a fullyvocalized Arabic text, the lack of diacritics can be regarded as a sokune .. (silence).In some cases, a letter doubled, may be replaced by a single letter with tashdeed .. (reinforcement) placed above. [AlKharashi1999]

    In addition, it is important to note that the notion of uppercase and lowercaseletter does not exist, the Arabic writing is called unicameral. In addition, Arabic isa language semi cursive, most letters are attached to each other, their spellings differdepending on whether they are preceded and/or followed by other letters or they areisolated. Only six of them does not attach to the following letter : .. andone letter does not attach at all : .. [Mesfar2008]

    Letter SpellingsHamzah Ww Ayn

    Table 2.1: 3 types of Arabic letters: 1 form, 2 forms or 4 forms

    2.3 LexicographyThe traditional Arabic grammar has only three subsets: Nouns, Verbs and Particles.

    2.3.1 Verbs

    A verb is an entity expressing a time-dependent sense. Most Arabic verbs are formedon three radical consonants that is the case of the verb .. (kataba write)and eventually four consonants that is the case of the verb .. (dahraa rollalong). These roots may form several patterns as a result of one or more morphologicaltransformations (eg: repetition of a consonant, lengthening a vowel, the expanding ofa morpheme, etc.), it comes in this case to roots with augmented pattern.

    Several linguistic studies have been conducted on the verbal system in Arabic,see[Larcher2003]. In this section, it is necessary to introduce a classification of verbsaccording to their radicals:

    2.3.1.1 Verbs with a simple root ) :(

    A verb with a simple root has a base of three consonants called radical consonants.These verbs are associated with verbal pattern .. (faala). When none of the root

    26

  • Chapter 2

    consonants of the verb is a long vowel, it is called healthy. These radicals may involveprocessing or causes of defects ,() we mention :

    The presence of .. ( hamz), .. (y - y) or .. (w ww) among theradical consonants. Depending on the position of that, we distinguish differenttypes of verbs :

    If one of the root consonants is .. ( hamz), independently of its position=> Hamzated verb ) ( ;

    The first radical consonant is a .. (w) or .. (y) => Assimilated verb) ;(

    The second radical consonant is a .. (w) or .. (y) => Hollow verb );(

    The third radical consonant is a .. (w) or .. (y) => Weakened verb );(

    The presence of two identical consonants in the second and third position of theroot => Geminated verb ) .(

    2.3.1.2 Verbs with augmented root ) :(

    The patterns of verbs with augmented root are formed from simple roots by a set ofmorphological operations to provide a specific meaning to the outcome verbs , wemention:

    .. (faala)

    .. (faala)

    .. (faala)

    .. (tafaala)

    .. (tafaala)

    .. (ftaala)

    .. (nfaala)

    .. (stafala)

    27

  • Chapter 2

    2.3.2 Nouns

    The morphological system of Arabic nouns contains three subcategories:

    2.3.2.1 Primitive nouns ) :(

    The primitive nouns are nouns that can not be attached to a verbal root. They wellform the fundamental glossary of the concrete language. eg: .. (ras head),.. (kursiyy chair), .. (kab Sheep), etc. In this category we also includenouns composed of two letters such as: .. (dam - blood), .. (fam - mouth), .. (ab father), .. (h brother), etc.

    2.3.2.2 Nouns derived from verbals ) ( :

    These are the nouns that can be derived from a verbal root. The number and nature ofthese forms vary depending on the status of the verb to which they relate. As nouns,they can receive marks of case, gender and indeterminacy.

    2.3.2.3 Numbers :

    This category of nouns is made up of simple numerals representing units: from .. (sifr- zero, 0) to .. (tisat nine, 9); the tens: .. (aarat ten, 10), .. (iruwn twenty, 20) and .. (tisuwn ninety, 90) ; the hundreds, etc. wellas numerals compounds such as cardinals of .. _ (ahada aara - eleven, 11)to .. _ (tisat aara - nineteen, 19).

    In their decomposition, the Arab grammarians have classified adjectives to nounsas they almost take all the morphological forms and may, for example, be definite orindefinite and flex according to case, number and type.

    2.3.2.4 Demonstrative pronouns ) :(

    Demonstrative pronouns represent a subcategory of noun expressing an idea of demon-stration. They can indicate that the object represented is found, either in the text,either in space or time, defined by the situation of utterance. They are two subsets:near-deictic (eg: .. (hada this), .. (haula these) and far-deictic (eg:.. (dalika that), .. (uwlaika those), etc.). Demonstratives are deriv-able only to dual.

    28

  • Chapter 2

    2.3.2.5 Relative pronouns ) ):

    Relative pronouns relate to the noun or personal pronoun that precedes them and thatwe denote by antecedent. The relatives shall afford with their antecedents but arederivable only to dual (as demonstratives). Among the relative pronouns, we mention:.. (al-ladiy - that, masculine, singular), .. (al-latayni those, feminine,dual), .. (those, masculine, plural), etc.

    2.3.2.6 Personal pronouns ( ):

    Personal pronouns are intended to identify three types of grammatical persons:

    First person, ie, the speaker () , that who is talking: .. (na - I) or.. (nahnu we) ;

    Second person, ie, the listener ,() that who talking to: .. (anta you, masculine, singular), .. (anti you, feminine, singular), .. (an-tuma you, dual), .. (antum you, masculine, plural), .. (antunna you, feminine, plural);

    Third person, ie, the absent () , that who talking about: .. (huwa he),.. (hiya she), .. (huma they, dual), .. (hum they, masculine),.. (hunna they, feminine).

    2.3.3 Function words

    The function words are used to locate entities, facts or objects in relation to time orplace. They also play a key role in the coherence and sequencing of a text. For example,we have particles that designate a time:

    .. (bada after)

    .. (qabla before)

    .. (mundu since)

    or a place like

    .. (haytu where),

    According to their semantic meaning and their function in the sentence, they canplay an important role in the interpretation of a sentence expressing an introduction,explanation, consequence, etc.[Kadri1992]. Function words include various categories,we mention:

    29

  • Chapter 2

    Prepositions : .. (fiy in) or .. (ala on);

    Conjunctions: .. (tumma then) ;

    Adverbs : .. (abad never) or .. _ (biakl adiyy normally,in the normal way) ;

    Quantifiers: .. (kulla all ) or .. (bada some) ;

    Etc.

    The function words are divided into subgroups: those variables (quantifiers) and thosethat are invariable (adverbs, prepositions, etc.).

    2.4 MorphologyThere are several categories of Fusional languages, and Arabic is precisely in thecategory of languages with Intro-flexion: this category of languages, the consonantsindicate the meaning and vowels mark the flexion of word. This system is foundespecially in the Semitic languages (eg: Arabic, Hebrew) [Choueiter2006]

    Morphologically, the Arabic language is very rich and based on the structure ofpatterns and roots. Most Arabic words are generated from a finite set of roots (about7000 roots) transformed using one or more patterns (about 400-500). Theoretically, asingle Arabic root can generate hundreds of words (noun, verb, ...). An Arabic wordcan exist in about a hundred of forms in a normal text by adding certain suffixes andprefixes (mainly considered as stop-words in English).[AlKharashi1999]

    2.4.1 Flexional Morphology

    Arabic uses for the declension of verbs and nouns, some indications of aspect, mood,time, person, gender, number and case, which are generally suffixes and prefixes[Gaudefroy1975].Generally, these flexional marks can distinguish [?] :

    Mode of verbs: eg, for the verb .. (dahaba to go), forms in the Perfective() can be identified using their suffixes as .. (dahabatu I went) ortheir prefixes such in the Imperfective () as .. (adhabu I go) ;

    Function of nouns: using of suffixes such as .. (raulani Two men) inNominative ) ( or .. (raulayni Two men) in Accusative )( or Genitive ) .( [?]

    30

  • Chapter 2

    2.4.1.1 Flexion of verbs

    Called also Conjugation, it describes the variation in their forms according to circum-stances. Generally, conjugation includes a number of values which are:

    Aspect: The aspect is a grammar feature associated, in most cases, to verbs inorder to indicate which state it expresses; considered from the perspective of itsdevelopment (beginning, progress, completion, overall evolution, etc.), regardlessof when it comes ;

    Mood : Mood indicates how the action expressed by the verb is designed andpresented. The action can be doubted, affirmed as actual or eventual. Theycombine the semantics of verbs and thereby create aspects ;

    Tense : Tense is a grammatical feature to locate a fact (which may be a state oraction) in the enunciation time axis relative to the three markers: past, presentand future. The temporal indications are often accompanied by aspectual indi-cations that are more or less related.

    These three key values are closely related ; they can describe two basic forms of theverb in Arabic :

    Perfective :() it indicates that the progress of the action expressed bythe verb is finished, which means the past. It is characterized by adding suffixesof person, gender, number and mood to the verbs stem. For example, for thefeminine plural of the verb .. (kataba to write), we add the suffix .. toget the form .. (katabna - *they* wrote, feminine) and for the masculineplural, we add the suffix .. to get the form .. (katabuw - *they* wrote,masculine) ;

    Imperfective :() it indicates an unfinished progress, which may implythe present. It is characterized by adding a prefix and one or more infixes as aletter duplication or a vowel substitution. For example, for the verb .. (madda to give), we can get .. (amuddu I give) or .. (yamdudna theygives, feminine). It includes two types of modal inflections:

    The indicative of actual mode where the speaker states the actual character(reread, to be achieved, in progress, etc.) of action or state expressed by theverb;

    The subjunctive of potential mode in which the speaker merely states thepossible or virtual nature of action or state expressed by the verb.

    31

  • Chapter 2

    Imperative :() it expresses the order, command, or exhortation ... etc. Itexists only the with the 2nd person in singular, dual and plural;

    2.4.1.2 Flexion of nouns

    In Arabic, the declension () of nouns involves three cases: Nominative ,()Accusative () and Genitive .() Except for some special cases, the nouns aredeclinables () and appear in one of these three cases according to their functionsin the sentence. In terms of the spelling, the case represents only an assistant graphicat the end of nominal forms. The nominal system of Arabic admits different systemsdepending on the nature of variation of the form (triptote, diptote, etc.) and the numberthereof (singular, dual or plural). We can distinguish :

    2.4.1.2.1 Declension of singular nouns:

    Basic declension of triptotes :() This is the most frequent case, it takesthe vowel .. (dammat u) as a sign of the nominative , the vowel .. (fathat a) in the accusative and the vowel .. (kasrat i) in the genitive. When the noun isundefined, the tanwn is marked respectively by the three diacritics: .. ( un), .. ( - an) et .. ( in). In the indefinite accusative , except the case of nouns ending by.. (ta) or by .. (), an lif .. () strengthens the tanwn .. (an) : for example, inthe accusative indefinite the noun .. (kitab book) becomes .. (kitab book, accusative, indefinite) and the book .. (aziyrat island) becomes .. (aziyrat island, accusative, indefinite).

    Declension of diptotes( :( The nouns that are diptotes, gram-matically undefined, do not accept tanwn and take the same mark in the accusativeand the genitive which is the .. (fathat a). By contrary, when they are defined,they follow the declension of triptotes. This is the case of feminine nouns that endwith () such as .. (sahra desert), masculine adjectives of colors withthe pattern .. (afal) such as .. (hmar red) and those which are femininewith the pattern .. (fala) such as .. (bayda white , feminine)

    Declension of The Five Nouns ) ): The five nouns are :

    Three nouns: .. (abuw - father), .. (huw - brother) and .. (hamuw- stepfather) ;

    A variant of .. (fam mouth) : .. , .. and .. ;

    32

  • Chapter 2

    The noun .. (duw possessor of ).

    These are bi-literal nouns who extend their final vowel when they are defined by acomplement.

    Declension of deverbals with defective roots : Some active participles anddeverbal nouns of verbs with defective root such as the active participle .. (mad

    past) et the deverbal noun .. (tahall - abandon) only take the mark of casein the accusative: the last letter of root .. (y) is replaced by the tanwn (in) tothe indefinite nominative and genitive. As for the passive participles that end in .. or .. such as .. (mut given), they lose their case inflection. a Tanwndistinguishes ind