33
A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS A Project Work Submitted in Partial Fulfilment of the Requirements for the Degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING by ANASUYA PAUL (Roll No. 10700111006) JOYEETA BAGCHI (Roll No. 10700111021) KOUSHIK DUTTA (Roll No. 10700111024) SNEHA SARKAR (Roll No. 10700111049) Under the supervision of Mr. Alok Ranjan Pal DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West Bengal University of Technology) Purba Medinipur 721171, West Bengal, India

project doc (1)

Embed Size (px)

Citation preview

  1. 1. A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS A Project Work Submitted in Partial Fulfilment of the Requirements for the Degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING by ANASUYA PAUL (Roll No. 10700111006) JOYEETA BAGCHI (Roll No. 10700111021) KOUSHIK DUTTA (Roll No. 10700111024) SNEHA SARKAR (Roll No. 10700111049) Under the supervision of Mr. Alok Ranjan Pal DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West Bengal University of Technology) Purba Medinipur 721171, West Bengal, India
  2. 2. 1 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West Bengal University of Technology) Purba Medinipur 721171, West Bengal, India CERTIFICATE OF APPROVAL This is to certify that the work embodied in this project entitled A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS submitted by Anasuya Paul, Joyeeta Bagchi, Koushik Dutta and Sneha Sarkar to the Department of Computer Science & Engineering, is carried out under my direct supervision and guidance. The project work has been prepared as per the regulations of West Bengal University of Technology and I strongly recommend that this project work be accepted in fulfilment of the requirement for the degree of B.Tech. Supervisor Mr. Alok Ranjan Pal Asst. Prof., Dept. of CSE Countersigned by Prof. (Dr.) Dilip Kumar Gayen Head Department of CSE
  3. 3. 2 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West Bengal University of Technology) Purba Medinipur 721171, West Bengal, India Certificate by the Board of Examiners This is to certify that the project work entitled A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS submitted by Anasuya Paul, Joyeeta Bagchi, Koushik Dutta and Sneha Sarkar to the Department of Computer Science and Engineering of College of Engineering of Management, Kolaghat has been examined and evaluated. The project work has been prepared as per the regulations of West Bengal University of Technology and qualifies to be accepted in fulfilment of the requirement for the degree of B. Tech. Project Co-ordinator Board of Examiners
  4. 4. 3 ABSTRACT Based on the various inflexions of verbs available in the Bengali Dictionary, an attempt is made to retrieve the stem word from their inflexions in the underlying sentences. The input sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL project of the Govt. of India, while the information about different inflexions of particular verb is collected from Bengali Dictionary. In this project, we present a lightweight stemmer for 14 selected Bengali Verbs that strips the suffixes using a predefined suffix list, on a longest match basis, and then finds root on basis of some rules. We have applied the algorithm over 450 sentences and achieved around 99.36% accuracy in retrieving the root word from their inflexions in the underlying sentences .The proposed stemmer is both computationally inexpensive and domain independent.
  5. 5. 4 INDEX Sl.No. TITLE Pg.No. 1. Introduction ------------------------------------------------------------------ 5 6 2. Theoretical Study ------------------------------------------------------------ 7 12 3. Related Work ---------------------------------------------------------------- 13 14 4. Proposed Approach --------------------------------------------------------- 15 21 4.1. Overall Pictorial Representation ------------------------------------------ 15 4.1.1. Explanation of Proposed Approach with example ---------------------- 16 4.1.2. Detail explanation of Module 1 (Suffix Stripping) --------------------- 16 4.1.3. Detail explanation of Module 2 (Applying Rules) ---------------------- 17 4.1.4. Sentence Collection --------------------------------------------------------- 17 4.1.5. Normalization ---------------------------------------------------------------- 18 4.1.6. Tagging of Verbs ------------------------------------------------------------ 19 4.1.7. Preparing Output File ------------------------------------------------------- 19 4.1.8. Preparing Suffix List -------------------------------------------------------- 19 4.1.9. Verification ------------------------------------------------------------------- 20 4.2. Algorithm --------------------------------------------------------------------- 20 21 5. Output and Discussion ------------------------------------------------------ 22 24 5.1. Partial View of Input File -------------------------------------------------- 22 5.2. Suffix List -------------------------------------------------------------------- 22 5.3. Partial View of Output File ------------------------------------------------ 23 5.4. Efficiency --------------------------------------------------------------------- 24 5.5. Time Complexity ------------------------------------------------------------ 24 6. Conclusion and Future Work ---------------------------------------------- 25 i. Acknowledgement ---------------------------------------------------------- 26 ii. References ------------------------------------------------------------------- 27 28 iii. Appendix --------------------------------------------------------------------- 29 32
  6. 6. 5 1. INTRODUCTION Stemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis. It is used to improve the performance of spelling checkers and information retrieval applications, where morphological analysis would be too computationally expensive. It is a pre-processing step in Text Mining applications as well as a very common requirement of Natural Language processing functions. The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form. We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Bengali is one of the most morphologically rich languages. More than one inflection can be applied to the stem to form the word type. Stemming is a hard problem for the four categories:Noun, Adjective, Adverb and Verb but Verb is the most problematic area for Stemming. Bangla has a vast inflectional system, the number of inflected and derivational forms of a certain lexicon is huge. For example there are nearly (10*5) forms for a certain verb word in Bengali as there is 10 tenses and 5 persons and a root verb changes its form according to tense and person. For example here are 20 forms of verb root KA (). Other than this, there are lots of prefixes and suffixes, which can attach with a root word and form a new word. Different forms of verb root DEKHA () are dekhi(), dekhis() ,dekh () ,dekhe () ,dekhen () , dekhbo () , dekhbi () , dekhbe () , dekhben () , dekhchi() , dekhchis () , dekhche() , dekhchen () , dekhchilam () , dekhchili () , dekhchilo() , dekhchilen () , dekhlam () , dekhli () , dekhlo () , dekhlen () , dekhtis () , dekhtam () , dekhto () , dekhten () , dekhai () , dekhay () , dekhas () , dekhao () , dekhechi () , dekhecho () , dekhechis () , dekhechen () , dekhtei () , dekhar () , dekhabo () , dekhaben () , dekhabi () etc. Different suffixes that are added with root word to form a new word are chilen() , chilam () , chilis () , chilo () , chile () , chili () , chilo () , chen () , lam () , len () , tam () , tei () , tis () , ten () , ben () , chi () , che () , bi () , be () , te() , le () , li () , lo () , to () etc.
  7. 7. 6 Overview of Stemming of Bengali Verbs Root Word Inflected Verb Form Stripped Word + Suffixes Suffixes + + + + + Table 1: Stemming of Bengali Verbs We review the existing work in this area in Section 2; then we present the proposed stemming algorithm in Section 4, followed by its output and discussion in section 5 and evaluation in section 6. Finally, we conclude with a look at future research directions.
  8. 8. 7 2. THEORITICAL STUDY Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of humancomputer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". Modern NLP algorithms are based on machine learning, especially statistical machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. The machine-learning paradigm calls instead for using general learning algorithms often, although not always, grounded in statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned. The following is a list of some of the most commonly researched tasks in NLP.What distinguishes these tasks from other potential and actual NLP tasks is not only the volume of research devoted to them but the fact that for each one there is typically a well-defined problem setting, a standard metric for evaluating the task, standard corpora on which the task can be evaluated, and competitions devoted to the specific task. a. Automatic summarization Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. b. Coreference resolution Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names that they refer to. The more general task of co-reference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
  9. 9. 8 c. Discourse analysis This rubric includes a number of related tasks. One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.). d. Machine translation Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly. e. Morphological segmentation Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. Not only for Turkish but also the Manipuri,[4] which is a highly agglutinated Indian language. f. Named entity recognition (NER) Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. g. Natural language generation Convert information from computer databases into readable human language. h. Natural language understanding Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural languages semantics without confusions with implicit assumptions such as closed world assumption (CWA) vs. open world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. i. Optical character recognition (OCR) Given an image representing printed text, determine the corresponding text.
  10. 10. 9 j. Part-of-speech tagging Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. k. Parsing Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). l. Question answering Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open- ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions. m. Relationship extraction Given a chunk of text, identify the relationships among named entities (e.g. who is the wife of whom). n. Sentence breaking (also known as sentence boundary disambiguation) Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations). o. Sentiment analysis Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing. p. Speech recognition Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). Note also that in most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. q. Speech segmentation Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.
  11. 11. 10 r. Topic segmentation and recognition Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment. s. Word segmentation Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. t. Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet. In some cases, sets of related tasks are grouped into subfields of NLP that are often considered separately from NLP as a whole. Examples include: Information retrieval (IR) This is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP. Information extraction (IE) This is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognition, Coreference resolution, relationship extraction, etc. Speech processing This covers speech recognition, text-to-speech and related tasks. Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root formgenerally a written word form. The stem needs not to be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument". The design of stemmers is language specific, and requires some to significant linguistic expertise in the language, as well as the understanding of the needs for a spelling checker for that language. A typical simple stemmer algorithm involves removing suffixes using a
  12. 12. 11 list of frequent suffixes, while a more complex one would use morphological knowledge to derive a stem from the words. Words that are identified to have same root form are grouped in a cluster with the identified root word as cluster centre. An inflectional suffix is a terminal affix that does not change the word-class (parts of speech) of the root during concatenation; it is added to maintain the syntactic environment of the root in Bangla. On the other hand, derivational suffixes change word-class (parts of speech) and the orthographic-form of the root word. Experiments have been carried out with two types of algorithms: simple suffix stripping algorithm and score based stemming cluster identification algorithm. The Suffix stripping algorithm simply checks if any word has any suffixes (one or more than one suffixes) from a manually generated suffix list and then the word is assigned to the appropriate cluster where cluster centre is the assumed root word, i.e., the form obtained after deleting the suffix from the surface form. Suffix stripping algorithm works well for Noun, Adjective, Adverb categories. The words of other part of speech categories especially Verbs follow derivational morphology. The score based stemming technique has been designed to resolve the stem for inflected word forms. The technique uses Minimum Edit Distance method, well known for spelling error detection, to measure the cost of classifying every word being in a particular class. Score based technique considers two standard operations of Minimum Edit Distance, i.e., insertion and deletion. The consideration range of insertion and deletion for the present task is maximum three characters. The idea is that the present word matches an existing cluster centre after insertion and/or deletion of maximum three characters. The present word will be assigned to the cluster that can be reached with minimum number of insertion and/or deletion. This is an iterative clustering mechanism for assigning each word into a cluster. A separate list of verb inflections (only 50 entries; manually edited) has been maintained to validate the result of the score based technique. Stemming algorithms can be broadly classified into two categories, namely Rule Based and Statistical. 2.1. Rule Based Approach In a rule based approach language specific rules are encoded and based on these rules stemming is performed. In this approach various conditions are specified for converting a word to its derivational stem, a list of all valid stems are given and also there are some exceptional rules which are used to handle the exceptional cases. For example the word absorption is derived from the stem absorpt and absorbing is derived from the stem absorb. The problem of the spelling exceptions arises in the above case when we try to match the two words absorpt and absorb. Such exceptions are handled very carefully by introducing recording and partial matching techniques in the stemmer as post stemming procedures.
  13. 13. 12 Advantages of Rule Based Approach are - 1. These are fast in nature i.e. the computation time used to find a stem is lesser. 2. The retrieval results for English by using Rule Based Stemmer are very high. But one of the main disadvantages of Rule Based Stemmer is that one need to have extensive language expertise to make them. 2.2. Statistical Approach Statistical stemming is an effective and popular approach in information retrieval. Some recent studies show that statistical stemmers are good alternatives to rule-based stemmers. Additionally, their advantage lies in the fact that they do not require language expertise. Rather they employstatistical information from a large corpus of a given language to learn morphology of words. Yet another suffix stripper (YASS) is one such statistics based language independent stemmer .Its performance is comparable to that of Porters and Lovins stemmers, both in terms of average precision and the total number of relevant documents retrieved the challenge of retrieval from languages with poor resources. GRAS is a graph based language independent stemming algorithm for information retrieval [19]. The following features make this algorithm attractive and useful: (1) retrieval effectiveness, (2) generality, that is, its language-independent nature, and (3) low computational cost. Advantages of Statistical Stemmer are: 1. Statistical stemmers are useful for languages having scarce resources. 2. This approach yields best retrieval results for suffixing languages or the languages which are morphologically more complex like French,Portuguese, Hindi, Marathi, and Bengali rather than English. Disadvantages of Statistical approach is that Statistical Stemmers are time consuming because for these stemmers to work we need to have complete language coverage, in terms of morphology of words, their variants etc.
  14. 14. 13 3. RELATED WORK Martin Porter developed the Porter Stemmer,which is a conflation stemmer, in 1980 at the University of Cambridge [5]. The Porter Stemmer uses the fact that English language suffixes are mostly a combination of smaller and simpler suffixes. Porter designed a rule-based stemmer with five steps, each of which applies a set of rules. Ramanathan and Rao (2003) proposed a lightweight stemmer for Hindi which has used a hand crafted suffix list and has performed longest match stripping. Light stemming refers to stripping of a small set of either prefixes or suffixes or both, without trying to deal with infixes, or recognize patterns and find roots. This lightweight stemmer proposed for Hindi is based on the grammar for Hindi language in which a list of total 65 suffixes is generated manually. Terms are conflated by stripping off word endings from a suffix list on a `longest match' basis. Noun, adjective and verb infections have been discussed and based on that 65 unique suffixes are collected. The major advantage of this approach is as it is computationally inexpensive. Documents were chosen from varied domains such as Films, Health, Business ,Sports and Politics. The collection contained 35977 unique words. Under stemming and over stemming errors calculated in this methodology were 4.68% and 13.84% respectively. No recall/precision-based evaluation of the work has been reported; thus the effectiveness of this stemming procedure is difficult to estimate. Majumder et al. (2007) developed statistical approach YASS: Yet Another Suffix Stripper, which uses a clustering based approach based on string distance measures and requires no linguistic knowledge. They concluded that stemming improves recall of IR systems for Indian languages like Bengali. YASS is based on string distance measure which is used to cluster a lexicon created from a text corpus into homogenous groups. Each group is expected to represent an equivalence class consisting of morphological variants of the single root word. Dasgupta and Ng (2006) proposed unsupervised morphological parsing of Bengali. Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morphophonological rules. This parser is composed of two steps: (1) inducing prefixes, suffixes and roots from a vocabulary consisting of words taken from a large, unannotated corpus, and (2) segmenting a word based on these induced morphemes. When evaluated on a set of 4,110 human-segmented Bengali words, our algorithm achieves 83% success. Pandey and Siddiqui (2008) [17] proposed an unsupervised stemming algorithm for Hindi based on Goldsmith (2001) [69] approach. It is based on split-all method. For unsupervised learning (training), words from Hindi documents from EMILLE corpus have been extracted. These words have been split to give n-gram (n=1, 2, 3 l) suffix, where l is length of the word. Then suffix and stem probabilities are computed. These probabilities are multiplied to give split probability. The optimal segment corresponds to maximum split probability. Some post-processing steps have been taken to refine the
  15. 15. 14 learned suffixes. It is evaluated on 1000-1000 words randomly extracted words (only) from Hindi WordNet1 data base. The training data has been constructed by extracting 106403 words extracted from EMILLE2 corpus. The observed accuracy is 89.9% after applying some heuristic measures. The F-score is 94.96%. The algorithm does not require any language specific information. Majgaonker and Siddiqui (2010) developed an unsupervised approach for Marathi stemmer. Three different approaches (rule based, suffix stripping and statistical stripping) for suffix rules generation has been used in unsupervised stemmer. The rule- based stemmer uses a set of manually extracted suffix stripping rules whereas the unsupervised approach learns suffixes automatically from a set of words extracted from rawMarathi text. The performance of both the stemmers has been compared on a test dataset consisting of 1500 manually stemmed word. The maximum accuracy observed is 82.5% for the statistical suffix stripping approach. This approach uses a set of words to learn suffixes. Suba et al. (2011) proposed two stemmers for Gujaratia lightweight inflectional stemmer based on a hybrid approach and a heavyweight derivational stemmer based on a rule-based approach. The inflectional stemmer has an average accuracy of about 90.7% which is considerable as far as IR is concerned. Boost in accuracy due to POS based stemming was 9.6% and due to inclusion of the language characteristics it was further boosted by 12.7%. The derivational stemmer has an average accuracy of 70.7% which can act as a good baseline and can be useful in tasks such as dictionary search or data compression. The limitations of inflectional stemmer can be easily overcome if modules like Named Entity Recognizer are integrated with the system. In A Light Weight Stemmer for Bengali and Its Use in Spelling Checker by Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan from Center for Research on Bangla Language Processing, BRAC University, Dhaka, Bangladesh,presents a computationally inexpensive stemming algorithm for Bengali, which handles suffix removal in a domain independent way.First the spelling checker checks the given word with a lexicon containing only the root words. If the word is found, then it is a valid word, terminating the checking process.If the word is not found in the lexicon, they apply the stemming algorithm. There are two possible scenarios: the stemming algorithm finds and returns a stem, or it cannot find a possible suffix.Then they try to get probable stem list with their suffixes from modified stemming method. Correction accuracy for single error misspellings: 90.8%. Correction accuracy for multi-error misspellings: 67%. In 2012 an iterative stemmer for Tamil Language was proposed by Vivekanandan Ramachandran et al.In this proposed model,suffix stripper algorithm is used to stem Tamil words to its root word. Upendra Mishra and Chandra Prakash , present the Hybrid approach which is combination of brute force and suffix removal approach and reduces the problem of over-stemming and under-stemming.
  16. 16. 15 4.PROPOSED APPROACH Our proposed algorithm is based on a lightweight stemmer for Bengali Verbs that strips the suffixes using a predefined suffix list, on a longest match basis, and then finds root on basis of some rules. For this purpose, firstly the input file is read and inflected verb forms are fetched. The inflexion of each such inflected verb is then compared with the suffixes in the suffix list and removed, if any match is found. The subroot is then checked. If it ends with e- kar( ), o-kar( ), a-kar( ) or aa-kar( ) then, replace it with aa-kar( ). If it starts with e-kar ( ), u-kar ( ) or a-kar( ), then replace it with a-kar( ), o-kar( ) or aa- kar( ) respectively. Generate the output doc file by copying the contents of input file and concatenating it with their obtained root words wherever the word contains /verb. Finally, compare the generated output file with the desired output file and calculate the efficiency. 4.1. Overall Pictorial Presentation Reading Input Text Selecting & tagging verbs Fetching of tagged verbs Module 1: Applying suffix striping Obtaining stripped part Module 2: Applying rules Generating Output File Calculating Efficiency Figure 1: Pictorial representation of proposed approach
  17. 17. 16 4.1.1. Explanation of Proposed Approach with example PROCESS EXAMPLE Reading Input Text Selecting & tagging verbs /verb /verb Fetching of tagged verbs /verb, /verb Applying suffix striping -> + , -> + Obtaining stripped part , Applying rules , Generating Output File /verb/ /verb/ 4.1.2. Detail explanation of Module 1 (Suffix Stripping) Table 2: Proposed Approach with example No Reading Suffix List Fetching suffixes from the suffix list Strip the suffix from the inflected verb Fetch next suffix from the suffix list Obtain the subroot/stripped verb Yes No Yes Checking if the consired verb contains the suffix Are all the suffixes fetched Figure 2: Module 1(Suffix Stripping)
  18. 18. 17 4.1.3. Detail explanation of Module 2: Applying Rules 4.1.4. Sentence Collection Technology Development for Indian Languages (TDIL) Programme initiated by the Department of Electronics & Information Technology (DeitY), Ministry of Communication & Information Technology (MC&IT), Govt. of India has the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services. Reading stripped verb/subroot No Yes No No Yes Yes No Yes YesYesYesYes NoNoNoNoSubroot ends with e-kar( ) Subroot ends with o-kar( ) Subroot ends with a-kar( ) Subroot ends with aa-kar( ) Length of subroot < 3 Replace the ending kar with aa-kar( ) Concatenate with aa-kar( ) Subroot starts with e-kar( ) Replace the starting kar with a-kar( ) Subroot starts with u-kar( ) Replace the starting kar with o-kar( ) Subroot starts with a-kar( ) Replace the starting kar with aa-kar( ) Obtain root verb Figure 3: Module 2 (Applying Rules)
  19. 19. 18 The Programme also promotes Language Technology standardization through active participation in International and national standardization bodies such as ISO, UNICODE, World-wide-Web consortium (W3C) and BIS (Bureau of Indian Standards) to ensure adequate representation of Indian languages in existing and future language technology standards. The input sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL project of the Govt. of India, while the information about different inflexions of particular verb is collected from Bengali Dictionary. We have selected 14 Bengali Verbs, and presented a sentence for each inflexion of a particular verb. Accordingly, we have applied our algorithm over 638 sentences. 4.1.5. Normalization The Bengali text corpus developed in the TDIL project of the Govt. of India separates words by |, whereas we have separated words by spaces . Moreover, the end of each sentence is marked by | and any kind of exclamation sign, eg. question mark ?, comma ,, exclamation mark !, etc., is replaced by | . Figure 4: Screen Shot of Un-normalized Document Figure 5: Screen Shot of Un-normalized Document
  20. 20. 19 4.1.6. Tagging of Verb In every sentence, the inflected word whose root is to be found out is tagged by /verb. Figure 6: Screen Shot of verb-tagged Document 4.1.7. Preparing Output File An output file is prepared whereby the inflected word of every sentence whose root is to be found out is tagged as /verb/ concatenated by the actual root word. This file is prepared in order to calculate the efficiency of our proposed algorithm. Figure 7: Screen Shot of desired output Document 4.1.8. Preparing Suffix List After surveying the inflexions of various Bengali Verbs from the 50 different categories of the Bengali text corpus developed in the TDIL project of the Govt. of India, we have prepared a suffix list by selecting 35 mostly occurring suffixes.
  21. 21. 20 4.1.9. Verification Generated output file is compared with the prepared output file and thereby the efficiency of the algorithm is calculated. 4.2. Algorithm STEP 1. Start of algorithm. STEP 2. Create 4 new string[] namely splits1[], splits2[ ] and splits3[ ]. STEP 3. Read the contents of the doc files and split the words by space ( ) separator. 3.1. Store the words of each sentence in splits1[ ]. 3.2. Store the inflexions in splits2[ ]. 3.3. Store the desired root words in splits3[ ]. STEP 4. Declare and initialize variables l1=length of splits1[ ] , l2=length of splits2[ ] . STEP 5. Fetch the inflected verb forms in input1[] from splits1[i] if /verb is contained by the currently fetched word. This step is repeated l1 times. 5.1. Determine the subroot from input1[i] by repeating the steps l2 times. 5.1.1. if splits2[j] in contained in input1[i] then, 5.1.1.a. Declare variable index which stores the index of last occurrence of splits2[j] in input1[i]. 5.1.1.b. If index is greater than equal to 2 then, 5.1.1.b.i. Store the substring of input1[i] from begindex=0 to endindex=index in input1[i]. 5.1.1.b.ii. Break the loop. 5.2. Determine the actual root input1[i] by repeating the steps l1 times. 5.2.1. Check the ending kar of input1[i]. 5.2.1.a. if input1[i] ends with e-kar( ), o-kar( ), a- kar( ) or aa-kar( ) then, replace it with aa-kar( ). 5.2.1.b. if length of input1[i] is less than 3, concate it with aa- kar( ). 5.2.2. Check the starting kar of input1[i]. 5.2.2.a. if input1[i] starts with e-kar ( ), then replace it with a-kar( ). 5.2.2.b. if input1[i] starts with u-kar ( ), then replace it with o-kar( ).
  22. 22. 21 5.2.2.c. if input1[i] starts with a-kar( ), then replace it with aa-kar( ). STEP 6. Generate the output doc file by copying the contents of splits1[] and concatenating it with their obtained root words from input1[] wherever the word contains /verb. STEP 7. Compare the obtained sentences in splits1[ ] with the desired sentences in splits3[ ] and calculate the efficiency. STEP 8. End of algorithm.
  23. 23. 22 5. OUTPUT AND DISCUSSION: 5.1. Partial View of Input File: 5.2. Suffix List: Figure 8: Partial view of Input File Figure 9: Screen shot of Suffix List
  24. 24. 23 5.3. Partial View of Output File: Figure 10: Partial view of Output File
  25. 25. 24 5.4. EFFICIENCY: Dealing with 500 sentences, our proposed approach gives an efficiency of 99.4%. 5.5. TIME COMPLEXITY: The time complexity of the proposed algorithm is: WORST CASE: O(n2 ) Figure 11: Screen shot of Efficiency of proposed approach
  26. 26. 25 6. CONCLUSION AND FUTURE WORK: Stemming plays a vital role in information retrieval system and its effect is very large. In this project, we present a lightweight stemmer for 14 selected Bengali Verbs that strips the suffixes using a predefined suffix list, on a longest match basis, and then finds root on basis of some rules. Except a few cases, the result obtained from our algorithm is quite satisfactory according to our expectation. We argue that a stronger and populated learning set would invariably yield better result. In future , we plan to test our algorithm with more sets of Bengali verbs. As the research in Bengali language is much less than those in languages like English and Hindi, still lot of dimensions are untouched. Using several relevant and new approach, better Bengali stemmer can be developed and thus will be useful for further linguistic computing.
  27. 27. 26 ACKNOWLEDGEMENT It gives us great pleasure to find an opportunity to express our deep and sincere gratitude to our project guide Mr. Alok Ranjan Pal. We do very respectfully recollect his constant encouragement, kind attention and keen interest throughout the course of our work. We are highly indebted to him for the way he modeled and structured our work with his valuable tips and suggestions that he accorded to us in every respect of our work. We are extremely grateful to the Department of Computer Science & Engineering, CEMK, for extending all the facilities of our department. We humbly extend our sense of gratitude to other faculty members, laboratory staff, library staff and administration of this Institute for providing us their valuable help and time with a congenital working environment. Last but not the least; we would like to convey our heartiest thanks to all our classmates who time to time have helped us with their valuable suggestions during our project work. Date:23.05.2015 Anasuya Paul University Roll:10700111006 University Registration No:111070110006 Joyeeta Bagchi University Roll:10700111021 University Registration No:111070110021 Koushik Dutta University Roll:10700111024 University Registration No:111070110024 Sneha Sarkar University Roll:10700111049 University Registration No:111070110049
  28. 28. 27 References: 1. Ramanathan and D. D. Rao, A Lightweight Stemmer for Hindi, Workshop on Computational Linguistics for South-Asian Languages, EACL, 2003. 2. M. Z. Islam, M. N. Uddin and M. Khan, A Light Weight Stemmer for Bengali and its Use in Spelling Checker.Proc. 1st Intl. Conf. on Digital Comm. and Computer Applications (DCCA07), Irbid, Jordan, March 19-23 2007. 3. P. Majumder, M. Mitra, S. K. Parui, G. Kole, P. Mitra, and K. Datta, YASS: Yet Another Suffix Stripper,Association for Computing Machinery Transactions on Information Systems, 25(4):18-38, 2007. 4. S. Dasgupta and V. Ng, Unsupervised Morphological Parsing of Bengali, Language Resources and Evaluation, 40(3-4):311-330, 2006. 5. A. K. Pandey and T. J. Siddiqui, An Unsupervised Hindi Stemmer with Heuristic Improvements, In Proceedings of the Second Workshop on Analytics For Noisy Unstructured Text Data, 303:99-105, 2008. 6. M. M. Majgaonker and T. J Siddiqui, Discovering Suffixes: A Case Study for Marathi Language,International Journal on Computer Science and Engineering, Vol. 02, No. 08, pp. 2716-2720, 2010. 7. K. Suba, D. Jiandani and P. Bhattacharyya, Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati, In proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP 2011, Chiang Mai, Thailand, pp.1-8, 2011. 8. M.F. Porter, An algorithm for suffix stripping, Program, 14(3) 1980, pp. 130137. 9. P. Kundu and B.B. Chaudhuri, Error Pattern in Bengali Text, International Journal of Dravidian Linguistics, 28(2) 1999. 10. B.B. Chaudhuri, Reversed word dictionary and phonetically similar word grouping based 11. spell-checker to Bengali text, In the Proceedings of LESAL Workshop, 2001. 12. Sandipan Sarkar and Sivaji Bandyopadhyay. Study on Rule-Based Stemming Patterns and Issues in a Bengali Short Story-Based Corpus. In ICON 2009. 13. S. Dasgupta,M. Khan: Morphological parsing of Bangla words using PCKIMMO. In: ICCIT 2004. (2004). 14. Barzilay, R. & Elhadad. M. 1997. Using Lexical Chains for Text Summarization.In Proceedings of the Workshop on Intelligent Scalable Text Summarization. Madrid, Spain. 15. Pratikkumar patel kashyap popat hybrid stemmer for gujarati in proc. of the 1st workshop on south and southeast Asian natural language processing (wssanlp), pages 5155, the 23rd international conference on computational linguistics (coling), Beijing, august 2010 16. Upendra Mishra Chandra Prakash MAULIK: An Effective Stemmer for Hindi Language International Journal on Computer Science and Engineering (IJCSE). Abduelbaset m. Goweder, Husien a. Alhammi, Tarik rashed, and Abdulsalam Musrat A Hybrid Method for Stemming Arabic Text.
  29. 29. 28 17. Kartik Suba, Dipti Jiandani and Pushpak Bhattacharyya Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati 18. Hairdar Harmanani, Walid Keirouz, Saeed Raheel A Rule Based Extensible Stemmer for Information Retrieval with Application to Arabic The international Arab Journal of Information Technology.Vol -3 July- 2006. 19. Navanath Saharia, Utpal Sharma and Jugal Kalita [6] present paper on Analysis and Evaluation of Stemming Algorithms: A case Study with Assamese. ICACCI12, August 3-5, 2012, Chennai, T Nadu, India. 20. Nikhil Kanuparthi, Abhilash Inumella and Dipti Misra Sharma Hindi Derivational Morphological Analyzer. Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON2012), pages 1016, Montreal, Canada, June 7, 2012. C 2012 Association for Computational Linguistics. 21. Juhi Ameta, Nisheeth Joshi, Iti Mathur A Lightweight Stemmer for Gujarati. 22. Mohamad Ababneh, Riyad Al-Shalabi, Ghassan Kanaan, and Alaa AlNobani Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve Search Effectiveness The International Arab Journal of Information Technology, Vol. 9, No. 4, July 2012. 23. Ms. Anjali Ganesh Jivani A Comparative Study of Stemming Algorithms Int. J. Comp. Tech. Appl., Vol 2 (6), 1930-1938 24. M. F. Porter 1980. "An Algorithm for Suffix Stripping Program", 14(3):130-137. 25. V. M. Orengo and C. Huyck A Stemming Algorithm for the Portuguese Language Proceedings of the Eighth International Symposium on String Processing and Information Retrieval, pages 186-193, 2001. 26. Deepika Sharma Stemming Algorithms: A Comparative Study and their Analysis International Journal of Applied Information Systems (IJAIS) ISSN: 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 4 No.3, September 2012 . 27. J. B. Lovins 1968. "Development of a Stemming Algorithm."Mechanical Translation and Computational Linguistics, 11(1-2), 22-31.
  30. 30. 29 Appendix: 1. Program Code : package stemming_verb; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.extractor.WordExtractor; import java.io.*; public class Stemming_verb { public static void main(String[] args) { File file1=null, file2=null, file3=null, file4=null; WordExtractor extractor1 = null, extractor2=null, extractor3=null, extractor4=null; try{ /*------------------Reading sentences-----------------------*/ file1 = new File("G:Stemmingfinal_projectsentence_input.doc"); FileInputStream fis1 = new FileInputStream(file1.getAbsolutePath()); HWPFDocument document1 = new HWPFDocument(fis1); extractor1 = new WordExtractor(document1); String fileData1 = extractor1.getText(); String[] splits1 = fileData1.split(" "); String[] input1=new String[splits1.length]; int l1=splits1.length; /*------------------Reading inflexions-----------------------*/ file2 = new File("G:Stemmingfinal_projectsuffixes.doc"); FileInputStream fis2 = new FileInputStream(file2.getAbsolutePath()); HWPFDocument document2 = new HWPFDocument(fis2); extractor2 = new WordExtractor(document2); String fileData2 = extractor2.getText(); String[] splits2 = fileData2.split(""); int l2=splits2.length;
  31. 31. 30 /*-------------------Reading desired output file----------------------*/ file4 = new File("G:Stemmingfinal_projectsentence_output.doc"); FileInputStream fis4 = new FileInputStream(file4.getAbsolutePath()); HWPFDocument document4 = new HWPFDocument(fis4); extractor4 = new WordExtractor(document4); String fileData4 = extractor4.getText(); String[] splits4 = fileData4.split(" "); int l4=splits4.length; /*-------------------Suffix stripping----------------------*/ int verb=0; for(int i=0;i=2) { input1[i]=input1[i].substring(0, index); break; } } } } }
  32. 32. 31 /*--------------------applying rules------------------*/ for(int i=0;i
  33. 33. 32 } } /*--------------------Writing obtained root words to doc file----------------------*/ boolean append75=true; FileWriter write75=new FileWriter("G:Stemmingfinal_projectoutput_new.doc",append75); PrintWriter print_line75= new PrintWriter(write75); String word1,word2; for(int i=0;i