A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS A Project Work Submitted in Partial Fulfilment of the Requirements for the Degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING by ANASUYA PAUL (Roll No. 10700111006) JOYEETA BAGCHI (Roll No. 10700111021) KOUSHIK DUTTA (Roll No. 10700111024) SNEHA SARKAR (Roll No. 10700111049) Under the supervision of Mr. Alok Ranjan Pal DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West Bengal University of Technology) Purba Medinipur – 721171, West Bengal, India
1. A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS A Project
Work Submitted in Partial Fulfilment of the Requirements for the
Degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE &
ENGINEERING by ANASUYA PAUL (Roll No. 10700111006) JOYEETA BAGCHI
(Roll No. 10700111021) KOUSHIK DUTTA (Roll No. 10700111024) SNEHA
SARKAR (Roll No. 10700111049) Under the supervision of Mr. Alok
Ranjan Pal DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE
OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West
Bengal University of Technology) Purba Medinipur 721171, West
Bengal, India
2. 1 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE
OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West
Bengal University of Technology) Purba Medinipur 721171, West
Bengal, India CERTIFICATE OF APPROVAL This is to certify that the
work embodied in this project entitled A RULE BASED APPROACH ON
STEMMING OF BENGALI VERBS submitted by Anasuya Paul, Joyeeta
Bagchi, Koushik Dutta and Sneha Sarkar to the Department of
Computer Science & Engineering, is carried out under my direct
supervision and guidance. The project work has been prepared as per
the regulations of West Bengal University of Technology and I
strongly recommend that this project work be accepted in fulfilment
of the requirement for the degree of B.Tech. Supervisor Mr. Alok
Ranjan Pal Asst. Prof., Dept. of CSE Countersigned by Prof. (Dr.)
Dilip Kumar Gayen Head Department of CSE
3. 2 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING COLLEGE
OF ENGINEERING & MANAGEMENT, KOLAGHAT (Affiliated to West
Bengal University of Technology) Purba Medinipur 721171, West
Bengal, India Certificate by the Board of Examiners This is to
certify that the project work entitled A RULE BASED APPROACH ON
STEMMING OF BENGALI VERBS submitted by Anasuya Paul, Joyeeta
Bagchi, Koushik Dutta and Sneha Sarkar to the Department of
Computer Science and Engineering of College of Engineering of
Management, Kolaghat has been examined and evaluated. The project
work has been prepared as per the regulations of West Bengal
University of Technology and qualifies to be accepted in fulfilment
of the requirement for the degree of B. Tech. Project Co-ordinator
Board of Examiners
4. 3 ABSTRACT Based on the various inflexions of verbs
available in the Bengali Dictionary, an attempt is made to retrieve
the stem word from their inflexions in the underlying sentences.
The input sentences are collected from 50 different categories of
the Bengali text corpus developed in the TDIL project of the Govt.
of India, while the information about different inflexions of
particular verb is collected from Bengali Dictionary. In this
project, we present a lightweight stemmer for 14 selected Bengali
Verbs that strips the suffixes using a predefined suffix list, on a
longest match basis, and then finds root on basis of some rules. We
have applied the algorithm over 450 sentences and achieved around
99.36% accuracy in retrieving the root word from their inflexions
in the underlying sentences .The proposed stemmer is both
computationally inexpensive and domain independent.
5. 4 INDEX Sl.No. TITLE Pg.No. 1. Introduction
------------------------------------------------------------------
5 6 2. Theoretical Study
------------------------------------------------------------ 7 12
3. Related Work
---------------------------------------------------------------- 13
14 4. Proposed Approach
--------------------------------------------------------- 15 21
4.1. Overall Pictorial Representation
------------------------------------------ 15 4.1.1. Explanation of
Proposed Approach with example ---------------------- 16 4.1.2.
Detail explanation of Module 1 (Suffix Stripping)
--------------------- 16 4.1.3. Detail explanation of Module 2
(Applying Rules) ---------------------- 17 4.1.4. Sentence
Collection
--------------------------------------------------------- 17 4.1.5.
Normalization
---------------------------------------------------------------- 18
4.1.6. Tagging of Verbs
------------------------------------------------------------ 19
4.1.7. Preparing Output File
------------------------------------------------------- 19 4.1.8.
Preparing Suffix List
-------------------------------------------------------- 19 4.1.9.
Verification
-------------------------------------------------------------------
20 4.2. Algorithm
---------------------------------------------------------------------
20 21 5. Output and Discussion
------------------------------------------------------ 22 24 5.1.
Partial View of Input File
-------------------------------------------------- 22 5.2. Suffix
List
--------------------------------------------------------------------
22 5.3. Partial View of Output File
------------------------------------------------ 23 5.4. Efficiency
---------------------------------------------------------------------
24 5.5. Time Complexity
------------------------------------------------------------ 24 6.
Conclusion and Future Work
---------------------------------------------- 25 i.
Acknowledgement
---------------------------------------------------------- 26 ii.
References
-------------------------------------------------------------------
27 28 iii. Appendix
---------------------------------------------------------------------
29 32
6. 5 1. INTRODUCTION Stemming is an operation that splits a
word into the constituent root part and affix without doing
complete morphological analysis. It is used to improve the
performance of spelling checkers and information retrieval
applications, where morphological analysis would be too
computationally expensive. It is a pre-processing step in Text
Mining applications as well as a very common requirement of Natural
Language processing functions. The main purpose of stemming is to
reduce different grammatical forms / word forms of a word like its
noun, adjective, verb, adverb etc. to its root form. We can say
that the goal of stemming is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base
form. Bengali is one of the most morphologically rich languages.
More than one inflection can be applied to the stem to form the
word type. Stemming is a hard problem for the four categories:Noun,
Adjective, Adverb and Verb but Verb is the most problematic area
for Stemming. Bangla has a vast inflectional system, the number of
inflected and derivational forms of a certain lexicon is huge. For
example there are nearly (10*5) forms for a certain verb word in
Bengali as there is 10 tenses and 5 persons and a root verb changes
its form according to tense and person. For example here are 20
forms of verb root KA (). Other than this, there are lots of
prefixes and suffixes, which can attach with a root word and form a
new word. Different forms of verb root DEKHA () are dekhi(),
dekhis() ,dekh () ,dekhe () ,dekhen () , dekhbo () , dekhbi () ,
dekhbe () , dekhben () , dekhchi() , dekhchis () , dekhche() ,
dekhchen () , dekhchilam () , dekhchili () , dekhchilo() ,
dekhchilen () , dekhlam () , dekhli () , dekhlo () , dekhlen () ,
dekhtis () , dekhtam () , dekhto () , dekhten () , dekhai () ,
dekhay () , dekhas () , dekhao () , dekhechi () , dekhecho () ,
dekhechis () , dekhechen () , dekhtei () , dekhar () , dekhabo () ,
dekhaben () , dekhabi () etc. Different suffixes that are added
with root word to form a new word are chilen() , chilam () , chilis
() , chilo () , chile () , chili () , chilo () , chen () , lam () ,
len () , tam () , tei () , tis () , ten () , ben () , chi () , che
() , bi () , be () , te() , le () , li () , lo () , to () etc.
7. 6 Overview of Stemming of Bengali Verbs Root Word Inflected
Verb Form Stripped Word + Suffixes Suffixes + + + + + Table 1:
Stemming of Bengali Verbs We review the existing work in this area
in Section 2; then we present the proposed stemming algorithm in
Section 4, followed by its output and discussion in section 5 and
evaluation in section 6. Finally, we conclude with a look at future
research directions.
8. 7 2. THEORITICAL STUDY Natural language processing (NLP) is
a field of computer science, artificial intelligence, and
computational linguistics concerned with the interactions between
computers and human (natural) languages. As such, NLP is related to
the area of humancomputer interaction. Many challenges in NLP
involve natural language understanding, that is, enabling computers
to derive meaning from human or natural language input, and others
involve natural language generation. The history of NLP generally
starts in the 1950s, although work can be found from earlier
periods. In 1950, Alan Turing published an article titled
"Computing Machinery and Intelligence" which proposed what is now
called the Turing test as a criterion of intelligence. Some notably
successful NLP systems developed in the 1960s were SHRDLU, a
natural language system working in restricted "blocks worlds" with
restricted vocabularies, and ELIZA, a simulation of a Rogerian
psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.
Using almost no information about human thought or emotion, ELIZA
sometimes provided a startlingly human-like interaction. When the
"patient" exceeded the very small knowledge base, ELIZA might
provide a generic response, for example, responding to "My head
hurts" with "Why do you say your head hurts?". Modern NLP
algorithms are based on machine learning, especially statistical
machine learning. The paradigm of machine learning is different
from that of most prior attempts at language processing. The
machine-learning paradigm calls instead for using general learning
algorithms often, although not always, grounded in statistical
inference to automatically learn such rules through the analysis of
large corpora of typical real-world examples. A corpus (plural,
"corpora") is a set of documents (or sometimes, individual
sentences) that have been hand-annotated with the correct values to
be learned. The following is a list of some of the most commonly
researched tasks in NLP.What distinguishes these tasks from other
potential and actual NLP tasks is not only the volume of research
devoted to them but the fact that for each one there is typically a
well-defined problem setting, a standard metric for evaluating the
task, standard corpora on which the task can be evaluated, and
competitions devoted to the specific task. a. Automatic
summarization Produce a readable summary of a chunk of text. Often
used to provide summaries of text of a known type, such as articles
in the financial section of a newspaper. b. Coreference resolution
Given a sentence or larger chunk of text, determine which words
("mentions") refer to the same objects ("entities"). Anaphora
resolution is a specific example of this task, and is specifically
concerned with matching up pronouns with the nouns or names that
they refer to. The more general task of co-reference resolution
also includes identifying so-called "bridging relationships"
involving referring expressions. For example, in a sentence such as
"He entered John's house through the front door", "the front door"
is a referring expression and the bridging relationship to be
identified is the fact that the door being referred to is the front
door of John's house (rather than of some other structure that
might also be referred to).
9. 8 c. Discourse analysis This rubric includes a number of
related tasks. One task is identifying the discourse structure of
connected text, i.e. the nature of the discourse relationships
between sentences (e.g. elaboration, explanation, contrast).
Another possible task is recognizing and classifying the speech
acts in a chunk of text (e.g. yes-no question, content question,
statement, assertion, etc.). d. Machine translation Automatically
translate text from one human language to another. This is one of
the most difficult problems, and is a member of a class of problems
colloquially termed "AI-complete", i.e. requiring all of the
different types of knowledge that humans possess (grammar,
semantics, facts about the real world, etc.) in order to solve
properly. e. Morphological segmentation Separate words into
individual morphemes and identify the class of the morphemes. The
difficulty of this task depends greatly on the complexity of the
morphology (i.e. the structure of words) of the language being
considered. English has fairly simple morphology, especially
inflectional morphology, and thus it is often possible to ignore
this task entirely and simply model all possible forms of a word
(e.g. "open, opens, opened, opening") as separate words. In
languages such as Turkish, however, such an approach is not
possible, as each dictionary entry has thousands of possible word
forms. Not only for Turkish but also the Manipuri,[4] which is a
highly agglutinated Indian language. f. Named entity recognition
(NER) Given a stream of text, determine which items in the text map
to proper names, such as people or places, and what the type of
each such name is (e.g. person, location, organization. For
example, the first word of a sentence is also capitalized, and
named entities often span several words, only some of which are
capitalized. g. Natural language generation Convert information
from computer databases into readable human language. h. Natural
language understanding Convert chunks of text into more formal
representations such as first-order logic structures that are
easier for computer programs to manipulate. Natural language
understanding involves the identification of the intended semantic
from the multiple possible semantics which can be derived from a
natural language expression which usually takes the form of
organized notations of natural languages concepts. Introduction and
creation of language metamodel and ontology are efficient however
empirical solutions. An explicit formalization of natural languages
semantics without confusions with implicit assumptions such as
closed world assumption (CWA) vs. open world assumption, or
subjective Yes/No vs. objective True/False is expected for the
construction of a basis of semantics formalization. i. Optical
character recognition (OCR) Given an image representing printed
text, determine the corresponding text.
10. 9 j. Part-of-speech tagging Given a sentence, determine the
part of speech for each word. Many words, especially common ones,
can serve as multiple parts of speech. For example, "book" can be a
noun ("the book on the table") or verb ("to book a flight"); "set"
can be a noun, verb or adjective; and "out" can be any of at least
five different parts of speech. Some languages have more such
ambiguity than others. Languages with little inflectional
morphology, such as English are particularly prone to such
ambiguity. k. Parsing Determine the parse tree (grammatical
analysis) of a given sentence. The grammar for natural languages is
ambiguous and typical sentences have multiple possible analyses. In
fact, perhaps surprisingly, for a typical sentence there may be
thousands of potential parses (most of which will seem completely
nonsensical to a human). l. Question answering Given a
human-language question, determine its answer. Typical questions
have a specific right answer (such as "What is the capital of
Canada?"), but sometimes open- ended questions are also considered
(such as "What is the meaning of life?"). Recent works have looked
at even more complex questions. m. Relationship extraction Given a
chunk of text, identify the relationships among named entities
(e.g. who is the wife of whom). n. Sentence breaking (also known as
sentence boundary disambiguation) Given a chunk of text, find the
sentence boundaries. Sentence boundaries are often marked by
periods or other punctuation marks, but these same characters can
serve other purposes (e.g. marking abbreviations). o. Sentiment
analysis Extract subjective information usually from a set of
documents, often using online reviews to determine "polarity" about
specific objects. It is especially useful for identifying trends of
public opinion in the social media, for the purpose of marketing.
p. Speech recognition Given a sound clip of a person or people
speaking, determine the textual representation of the speech. This
is the opposite of text to speech and is one of the extremely
difficult problems colloquially termed "AI-complete" (see above).
In natural speech there are hardly any pauses between successive
words, and thus speech segmentation is a necessary subtask of
speech recognition (see below). Note also that in most spoken
languages, the sounds representing successive letters blend into
each other in a process termed coarticulation, so the conversion of
the analog signal to discrete characters can be a very difficult
process. q. Speech segmentation Given a sound clip of a person or
people speaking, separate it into words. A subtask of speech
recognition and typically grouped with it.
11. 10 r. Topic segmentation and recognition Given a chunk of
text, separate it into segments each of which is devoted to a
topic, and identify the topic of the segment. s. Word segmentation
Separate a chunk of continuous text into separate words. For a
language like English, this is fairly trivial, since words are
usually separated by spaces. However, some written languages like
Chinese, Japanese and Thai do not mark word boundaries in such a
fashion, and in those languages text segmentation is a significant
task requiring knowledge of the vocabulary and morphology of words
in the language. t. Word sense disambiguation Many words have more
than one meaning; we have to select the meaning which makes the
most sense in context. For this problem, we are typically given a
list of words and associated word senses, e.g. from a dictionary or
from an online resource such as WordNet. In some cases, sets of
related tasks are grouped into subfields of NLP that are often
considered separately from NLP as a whole. Examples include:
Information retrieval (IR) This is concerned with storing,
searching and retrieving information. It is a separate field within
computer science (closer to databases), but IR relies on some NLP
methods (for example, stemming). Some current research and
applications seek to bridge the gap between IR and NLP. Information
extraction (IE) This is concerned in general with the extraction of
semantic information from text. This covers tasks such as named
entity recognition, Coreference resolution, relationship
extraction, etc. Speech processing This covers speech recognition,
text-to-speech and related tasks. Stemming is the term used in
linguistic morphology and information retrieval to describe the
process for reducing inflected (or sometimes derived) words to
their word stem, base or root formgenerally a written word form.
The stem needs not to be identical to the morphological root of the
word; it is usually sufficient that related words map to the same
stem, even if this stem is not in itself a valid root. A stemmer
for English, for example, should identify the string "cats" (and
possibly "catlike", "catty" etc.) as based on the root "cat", and
"stemmer", "stemming", "stemmed" as based on "stem". A stemming
algorithm reduces the words "fishing", "fished", and "fisher" to
the root word, "fish". On the other hand, "argue", "argued",
"argues", "arguing", and "argus" reduce to the stem "argu"
(illustrating the case where the stem is not itself a word or root)
but "argument" and "arguments" reduce to the stem "argument". The
design of stemmers is language specific, and requires some to
significant linguistic expertise in the language, as well as the
understanding of the needs for a spelling checker for that
language. A typical simple stemmer algorithm involves removing
suffixes using a
12. 11 list of frequent suffixes, while a more complex one
would use morphological knowledge to derive a stem from the words.
Words that are identified to have same root form are grouped in a
cluster with the identified root word as cluster centre. An
inflectional suffix is a terminal affix that does not change the
word-class (parts of speech) of the root during concatenation; it
is added to maintain the syntactic environment of the root in
Bangla. On the other hand, derivational suffixes change word-class
(parts of speech) and the orthographic-form of the root word.
Experiments have been carried out with two types of algorithms:
simple suffix stripping algorithm and score based stemming cluster
identification algorithm. The Suffix stripping algorithm simply
checks if any word has any suffixes (one or more than one suffixes)
from a manually generated suffix list and then the word is assigned
to the appropriate cluster where cluster centre is the assumed root
word, i.e., the form obtained after deleting the suffix from the
surface form. Suffix stripping algorithm works well for Noun,
Adjective, Adverb categories. The words of other part of speech
categories especially Verbs follow derivational morphology. The
score based stemming technique has been designed to resolve the
stem for inflected word forms. The technique uses Minimum Edit
Distance method, well known for spelling error detection, to
measure the cost of classifying every word being in a particular
class. Score based technique considers two standard operations of
Minimum Edit Distance, i.e., insertion and deletion. The
consideration range of insertion and deletion for the present task
is maximum three characters. The idea is that the present word
matches an existing cluster centre after insertion and/or deletion
of maximum three characters. The present word will be assigned to
the cluster that can be reached with minimum number of insertion
and/or deletion. This is an iterative clustering mechanism for
assigning each word into a cluster. A separate list of verb
inflections (only 50 entries; manually edited) has been maintained
to validate the result of the score based technique. Stemming
algorithms can be broadly classified into two categories, namely
Rule Based and Statistical. 2.1. Rule Based Approach In a rule
based approach language specific rules are encoded and based on
these rules stemming is performed. In this approach various
conditions are specified for converting a word to its derivational
stem, a list of all valid stems are given and also there are some
exceptional rules which are used to handle the exceptional cases.
For example the word absorption is derived from the stem absorpt
and absorbing is derived from the stem absorb. The problem of the
spelling exceptions arises in the above case when we try to match
the two words absorpt and absorb. Such exceptions are handled very
carefully by introducing recording and partial matching techniques
in the stemmer as post stemming procedures.
13. 12 Advantages of Rule Based Approach are - 1. These are
fast in nature i.e. the computation time used to find a stem is
lesser. 2. The retrieval results for English by using Rule Based
Stemmer are very high. But one of the main disadvantages of Rule
Based Stemmer is that one need to have extensive language expertise
to make them. 2.2. Statistical Approach Statistical stemming is an
effective and popular approach in information retrieval. Some
recent studies show that statistical stemmers are good alternatives
to rule-based stemmers. Additionally, their advantage lies in the
fact that they do not require language expertise. Rather they
employstatistical information from a large corpus of a given
language to learn morphology of words. Yet another suffix stripper
(YASS) is one such statistics based language independent stemmer
.Its performance is comparable to that of Porters and Lovins
stemmers, both in terms of average precision and the total number
of relevant documents retrieved the challenge of retrieval from
languages with poor resources. GRAS is a graph based language
independent stemming algorithm for information retrieval [19]. The
following features make this algorithm attractive and useful: (1)
retrieval effectiveness, (2) generality, that is, its
language-independent nature, and (3) low computational cost.
Advantages of Statistical Stemmer are: 1. Statistical stemmers are
useful for languages having scarce resources. 2. This approach
yields best retrieval results for suffixing languages or the
languages which are morphologically more complex like
French,Portuguese, Hindi, Marathi, and Bengali rather than English.
Disadvantages of Statistical approach is that Statistical Stemmers
are time consuming because for these stemmers to work we need to
have complete language coverage, in terms of morphology of words,
their variants etc.
14. 13 3. RELATED WORK Martin Porter developed the Porter
Stemmer,which is a conflation stemmer, in 1980 at the University of
Cambridge [5]. The Porter Stemmer uses the fact that English
language suffixes are mostly a combination of smaller and simpler
suffixes. Porter designed a rule-based stemmer with five steps,
each of which applies a set of rules. Ramanathan and Rao (2003)
proposed a lightweight stemmer for Hindi which has used a hand
crafted suffix list and has performed longest match stripping.
Light stemming refers to stripping of a small set of either
prefixes or suffixes or both, without trying to deal with infixes,
or recognize patterns and find roots. This lightweight stemmer
proposed for Hindi is based on the grammar for Hindi language in
which a list of total 65 suffixes is generated manually. Terms are
conflated by stripping off word endings from a suffix list on a
`longest match' basis. Noun, adjective and verb infections have
been discussed and based on that 65 unique suffixes are collected.
The major advantage of this approach is as it is computationally
inexpensive. Documents were chosen from varied domains such as
Films, Health, Business ,Sports and Politics. The collection
contained 35977 unique words. Under stemming and over stemming
errors calculated in this methodology were 4.68% and 13.84%
respectively. No recall/precision-based evaluation of the work has
been reported; thus the effectiveness of this stemming procedure is
difficult to estimate. Majumder et al. (2007) developed statistical
approach YASS: Yet Another Suffix Stripper, which uses a clustering
based approach based on string distance measures and requires no
linguistic knowledge. They concluded that stemming improves recall
of IR systems for Indian languages like Bengali. YASS is based on
string distance measure which is used to cluster a lexicon created
from a text corpus into homogenous groups. Each group is expected
to represent an equivalence class consisting of morphological
variants of the single root word. Dasgupta and Ng (2006) proposed
unsupervised morphological parsing of Bengali. Unsupervised
morphological analysis is the task of segmenting words into
prefixes, suffixes and stems without prior knowledge of
language-specific morphotactics and morphophonological rules. This
parser is composed of two steps: (1) inducing prefixes, suffixes
and roots from a vocabulary consisting of words taken from a large,
unannotated corpus, and (2) segmenting a word based on these
induced morphemes. When evaluated on a set of 4,110 human-segmented
Bengali words, our algorithm achieves 83% success. Pandey and
Siddiqui (2008) [17] proposed an unsupervised stemming algorithm
for Hindi based on Goldsmith (2001) [69] approach. It is based on
split-all method. For unsupervised learning (training), words from
Hindi documents from EMILLE corpus have been extracted. These words
have been split to give n-gram (n=1, 2, 3 l) suffix, where l is
length of the word. Then suffix and stem probabilities are
computed. These probabilities are multiplied to give split
probability. The optimal segment corresponds to maximum split
probability. Some post-processing steps have been taken to refine
the
15. 14 learned suffixes. It is evaluated on 1000-1000 words
randomly extracted words (only) from Hindi WordNet1 data base. The
training data has been constructed by extracting 106403 words
extracted from EMILLE2 corpus. The observed accuracy is 89.9% after
applying some heuristic measures. The F-score is 94.96%. The
algorithm does not require any language specific information.
Majgaonker and Siddiqui (2010) developed an unsupervised approach
for Marathi stemmer. Three different approaches (rule based, suffix
stripping and statistical stripping) for suffix rules generation
has been used in unsupervised stemmer. The rule- based stemmer uses
a set of manually extracted suffix stripping rules whereas the
unsupervised approach learns suffixes automatically from a set of
words extracted from rawMarathi text. The performance of both the
stemmers has been compared on a test dataset consisting of 1500
manually stemmed word. The maximum accuracy observed is 82.5% for
the statistical suffix stripping approach. This approach uses a set
of words to learn suffixes. Suba et al. (2011) proposed two
stemmers for Gujaratia lightweight inflectional stemmer based on a
hybrid approach and a heavyweight derivational stemmer based on a
rule-based approach. The inflectional stemmer has an average
accuracy of about 90.7% which is considerable as far as IR is
concerned. Boost in accuracy due to POS based stemming was 9.6% and
due to inclusion of the language characteristics it was further
boosted by 12.7%. The derivational stemmer has an average accuracy
of 70.7% which can act as a good baseline and can be useful in
tasks such as dictionary search or data compression. The
limitations of inflectional stemmer can be easily overcome if
modules like Named Entity Recognizer are integrated with the
system. In A Light Weight Stemmer for Bengali and Its Use in
Spelling Checker by Md. Zahurul Islam, Md. Nizam Uddin and Mumit
Khan from Center for Research on Bangla Language Processing, BRAC
University, Dhaka, Bangladesh,presents a computationally
inexpensive stemming algorithm for Bengali, which handles suffix
removal in a domain independent way.First the spelling checker
checks the given word with a lexicon containing only the root
words. If the word is found, then it is a valid word, terminating
the checking process.If the word is not found in the lexicon, they
apply the stemming algorithm. There are two possible scenarios: the
stemming algorithm finds and returns a stem, or it cannot find a
possible suffix.Then they try to get probable stem list with their
suffixes from modified stemming method. Correction accuracy for
single error misspellings: 90.8%. Correction accuracy for
multi-error misspellings: 67%. In 2012 an iterative stemmer for
Tamil Language was proposed by Vivekanandan Ramachandran et al.In
this proposed model,suffix stripper algorithm is used to stem Tamil
words to its root word. Upendra Mishra and Chandra Prakash ,
present the Hybrid approach which is combination of brute force and
suffix removal approach and reduces the problem of over-stemming
and under-stemming.
16. 15 4.PROPOSED APPROACH Our proposed algorithm is based on a
lightweight stemmer for Bengali Verbs that strips the suffixes
using a predefined suffix list, on a longest match basis, and then
finds root on basis of some rules. For this purpose, firstly the
input file is read and inflected verb forms are fetched. The
inflexion of each such inflected verb is then compared with the
suffixes in the suffix list and removed, if any match is found. The
subroot is then checked. If it ends with e- kar( ), o-kar( ),
a-kar( ) or aa-kar( ) then, replace it with aa-kar( ). If it starts
with e-kar ( ), u-kar ( ) or a-kar( ), then replace it with a-kar(
), o-kar( ) or aa- kar( ) respectively. Generate the output doc
file by copying the contents of input file and concatenating it
with their obtained root words wherever the word contains /verb.
Finally, compare the generated output file with the desired output
file and calculate the efficiency. 4.1. Overall Pictorial
Presentation Reading Input Text Selecting & tagging verbs
Fetching of tagged verbs Module 1: Applying suffix striping
Obtaining stripped part Module 2: Applying rules Generating Output
File Calculating Efficiency Figure 1: Pictorial representation of
proposed approach
17. 16 4.1.1. Explanation of Proposed Approach with example
PROCESS EXAMPLE Reading Input Text Selecting & tagging verbs
/verb /verb Fetching of tagged verbs /verb, /verb Applying suffix
striping -> + , -> + Obtaining stripped part , Applying rules
, Generating Output File /verb/ /verb/ 4.1.2. Detail explanation of
Module 1 (Suffix Stripping) Table 2: Proposed Approach with example
No Reading Suffix List Fetching suffixes from the suffix list Strip
the suffix from the inflected verb Fetch next suffix from the
suffix list Obtain the subroot/stripped verb Yes No Yes Checking if
the consired verb contains the suffix Are all the suffixes fetched
Figure 2: Module 1(Suffix Stripping)
18. 17 4.1.3. Detail explanation of Module 2: Applying Rules
4.1.4. Sentence Collection Technology Development for Indian
Languages (TDIL) Programme initiated by the Department of
Electronics & Information Technology (DeitY), Ministry of
Communication & Information Technology (MC&IT), Govt. of
India has the objective of developing Information Processing Tools
and Techniques to facilitate human-machine interaction without
language barrier; creating and accessing multilingual knowledge
resources; and integrating them to develop innovative user products
and services. Reading stripped verb/subroot No Yes No No Yes Yes No
Yes YesYesYesYes NoNoNoNoSubroot ends with e-kar( ) Subroot ends
with o-kar( ) Subroot ends with a-kar( ) Subroot ends with aa-kar(
) Length of subroot < 3 Replace the ending kar with aa-kar( )
Concatenate with aa-kar( ) Subroot starts with e-kar( ) Replace the
starting kar with a-kar( ) Subroot starts with u-kar( ) Replace the
starting kar with o-kar( ) Subroot starts with a-kar( ) Replace the
starting kar with aa-kar( ) Obtain root verb Figure 3: Module 2
(Applying Rules)
19. 18 The Programme also promotes Language Technology
standardization through active participation in International and
national standardization bodies such as ISO, UNICODE,
World-wide-Web consortium (W3C) and BIS (Bureau of Indian
Standards) to ensure adequate representation of Indian languages in
existing and future language technology standards. The input
sentences are collected from 50 different categories of the Bengali
text corpus developed in the TDIL project of the Govt. of India,
while the information about different inflexions of particular verb
is collected from Bengali Dictionary. We have selected 14 Bengali
Verbs, and presented a sentence for each inflexion of a particular
verb. Accordingly, we have applied our algorithm over 638
sentences. 4.1.5. Normalization The Bengali text corpus developed
in the TDIL project of the Govt. of India separates words by |,
whereas we have separated words by spaces . Moreover, the end of
each sentence is marked by | and any kind of exclamation sign, eg.
question mark ?, comma ,, exclamation mark !, etc., is replaced by
| . Figure 4: Screen Shot of Un-normalized Document Figure 5:
Screen Shot of Un-normalized Document
20. 19 4.1.6. Tagging of Verb In every sentence, the inflected
word whose root is to be found out is tagged by /verb. Figure 6:
Screen Shot of verb-tagged Document 4.1.7. Preparing Output File An
output file is prepared whereby the inflected word of every
sentence whose root is to be found out is tagged as /verb/
concatenated by the actual root word. This file is prepared in
order to calculate the efficiency of our proposed algorithm. Figure
7: Screen Shot of desired output Document 4.1.8. Preparing Suffix
List After surveying the inflexions of various Bengali Verbs from
the 50 different categories of the Bengali text corpus developed in
the TDIL project of the Govt. of India, we have prepared a suffix
list by selecting 35 mostly occurring suffixes.
21. 20 4.1.9. Verification Generated output file is compared
with the prepared output file and thereby the efficiency of the
algorithm is calculated. 4.2. Algorithm STEP 1. Start of algorithm.
STEP 2. Create 4 new string[] namely splits1[], splits2[ ] and
splits3[ ]. STEP 3. Read the contents of the doc files and split
the words by space ( ) separator. 3.1. Store the words of each
sentence in splits1[ ]. 3.2. Store the inflexions in splits2[ ].
3.3. Store the desired root words in splits3[ ]. STEP 4. Declare
and initialize variables l1=length of splits1[ ] , l2=length of
splits2[ ] . STEP 5. Fetch the inflected verb forms in input1[]
from splits1[i] if /verb is contained by the currently fetched
word. This step is repeated l1 times. 5.1. Determine the subroot
from input1[i] by repeating the steps l2 times. 5.1.1. if
splits2[j] in contained in input1[i] then, 5.1.1.a. Declare
variable index which stores the index of last occurrence of
splits2[j] in input1[i]. 5.1.1.b. If index is greater than equal to
2 then, 5.1.1.b.i. Store the substring of input1[i] from begindex=0
to endindex=index in input1[i]. 5.1.1.b.ii. Break the loop. 5.2.
Determine the actual root input1[i] by repeating the steps l1
times. 5.2.1. Check the ending kar of input1[i]. 5.2.1.a. if
input1[i] ends with e-kar( ), o-kar( ), a- kar( ) or aa-kar( )
then, replace it with aa-kar( ). 5.2.1.b. if length of input1[i] is
less than 3, concate it with aa- kar( ). 5.2.2. Check the starting
kar of input1[i]. 5.2.2.a. if input1[i] starts with e-kar ( ), then
replace it with a-kar( ). 5.2.2.b. if input1[i] starts with u-kar (
), then replace it with o-kar( ).
22. 21 5.2.2.c. if input1[i] starts with a-kar( ), then replace
it with aa-kar( ). STEP 6. Generate the output doc file by copying
the contents of splits1[] and concatenating it with their obtained
root words from input1[] wherever the word contains /verb. STEP 7.
Compare the obtained sentences in splits1[ ] with the desired
sentences in splits3[ ] and calculate the efficiency. STEP 8. End
of algorithm.
23. 22 5. OUTPUT AND DISCUSSION: 5.1. Partial View of Input
File: 5.2. Suffix List: Figure 8: Partial view of Input File Figure
9: Screen shot of Suffix List
24. 23 5.3. Partial View of Output File: Figure 10: Partial
view of Output File
25. 24 5.4. EFFICIENCY: Dealing with 500 sentences, our
proposed approach gives an efficiency of 99.4%. 5.5. TIME
COMPLEXITY: The time complexity of the proposed algorithm is: WORST
CASE: O(n2 ) Figure 11: Screen shot of Efficiency of proposed
approach
26. 25 6. CONCLUSION AND FUTURE WORK: Stemming plays a vital
role in information retrieval system and its effect is very large.
In this project, we present a lightweight stemmer for 14 selected
Bengali Verbs that strips the suffixes using a predefined suffix
list, on a longest match basis, and then finds root on basis of
some rules. Except a few cases, the result obtained from our
algorithm is quite satisfactory according to our expectation. We
argue that a stronger and populated learning set would invariably
yield better result. In future , we plan to test our algorithm with
more sets of Bengali verbs. As the research in Bengali language is
much less than those in languages like English and Hindi, still lot
of dimensions are untouched. Using several relevant and new
approach, better Bengali stemmer can be developed and thus will be
useful for further linguistic computing.
27. 26 ACKNOWLEDGEMENT It gives us great pleasure to find an
opportunity to express our deep and sincere gratitude to our
project guide Mr. Alok Ranjan Pal. We do very respectfully
recollect his constant encouragement, kind attention and keen
interest throughout the course of our work. We are highly indebted
to him for the way he modeled and structured our work with his
valuable tips and suggestions that he accorded to us in every
respect of our work. We are extremely grateful to the Department of
Computer Science & Engineering, CEMK, for extending all the
facilities of our department. We humbly extend our sense of
gratitude to other faculty members, laboratory staff, library staff
and administration of this Institute for providing us their
valuable help and time with a congenital working environment. Last
but not the least; we would like to convey our heartiest thanks to
all our classmates who time to time have helped us with their
valuable suggestions during our project work. Date:23.05.2015
Anasuya Paul University Roll:10700111006 University Registration
No:111070110006 Joyeeta Bagchi University Roll:10700111021
University Registration No:111070110021 Koushik Dutta University
Roll:10700111024 University Registration No:111070110024 Sneha
Sarkar University Roll:10700111049 University Registration
No:111070110049
28. 27 References: 1. Ramanathan and D. D. Rao, A Lightweight
Stemmer for Hindi, Workshop on Computational Linguistics for
South-Asian Languages, EACL, 2003. 2. M. Z. Islam, M. N. Uddin and
M. Khan, A Light Weight Stemmer for Bengali and its Use in Spelling
Checker.Proc. 1st Intl. Conf. on Digital Comm. and Computer
Applications (DCCA07), Irbid, Jordan, March 19-23 2007. 3. P.
Majumder, M. Mitra, S. K. Parui, G. Kole, P. Mitra, and K. Datta,
YASS: Yet Another Suffix Stripper,Association for Computing
Machinery Transactions on Information Systems, 25(4):18-38, 2007.
4. S. Dasgupta and V. Ng, Unsupervised Morphological Parsing of
Bengali, Language Resources and Evaluation, 40(3-4):311-330, 2006.
5. A. K. Pandey and T. J. Siddiqui, An Unsupervised Hindi Stemmer
with Heuristic Improvements, In Proceedings of the Second Workshop
on Analytics For Noisy Unstructured Text Data, 303:99-105, 2008. 6.
M. M. Majgaonker and T. J Siddiqui, Discovering Suffixes: A Case
Study for Marathi Language,International Journal on Computer
Science and Engineering, Vol. 02, No. 08, pp. 2716-2720, 2010. 7.
K. Suba, D. Jiandani and P. Bhattacharyya, Hybrid Inflectional
Stemmer and Rule-based Derivational Stemmer for Gujarati, In
proceedings of the 2nd Workshop on South and Southeast Asian
Natural Language Processing (WSSANLP), IJCNLP 2011, Chiang Mai,
Thailand, pp.1-8, 2011. 8. M.F. Porter, An algorithm for suffix
stripping, Program, 14(3) 1980, pp. 130137. 9. P. Kundu and B.B.
Chaudhuri, Error Pattern in Bengali Text, International Journal of
Dravidian Linguistics, 28(2) 1999. 10. B.B. Chaudhuri, Reversed
word dictionary and phonetically similar word grouping based 11.
spell-checker to Bengali text, In the Proceedings of LESAL
Workshop, 2001. 12. Sandipan Sarkar and Sivaji Bandyopadhyay. Study
on Rule-Based Stemming Patterns and Issues in a Bengali Short
Story-Based Corpus. In ICON 2009. 13. S. Dasgupta,M. Khan:
Morphological parsing of Bangla words using PCKIMMO. In: ICCIT
2004. (2004). 14. Barzilay, R. & Elhadad. M. 1997. Using
Lexical Chains for Text Summarization.In Proceedings of the
Workshop on Intelligent Scalable Text Summarization. Madrid, Spain.
15. Pratikkumar patel kashyap popat hybrid stemmer for gujarati in
proc. of the 1st workshop on south and southeast Asian natural
language processing (wssanlp), pages 5155, the 23rd international
conference on computational linguistics (coling), Beijing, august
2010 16. Upendra Mishra Chandra Prakash MAULIK: An Effective
Stemmer for Hindi Language International Journal on Computer
Science and Engineering (IJCSE). Abduelbaset m. Goweder, Husien a.
Alhammi, Tarik rashed, and Abdulsalam Musrat A Hybrid Method for
Stemming Arabic Text.
29. 28 17. Kartik Suba, Dipti Jiandani and Pushpak
Bhattacharyya Hybrid Inflectional Stemmer and Rule-based
Derivational Stemmer for Gujarati 18. Hairdar Harmanani, Walid
Keirouz, Saeed Raheel A Rule Based Extensible Stemmer for
Information Retrieval with Application to Arabic The international
Arab Journal of Information Technology.Vol -3 July- 2006. 19.
Navanath Saharia, Utpal Sharma and Jugal Kalita [6] present paper
on Analysis and Evaluation of Stemming Algorithms: A case Study
with Assamese. ICACCI12, August 3-5, 2012, Chennai, T Nadu, India.
20. Nikhil Kanuparthi, Abhilash Inumella and Dipti Misra Sharma
Hindi Derivational Morphological Analyzer. Proceedings of the
Twelfth Meeting of the Special Interest Group on Computational
Morphology and Phonology (SIGMORPHON2012), pages 1016, Montreal,
Canada, June 7, 2012. C 2012 Association for Computational
Linguistics. 21. Juhi Ameta, Nisheeth Joshi, Iti Mathur A
Lightweight Stemmer for Gujarati. 22. Mohamad Ababneh, Riyad
Al-Shalabi, Ghassan Kanaan, and Alaa AlNobani Building an Effective
Rule-Based Light Stemmer for Arabic Language to Improve Search
Effectiveness The International Arab Journal of Information
Technology, Vol. 9, No. 4, July 2012. 23. Ms. Anjali Ganesh Jivani
A Comparative Study of Stemming Algorithms Int. J. Comp. Tech.
Appl., Vol 2 (6), 1930-1938 24. M. F. Porter 1980. "An Algorithm
for Suffix Stripping Program", 14(3):130-137. 25. V. M. Orengo and
C. Huyck A Stemming Algorithm for the Portuguese Language
Proceedings of the Eighth International Symposium on String
Processing and Information Retrieval, pages 186-193, 2001. 26.
Deepika Sharma Stemming Algorithms: A Comparative Study and their
Analysis International Journal of Applied Information Systems
(IJAIS) ISSN: 2249-0868 Foundation of Computer Science FCS, New
York, USA Volume 4 No.3, September 2012 . 27. J. B. Lovins 1968.
"Development of a Stemming Algorithm."Mechanical Translation and
Computational Linguistics, 11(1-2), 22-31.
30. 29 Appendix: 1. Program Code : package stemming_verb;
import org.apache.poi.hwpf.HWPFDocument; import
org.apache.poi.hwpf.extractor.WordExtractor; import java.io.*;
public class Stemming_verb { public static void main(String[] args)
{ File file1=null, file2=null, file3=null, file4=null;
WordExtractor extractor1 = null, extractor2=null, extractor3=null,
extractor4=null; try{ /*------------------Reading
sentences-----------------------*/ file1 = new
File("G:Stemmingfinal_projectsentence_input.doc"); FileInputStream
fis1 = new FileInputStream(file1.getAbsolutePath()); HWPFDocument
document1 = new HWPFDocument(fis1); extractor1 = new
WordExtractor(document1); String fileData1 = extractor1.getText();
String[] splits1 = fileData1.split(" "); String[] input1=new
String[splits1.length]; int l1=splits1.length;
/*------------------Reading inflexions-----------------------*/
file2 = new File("G:Stemmingfinal_projectsuffixes.doc");
FileInputStream fis2 = new
FileInputStream(file2.getAbsolutePath()); HWPFDocument document2 =
new HWPFDocument(fis2); extractor2 = new WordExtractor(document2);
String fileData2 = extractor2.getText(); String[] splits2 =
fileData2.split(""); int l2=splits2.length;
31. 30 /*-------------------Reading desired output
file----------------------*/ file4 = new
File("G:Stemmingfinal_projectsentence_output.doc"); FileInputStream
fis4 = new FileInputStream(file4.getAbsolutePath()); HWPFDocument
document4 = new HWPFDocument(fis4); extractor4 = new
WordExtractor(document4); String fileData4 = extractor4.getText();
String[] splits4 = fileData4.split(" "); int l4=splits4.length;
/*-------------------Suffix stripping----------------------*/ int
verb=0; for(int i=0;i=2) { input1[i]=input1[i].substring(0, index);
break; } } } } }