[IEEE 2013 International Conference on Research and Innovation in Information Systems (ICRIIS) - Kuala Lumpur, Malaysia (2013.11.27-2013.11.28)] 2013 International Conference on Research

3rd International Conference on Research and Innovation in Information Systems – 2013 (ICRIIS’13)

An Improved Arabic Light Stemmer Osama Mohamed Elrajubi

Department of Communication and Networks Faculty of Information Technology, Misurata University

Misurata, Libya [email protected]

Abstract- According to the desired level of analyzing words, Arabic stemming algorithms can be classified into stem-based (light stemming algorithms), and root-based algorithms. Light stemming algorithms only remove prefixes and suffixes from the words, while root-based algorithms remove prefixes, suffixes and infixes. There are several light stemmers for Arabic (Light1, Light2, Light3, Light8, and Light10), For retrieval information Light10 stemmer is out-performed the other light stemmers. In this paper, Arabic stemming algorithms are studied. And, literature review of Arabic stemmers is discussed. In addition, a new Arabic light stemmer was proposed and Implemented. The main step of the light stemmer is removing the prefixes and suffixes of the words. And because this step causes changing of the meaning of some words, many other steps are designed and implemented in the proposed stemmer. The proposed stemmer and Light10 stemmer were tested on the same Arabic data which is developed in this work. The accuracy rate of Light10 stemmer was 66%, while the accuracy rate of the proposed stemmer was 88.25 %. The reasons for incorrect stemming of the proposed stemmer are mentioned.

Keywords: Arabic stemming, Arabic light stemmer, suffixes and prefixes stripping, Arabic retrieval.

I. INTRODUCTION Word stemming is very useful in many applications such

as information retrieval, data encryption, text compression, text classification and categorization [1]. Word Arabic stemming becomes mainly important for Arabic Information Retrieval (IR), for the reason that IR has to determine an appropriate form of words as index. Arabic is a highly inflected language, therefore there is a need for an efficient stemming algorithm for the retrieval and indexing of Arabic documents [1][2]. Fig. 1 shows the stem and the root of word (jalesoun), which means sitting in English, after applying light stemmer and Root-based stemmer.

Figure 1. Example of applying light stemmer and Root-based stemmer

Light stemming algorithms only remove prefixes and suffixes from the words, while root-based algorithms remove prefixes, suffixes and infixes. Analyzing Arabic words to their roots is preferred in linguistic-based applications, while analyzing words to their stems is better for other applications as information retrieval [3]. Some researchers aimed to develop stemming algorithms in general; and other researchers have studied the impact of stemming on a specific purpose [2]. Khoja (1999) carried out a research to design and experiment a novel algorithm for root extraction. She found that the proposed algorithm is more useful than prior research. The Stemmer removes the suffix and the prefix. Then, the remaining word was compared with a group of patterns of the same length to determine the root. The stemmer is also used a linguistic data files, as a list of all diacritic characters, definite articles, and punctuation characters [4][5].

Kazem Taghva, and et al (2005) implemented a root-extraction stemmer for Arabic language which is similar to Khoja stemmer but without a root dictionary. Their stemmer was found to perform equivalently to Khoja stemmer in addition to so-called "light" stemmers in monolingual document retrieval tasks performed on the Arabic Trec-2001 collection. Therefore, a root dictionary does not improve Arabic monolingual document retrieval [6].

Leah S. Larkey, et al (2007) developed several light stemmers for Arabic (Light1, Light2, Light3, Light8, and Light10), and assessed their effectiveness for information retrieval using standard TREC data. They also compared light stemming with several root-based stemmers. For retrieval information, their light stemmer (Light10) outperformed the other approaches [7]. Table 1 shows the prefixes and suffixes which are removed from Arabic words in each version of light stemmer.

TABLE 1. THE DIFFERENCE VERSIONS OF LIGHT STEMMER

Version of Stemmer Prefixes Suffixes

Light1 None

Light2 None

Light3

Light8

Light10

In 2008, Kchaou and Kanoun proposed an approach to

stemming Arabic words. Although they used two dictionaries, their approach is similar to the approach of

(jalesoun)

(jales)

(jls)

light stemmer

Root-based stemmer

sitting (adj, plural)

sitting (adj, single)

sat (verb)

33


Khoja. The two dictionaries are one of stems and another of roots. The approach has the improvement of diminishing the words that are inspired by their roots to their roots, and words which are inspired by their stems to their stems. Therefore, this approach solves the difficulty of the handicapped stems and roots in the stemmer of Khoja [8]. Al-Nashashibi, May Y. et al (2010) addressed five methods for extracting Arabic roots. An algorithm for correcting irregular words is ran for these five methods. They made a comparison between all approaches. The approach with the highest accuracy among all five algorithms was the rule-based algorithm when the correction algorithm was included in it [9].

Mohammad Hijjawi, et al (2011) proposed a new machine learning based methodology for overcome the problem of stemming in the Arabic language. Results have shown that this new method achieved a very high level of accuracy. Their study indicated that stemming can be also helpful for dialogue systems as a pre-processing step, which can decrease the number of written patterns to minimum [10]. A.Anwar, and et al. (2013) proposed an approach that enables video scenes classification and retrieving . Their approach is based on the Arabic closed-caption text of the video. They used light-10 stemmer for removing the most frequent suffixes and prefixes of the words. The proposed approach is efficient for retrieving Arabic videos [11].

The rest of the paper is organized as follows. Section II presents stemming algorithms. The design of the proposed stemmer with the details of each step in the proposed stemmer is presents in Section III. Experimental Work is defined in section IV while the evaluation and the reasons for incorrect stemming of the proposed stemmer is showed in Section V. A conclusion and a future Work are given in Section VI and section VII.

II. STEMMING ALGORITHMS

A. Light stemming algorithms (Stem-based algorithms) Light stemming algorithms are the most common

algorithms. They remove affix (only suffixes, and prefixes) from words, producing a root form called a stem which often approximates of the root morpheme of a word [12]. An affix is a morpheme which is attached to a word to generate a new word. A prefix is an affix that is attached before the stem of a word. While a suffix (postfix or ending) is an affix that is attached after the stem of a word [13]. Table 2 shows most of prefixes in Arabic, whereas table 3 shows most of suffixes in Arabic.

TABLE 2. MOST OF PREFIXES IN ARABIC

Length Prefixes

one letter , , , , , , , ,

Two letters ,

Three letters , , , ,

TABLE 3. MOST OF SUFFIXES IN ARABIC

Length Suffixes

One letter , , , , , ,

Two letters , , , , , , , , , , , , , ,

,

Three letters ,

B. Root-based algorithms (Root-based Stemmers) Root-based algorithms are also call Morphological

analyses algorithms, or pattern-base algorithms. These algorithms reduce words to their 3-letter roots. The problem with these algorithms, a number of words that have different meanings might have the same root [14]. At the same time, light stemming can fail to conflate some words that should go together [7].

III. THE PROPOSED STEMMER Fig. 2 shows the general diagram of the proposed

stemmer. It consists of eight steps. However, it has two steps are not found in Light10 stemmer. The two steps are: step of “searching in irregular word list”, and “step of Applying rules”.

Figure 2. General diagram of the proposed stemmer The details of each step in the proposed stemmer will be

discussed in the next lines.

Dividing the text into Words

Normalization

Searching in stop word list

Removing letter “ “ (and)

Searching in Irregular words list

Removing the prefixes and suffixes from the words

Applying rules

The result

34


A. Dividing the text into Words The first step of the stemmer is dividing the input text

into words. The word count will be taken according to the space after each word. For example, the stemmer will state the word count as four words in the sentence: “She goes to school”.

B. Search in Stop Words List Before starting the processing of each word to extract its

light stem, the stemmer will search for the word in a special list of words “stop words list” to find out if the word exists on the list or not. If it does exist, the stemmer will ignore the word. This happens because the word is one of the “Functional words” such as conjunctions and particles and so on. The “stop word list” used by the stemmer was taken from stop word list of Khoja stemmer, and fifty new words were added to it. Table 4 shows these fifty new stop words and its meaning.

TABLE 4. NEW STOP WORDS

Stop words

The meaning

Stop words

The meaning

And thus While

Other And always

But And perhaps

except And If

It What

Therefore Including

Now That

It Then

It And of them

Therefore And Like

But The place

Including Often

What With

And They And that

All Even

But With

After And must

When And Can be

Also Alone

And also of them

And why They

And so on Some

Sometimes Including

And sometimes And him

And whenever Alone

C. Normalization The third step in the stemmer is normalization of the

words. Normalization process in the proposed stemmer is the similar to the normalization process in Light10 stemmer which runs as following:

1- Remove punctuation and non letters. 2- Remove diacritics (primarily weak vowels). 3- Remove hamza from letter “ ” (Replace , , and with

). 4- Replace final letter with . 5- Replace final letter with .

D. Searching in Irregular Words List In this step, the stemmer will search for any word in a

table of Irregular words, to find out if the word exists on this table or not. If it does exist, the stemmer returns the stem of this word from table of Irregular words.

The following table contains twenty-six irregular words and their stems which were inserted on the table according to the database used in stemmer testing. Users of the stemmer can add new irregular words to the table whenever users wish.

TABLE 5. IRREGULAR WORDS

Irregular Word meaning The

Stem meaning

Allah Allah

Contentment Contentment

Lebanon Lebanon

And Germany Germany

Million Million

The television television

And the television television

France France

And Britain Britain

Syria Syria

United States United States

Europe Europe

Million Million

The Internet Internet

For the Internet Internet

And the Internet Internet

Twentieth Twentieth

35


Fifty Fifty

Day Day

The Nineties Nineties

The Japan Japan

Paris Paris

Baghdad Baghdad

Years Years

The Saudi Arabia Saudi Arabia

The Data Data

E. Removing letter “ “ (and) In this step, the stemmer removes letter (“and”) from

the beginning of the words if the length of the word is more than three characters. Removing letter “ ” is important because it is usually a conjunction. At the same time, it is problematic, because many common Arabic words begin with this character. These words can be added to the irregular words list.

F. Removing the prefixes and suffixes Removing the prefixes and suffixes from the words is

the main step of the stemmer. The stemmer removes the prefixes and suffixes if the length of the word does not became less than three characters. Fig. 3 shows the flow chart of the process of removing the prefixes and suffixes.

Table 6 shows the prefixes of the proposed stemmer. Table 7 shows the suffixes of the proposed stemmer.

TABLE 6. THE PREFIXES OF THE PROPOSED STEMMER

Length Prefixes

One letter ,

Two letters ,

Three letters , , ,

TABLE 7. THE SUFFIXES OF THE PROPOSED STEMMER

Length Suffixes One letter ,

Two letters , , , , , , , ,

Three letters

G. Applying rules The next step of deleting the prefixes and suffixes of the

words is correcting any word that its meaning changed. There are three rules which apply in the stemmer for correcting some words their meaning was affected.

1. Adding ( ) to the end of the word if the suffix ( ) is deleted

2. Adding ( ) to the end of the word if the suffix ( ) is deleted

3. Replacing the letter ( ) to the end of the word by ( ) if the suffix of the word is deleted.

Figure 3. flow chart of the process of removing the prefixes and suffixes.

Yes

Removing one character suffix

End

Is Length > 3 char?

No

Yes

No

Yes

Removing three characters prefixes and suffixes

Start

Removing two characters prefixes and suffixes

Is Length > 5 char?

Is Length > 4 char?

No

Yes

No

Has two characters prefix not

removed and Length > 3 char?

Removing one character prefix

36


H. The result The final step of the stemmer is determining the result.

The stemming result of the word will be correct, if the output form of the word is the same as the target form of the word. Otherwise, the result of the word will be incorrect.

IV. EXPERIMENTAL WORK

A. Data collection The proposed stemmer and Light10 stemmer were tested

on the same Arabic data which is collected and prepared in this work as follows:

1- Four news articles written in Arabic language were chosen from Aljazeera website channel on the Internet (http://www.aljazeera.net) [15]. The word count of these articles is 2791 words.

The first article: It is entitled “ ”, which is translated in English to "efforts to prevent

the collapse of the Egyptian Stock Exchange". It consists of 210 words.

The second article: It is entitled “ ”, which is translated in English to "Carthage

Airport is witnessing stories of returnees" and consists of 588 words.

The third Article: It is entitled “ ”, which is translated in English to "Iraq back into the

Arabic university". It consists of 761 words. The fourth Article: It is entitled “ ”, which is

translated in English to "Children of the Internet". It consists of 1232 words.

2 - Determining the target form (the correct form) of each word. The following two steps were taken when determining the target form.

• Deleting the prefixes and suffixes of the words in order to convert the word from plural form to singular form and from feminine form to the masculine form, and deleting definite articles from beginning of the word.

• Correcting manually of any word that its meaning was changed as a result of deleting some of the suffixes of the word.

B. Test plan Two tests were done in this work. The first test tests the

Light10 stemmer on the Arabic data which is developed in this work. The second test tests the proposed stemmer on the same Arabic data which is developed in this work. These tests compare the light stems of the words which are obtained automatically by the stemmer with the target forms. Therefore, the accuracy rate of the stemmers is calculated as following.

The accuracy rate =

TABLE 8. TEST PLAN

Test Case No Test Details Expected

Results

1 Comparing the light stems of the

words which are obtained automatically by Light10 stemmer with the target forms.

Words of the light stem (2791 words) are expected to match with the target forms.

2 Comparing the light stems of the

words which are obtained automatically by the proposed stemmer with the target forms.

Words of the light stem (2791 words) are expected to match with the target forms.

C. Test Results

Table 9 shows the test results of the two tests.

TABLE 9. TEST RESULTS

Test Case No Actual Result Comment

1 Number of words which

their light stems match with the target forms is 1841 words.

The accuracy rate of Light10 stemmer is 66%

2 Number of words which

their light stems match with the target forms is 2463 words.

The accuracy rate of the proposed stemmer is 88.25%

V. EVALUATION In this paper, an improved Arabic light stemmer was

proposed and Implemented. The proposed stemmer and Light10 stemmer, Light10 stemmer is the best light stemmer for Arabic information retrieval [11], were tested on the same Arabic data which is collected in this work. The accuracy rate of Light10 stemmer was 66%, while the accuracy rate of the proposed stemmer was 88.25 %. Therefore, the proposed stemmer is better than Light10 stemmer. Even though the proposed stemmer improved the accuracy rate of the system, it does not provide the correct stem for a large number of words (328 out of 2791). The next section provides the important reasons behind this action.

A. Reasons for incorrect stemming There are many reasons why the stemmer does not

provide the correct stem for a large number of words (328 out of 2791).

• In the stemmer, there are some words that start with the letter “ ”. It is important that this letter is removed from most of words, because it is usually a conjunction. But it is problematic, because many Arabic words begin with this character. For solving this problem, these words can be added to the irregular-word list.

The number of the matching words

Word count

37


• Deleting suffix “ ” from some words that end by “ ” results in changing the meaning of these words. This is because letter “ ” is necessary in these words in order not to affect their meaning. However, this letter is used in most word to change them to female form.

• Deleting suffix “ ” from some words change their meaning. This happens for two reasons. The first reason is that the suffix “ ” of some words is necessary for their meaning (It is not for changing the words into plural form). The second reason, suffix “ ” is for changing words into plural form, but when this suffix is removed, letter “ ” must be added to the end of these words.

• Deleting prefix “ ” from some words change their meaning. Prefix “ ” of some words is necessary for their meaning such as the word “ ” (she threw).

• Deleting prefix “ ” from some words change their meaning. prefix “ ” of many Arabic words is necessary and not considered as a prefix.

VI. CONCLUSION • Word stemming is an essential process for many

applications such as information retrieval, data encryption, text compression, text classification and categorization.

• The mainly steps of the light stemmer is removing the prefixes and suffixes of the words. This step causes changing of the meaning of some words, therefore applying some rules to correct these words is very useful.

• Most of rules in Arabic language have exceptions. Therefore, the use of irregular word list in the stemmer is necessary as there are some words are used in Arabic language but it are not original Arabic words.

• By comparing the accuracy rate of stemming words of Light10 stemmer (66%), with the accuracy rate of stemming words of the proposed stemmer (88.25%), the proposed stemmer is better than Light10 stemmer.

VII. FUTURE WORK • Developing the stop-word list and irregular word

list by adding new words to these lists. • Applying other rules to correct words which their

meaning are affected as result of removing prefixes and suffixes of these words.

• Assessing the effectiveness of the proposed stemmer for information retrieval using standard data.

• Improving the proposed stemmer by building an additional data base in the stemmer for saving all the words that have a correct stems when running the stemmer on any data.

• Developing the proposed stemmer to extract roots of Arabic words by applying a root-based algorithm on the result of the proposed light stemmer.

• Developing the proposed stemmer of other languages such as English.

REFERENCES

1- I. A. Al Kharashi and I. A. Al Sughaiyer,

Performance Evaluation of an Arabic Rule-Based Stemmer,

In Proceedings of the 17th National Computer Conference, Al-Madinah Al-Munw'warah, Saudi Arabia, 2004, pp. 517-528.

2. O. M. Elrajubi, An Improved Arabic Light Stemmer, Unpublished master’s thesis, Nottingham Trent University, Nottingham, UK, 2011.

3- H. K. Al Ameed, S. O. Al Ketbi, A. A. Al Kaabi, K. S. Al Shebli, N. F. Al Shamsi, N. H. Al Nuaimi, and S. S. Al Muhairi, ARABIC LIGHT STEMMER: ANEW ENHANCED APPROACH, The Second International Conference on Innovations in Information Technology (IIT’05), UAE University, UAE, 2005.

4- M. Aljlayl and O. Frieder, On Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach, In International Conference on Information and Knowledge Management, CIKM'02, ACM, McLean, VA, USA, pp. 340-347, 2002.

5- M. Sawalha and E. Atwell, Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers, Proceedings of COLING 2008 22nd International Conference on Computational Linguistics, pp.107-110. 2008.

6- K. Taghva, R. Elkhoury, and J. Coombs, Arabic Stemming Without A Root Dictionary, Information Technology: Coding and Computing (ITCC'04), Vol. 2, 2005.

7- L. S. Larkey, L. Ballesteros, and M. E. Connell, Light Stemming for Arabic Information Retrieval, chapter in book: Arabic Computational Morphology, Springer, Vol. 38, p.p 221-244, 2007.

8- Z. Kchaou and S. Kanoun, Arabic stemming with two dictionaries, 2008 International Conference on Innovations in Information Technology (IIT 2008), Dec 2008.

9- M. Y. Al-Nashashibi, D. Neagu, and A. A. Yaghi, Stemming techniques for Arabic words: A comparative study, 2010 2nd International Conference on Computer Technology and Development (ICCTD 2010), Nov. 2010.

10- M. Hijjawi, Z. Bandar, K. Crockett, and D. Mclean, An Arabic Stemming Approach using Machine Learning with Arabic Dialogue System, ICGST AIML-11 Conference, April 2011.

11. A. Anwar, G. I. Salama, and M.B. Abdelhalim, VIDEO CLASSIFICATION AND RETRIEVAL USING ARABIC CLOSED CAPTION, ICIT 2013 The 6th International Conference on Information Technology, May 2013.

12- M. A. H. Omer and S. Ma, Stemming Algorithm to Classify Arabic Documents, Journal of Communication and Computer, Vol. 7, No. 9, Sep. 2010.

13- Wikipedia [online], web site: http://en.wikipedia.org/wiki/Main_Page, (Accessed 23 June 2013).

14- R. Duwairi, M. Al-Refai, and N. Khasawneh,. Stemming versus light stemming as feature selection techniques for Arabic text categorization, Innovations'07: 4th International Conference on Innovations in Information Technology, IIT, Nov. 2007.

15. Aljazeera website channel [online], Available at: http://www.aljazeera.net/portal (Accessed 10 June. 2013).

38