Malayalam Text Compression - FTMS Malayalam is closer to the pre - Tamil Malayalam in phonology, morphology

  • View
    2

  • Download
    0

Embed Size (px)

Text of Malayalam Text Compression - FTMS Malayalam is closer to the pre - Tamil Malayalam in phonology,...

  • ISSN: 2289-7615 Page 1

    International Journal of Information System and Engineering

    www.ftms.edu.my/journals/index.php/journals/ijise

    Vol. 1(No.1), April, 2013

    Page : 01-11 ISSN: 2289-7615

    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Malayalam Text Compression

    Sajilal Divakaran School of Engineering and

    Computing Sciences, FTMS College, Kuala Lumpur, Malaysia

    sajilald@gmail.com

    Anjali C. University of Kerala,

    Thiruvananthapuram, Kerala, India 695581

    anjalichandrika@gmail.com

    Biji C. L. University of Kerala,

    Thiruvananthapuram, Kerala, India 695581 bijijomy@gmail.com

    Achuthsankar S. Nair

    University of Kerala, Thiruvananthapuram, Kerala, India 695581

    sankar.achuth@gmail.com

    Abstract In natural language processing and analysis, a very large number of problems remain unaddressed particularly in Malayalam computing. For instance, the informational analysis of Malayalam language text is itself not widely studied. Language studies of English, based on the concepts of information theory are quite well established, as evidenced by the success of text compression methods for English. However to the best of our knowledge, not a single attempt has been reported about Malayalam text compression even though the Unicode based Malayalam content is increasing in Malayalam blogs, Wikipedia and Websites. The general motivation behind every compression is the optimum use of resources such as data, space or transmission capacity.

    The availability of standard Unicode script and Google online language translation service in the internet triggers the use of Malayalam language. The statistics of Malayalam Wikipedia clearly indicates that the Malayalam content is steadily increasing since 2006. Moreover the searchable archives of Malayalam publications including eBooks and journals are likely to increase in the upcoming years. This opens up a way to seriously think about a Malayalam text compression for the optimum use of resources. Every language normally has certain hidden statistically significant features and certain redundancy. Exploiting all these features help us to frame a suitable text compression tool. Being motivated by the language studies of English based on Shannon theory, an informational analysis of Malayalam language text is being proposed in our frame work. Interestingly all language

    http://creativecommons.org/licenses/by/4.0/ http://en.wikipedia.org/wiki/Bandwidth_(computing)

  • ISSN: 2289-7615 Page 2

    structure has certain bias to the input message. Some Characters are more likely to occur than others. In general, the symbols in language follow an unequal probability distribution. Every compression algorithm tries to represent the input message in a new form with a fewer number of bits by exploiting the probability distribution. The proposed Malayalam text compressor follows a variable length encoding technique in which most probable Unicode character is represented by less number of bits. Moreover we were able to derive a theoretical limit for Malayalam text compression as 21%. A compression tool is developed using Java/J2EE with Apache tomcat as web server. Since similar work was not reported we have created a small dataset from the Malayalam blogs, Wikipedia and Websites for testing the performance of developed tool. The proposed Malayalam text compressor based on variable length coding has achieved a compression ratio of 17% for the best case. The performance analysis of proposed algorithm is carried out by considering percentage of compression and compression ratio. Keywords: Compression, Entropy coding, Natural language processing I. Introduction Malayalam is the mother tongue of about 3 crore people residing in Kerala, the southern state of India. To add a few historical piece of information, Malayalam is the youngest of the four major Dravidian languages spoken in South India and it is the official language of

    Kerala. It is from the traditions of Sanskrit, the Indo-Aryan language, that Malayalam draws its rich diversity of words and compound alphabets (conjuncts). Malayalam is closer to the pre - Tamil Malayalam in phonology, morphology and syntax, the major feature which sets apart the two being the heavy Sanskrit borrowing in Malayalam [1]. It is only from the 8th century AD that Malayalam developed literature independent of Tamil. Languages are generally carriers of communication. The computer technology has so advanced that people can now convey messages and shares their thoughts using their own mother tongue. It is appreciable that Kerala Government has given more importance to Malayalam Computing in information technology. This explores a new world of opportunities for many, who are not even proficient with English language to get in touch with the global world. The Malayalam content started appearing in internet during early 2000’s. The statistics of Malayalam Wikipedia content shows a progressive rise since 2006. Based on the statistics published by Malayalam Wikipedia on April 2012[2], there are nearly 24, 000 articles.

  • ISSN: 2289-7615 Page 3

    Figure 1.1: Wiki Malayalam Content Statistics

    Growth rate of Malayalam article displayed in Figure 1.1 clearly indicates a need for Malayalam compression tools in the mere future. Compression is required for effective storage of information and for its smooth transmission over channel. Compression is employed everywhere starting from images found in web, which in general follows JPEG or GIF standards and audio files follow mp3 standard. Moreover several file system automatically compress the file, when stored. The possibility of compression was first studied in detail with English language by the great American Mathematician Claude Elwood Shannon. The seminal paper of Claude E. Shannon [3] clearly stated that the sequence of English language are not framed in random, it usually follow a statistical structure. For example, the occurrence of ‘e’ is more frequent than ‘q’. This structure can be exploited to achieve a smaller representation of input file. With the same assumption, as a first step we took the frequency of occurrence of Malayalam characters from a study report [4]. In addition to the character specified in the report, space and full stop was

    included. The informational analysis of Malayalam text is carried out by creating a dataset from popular Malayalam blogs and websites [5, 6]. Till date, state of the art works are not known to perform compression in Malayalam text. Hence no benchmarks exist for comparison with the proposed work. II. Entropy and Compressibility Communication is the process of sharing ideas, thoughts, facts and information from one person to another. Languages are being developed as a mean to provide effective communication. Irrespective of the diversity in human biological traits, every communication system follows a common process of transmitting message from one point to another. The hidden statistical nature of communication process was first recognized by the great research Mathematician Claude Elwood Shannon and he used mathematics to unify the theory [3]. In the famous work, C.E. Shannon emphasized that languages are not framed in random manner, there is a specific style being followed in framing language. Most of the advancement in digital technology ever since happened including the art of connecting people together through the social networking sites, blogs, and email has the inspiration of C.E. Shannon’s novel idea. The Information Content in a message is the amount of surprise it creates in us [7]; in other words an unusual scenario has more information than a usual scenario. Shannon defined the measure of information contained in a message, based on the probability of each symbol in

    0

    5000

    10000

    15000

    20000

    25000

    30000

    M a

    la y

    a la

    m C

    o n

    te n

    t

    YEAR

    Wiki Malayalam Content Statistics

  • ISSN: 2289-7615 Page 4

    it. Suppose there are n symbols {a1, a2 …an} emanating independent of each other from a source, with probabilities {p1, p2 …pn} respectively. Then the information content of any message of size k made out of these symbols is given by

     

     k

    i

    ipI 1

    log …………..… (1)

    I.e. Information content of an English word such as “vande matharam” can be computed using standard probability of occurrence of English alphabet [7], as 54.79. The symbol ai which has a probability of pi to occur, is expected to occur n*pi times in the whole message. Thus the total information IT, of the message is given by

    iiT ppnI log)*( …………. (2) And the average information per symbol is the Information Entropy H, given by

     