[IEEE 2012 Second International Conference on Digital Information and Communication Technology and it's Applications (DICTAP) - Bangkok, Thailand (2012.05.16-2012.05.18)] 2012 Second

Entropy Rate of Thai Text and Testing Author Authenticity Using Character Combination

Distribution

Theerawat Kiatdarakun and Prapun Suksompong

School of Information, Computer and Communication Technology (ICT) Sirindhorn International Institute of Tehnology, Thammasat University

P.O.Box22, Thammasat-Rangsit Post Office Pathum Thani 12121, Thailand [email protected], [email protected]

Abstract—This paper has two main goals. The first goal is to estimate the entropy rate of Thai text which is found to be roughly 2 bits/character. The second goal is to come up with methods for text authentication based on probability distribution and information theoretic quantities. Using proposed methods, we found that digital books composed by the same author give close numerical values, while those from different authors give much higher differences. Among the three techniques under consideration, we found that the entropy-based method provides the best test. Thirty Thai text sources of various styles are tested to increase reliability of the study. Additionally, the comparison of the effectiveness of proposed methods is shown here.

I. INTRODUCTION

Given a text source, a probability distribution can be found to estimate how frequently individual characters are used in the text. One can extend this idea further by considering how often pairs of characters appear. Higher order consideration (considering more than two consecutive characters) gives even more insight on the dependency (and redundancy) among the characters used; we refer to these distributions as character combination distribution. The distributions can be used to find information theoretic quantities such as the entropy rate and the Kulback-Leiber divergence which will be discussed further. Entropy rate indicates the average number of bits per symbol needed to encode the data. In another aspect, it tells us the amount of redundancy in the text and maximum compression ratio. The entropy rate of English text is well-known, it is approximated to be 1.3 bits/symbol by Cover and King [15], which agrees with Shannon estimate [16]. Interested readers can look for the references therein. To our knowledge and from literature search, the entropy estimation of Thai text is not available. Therefore, the first goal of this paper is to provide such estimation. We firstly rely on the computation of entropy which will be discussed in Section IV. In addition, we also estimate the entropy rate by the use of the compression software.

Entropy and other information theoretic quantities have been applied to the categorization of texts. There are many works associated with text categorization into a variety of fields such as sports, politics, and computer [4] - [7]. In this paper, we consider the use of character combination distribution that can categorize texts based on the their authors, i.e. testing text authenticity, and this is our second goal. Why is author-based categorization of texts important? Imagine the situation when two persons claim ownership of the same book. One possible way to find the real author is to compare this work with other works of them. To achieve such comparison, we can gather the probability distribution of the characters or symbols used. In Sections IV and VI, we will demonstrate that different writing styles of two authors lead to different character combination distributions which can be quantified by various kinds of distance. In other words, the probability distribution would characterize the author's writing style and the numerical values computed from it would lead us to the solution of the author authenticity validation problem.

II. DETAILS OF THE STUDY

In this work, we study thirty Thai text sources which are listed in Table 3. Text sources # 5-20 are selected from Tripiṭaka, which is considered to be old writing style. Text sources # 21-30 are collected from the Dhamma-oriented novels written by Dungtrin. Dungtrin’s style of writing is modern since his works are quite new and the information is made for general audience. Text sources # 1-4 are from other writers. The complete Thai alphabet along with additional arabic numbers and symbols are taken into consideration. Thai alphabet itself contains 44 consonant characters, 14 vowels, 8 tonal and 10 number symbols [13]. The criteria for selecting interested characters depends on whether they are still used in modern texts. We have seen that some Thai characters are obsolete such as ‘ฃ’and ‘ฅ’, all of them are neglected. Thai

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 492

text has no explicit word boundaries, but a space should be considered. Symbols other than a space and those in Table 1 are stored as unknown characters. Therefore, a total of 92 different characters are considered here. Altough some text authentication techniques for English text relating to the word frequency are proposed such as [2], we see that the use of character combination distribution does not require word segmentation. This is important for Thai text, having no space between words. Therefore, we tested text authenticity based on character combination distribution in this work.

Table I. Our interested characters.

III. ENTROPY RATE ESTIMATION

Entropy is a measure of the information content of the texts in a language [1]. The joint entropy per character of order n is

�(�) =1

��(�� … ��)

= −�

�∑ … ∑ �(��, … , ��)��(��, … , ��) � �

(1) Once the joint entropy is known, the conditional entropy (in bits) can be calculated using the following equation,

��(�) = �(��|�� … ��)

= �(�� … ��) − �(�� … ��) (2)

The entropy of text is a hard thing to measure. Language is not a stationary ergodic source since it changes from time to time. Therefore, the best we could do is assuming our text sources to be ergodic and stationary, and the entropy rate is estimated under this assumption. Assuming that printed text is stationary and ergodic, Cover and King [15] stated that as the order increases to infinity,

lim�→∞ �(�) = lim�→∞ ��(�) (3)

The entropy rate of the text is the place at which the Eq. (3) holds. Directly estimating �(�) and ��(�) for large n turns out to be correspond due to limited number of characters

in each text source. Therefore, we also use WinRAR [17] to estimate the entropy rate of Thai text.

IV. FINDING AND COMPARING PROBABILITY DISTRIBUTION

The first thing we must do in order to authenticate text is to find probability distributions of those text as they are strongly related to writing patterns. To find them, firstly we count each of possible n gram sequence: a contiguous sequence of n characters, of the text and store those values in an array, called data. Then, we find the sum of all elements of data, called sum. The probability distribution of the text is then p(x) = data/sum. The step is repeated for 1 gram, 2 gram and 3 gram sequences. Therefore, all results are stored into three arrays. To compare two probability distributions, we consider three different families of distance.

A. Power-Sum Difference

The simplest distance we studied is the Euclidean distance between two arrays of joint probability distribution, However, we consider a more general expression: ��(�) = ∑ (�(��, ��, … , ��) − �(��, ��, … , ��))�

� (4) where n is the number of contiguous characters considered, and m is an integer greater than 1. p and q are probability distributions of the two text sources under comparison. We confine the scope of this paper to n = 1, 2, and 3 only. The less ��(�) is, the more possibility that two text sources are composed by the same person.

B. Kullback-Leibler Divergence Another method we can use to compare Kullback-Leibler divergence. This distance was proposed as an index for finding rates of convergence of densities and distribution functions [3]. In other words, it is a measure of the difference between two probability distributions p and q,

��(�||�) = ∑ �(��)� In�(��)

�(��) (5)

It is used for text categorization in [8], [9], but we exploits the use of it in separating two texts from different writers. If p and q are collected from the book written by the same person, ��(�||�) will have a low value. We can see that the value of ��(�||�) is not symmetric, i.e

��(�||�) ≠ ��(�||�) (6)

Therefore, we must try to experiment 2 schemes: swapping p and q, and noticing if which one yields a better result. In addition, more number of contiguous characters can be considered, and we can modify Eq. (5) into more general term:

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 493

��,�(�||�) = ∑ �(�� , … ��)� In�(��, … ��)

�(��, … ��) (7)

Higher order of Kullback-Leibler divergence exploits the dependency between characters, thus more effectiveness of text authentication may be acquired.

C. Entropy-Based Distance

Entropy measures a degree of uncertainty of a random variable. It can be used in text, image or object categorization [10] - [12]; however, we use it to measure similarity between two texts since it is related to each set of probability distribution in each text.

Figure 1: Joint Entropy per Character of 30 Thai text sources.

We could not see noticeable difference directly from theses entropy values; therefore, we derived a simple equation to get the standard values,

�",�(�) = ∑ (�(#)$%��&' � − �(#)$%��&' �)��*� (8)

where m is an integer greater than 1. Two subcripts: ‘Sample 1’ and ‘Sample 2’, are two text sources to be compared. A very small �",�(�) means two text sources are very likely to be written by the same person. Conversely, a large value of �",�(�) implies that they are written by different authors.

V. CRITERIA FOR THE PERFORMANCE COMPARISON

Once using these methods to all text sources, we can see that some methods may separate text sources with more distictive numerical results than other method. Therefore, it is important for us to rank several methods proposed in order of their efficiency so that we know which one is the most suitable method. How to judge the effectiveness of each method? Surely, this cannot be done using just our eyes looking to the results alone, but it requires some standard numerical values.

In Section VI, we use text source # 22 as the base sample, finding the distance between it and text sources # 1-30. Given that α is the minimum value of the distance from this text source to source # 1-20 which are not written by Dungtrin. β is defined to be the mamimum value of the distance from this text source to source # 22-30 which are written by Dungtrin. Therefore, the most desirable result is that the value of α is much higher than that of β. We define the variable,

+-- =.�/

/ (9)

where Eff stands for ‘Efficiency’. The more Eff is, the better capability of the method to authenticate the texts. Eff of less than zero means the method is not usable, since some of the texts from different writer have lower numerical values than those written by the same author – that is α < β.

VI. EXPERIMENT RESULTS AND DISCUSSION

A. Entropy Rate Estimation

Figure 2: The trend of �(�) and ��(�) of text source # 3.

We estimate joint entropy per character of text source # 3 as in Fig. 2. Both joint entropy per character and conditional entropy show a downward trend with increased number of orders. 1st -3rd order entropies are more than that of English text [14], this should be because of more number and type of characters that Thai text contains.

We can see that Fig. 2 shows no trend that conforms to Eq. (3) since more order leads to more separation between two graphs. Therefore, the sample size of 806191 characters (text source # 3) may not be high enough for entropy rate estimation at a high order. An another fast and efficient method we used is to estimate the entropy rate by using compression software. The uncompressed version of the third text source has original file

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 494

size of 801 KB, and 201 KB after being compressed by WinRAR compression software. The compression ratio is then 201/801 = 0.251. The size of uncompressed digital text is 8 bits/character, therefore the entropy rate of Thai text would be 0.251 × 8 = 2.01 bits/character. Using this method with all text sources whose size is quite large, say larger than 600,000 characters, the entropy rate of Thai text is estimated to be 1.09-3.27 bits/character.

B. Testing Authenticity Using Power-Sum Difference

We compared each text source with the base sample (text source # 22) by using Eq. (4) for m = 2 and stored all data in Fig. 3. The factor that we choose text source # 22 as the base sample is its high size. The line in Fig. 3 separates Dungtrin works (text source # 21-30) from other text sources. The values of ��(�), ��(2), and ��(3) of text sources # 21 - 30 are expected to be low as the same person wrote them. Conversely, their values of text sources # 1-20 should be relatively high due to different writers. There is a lapse at text source # 29 due to the size of sample – that is less than 60,000 characters. This is an example of degradation in performance with very low sample size. If m is more than 2, the values to the left and the right of the line are more separated. Overall, using power-sum difference may be not a good way for us since the values do not differ so much. The result from n = 1 scheme yields the clearest separation, however it does not take into account the dependency of contiguous characters.

Figure 3: The value of ��(�) for n = 1,2,3 when comparing

each text source with text source # 22.

C. Testing Authenticity Using Kullback-Leibler Divergence

Again, we use text source # 22 as the base sample, and each text source is compared with the base sample in Eq. (7). Two schemes: the base sample is p or q, are tested due to Eq. (6).

1) 1st Order (n = 1)

Figure 4: The value of ��,�(�||�) when comparing each text source with text source # 22.

Some values (dots) of text sources # 21 – 30 in Fig. 4 are greater than only some values (dots) from other text sources when the base sample is q. This implies that the method is not usable. In addition, the results from another scheme (circles) are also unacceptable as too minor difference can be seen in the graph.

2) 2nd Order (n = 2) and 3rd Order (n = 3)

Figure 5: The value of ��,�(�||�) for n = 2 and 3 when comparing each text source with text source # 22.

Clearer distinct values can be observed in the plot in Fig. 5. Text sources # 1 and 2 are modern dhamma books written by

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 495

Sudassa Onkom, so that it is quite similar to Dungtrin’s books in terms of character combination distribution. We can see even more noticeable higher values from text sources # 5 – 20 selected from Tripiṭaka, having much different style of writing from Dungtrin’s.

Even more distinction in values can be observed in red plots in Fig. 5. Altough the ‘x’ plot from text sources # 27-30 are higher than the first ‘x’ plot, it is possible to accept the method as text sources # 27 – 30 have quite low characters, say lower than 260,000 characters. Very good results can be seen in text sources # 21, 23, and 24, having about a million chacracter long.

D. Testing Authenticity Using Joint Entropy per Character

Figure 6: The value of �",�(�) when n = 1 and 2, and m = 2

when comparing each text source with text source # 22.

Figure 7: The value of �",�(�) when n = 3, 4, and 5, and m =

2 when comparing each text source with text source # 22.

As expected, the values to the left of the line is higher than the values to the right of the line in Fig. 6 and 7. We can see that the values from Dungtrin works (text sources # 21 - 30) are very near to zero, except for sample 29 and 30 having only almost 60,000 characters. Hence, Eq. (8) is very likely to be valid. In order to make Eq. (8) more reliable, more numbers of n are applied and the result is as Fig. 7. A very clear separation in value can be observed in Dungtrin’s text of text source # 21, 23, and 24 which have the longest # of characters among text source # 21-30. Therefore, we can say that more size of text sources makes Eq. (8) much more accurate. The variable m in Eq. (8) could be increased to yield more distinguishable results.

E. Comparison of Proposed Methods

We compare all previous results by using Eq. (9) and store them in Table II. However, we must realize the fact from previous results that text source of very low size leads to degradation in performance. Therefore, only text sources whose sizes are more than 600,000 characters are considered here. Note that from Eq. (9),

+-- =7 − 9

9

From Table II, :��,�(�||�) (q is the base sample), is a good example of unusable method since Eff < 0. By using the joint entropy, we can effectively separate pieces of work because of high efficiency. Higher value of n tends to yield more satisfactory results as more numbers of entropy values, i.e. �(1) to �(�), are taken.

Table II. Eff of each method.

Method Eff Rank ��(1) 0.5 10 ��(2) 0.19 12 ��(3) 0.17 13

p is the base sample

:��,�(�||�) 2.51 5 :��,�(�||�) 0.90 9 :��,;(�||�) 0.92 8

q is the base sample

:��,�(�||�) -0.11 N/A :��,�(�||�) 1.00 7 :��,;(�||�) 1.53 6

�",�(1) 6.60 3 �",�(2) 4.00 4 �",�(3) 7.72 2 �",�(4) 9.42 1 �",�(5) 10.59 11

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 496

VII. CONCLUSION

Entropy rates of languages are varied due to the components and the usage of the language. This can be seen from the entropy rate of Thai text that is higher than English text, and because of this, there is more contraint on data compresstion ratio of Thai language than that of English text. Most methods can distingquish accurately, however, with a large text size. An instance of this is in Fig. 7 where the results are likely valid in large text sources (more than 600,000 characters). Thirty text sources could ensure our methods to some extent. Finally, there are plenty of future works of this project, which can be optimized equations that the authors have not done at this point and the test of more text sources.

REFERENCES

[1] Claude Elwood Shannon, “A Mathematical Theory of Communication”, Bell System Technical Journal, vol. 27, pp. 379-429, pp. 523-656, 1948.

[2] Ronald Thisted, and Bradley Efron, “Did Shakespeare write a newly-discovered poem?” Biometrika, vol.74, pp. 445–455, 1987.

[3] M.E. Mayer and D.V. Gokhale, “Kullback-Leibler Information Measure for Studying Convergence Rates of Densities and Distributions”, IEEE Transaction on Information Theory, vol.39, no.4 pp. 1401-1404, 2002.

[4] Xin-Fu Li, Hai-Bin He, and Lei-Lei Zhao, “Chinese Text Categorization Based on CCIPA and SMO”, vol.5, pp.2514-2518, 2008.

[5] Zhong Gao, Guanming Lu, and Daquan Gu, “A Novel Hybrid system for Large-Scale Chinese Text Classification Problem”, pp. 121-124, 2009.

[6] Dingsheng Luo, Xinhao Wang, Xihong Wu, and Huisheng Chi, “Learning Effective Features for Chinese Text Categorization”, pp. 608-613, 2005.

[7] Tam T. Nguyen, Kuiyu Chang, and Siu Cheung Hui, “Word Cloud Model for Text Categorization”, pp. 487-496, 2012.

[8] Zhilong Zhen, Xiaoqin Zeng, Haijuan Wang, Lixin Han, “A Global Evaluation Criterion for Feature Selection in Text Categorization Using Kullback-Leibler Divergence”, pp. 440-445, 2011.

[9] Baoyi Wang and Shaomin Zhang, “A Novel Text Classification Algorithm Based on Naïve Bayes and KL-divergence”, pp. 913-915, 2005.

[10] Zhili Pei, Yuxin Zhou, Lisha Liu, Lihua Wang, Yinan Lu and, Ying Kong, “A Mutual Information and Information Entropy Pair based Feature Selection Method in Text Classification”, vol.6, pp. V6-258 – V6-261, 2010.

[11] Sungho Kim and In So Kweon, “Object Categorization Robust to Surface Markings using Entropy-guided Codebook”, WACV ‘07, pp. 22, 2007.

[12] Sungho Kim, In so Kweon, and Chi-Woo Lee, “Visual Categorization Robust to Large Intra-Class Variations using Entropy-guided Codebook”, pp. 3793-3798, 2007.

[13] Chomtip Pornpanomchai and Montri Daveloh, “Printed Thai Character Recognition by Genetic Algorithm”, vol.6, pp. 3354-3359, 2007.

[14] Wanas, M., Zayed, A., Shaker, M., and Taha, E., “First second- and third-order entropies of Arabic text (Corresp.)”, Bell System Technical Journal, vol. 22, No. 1, pp. 123, 1976.

[15] Thomas M. Cover and Roger C. King, “A Convergent Gambling Estimate of the Entropy of English” IEEE Trans, vol. 24, no. 4, pp. 413-421, 1978.

[16] Claude Elwood Shannon, “Prediction and Entropy of Printed English,” Bell Syst. Techn. J., pp. 50-64, Jan. 1951.

[17] Eugene Roshal, WinRAR (Version 3.80) [Computer Software]

Table III. Details of Selected Thai Text Sources Text

Source Index

# of Charac-

ters

Title

1 1131664 Sudassā Onkom, “All sentient beings are dependent on their Karma”

2 70774 Sudassā Onkom, “Fai nai lao ron thao fai narok”

3 806191 The Milinda Panha, Silapa Bannakarn Press, 2006

4 676329 Summarized Report on the Situation of the Nation, Office of the National Economic and Social Development Board, June 2011

5 1099648 Vinaya Piṭaka, Mahabhivang, 1st Book, Episode 1



8 1055478 Vinaya Piṭaka, Mahabhivang, 2nd Book 9 517376 Vinaya Piṭaka, Bhikkhunivibhanga 10 772386 Vinaya Piṭaka, Mahavagga, Episode 1 11 646506 Vinaya Piṭaka, Mahavagga, Episode 2 12 634843 Vinaya Piṭaka, Cullavagga, Episode 1 13 698208 Vinaya Piṭaka, Cullavagga, Episode 2 14 975300 Vinaya Piṭaka, Parivara 15 737737 Sutta Pitaka, Digha Nikaya,

Silakkhandha-vagga, Episode 1 16 395342 Sutta Pitaka, Digha Nikaya,

Silakkhandha-vagga, Episode 2 17 821163 Sutta Pitaka, Digha Nikaya, Maha-

vagga, Episode 1 18 561041 Sutta Pitaka, Digha Nikaya, Maha-

vagga, Episode 2 19 438612 Sutta Pitaka, Digha Nikaya, Patika-

vagga, Episode 1 20 590214 Sutta Pitaka, Digha Nikaya, Patika-

vagga, Episode 2 21 1148988

Dungtrin

Thāng narưphān 22 1035545 Kam phayākon. Ton chana kam 23 972629 Kam phayākon. Ton lưak kœt mai 24 646919 Čhit čhakkraphat. (All 3 Episodes) 25 559703 Čhet duan banlutham 26 320989 Sīadāi--khontāi maidai ʻān 27 251561 Mī chīwit thī khit mai thưng 28 102543 Na moranā : rūam botkhwām

khatsan khong Dangtrin 29 59795 Watha DungTrin: Chabap Khwam

Rak Lak Si 30 59071 Watha DungTrin: Chabap Chuan

Khit

978-1-4673-0734-5/12/$31.00 ©2012 IEEE 497

Documents

[IEEE 2012 Second International Conference on Digital Information and Communication Technology and it's Applications (DICTAP) - Bangkok, Thailand (2012.05.16-2012.05.18)] 2012 Second