Main Content Extraction from Persian HTML Files

Motivation UTF-8 Encoding Form R2L Results Conclusion

Using UTF-8 to Extract Main Contentof

Right to Left Language Web Pages

Hadi Mohammadzadeh Franz Schweiggert and GholamrezaNakhaeizadeh

Ph.D. CandidateInstitute of Applied Information Processing

University of UlmGermany

[email protected]

July 2011

1 / 27


Outline

1 Motivation

2 UTF-8 Encoding Form

3 R2LAlgorithmData Sets, Evaluation

4 Results

5 Conclusion

2 / 27


Outline

1 Motivation



4 Results

5 Conclusion

3 / 27


What are the R2L laguages?Laguages which are written from the right side to the leftside, Arabic, Persian, Urdu and Pshtoo

What are the benefits of MCE?Web Mining (WM) and Information Retrieval (IR)applications use the MCE algorithms as a preprocessingstep on the raw HTML data to reduce noise and to obtainmore accurate resultsOther applications use the MCE to reduce the documentsize for presenting on screen readers and small screendevices .Furthermore, the MCE can be considered as preprocessingfor text mining .

4 / 27


Why should we work on the right to left (R2L) languages?In the early stage of the Internet, most of the web pageswere written in the English language. Now, especially in thelast decade, a large portion of information has beenpublished in other languages such as Spanish, German,French and right-to-left (R2L) languages such as Persian,Arabic languages

What are our goals?From a technical point of view, most of the main contentextraction approaches separate the MC from the extraneousitems using the HTML tags. This leads to use a parser forall web pages which consequently increases the computationtime. Thus, the main goal of my research is to increase boththe accuracy and effectiveness of the MCE algorithmsdealing with R2L languages

5 / 27


An example of web pages with selected MC

6 / 27


Outline

1 Motivation



4 Results

5 Conclusion

7 / 27


In UTF-8, English characters need only one byte, withvalues less than 128In UTF-8, all letters of non-English-languages (Persian ,Arabic, Pashto, and Urdu Languages) take exactly 2 bytes,each with value greater than 127By using UTF-8 and simple condition , we are able toseperate English characters from R2L characters .

If the value of one byte is < 128 ⇒ this byte is a member offirst 128 charcters including English charactersOtherwise it is a member of R2L characters

8 / 27


Outline

1 Motivation



4 Results

5 Conclusion

9 / 27


Algorithm

Characteristics

R2L is stable despite invalid or ill-defined HTML files.R2L produces high accuracy results on most testeddocuments.The basic idea behind the R2L is the differentiationbetween the English character set and the R2L characterset.

10 / 27


Algorithm

R2L consists of two steps:In the First step, we seperate R2L from non-R2L charactersin each line of the HTML file and save results in an array,named TIn the Second step, we find a region comprising MC in thearray T

11 / 27


Algorithm

Step One : Separating R2L Characters from Non-R2L ones based on Simple Rule

S1 = {All characters belonging to R2L languages}S2 = {All first 128 characters of UTF-8}

R2L reads an HTML file line by line, then classifies charactersin each line in two sets, setOne and setTwo. If character is amember of S1 then we add it to setOne, otherwise to setTwo.

setOne ⊂ S1setTwo ⊂ S2

12 / 27


Algorithm

Step One : Separating R2L Characters from Non-R2L ones based on Simple Rule

To save setOne and setTwo of one line we use structure below:struct HTML_file_line{

String EPL; // Non-R2L ChractersString NEPL; // R2L Characters

}

For each line, the first and second fields save setOne andsetTwo, respectively.For saving all the lines of the HTML file, we declare anarray, T, of the above structure.

13 / 27


Algorithm

Step Two : Finding a Region Comprising MC in Array T

number of linesin the HTML le

number of characters

belonging to S1

number of characters

belonging to S2

B

A

A

C

Recognizing an area in the array T, in which:Characters in fields of NEPL have high densityCharacters in fields of EPL have low density

NowFor the field NEPL, a vertical line with identical length is drawn upside of the x-axis.Similarly, for the field EPL a vertical line with equal length is drawn downside of thex-axis.

14 / 27


Algorithm


Different types of regions:

Regions with low or near zero density of columns above thex-axis and high density of columns below the x-axis, labeledARegion with high density of columns above the x-axis and alow density of columns below the x-axis, labelled BRegions with slight difference between the density of thecolumns above and below the x-axis, labelled C

15 / 27


Algorithm


Now the problem of finding MC in the HTML file becomes theproblem of finding region B. To find Region B we should follow3 steps:

1) For all columns we calculate diff i and draw new diagram

diffi = length(NEPLi) − length(EPLi)

If diff i > 0 then we draw a line with the length of diff i

above the x-axixOtherwise , we draw a line with the length of absolute valueof diff i below the x-axix

2) We find a column with the longest length above thex-axis3) Finding the boundaries of the MC region, but how

16 / 27


Algorithm


diff calculated by formula (1)

number of linesin the HTML le

main content

17 / 27


Algorithm


After recognizing the longest column above the x-axis, thealgorithm moves up and down in the HTML file to find allparagraph belonging to the MC. But where is the end ofthese movements?The number of lines we need to traverse to find the nextMC paragraph is defined as a parameter P.By considering this parameter, we move up or down,respectively until we can not find a line containingcharacters which are members of S1.At this moment, all lines we found make our MC.Based on the heuristic fitness function, which will beexplained in next pages, we consider a value of 8 as adefault value for parameter P.

18 / 27


Algorithm

Fitness Function to Optimize the R2L Algorithm

All MCE algorithms suffer from some parameters whichshould be tunedThere is, at most, 20 lines between two paragraphs in theMC regionIn R2L, to find the best value for the parameter P, weimplemented a heuristic fitness function (HFF)HFF finds the best value for parameter P in the interval[1, 20] for each web site

19 / 27


Data Sets, Evaluation

Data Sets

Web site Num. of Pages LanguagesBBC 598 Farsi

Hamshahri 375 FarsiJame Jam 136 Farsi

Ahram 188 ArabicReuters 116 Arabic

Embassy of 31 FarsiGermany, Iran

BBC 234 UrduBBC 203 PashtoBBC 252 ArabicWiki 33 Farsi

Arabic, Farsi, Pashto, and UrduWe have collected 2166 web pages from different web sites.

20 / 27


Data Sets, Evaluation

Evaluation

To calculate the accuracy, we first need to manually make agold standard file, gs-file.Compare the output of the R2L , called MC-file, with acorresponding gs-file.Use a metric to compare the gs-file with the MC-file.Utilize Longest Common Subsequence (LCS) as a metric tocompare two substrings.Evaluate the accuracy by applying - Recall, Precision, andF1-measure.

r =length(k)

length(g), p =

length(k)

length(m), F1 = 2 ∗

p ∗ r

p + r

21 / 27


Outline

1 Motivation



4 Results

5 Conclusion

22 / 27


Web site Languages F1-measure Best value for F1-measure withwith P = 8 Parameter P New Parameter

in Column 4BBC Farsi 0.9906 8 0.9906

Hamshahri Farsi 0.9909 8 0.9909Jame Jam Farsi 0.9769 3 0.9872

Ahram Arabic 0.9295 7 0.983Reuters Arabic 0.9356 4 0.9708

Embassy of Farsi 0.9536 15 0.9715Germany, Iran

BBC Urdu 0.9564 11 0.9972BBC Pashto 0.9745 8 0.9745BBC Arabic 0.987 8 0.987Wiki Farsi 0.283 16 0.3852

23 / 27


Comparison with Other Algorithms

Web site R2L , P = 8 R2L, the best P LICEABBC-Farsi 0.9906 0.9906 0.7755Hamshahri 0.9909 0.9909 0.9406Jame Jam 0.9769 0.9872 0.9085

Ahram 0.9295 0.983 0.9342Reuters 0.9356 0.9708 0.9221

Embassy of 0.9536 0.9715 0.8919Germany, Iran

BBC-Urdu 0.9564 0.9972 0.9495BBC-Pashto 0.9745 0.9745 0.9403BBC-Arabic 0.987 0.987 0.302

Wiki 0.283 0.3852 0.7121

LICEA, Language Independent CE Algorithm, wascompared with many MCE algorithms and it is one thebest MCE algorithm proposed in 2009Results of R2L are superior to LICEA

24 / 27


Outline

1 Motivation



4 Results

5 Conclusion

25 / 27


Conclusion and Future Work

Result shows that the R2L produces satisfactory MC withF1-measure > 0.92Advantages

It is DOM tree and HTML-format independentWe do not need to use parser for our algorithm, Usingparser is time consumingBy fine-tuning of the parameter P for each web site, we seean increase in the F1-measure.

Future worksWorking on Wikipedia web pages to obtain better resultsTrying to extend R2L to a free-parameter algorithm

26 / 27


Thank you for your attention

27 / 27

Technology

Main Content Extraction from Persian HTML Files