Upload
hadi-mohammadzadeh
View
1.571
Download
4
Tags:
Embed Size (px)
Citation preview
Motivation UTF-8 Encoding Form R2L Results Conclusion
Using UTF-8 to Extract Main Contentof
Right to Left Language Web Pages
Hadi Mohammadzadeh Franz Schweiggert and GholamrezaNakhaeizadeh
Ph.D. CandidateInstitute of Applied Information Processing
University of UlmGermany
July 2011
1 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
2 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
3 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
What are the R2L laguages?Laguages which are written from the right side to the leftside, Arabic, Persian, Urdu and Pshtoo
What are the benefits of MCE?Web Mining (WM) and Information Retrieval (IR)applications use the MCE algorithms as a preprocessingstep on the raw HTML data to reduce noise and to obtainmore accurate resultsOther applications use the MCE to reduce the documentsize for presenting on screen readers and small screendevices .Furthermore, the MCE can be considered as preprocessingfor text mining .
4 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Why should we work on the right to left (R2L) languages?In the early stage of the Internet, most of the web pageswere written in the English language. Now, especially in thelast decade, a large portion of information has beenpublished in other languages such as Spanish, German,French and right-to-left (R2L) languages such as Persian,Arabic languages
What are our goals?From a technical point of view, most of the main contentextraction approaches separate the MC from the extraneousitems using the HTML tags. This leads to use a parser forall web pages which consequently increases the computationtime. Thus, the main goal of my research is to increase boththe accuracy and effectiveness of the MCE algorithmsdealing with R2L languages
5 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
An example of web pages with selected MC
6 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
7 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
In UTF-8, English characters need only one byte, withvalues less than 128In UTF-8, all letters of non-English-languages (Persian ,Arabic, Pashto, and Urdu Languages) take exactly 2 bytes,each with value greater than 127By using UTF-8 and simple condition , we are able toseperate English characters from R2L characters .
If the value of one byte is < 128 ⇒ this byte is a member offirst 128 charcters including English charactersOtherwise it is a member of R2L characters
8 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
9 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Characteristics
R2L is stable despite invalid or ill-defined HTML files.R2L produces high accuracy results on most testeddocuments.The basic idea behind the R2L is the differentiationbetween the English character set and the R2L characterset.
10 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
R2L consists of two steps:In the First step, we seperate R2L from non-R2L charactersin each line of the HTML file and save results in an array,named TIn the Second step, we find a region comprising MC in thearray T
11 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step One : Separating R2L Characters from Non-R2L ones based on Simple Rule
S1 = {All characters belonging to R2L languages}S2 = {All first 128 characters of UTF-8}
R2L reads an HTML file line by line, then classifies charactersin each line in two sets, setOne and setTwo. If character is amember of S1 then we add it to setOne, otherwise to setTwo.
setOne ⊂ S1setTwo ⊂ S2
12 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step One : Separating R2L Characters from Non-R2L ones based on Simple Rule
To save setOne and setTwo of one line we use structure below:struct HTML_file_line{
String EPL; // Non-R2L ChractersString NEPL; // R2L Characters
}
For each line, the first and second fields save setOne andsetTwo, respectively.For saving all the lines of the HTML file, we declare anarray, T, of the above structure.
13 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step Two : Finding a Region Comprising MC in Array T
number of linesin the HTML le
number of characters
belonging to S1
number of characters
belonging to S2
B
A
A
C
Recognizing an area in the array T, in which:Characters in fields of NEPL have high densityCharacters in fields of EPL have low density
NowFor the field NEPL, a vertical line with identical length is drawn upside of the x-axis.Similarly, for the field EPL a vertical line with equal length is drawn downside of thex-axis.
14 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step Two : Finding a Region Comprising MC in Array T
Different types of regions:
Regions with low or near zero density of columns above thex-axis and high density of columns below the x-axis, labeledARegion with high density of columns above the x-axis and alow density of columns below the x-axis, labelled BRegions with slight difference between the density of thecolumns above and below the x-axis, labelled C
15 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step Two : Finding a Region Comprising MC in Array T
Now the problem of finding MC in the HTML file becomes theproblem of finding region B. To find Region B we should follow3 steps:
1) For all columns we calculate diff i and draw new diagram
diffi = length(NEPLi) − length(EPLi)
If diff i > 0 then we draw a line with the length of diff i
above the x-axixOtherwise , we draw a line with the length of absolute valueof diff i below the x-axix
2) We find a column with the longest length above thex-axis3) Finding the boundaries of the MC region, but how
16 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step Two : Finding a Region Comprising MC in Array T
diff calculated by formula (1)
number of linesin the HTML le
main content
17 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Step Two : Finding a Region Comprising MC in Array T
After recognizing the longest column above the x-axis, thealgorithm moves up and down in the HTML file to find allparagraph belonging to the MC. But where is the end ofthese movements?The number of lines we need to traverse to find the nextMC paragraph is defined as a parameter P.By considering this parameter, we move up or down,respectively until we can not find a line containingcharacters which are members of S1.At this moment, all lines we found make our MC.Based on the heuristic fitness function, which will beexplained in next pages, we consider a value of 8 as adefault value for parameter P.
18 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Algorithm
Fitness Function to Optimize the R2L Algorithm
All MCE algorithms suffer from some parameters whichshould be tunedThere is, at most, 20 lines between two paragraphs in theMC regionIn R2L, to find the best value for the parameter P, weimplemented a heuristic fitness function (HFF)HFF finds the best value for parameter P in the interval[1, 20] for each web site
19 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Data Sets, Evaluation
Data Sets
Web site Num. of Pages LanguagesBBC 598 Farsi
Hamshahri 375 FarsiJame Jam 136 Farsi
Ahram 188 ArabicReuters 116 Arabic
Embassy of 31 FarsiGermany, Iran
BBC 234 UrduBBC 203 PashtoBBC 252 ArabicWiki 33 Farsi
Arabic, Farsi, Pashto, and UrduWe have collected 2166 web pages from different web sites.
20 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Data Sets, Evaluation
Evaluation
To calculate the accuracy, we first need to manually make agold standard file, gs-file.Compare the output of the R2L , called MC-file, with acorresponding gs-file.Use a metric to compare the gs-file with the MC-file.Utilize Longest Common Subsequence (LCS) as a metric tocompare two substrings.Evaluate the accuracy by applying - Recall, Precision, andF1-measure.
r =length(k)
length(g), p =
length(k)
length(m), F1 = 2 ∗
p ∗ r
p + r
21 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
22 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Web site Languages F1-measure Best value for F1-measure withwith P = 8 Parameter P New Parameter
in Column 4BBC Farsi 0.9906 8 0.9906
Hamshahri Farsi 0.9909 8 0.9909Jame Jam Farsi 0.9769 3 0.9872
Ahram Arabic 0.9295 7 0.983Reuters Arabic 0.9356 4 0.9708
Embassy of Farsi 0.9536 15 0.9715Germany, Iran
BBC Urdu 0.9564 11 0.9972BBC Pashto 0.9745 8 0.9745BBC Arabic 0.987 8 0.987Wiki Farsi 0.283 16 0.3852
23 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Comparison with Other Algorithms
Web site R2L , P = 8 R2L, the best P LICEABBC-Farsi 0.9906 0.9906 0.7755Hamshahri 0.9909 0.9909 0.9406Jame Jam 0.9769 0.9872 0.9085
Ahram 0.9295 0.983 0.9342Reuters 0.9356 0.9708 0.9221
Embassy of 0.9536 0.9715 0.8919Germany, Iran
BBC-Urdu 0.9564 0.9972 0.9495BBC-Pashto 0.9745 0.9745 0.9403BBC-Arabic 0.987 0.987 0.302
Wiki 0.283 0.3852 0.7121
LICEA, Language Independent CE Algorithm, wascompared with many MCE algorithms and it is one thebest MCE algorithm proposed in 2009Results of R2L are superior to LICEA
24 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 R2LAlgorithmData Sets, Evaluation
4 Results
5 Conclusion
25 / 27
Motivation UTF-8 Encoding Form R2L Results Conclusion
Conclusion and Future Work
Result shows that the R2L produces satisfactory MC withF1-measure > 0.92Advantages
It is DOM tree and HTML-format independentWe do not need to use parser for our algorithm, Usingparser is time consumingBy fine-tuning of the parameter P for each web site, we seean increase in the F1-measure.
Future worksWorking on Wikipedia web pages to obtain better resultsTrying to extend R2L to a free-parameter algorithm
26 / 27