Upload
jessica-clark
View
244
Download
0
Embed Size (px)
Citation preview
Web Page Language Identification Based on
URLsReporter: 鄭志欣
Advisor: Hsing-Kuo Pao
1
Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008
Reference
2
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Outline
3
Given only the URL of a web page, can we identify its language? Web crawlers Personalized Web Browser
We consider the problem of determining the language of a web page using only its URL. English , French , German , Spanish , and Italian .com (60%) , .org (10%)
www.wasserbett-test.com
Introduction
4
Applying machine learning techniques Features
Word features N-grams features Custom-made features
Machine learning algorithm Naïve Bayes Decision Tree Relative Entropy Maximum Entropy
Introduction
5
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Outline
6
Words as features Remove “www” , ”index”, ”html” …,etc. For example,
http://www.internetwordstats.com/africa2.htm Split into : internetwordstats , com , africa cnn , gov are indicative of English Produits ,recherche are indicative of French
Extracting Feature Vectors
7
Trigrams as features Start with the some token as the method
above(word as features) Eg, weather
“_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” “_th” , “ing” are very common in English
8
Custom-made features Top-level domain country code OpenOffice dictionaries Dictionary with city names Number of hyphens
9
Country code top-level domain only (ccTLD) Country code top-level domain plus
(ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)
Classification Algorithms
10
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Outline
11
The algorithms were evaluated on three different data sets Open Directory Project Microsoft’s Live Search 1260 pages form a large web crawl labels by
hand
DataSet
12
Data set Language Training size
Test size
Open Directory Project
English 145,000 4910
German 144,999 4965
French 144,996 4961
Spanish 144,974 4878
Italian 144,987 4933
SearchEngineResults
English 99,992 999
German 99,572 992
French 99,549 997
Spanish 99,838 997
Italian 99,786 997
WebCrawl
English 0 1082
German 0 81
French 0 57
Spanish 0 19
Italian 0 2113
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions
Outline
14
P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−)))
= p(+|+)
= p(−|−)
F = 2/(1/R+1/P)
15
Human Performance
16
Baseline : ccTLD
17
18
19
20
21
This paper shows that high quality language identifiers for web pages can be built based on URLs alone.
The largest challenge is to identify English-looking URLs of non-English web pages.
Conclusions
22