19
Web Spam Detection: li nk-based and content-b ased techniques Reporter : 鄭鄭鄭 Advisor : Hsing-Ku o Pao 2010/11/8 1

Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1

Embed Size (px)

Citation preview

Web Spam Detection: link-based and content-based techniques

Reporter : 鄭志欣Advisor : Hsing-Kuo Pao

2010/11/81

Outline

• Introduction• Web Spam: a debatable problem• Characterizing Spam Pages• DataSets• Method• Combined Classifier• Conclusion

2

Introduction

• Characterize Web Spam pages[1][2]– Inclusion of many unrelated keywords and links.– Use of many keywords in the URL.– Redirection of the user to another page.– Creation of many copies with substantially duplic

ate content.– Insertion of hide text by writing in the same color

as the background of the page.

3

4[3]

Web Spam: a debatable problem

• Some Define– All deceptive actions which try to increase the ran

king of a page in search engines are generally referred to as Web spam or spamdexing.

– An unjustifiably favorable relevance or importance score for some web page, considering the page’s true value.[4]

– Any attempt to deceive a search engine’s relevancy algorithm.

• Search Engine Optimization (SEO)5

Characterizing Spam Pages

• Content spam– Inserting a large number of keywords.– It is shown that 82-86% of spam pages of this type

can be detected by an automatic classifier.[5]

• Link spam– A link farm is a densely connected set of pages, cr

eated explicitly with the purpose of deceiving a link-based ranking algorithm.

6

Link Farm[6]

7“manipulation of the link structure by a group of users with the intent of improving the rating of one or more users in the group”.

8

High and low-ranked pages are different

9

DataSet[7]

• WEBSPAM-UK2006– .uk Domain

• 77.9 million pages, over 3 billion links, 11,400 hosts, May 2006 .

10http://barcelona.research.yahoo.net/webspam/

TrustRank[4]

11

Truncated PageRank(1/2)[2]

12

Truncated PageRank(2/2)

13

Estimation of Supporters[2]

14

Link and Content features

15

Topological dependencies : in-links[6]

16

Topological dependencies : out-links

17

Conclusion

• The current precision and recall of Web spam detection algorithms can be improved using a combination of factors already used by search engine.

• User interaction features (e.g. data collected via toolbar or by observing clicks in search engine results).

18

Reference• [1]Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. Link-ba

sed characterization and detection of Web Spam. In Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Seattle, USA, August 2006.(cita 57)

• [2]Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press(cita 49)

• [3] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR), pages 423–430, Amsterdam, Netherlands, 2007. ACM Press(cita 90)

• [4]Gy¨ongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.(cita 455)

• [5] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83–92, Edinburgh, Scotland, May 2006.(cita 196)

• [6]Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment(cita 96)

• [7] http://barcelona.research.yahoo.net/webspam/

19