Sadia Masood 2

Embed Size (px)

Citation preview

  • 7/31/2019 Sadia Masood 2

    1/53

    Web Spam

    Seminar: Future Of Web Search

    Know Your Neighbors: Web Spam

    Detection using the Web Topology

    Presenter: Sadia Masood

    Tutor : Klaus Berberich

    Date : 17-Jan-2008

    University of Saarland

  • 7/31/2019 Sadia Masood 2

    2/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    2

  • 7/31/2019 Sadia Masood 2

    3/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    3

  • 7/31/2019 Sadia Masood 2

    4/53

  • 7/31/2019 Sadia Masood 2

    5/53

    What is Web Spamming?

    Any deliberate human action that is meant to trigger an

    unjustifiably favorable relevance or importance of some web page

    considering the pages true value [5]

    Also called Search Engine SpammingorSpamdexing

    [5]: Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on

    Adversarial Information Retrieval on the Web, 2005.5

  • 7/31/2019 Sadia Masood 2

    6/53

    What is Web Spamming?

    Example

    Query: Kaiser Pharmacy (at the time paper was written 2005)

    Result: had techdictionary.com as a 3rd hit

    6

  • 7/31/2019 Sadia Masood 2

    7/53

    Why is Web Spamming Bad?

    Search Engines suffer because

    It damages the search engines reputation

    there is a the cost involved to crawling, indexing, storing the spam

    pages.

    Users suffer because

    precision in the query results is lowered

    7

  • 7/31/2019 Sadia Masood 2

    8/53

    How is it done?

    Web Spamming Techniques involve

    Boosting Techniques

    Content spam (e.g, Term repetition)

    Link spam (e.g Link farming)

    Hiding Techniques

    For example, Content hiding, Cloaking

    8

  • 7/31/2019 Sadia Masood 2

    9/53

    What do the Search Engines Ultimately want ?

    Want to calculate and return the exact page ranks based on

    relevance and importance

    Want to avoid the spam pages altogether before they use

    resources that might be used in storing or processing those webpages

    9

  • 7/31/2019 Sadia Masood 2

    10/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    10

  • 7/31/2019 Sadia Masood 2

    11/53

  • 7/31/2019 Sadia Masood 2

    12/53

    The Flow

    Feature extraction

    Combining link & content based features

    Link based

    Content based

    Classification

    Clustering Propagation Stacked graphical learning

    Smoothing

    12

  • 7/31/2019 Sadia Masood 2

    13/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    13

  • 7/31/2019 Sadia Masood 2

    14/53

    Web Spam

    Data Set

    Data Set

    Collected in May 2006

    Publicly available WEBSPAM-UK2006 dataset [15]

    Pages from .uk domain

    77.9 million pages were collected corresponding to 11,400 hosts

    A group of volunteers labeled them normal, borderline or spam

    6,552 hosts evaluated

    [15] http://www.yr-bcn.es/webspam/datasets/14

  • 7/31/2019 Sadia Masood 2

    15/53

    Data Set

    Distribution of host labels, as judged by human volunteers

    Evaluation

    True positive rate, or Recall R False positive rate

    F-measure

    15

  • 7/31/2019 Sadia Masood 2

    16/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    16

  • 7/31/2019 Sadia Masood 2

    17/53

    Link Based Features

    Degree related measures

    PageRank

    TrustRank Truncated PageRank

    Estimation of d supporters

    140 features per host

    17

  • 7/31/2019 Sadia Masood 2

    18/53

    Link Based Features

    Page Rank

    Uses link structure to determine importance or popularity of a web page

    Intuition:

    A web page is important if several other important pages point to

    it

    PageRank of a page influences and is being influenced by

    importance of the other pages

    18

  • 7/31/2019 Sadia Masood 2

    19/53

    Link Based Features

    Page Rank

    Calculation: Initial scores already defined for all the pages

    A

    B

    C

    D

    [6] http://en.wikipedia.org/wiki/PageRank

    A

    B

    C

    D

    19

  • 7/31/2019 Sadia Masood 2

    20/53

    Link Based Features

    Page Rank

    Page Rank with Random Surfer Model

    d is damping factor,

    M(pi) is the set of pages that link topi,

    L(pj) is the number of outbound links on pagepj,

    Nis the total number of pages.

    20

    Li k B d F t

  • 7/31/2019 Sadia Masood 2

    21/53

    Link Based Features

    Trust Rank

    A page having high pageRank is more likely to be spam if it had no

    relationship with a set of trusted pages

    How it works?

    Determine the seed set S (using high PageRank or high out-degree)

    Determine good or bad nodes of the set S (using oracle)

    E.g, S ={ , , }

    where, good =

    bad =21

    Li k B d F t

  • 7/31/2019 Sadia Masood 2

    22/53

    Link Based Features

    Trust Rank

    Image [4]: http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf

    Calculate & propagate the trust from good pages (adjusting trust

    attenuation)

    22

    Link Based Feat res

  • 7/31/2019 Sadia Masood 2

    23/53

    Link Based Features

    Histogram of ratio b/w TrustRank & PageRank

    23

    Link Based Features

  • 7/31/2019 Sadia Masood 2

    24/53

    Link Based Features

    Truncated Page Rank

    A variant of PageRank, diminishes the influence of a page

    to the PageRank score of its close neighbors

    spamweb web

    24

    Link Based Features

  • 7/31/2019 Sadia Masood 2

    25/53

    Link Based Features

    Estimation of d-supporters

    x is d-supporter of node y if shortest path from x to y has

    length d Nd(x) be the set of d-supporters of page x

    For each page x, cardinality of Nd(x) is an increasing fuction

    with respect to d.

    Image [4]25

    Link Based Features

  • 7/31/2019 Sadia Masood 2

    26/53

    Link Based Features

    Bottleneck Number

    The bottleneck measure for page x, defined as

    indicates the minimum rate of growth of neighbors of x up to a

    certain distance

    spam pages have smaller bottle neck numbers than non-spam

    pages

    Bottleneck: Non-Spam & Spam

    Image [4] 26

    Link Based Features

  • 7/31/2019 Sadia Masood 2

    27/53

    Link Based Features

    Histogram of b4(x) of Spam and Non-spam Pages

    27

    The Agenda

  • 7/31/2019 Sadia Masood 2

    28/53

    Web Spam

    The Agenda

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    28

    Content Based Features

  • 7/31/2019 Sadia Masood 2

    29/53

    Content Based Features

    Number of words in the page, title & average word length

    Fraction of anchor text

    Fraction of visible text

    Compression rate

    Corpus precision & corpus recall

    Query precision and query recall Independent trigram likelihood

    Entropy of trigrams

    96 features per host

    29

    Content Based Features

  • 7/31/2019 Sadia Masood 2

    30/53

    Co te t ased eatu es

    Average word length

    30

    Content Based Features

  • 7/31/2019 Sadia Masood 2

    31/53

    Compression Rate

    Compression rate = size of compressed text (visible text)

    size of uncompressed text

    Precision and Recall

    F= Frequent terms in the collection

    T= Terms in the page

    Q= Frequent terms in the query log

    Corpus Recall = | F T | / | F |

    Query Recall = | Q T | / | Q |

    31

    Content Based Features

  • 7/31/2019 Sadia Masood 2

    32/53

    Corpus Precision = | F T | / | T |

    Query Precision = | Q T | / | T |

    32

    Content Based Features

  • 7/31/2019 Sadia Masood 2

    33/53

    Entropy of trigrams (Compression)

    Calculated on distribution of trigrams

    Let

    be probability distribution on trigrams of a page

    be the set of all trigrams in a page

    Entropy of trigrams=

    33

    The Agenda

  • 7/31/2019 Sadia Masood 2

    34/53

    Web Spam

    Focus of the First Paper

    Motivation

    The Objective DATASET

    Link Based Features

    Content Based Features

    Using the Link and Content Features

    Using the Web Topology

    Conclusion

    34

    Using Link and Content Based Features

  • 7/31/2019 Sadia Masood 2

    35/53

    Cost Sensitive Decision Tree

    Bagging

    35

    Using Content & Link Based Features

  • 7/31/2019 Sadia Masood 2

    36/53

    Comparing Link and Content based features

    36

    The Agenda

  • 7/31/2019 Sadia Masood 2

    37/53

    Web Spam

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features

    Using the Content & Link Based Features

    Using the Web Topology

    Conclusion

    37

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    38/53

    Observation :

    Similar pages tend to be linked to be linked together more frequently

    than dissimilar ones

    Similar pages tend to be linked to be linked together more frequentlythan dissimilar ones

    38

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    39/53

    39

    Using the Web Topology -Topological Dependencies of Spam Nodes

  • 7/31/2019 Sadia Masood 2

    40/53

    Sin

    Sout= No. of spam hosts linked by x

    All hosts linked by x

    Sout

    Sin= No. of spam hosts linked to xAll hosts linked to x

    40

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    41/53

    Clustering

    Using METIS graph clustering algorithm

    41

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    42/53

    Clustering

    42

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    43/53

    Use the graph topology to smooth predictions by propagating them

    as random walks

    Main Idea

    Use thepredicted spamicityof a particular classification method and

    start a random walk with the restart probabilty 1-

    Where

    h: host

    p(h) : [0..1] ( p(h)=0: non-spam , p(h)=1: spam, for each host h)

    v(0) : vector such that v(0)h =

    outdeg(g): the out-degree of g

    Propagation

    43

    Using the Web Topology- Propagation

  • 7/31/2019 Sadia Masood 2

    44/53

    Applied with three forms of random walk

    on the host graph (forward)

    on the transposed host graph (backward)

    on the undirected host graph (both)

    Observed improvements on out of other tried values

    Propagation

    44

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    45/53

    Results with in the table

    Propagation

    45

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    46/53

    Stacked Graph Learning

    Uses initial predictions by the classification scheme C for

    all the objects in the data set

    Generates a set of extra features for each object based on

    relevance ( w.r.t in-links, outlinks or both)

    Let r(h) be the set if pages related to host h

    Then,

    Adds these extra features f(h) to the input of C and run the algorithm

    again

    46

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    47/53

    Stacked Graph Learning

    Improvement is seen in the first Iteration

    47

    Using the Web Topology

  • 7/31/2019 Sadia Masood 2

    48/53

    Stacked Graph Learning

    Second Iteration showed even better results

    48

    The Agenda

  • 7/31/2019 Sadia Masood 2

    49/53

    Web Spam

    Focus of the First Paper

    Motivation

    The Objective

    DATASET

    Link Based Features

    Content Based Features Using the Web Topology

    Conclusion

    49

    Conclusion

  • 7/31/2019 Sadia Masood 2

    50/53

    Experiments have clearly shown that

    Combining both the link and content based features bringssignificant improvement to spam detection techniques

    Using web graph dependencies , we can detect the web spam by

    turning spammers ingenuity against themselves

    50

  • 7/31/2019 Sadia Masood 2

    51/53

    Thank You!

    Q&A and Discussion

    51

    References-1

  • 7/31/2019 Sadia Masood 2

    52/53

    [1] www.milliondollarhomepage.com

    [2] http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf

    [3] http://images.google.com/imgres?

    [4] http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf

    [5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International

    Workshop on Adversarial Information Retrieval on the Web, 2005.

    [6] http://en.wikipedia.org/wiki/PageRank

    [7] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based

    characterization and detection of Web Spam. In AIRWeb, 2006.

    [8] http://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdf52

    References-2

    http://www.milliondollarhomepage.com/http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdfhttp://images.google.com/imgreshttp://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdfhttp://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdfhttp://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdfhttp://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdfhttp://images.google.com/imgreshttp://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdfhttp://www.milliondollarhomepage.com/
  • 7/31/2019 Sadia Masood 2

    53/53

    [9] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using

    rank propagation and probabilistic counting for link-based spam detection.

    Technical report, DELIS Dynamically Evolving, Large-Scale Information

    Systems, 2006.

    [10] http://en.wikipedia.org/wiki/Cross-validation

    [11]A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web

    pages through content analysis. In Proceedings of the World Wide Web

    conference, pages 8392, Edinburgh, Scotland, May 2006.[12] http://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdf

    [13] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with

    TrustRank. In Proc. of the 30th VLDB Conf., 2004.

    [14] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search

    engines. ACM SIGIR Forum, 36(2), 2002.

    [15] http://www.yr-bcn.es/webspam/datasets/

    53

    http://en.wikipedia.org/wiki/Cross-validationhttp://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdfhttp://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdfhttp://en.wikipedia.org/wiki/Cross-validation