WikiTrust: Turning Wikipedia Quantity into Quality B. Thomas Adler, Luca de Alfaro, and Ian Pye

Preview:

Citation preview

WikiTrust: Turning Wikipedia

Quantity into Quality

B. Thomas Adler, Luca de Alfaro, and Ian Pye

•Wikipedia:

•3,000,000+ Article,

•1,000,000,000+ Revisions

Our Goal: Crowd-sourcing community consensus

Vandalism

•Prevents Wikipedia being taken fully seriously

•Harder to use Wikipedia in schools

•Harder to make static selections

•Zero-delay: Use only those features which are available at the time the revision is created. (no lookahead)

•Historical: Use the full set of WikiTrust features, including how the revision is treated by subsequent authors. (lookahead)

Vandalism DetectionGiven a new revision, classify as Vandalism or Regular

•Wikipedia 1.0 Project: Aims to extract a static snapshot of Wikipedia.

•Use in Schools, Developing Countries, OLPC Project.

Revision SelectionGiven an article, select the “best” revision to show to a user.

Core Concepts•Wikipedia Article

•Many Revisions

•1 Author per Revision

•Author has Reputation, Revision has Trust.

•Binary Classifier: Either A or B.

Zero Day Features•Author is Anonymous (Turns out we

don’t care)

•Time interval after the previous edit (Useful, but only as a predicate time > 12 seconds)

•Time of day of edit (Not used)

Zero Day Features•Difference from previous revisions

(Not really)

•Comment Length (Nope)

Zero Day Features(we care about these)

•Previous Text Trust Histogram

•Current Text Trust Histogram

•Histogram Difference

Text Trust•New text starts with a trust value

proportional to the author's reputation.

•Text can gain trust when revised.

•Cut-and-paste, deletions result in local trust loss.

•We remember deleted text and its trust.

A Sequence of Differences

•For revisions v1, v2, v3... of a wiki, word trust is computed from the difference between vi, vi-1

•How did we arrive at the current version of an article?

Text Trust: The Algorithm Illustrated

1) Trust of new text

1

Text Trust: The Algorithm Illustrated

1) Trust of new text

2) New block borders have the same trust as new text

2 22

Text Trust: The Algorithm Illustrated

1) Trust of new text

2) New block borders have the same trust as new text

3) The revision effect increases the trust of existing text

3 3

Text Trust: The Algorithm Illustrated

1) Trust of new text

2) New block borders have the same trust as new text

3) The revision effect increases the trust of existing text4) Note: this is not a new border

4

4

Zero Day Features(we care about these)

•Previous Text Trust Histogram

•Current Text Trust Histogram

•Histogram Difference

Historical Features

•Next revision comment length (length > 110 chars)

•Next revision comment has the word revert in it (too noisy)

Historical Features•Author Reputation (How do other

users judge this user’s edits?)

Historical Features

•Minimum Revision Quality

•Average Revision Quality

•Maximum Dissent

Historical Features

•Total Weight of Judges (not at all)

ROC AUC Scoring

•>0.90 = Excellent

•0.8 - 0.9 = Good

•< 0.8 = Poor

•0.5 = Expected result from flipping a coin

Probability that a binary classifier is correct

Results (PAN 2010)ROC of 0.937

Results (PAN 2010)ROC of 0.937XROC of 0.914 ?

Results (PAN 2010)ROC of 0.937XROC of 0.904 ?

Other Directions

•Wikipedia 1.0

•Vandalism API

•Newsgroup Reputation

•IP Address Reputation

The fraction of change that is in the same direction of the future.

• Qual = 1: vj is a totally good edit

• Qual = -1: vj is reverted

• -1 ≤ Qual ≤ 1

vi

vk

vj

“work done”d(v

i, vj)

d(v

i , vj )-d

(vj , v

k )

“prog

ress”

the past

the future

Revision Quality

Recommended