Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs

Moshe Koppel and Yaron Winter

The Ultimate Problem

•Let’s skip right to the hardest problem:Given two anonymous short documents, determine

if they were written by the same author.

•If we can solve this, we can solve pretty much any variation of the attribution problem.

Experimental Setup

•Construct pairs <Bi,Ej> by choosing the first 500 words of blog i and the last 500 words of blog j.

•Create 1000 such pairs, half of which are same-author pairs (i=j). (In the real world, there are many more different-author

pairs than same-author pairs, but let’s keep the bookkeeping simple for now.)

Note: no individual author appears in more than one pair.

•The task is to label each pair as same-author or different-author.

A Simple Unsupervised Baseline Method

.1Vectorize B and E (e.g., as frequencies of character n-grams)

.2Compute the cosine similarity of B and E .

.3If/f it exceeds some (optimally chosen) threshold, assign the pair <B,E> to same-author.

A Simple Unsupervised Baseline Method

.1Vectorize B and E (e.g., as frequencies of character n-grams)

.2Compute the cosine similarity of B and E .

.3If/f it exceeds some threshold, assign the pair <B,E> to same-author.

This method yields accuracy of 70.6% (using the optimal threshold).

A Simple Supervised Baseline Method

•Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with

each pair labeled.

•We can do the obvious thing :.1Vectorize B and E (e.g., as frequencies of character n-grams).2Compute the difference vector (e.g., terms are |bi-ei|/(bi+ei) )

.3Learn on training corpus to find some suitable classifier

A Simple Supervised Baseline Method

•Suppose that, in addition to the (test) corpus just described, we have a training corpus constructed the same way, but with

each pair labeled.

•We can do the obvious thing :.1Vectorize B and E (e.g., as frequencies of character n-grams).2Compute the difference vector (e.g., terms are |bi-ei|/(bi+ei) )

.3Learn on training corpus to find some suitable classifier

•With a lot of effort, we get accuracy of 79.8%.But we suspect we can do better, even without using a labeled training corpus (too

much).

Exploiting the Many-Authors Method

.1Given B and E, generate a list of impostors E1,..,En.

.2Use our algorithm for the many-candidate problem for anonymous text B and candidates {E, E1,…,En}.

.3If/f E is selected as the author with sufficiently high score, assign the pair to same-author.

.4(Optionally, add impostors to B and check if anonymous document E is assigned to author B).

Design Choices

There are some obvious questions we need to consider:

•How many impostors is optimal? (Fewer impostors means more false positives; more impostors means more false negatives.)

•Where should we get the impostors from? (If the impostors are not convincing enough, we’ll get too many false positives; if

the impostors are too convincing – e.g. drawn from the genre of B that is not also the genre of E – we’ll get too many false negatives.)

How Many Impostors?•We generated a random corpus of 25000 impostor documents

(results of Google searches for medium-frequency words in our corpus).

•For each pair, we randomly selected N of these documents as impostors and applied our algorithm (using a fixed score threshold

k=5%).

•Here are the accuracy results (y-axis) for different values of N:

• Best result: 83.4% at 625 impostors

5 50 100 25050.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

Random Impostorsaccuracy vs impostors number (k=5%)

Accuracy

Acc

ura

cy

Random Impostors


5 50 100 25050.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

Random Impostorsaccuracy vs impostors number (k=5%)

Accuracy

Acc

ura

cy

Fewer false positive

Random Impostors

Fewer false

negative

Which Impostors?•Now, instead of using random impostors, for each pair <B,E>,

we choose the N impostors that have the most “lexical overlap” with B (or E).

•The idea is that more convincing impostors should prevent false positives .


5 50 100 350 62550.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

Lexically Similar Impostors

Similar ImpostorsRandom Impostors

Acc

ura

cy

Similar Impostors

K=5%

• Best accuracy result: 83.8% at 50 impostors

5 50 100 350 62550.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

Lexically Similar Impostors

Similar ImpostorsRandom Impostors

Acc

ura

cy

Similar Impostors

K=5%

Only 2% false positive

Which Impostors?•It turns out that (for a fixed score threshold k) using similar

impostors doesn’t improve accuracy, but it allows us to use fewer impostors.

•We can also try to match impostors to the suspect’s genre .

•For example, suppose that we know that B and E are drawn from a blog corpus. We can limit impostors to blog posts.


5 50 100 250 75050.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

Blog Impostors

Blog ImpostorsRandom Impostors

Acc

ura

cy

Same-Genre Impostors

K=5%

Impostors ProtocolOptimizing on a development corpus, we settle on the following protocol:

.1From a large blog universe, choose as potential impostors the 250 blogs most similar to E.

.2Randomly choose 25 actual impostors from among the potential impostors.

.3Say that <B,E> are same-author if score(B,E)≥k, where k is used to trade-off precision and accuracy.

Results

0.3 0.600000000000001 0.90.4

0.5

0.6

0.7

0.8

0.9

1.0

Blogs

Google

Cosine

Min-Max

SVM

Recall for Same-Author Pairs

Pre

cisi

on

Recall

Results

Optimizing thresholds on a development corpus, we obtain accuracies as follows:

Cosine Min-Max SVM Google Blogs50.0

60.0

70.0

80.0

90.0

100.0

Accuracy

Accuracy

Conclusions•We can use (almost) unsupervised methods to determine if two

short documents are by the same author. This actually works better than a supervised baseline method.

•The trick is to see how robustly the two can be tied together from among some set of impostors.

•The right number of impostors to use depends on the quality of the impostors and the relative cost of false-positives vs. false-negatives.

•We assumed throughout that the prior probability of same-author 0.5; we have obtained similar results for skewed corpora (just by

changing the score threshold).

Open Questions

•What if x and y are in two different genres (e.g. blogs and facebook statuses)?

•What if a text was merely “influenced” by x but mostly written by y? Can we discern (or maybe quantify) this influence?

•Can we use these methods (or related ones) to identify outlier texts in a corpus (e.g. a play attributed to Shakespeare that

wasn’t really written by Shakespeare)?

Documents

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter