Text-Based Measures of Document Diversity

Text-Based Measures of Document DiversityDate： 2014/02/12Source：KDD’13Authors：Kevin Bache, David Newman, and Padhraic

SmythAdvisor：Dr. Jia-Ling, KohSpeaker： Shun-Chen, Cheng

2

Outline

IntroductionMethodExperimentConclusions

3

Introduction (Interdiscipli

nary)

the hypothesis：

interdisciplinary research can lead to new discoveries at a rate faster than that of traditional research projects conducted within single disciplines

(single disciplines)

4

Introduction

Task：

Diversity score

assign

quantifying how diverse a document is in terms of its content

Goal

5

Framework

Diversity score of each document

corpus

LDALearn

T for D D x T

matrix

Rao’s Diversity measure

Topic co-occurrence similarity measures

T ： topicD ：document

6

Outline


7

Topic-based Diversity(1)

LDA ： collapsed Gibbs samplerUsing the topic-word assignments from the

final iteration of the Gibbs samplerndj corresponding to the number of word

tokens in document d that are assigned to topic j.

Example of create D x T matrix ：9 0 10 10 62 15 81 2 16

t1 t2 t3d1

d2d3d4

n13

8


Rao’s Diversity for a document d ：

ndj ： the value of entry (d,j) in DxT matrix nd ： the number of word tokens in d

d

dj

nn

djP )|(

),( ji measure of the distance between topic i and topic j

9


Example of Rao’s diversity ：

9 0 10 10 62 15 81 2 16

t1 t2 t3d1

d2d3d4

div(1) = 1.26div(2) = 0.04688div(3) = 0.09344div(4) = 1.557895

i)(j, j)(i, similarity cosineon based assume

1.0)3,2(7.0)3,1(2.0)2,1(

10

Topicco-occurranceSimilarity

Cosine similarity ：

Probabilistic-based ：

NndP d)( N ： number of word tokens in

the corpus.

ndj ： the value of entry (d,j) in DxT matrix

),( jis

11

Similarity toDistance

Similarity measures

Cosine similarity

Probability based

),( ji

distance means ),( ji

Similarity to Distance

12

Outline


13

Experiment

DatasetPubMed Central Open Access dataset (PubMed )NSF Awards from 2007 to 2012 (NSF)Association of Computational Linguistics Anthology

Network (ACL)

Topic Modeling (LDA)MALLETα ： 0.05*(N/D*T) ， β ： 0.015,000 iterations. Keep only the final sample in the

chain.T = 10, 30, 100 and 300 topics.

14

Pseudo-Documents

Reason ： no ground-truth measure for a document's diversity.Half of which were designed to have high diversity and half of

which were designed to have low diversity.High diversity pseudo-document ：

manually selecting

Randomly select an article from A and one from B.

Relatively unrelated

Journal A

Journal B

Pseudo-document

Randomly select

15

ExperimentROC Curve

AUC： Area under the ROC curve

16

ExperimentAUC scores for different diversity measures based on 1000 pseudo-documents from PubMed

17

ExperimentEvaluating transformations

18

Experimentmost diverse NSF grant proposals

19

Outline


20

Conclusions

Presented an approach for quantifying the diversity of individual documents in a corpus based on their text content.

More data-driven, performing the equivalent of learning journal categories by learning topics from text.

Can be run on any collection of text documents, even without a prior categorization scheme.

A possible direction for future work is that of temporal document diversity.