Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al

Newsjunkie: Providing Personalized Newsfeeds via Analysis of In

formation NoveltyGabrilovich et.al

WWW2004

Main Contents

• Identify novelty of news stories given preceding news a user has read

• Newsjunkie: a set of algorithms for different (but related) tasks

• Technique: text collection comparison• Tasks:

– Ranking news by novelty– Personalized news updates– Characterization of relevance types of articles

• Evaluation or Examples

Review: Text Comparison

• Syntactic differences b/w Web pages– e.g :AT&T Internet Difference Engine

• Characteristic words– e.g: genre classification

• Language models for entire collections– e.g: corpus linguistics

• Comparing one set of documents to another– e.g: MMR (Maximum Marginal Relevance)– Newsjunkie

Research Problems

• Focus on temporal aspects of content difference– automatically assess the novelty over time of news

articles coming from live newsfeeds.

• Look for documents most dissimilar from documents reviewed earlier– limitation: output entire documents rather than nov

el parts of multiple documents => much harder : + IE + summarization

Difference of Text Content

• KL divergence

• Density of new named entities– assumption: novelty is often conveyed by

introducing new named entities

? Is normalization reasonable? What we need is new info. regardless how long the document is.

Task 1: news ranking

Evaluation 1• User evaluate on 3 distance metrics, 12 topics

– KL divergence; density of NE; chronological order

• Each metric produced a set of 3 novel documents • Users judge which set is the most novel• Statistical significance tests on mean ranks

– KL & NE are superior than chronological order

– No significant difference b/w KL & NE

? Not consider the order of the 3 articles, while the question is ranking!

? Statistical tests only on mean, how about variance?

Task 2: personalized news update

• Task 2.1 single daily update– articles on the preceding day as background– user specify a novelty threshold

Future work: consider more previous articles with weights decaying with age

• No evaluation in this part

Task 2.2: breaking news report

• detect new information about a story• preceding articles within a sliding window as

background– empirically, size of 40 articles

• Filtering out delayed reports and recaps– those are narrow spikes in a distance graph

• based on the nature of news reports

– median filter filters out narrow spikes

– empirically, width of filter : 5

? parameters setting

Task 2.2: example

Task 3: relevance type of articles

• Four types of relevance to background– Recap: repeat old stuff,– Elaboration: add new info.– Offshoot: mainly about another topic– Irrelevant: totally different topic

• Identify them using intra-document dynamics

Task 3: intra-document dynamics

• Estimate relevance of different parts within a document

• Sliding window with a fixed size

• Compare content within the window to background

• Plot the distance scores

• Identify different patterns

What will the graph of a irrelevant article look like?

-- Higher absolute scores, but small dynamic range

Contributions

• Novel novelty metric– density of named entities

• Evaluation by users

• Breaking news detection– novel adoption of median filter

• Characterization of article types– intra-story pattern novelty

Limitations

• Generalization of the metric on named entities: – works well on news domain, but others?

• User evaluation: too coarse– without considering order of articles

– used old news which users had seen before the tests

• Claimed “personalized”, but only provided flexibility in threshold and, possibly, article relevance type selection

• Better if it can identify novel parts– or maybe not, keep integrity of a piece of news

Thank you!

Documents

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al