Upload
ronald-ellis
View
214
Download
2
Embed Size (px)
Citation preview
Newsjunkie: Providing Personalized Newsfeeds via Analysis of In
formation NoveltyGabrilovich et.al
WWW2004
Main Contents
• Identify novelty of news stories given preceding news a user has read
• Newsjunkie: a set of algorithms for different (but related) tasks
• Technique: text collection comparison• Tasks:
– Ranking news by novelty– Personalized news updates– Characterization of relevance types of articles
• Evaluation or Examples
Review: Text Comparison
• Syntactic differences b/w Web pages– e.g :AT&T Internet Difference Engine
• Characteristic words– e.g: genre classification
• Language models for entire collections– e.g: corpus linguistics
• Comparing one set of documents to another– e.g: MMR (Maximum Marginal Relevance)– Newsjunkie
Research Problems
• Focus on temporal aspects of content difference– automatically assess the novelty over time of news
articles coming from live newsfeeds.
• Look for documents most dissimilar from documents reviewed earlier– limitation: output entire documents rather than nov
el parts of multiple documents => much harder : + IE + summarization
Difference of Text Content
• KL divergence
• Density of new named entities– assumption: novelty is often conveyed by
introducing new named entities
? Is normalization reasonable? What we need is new info. regardless how long the document is.
Task 1: news ranking
Evaluation 1• User evaluate on 3 distance metrics, 12 topics
– KL divergence; density of NE; chronological order
• Each metric produced a set of 3 novel documents • Users judge which set is the most novel• Statistical significance tests on mean ranks
– KL & NE are superior than chronological order
– No significant difference b/w KL & NE
? Not consider the order of the 3 articles, while the question is ranking!
? Statistical tests only on mean, how about variance?
Task 2: personalized news update
• Task 2.1 single daily update– articles on the preceding day as background– user specify a novelty threshold
Future work: consider more previous articles with weights decaying with age
• No evaluation in this part
Task 2.2: breaking news report
• detect new information about a story• preceding articles within a sliding window as
background– empirically, size of 40 articles
• Filtering out delayed reports and recaps– those are narrow spikes in a distance graph
• based on the nature of news reports
– median filter filters out narrow spikes
– empirically, width of filter : 5
? parameters setting
Task 2.2: example
Task 3: relevance type of articles
• Four types of relevance to background– Recap: repeat old stuff,– Elaboration: add new info.– Offshoot: mainly about another topic– Irrelevant: totally different topic
• Identify them using intra-document dynamics
Task 3: intra-document dynamics
• Estimate relevance of different parts within a document
• Sliding window with a fixed size
• Compare content within the window to background
• Plot the distance scores
• Identify different patterns
What will the graph of a irrelevant article look like?
-- Higher absolute scores, but small dynamic range
Contributions
• Novel novelty metric– density of named entities
• Evaluation by users
• Breaking news detection– novel adoption of median filter
• Characterization of article types– intra-story pattern novelty
Limitations
• Generalization of the metric on named entities: – works well on news domain, but others?
• User evaluation: too coarse– without considering order of articles
– used old news which users had seen before the tests
• Claimed “personalized”, but only provided flexibility in threshold and, possibly, article relevance type selection
• Better if it can identify novel parts– or maybe not, keep integrity of a piece of news
Thank you!