Upload
evelyn-briggs
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Automatic Discovery of Useful Facet Terms
Wisam Dakka – Columbia University
Rishabh Dayal – Columbia University
Panagiotis G. Ipeirotis – NYU
Searching the NYT Archive for Book Research
Motivation: News Archive Accessing and searching is not an easy task
Researchers and reporters spend a large amount of time going through their long query results
News archives are huge and available for tens of years Many relevant results
Results in the first page are not more relevant than the results in the 5th or the 10th page (NYT archive)
Search engines of news archive mainly follow the paradigm Search, skim through long results, modify, and search again
Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster
Newsblaster archive About 6 years of news from 24 news sources Stories are clustered daily into hierarchies of topics and events Events are threaded over time, summarized, and classified
Motivation: MI for Newsblaster Archive Our multifaceted interfaces work has some
limitations [CIKM2005]: Supervised learning: facets that could be identified
by our algorithm appear in the training set WordNet hypernyms
WordNet has rather poor coverage of named entities
Free text collections The quality of the hierarchies built on top of news
stories was low.
Challenge: Automatic Extraction of the Useful Facets from News Archive Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text
Automatically group together facet terms that belong to the same facet
Build the appropriate browsing structure for each facet
Intuition: Look for Facet Terms Elsewhere Pilot study - 100 stories from The NYTimes
Common facets: Location, Institutes, History, People, Social Phenomenon, Markets, Nature, and Event
Sub-facets: Leaders under People, Corporations under Markets Clear phenomenon: the terms for the useful facets do
not usually appear in the news stories A journalist writing a story about Jacques Chirac will not
necessarily use the terms Political Leader, Europe, or France. Such missing terms are tremendously useful for identifying the appropriate facets for the story
We will look for these terms elsewhere infrequent terms in the original collection, but are frequent in
expanded documents
Context-Aware Expansion
Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Murkowski made the announcement
three days after BP said it would shut
down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Murkowski made the
announcement three days after BP
said it would shut down a Prudhoe Bay oil field after a small leak was
found. Energy officials have said
pipeline repairs are likely to take months, curtailing Alaskan production into next year
Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Wikipedia Wiki
Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Wiki TextWiki TextWiki Text
Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Wiki TextWiki TextWordnet Text
Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year
Wiki TextWiki TextGoogle Text
Wordnet
Wordnet
Name EntitiesYahoo Term Extractor
0
100
200
300
400
500
600
700
800
900
1000
Term frequnecy incontext-aware documents
0
100
200
300
400
500
600
700
800
900
1000
Term frequnecy inoriginal documents
Useful Facets Terms are Elsewhere
Infrequent
Terms
Context-aware Collection
ti
Original Collection
Frequency-based shifting
Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem)
Rank-shifting
Term Frequency Analysis
Summary: Candidate Facet Terms For each document in the database, identify the
important terms that are useful to characterize the contents of the document
For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context-aware” document
Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms
Indicative
Research in Progress
Cleaning and filtering Grouping similar facet terms under one facet Evaluation
The resulted candidate terms The resulted hierarchies