Upload
norman-bridges
View
224
Download
0
Embed Size (px)
Citation preview
1
1
Sailing the Corpus Sea: Tools for Visual
Discovery of Stories in Blogs and News
Bettina Berendt
www.cs.kuleuven.be/~berendt
2
2
About me
3
3
Thanks to ...
Ilija Subašić
Tool forthcoming;
all beta testers and experiment participants welcome!
Daniel Trümper
Tool at http://www.cs.kuleuven.be/~berendt/PORPOISE/
PASCAL
for funding PORPOISE
4
4
References
Subašić, I. & Berendt, B. (2008). Web Mining for Understanding Stories through Graph Visualisation. In Proceedings of ICDM 2008. IEEE Press.
Berendt, B. and D. Trümper (2009). Semantics-based analysis and navigation of heterogeneous text corpora: the Porpoise news and blogs engine. In I.-H. Ting & H.-J. Wu (Eds.), Web Mining Applications in E-commerce and E-services (pp. 45-64). Berlin etc.: Springer, Studies in Computational Intelligence, Vol. 172
Berendt, B. & Subašić, I. (2009). Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz, 02/09, 11-17.
Subašić, I. & Berendt, B. (in press). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems.
Please see http://www.cs.kuleuven.be/~berendt/ for these papers.
5
5
Motivation 1: What‘s the story?
6
6Motivation 2: Global+local interaction; beyond “similar documents“
with respect to what?
7
7Solution vision: Sailing the Internet
GlobalAnalysis
Search
Localanalysis
8
8
Solution approach: Architecture & states overview (version 1)
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Import
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
9
9
Retrieval and preprocessing
• Crawler / wrapper, using • Yahoo! Web services• Yahoo! / Google News• Blogdigger
• Translator (uses Babelfish)• Preprocessing (uses Textgarden, Terrier)• Named-entity recognition (uses GATE, OpenCalais)• Similarity Computation
Web
Source doc.s databaseRetrieval & Preprocessing
10
10
Story learning
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
11
11
Story learning: a focus on news-type stories
12
12
Story learning needs timelines
13
13
... But where‘s the evolution? Approach Temporal latent topics
Mei & Zhai, PKDD 2005
• no fine-grained relational information• “themes“ are fixed by the algorithm• no „drill down“ possible no combination of machine and human intelligence
14
14
The ETP3 problem
Evolutionary theme patterns discovery, summary & exploration
1. identify topical sub-structure in a set (generally, a time-indexed stream) of documents constrained by being about a common topic
2. show how these substructures emerge, change, and disappear (and maybe re-appear) over time
3. provide intuitive interfaces for interactively exploring the topics (story space) and the underlying documents (document space)
and for their own sense-making
– use machine-generated summarization only as a starting point!
15
15
So ...
16
16
Demo
17
17
Evaluations (so far ...)
3. Learning effectiveness
Document search with story graphs leads to averages of
67-75% accuracy on judgments of story fact truth
3.4 nodes/words per query – but this appears to depend on the corpus
1. Information retrieval quality
Challenge: What is the ground truth
Build on Wikipedia or other timelines
Edges – events: up to 80% recall, ca. 30% precision
2. Search quality
Story subgraphs index coherent document clusters
18
18
How to spot events / eventfulness (1)
Search for properties that can be detected formally and visually
Tested so far: Graph topology
Global properties
– Graph size
– Number of connected components
– Size of connected components (avg, max)
Local properties
– Degree
– Degree centrality
– Sum of adjacent TR weights
19
19
Evidence 1 (a story begins)
20
20
Evidence 2 (a suspect is found)
21
21
Evidence 3 (An eventless time)
22
22
Spotting events (2):
Changes in story graphs mark theadvent of an event
23
23
So what happens once you have found documents?
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Import
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
24
24The neighbourhood of a document
25
25Constructing the similarity measure & neighbourhood (I)
26
26Constructing the similarity measure & neighbourhood (II)
27
27Constructing the similarity measure & neighbourhood (III)
A news source
A German-language blog
Most neighbours are blogs
Most neighbours are English-
language blogs
English blog
German blog
English news
28
28Comparing documents
29
29Comparing documents; utilizing multilingual sources
30
30Refocusing
31
31Structuring a neighbourhood
32
32Ex.: Finding a “story“
33
33
Behind the scenes
34
34
Ingredients of a solutionto ETP3 = Evolutionary theme patterns discovery, summary and exploration
Document / text pre-processing
Document summarization strategy
Selection approach for concepts
Similarity measure to determine relations
Burstiness measure
Interaction approach
STORIES
• Template recognition• Multi-document named entities• Stopword removal, lemmatization
• no topics, but salient concepts & relations• time window; word-span window
• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed
• bursty co-occurrence
• time relevance, a “temporal co-occurrence lift”
• Graphs (& layout)• Comparative statics or morphing• Drill-down: “uncovering” relations• Links to documents (in progress)
ETP3
35
35
Data collection and preprocessing
Articles from a news archive for search term identifying the top-level story
Only English-language articles
Only freely available articles
Preprocessing: (repeated from above)
HTML cleaning
tokenization
stopword removal
lemmatization
multi-document named-entity recognition
36
36
Story elements
content-bearing words
the 150 top-TF words without stopwords
37
37
Story stages: co-occurrence in a window
“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)
38
38
Salient story elements
1. Split whole corpus T by week (or some other time interval)
2. For each week
Compute the weights for corpus t for this week
3. Weight =
Support of co-occurrence of 2 content-bearing words w1, w2 in t =
(# articles from t containing both w1, w2 in window) / (# all articles in t)
4. Threshold
Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
Time-relevance TR of co-occurrence(w1, w2) =
support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
θ2 (e.g., 2) *
5. Rank by TR, for each week identify top 2
6. Story elements = peak words = all elements of these top 2 pairs (# = 38)
39
39
Salient story stages, and story evolution
7. Story stage = co-occurrences of peak words in t
For each week t: aggregate over t-2, t-1, t moving average
8. Story evolution = how story stages evolve over the t in T
40
40
Summary
Navigation in story space story building
+
Document search
+
Navigation in document space
lead to understandable, useful + intuitive interfaces
41
41
Datasets: Where does STORIES not work well?
the need for contiguity and a sizeable dataset
DUC dataset
the need for action and a storyline
Tsunami dataset
the benefit of hindsight
Enron
42
42
My questions to you – and: The Future
linguistic analysis
sparsity problem
which word classes, which relations?
semantic enhancements
linkage information
indexing and ranking for search
graph mining
business models for search
news/blogs
other texts
43
43
Thanks!
... also for:
your questions!