43
1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt www.cs.kuleuven.b e/~berendt

1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt

Embed Size (px)

Citation preview

Page 1: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

1

1

Sailing the Corpus Sea: Tools for Visual

Discovery of Stories in Blogs and News

Bettina Berendt

www.cs.kuleuven.be/~berendt

Page 2: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

2

2

About me

Page 3: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

3

3

Thanks to ...

Ilija Subašić

Tool forthcoming;

all beta testers and experiment participants welcome!

Daniel Trümper

Tool at http://www.cs.kuleuven.be/~berendt/PORPOISE/

PASCAL

for funding PORPOISE

Page 4: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

4

4

References

Subašić, I. & Berendt, B. (2008). Web Mining for Understanding Stories through Graph Visualisation. In Proceedings of ICDM 2008. IEEE Press.

Berendt, B. and D. Trümper (2009). Semantics-based analysis and navigation of heterogeneous text corpora: the Porpoise news and blogs engine. In I.-H. Ting & H.-J. Wu (Eds.), Web Mining Applications in E-commerce and E-services (pp. 45-64). Berlin etc.: Springer, Studies in Computational Intelligence, Vol. 172

Berendt, B. & Subašić, I. (2009). Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz, 02/09, 11-17.

Subašić, I. & Berendt, B. (in press). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems.

Please see http://www.cs.kuleuven.be/~berendt/ for these papers.

Page 5: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

5

5

Motivation 1: What‘s the story?

Page 6: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

6

6Motivation 2: Global+local interaction; beyond “similar documents“

with respect to what?

Page 7: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

7

7Solution vision: Sailing the Internet

GlobalAnalysis

Search

Localanalysis

Page 8: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

8

8

Solution approach: Architecture & states overview (version 1)

Construct composite-similarity neighbourhood

Search + selectdocument(s)

Aspect-based similarity search

Storyunderstanding

Selectneighbour- hood

Search

GlobalAnalysis

Localanalysis

Refocus

Source doc.s databaseStory/ontology learning

Import

Web

Retrieval & Preprocessing

Specify sources & filters

Storyspace

Documentspace

Page 9: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

9

9

Retrieval and preprocessing

• Crawler / wrapper, using • Yahoo! Web services• Yahoo! / Google News• Blogdigger

• Translator (uses Babelfish)• Preprocessing (uses Textgarden, Terrier)• Named-entity recognition (uses GATE, OpenCalais)• Similarity Computation

Web

Source doc.s databaseRetrieval & Preprocessing

Page 10: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

10

10

Story learning

Construct composite-similarity neighbourhood

Search + selectdocument(s)

Aspect-based similarity search

Storyunderstanding

Selectneighbour- hood

Search

GlobalAnalysis

Localanalysis

Refocus

Source doc.s databaseStory/ontology learning

Web

Retrieval & Preprocessing

Specify sources & filters

Storyspace

Documentspace

Page 11: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

11

11

Story learning: a focus on news-type stories

Page 12: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

12

12

Story learning needs timelines

Page 13: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

13

13

... But where‘s the evolution? Approach Temporal latent topics

Mei & Zhai, PKDD 2005

• no fine-grained relational information• “themes“ are fixed by the algorithm• no „drill down“ possible no combination of machine and human intelligence

Page 14: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

14

14

The ETP3 problem

Evolutionary theme patterns discovery, summary & exploration

1. identify topical sub-structure in a set (generally, a time-indexed stream) of documents constrained by being about a common topic

2. show how these substructures emerge, change, and disappear (and maybe re-appear) over time

3. provide intuitive interfaces for interactively exploring the topics (story space) and the underlying documents (document space)

and for their own sense-making

– use machine-generated summarization only as a starting point!

Page 15: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

15

15

So ...

Page 16: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

16

16

Demo

Page 17: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

17

17

Evaluations (so far ...)

3. Learning effectiveness

Document search with story graphs leads to averages of

67-75% accuracy on judgments of story fact truth

3.4 nodes/words per query – but this appears to depend on the corpus

1. Information retrieval quality

Challenge: What is the ground truth

Build on Wikipedia or other timelines

Edges – events: up to 80% recall, ca. 30% precision

2. Search quality

Story subgraphs index coherent document clusters

Page 18: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

18

18

How to spot events / eventfulness (1)

Search for properties that can be detected formally and visually

Tested so far: Graph topology

Global properties

– Graph size

– Number of connected components

– Size of connected components (avg, max)

Local properties

– Degree

– Degree centrality

– Sum of adjacent TR weights

Page 19: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

19

19

Evidence 1 (a story begins)

Page 20: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

20

20

Evidence 2 (a suspect is found)

Page 21: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

21

21

Evidence 3 (An eventless time)

Page 22: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

22

22

Spotting events (2):

Changes in story graphs mark theadvent of an event

Page 23: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

23

23

So what happens once you have found documents?

Construct composite-similarity neighbourhood

Search + selectdocument(s)

Aspect-based similarity search

Storyunderstanding

Selectneighbour- hood

Search

GlobalAnalysis

Localanalysis

Refocus

Source doc.s databaseStory/ontology learning

Import

Web

Retrieval & Preprocessing

Specify sources & filters

Storyspace

Documentspace

Page 24: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

24

24The neighbourhood of a document

Page 25: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

25

25Constructing the similarity measure & neighbourhood (I)

Page 26: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

26

26Constructing the similarity measure & neighbourhood (II)

Page 27: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

27

27Constructing the similarity measure & neighbourhood (III)

A news source

A German-language blog

Most neighbours are blogs

Most neighbours are English-

language blogs

English blog

German blog

English news

Page 28: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

28

28Comparing documents

Page 29: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

29

29Comparing documents; utilizing multilingual sources

Page 30: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

30

30Refocusing

Page 31: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

31

31Structuring a neighbourhood

Page 32: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

32

32Ex.: Finding a “story“

Page 33: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

33

33

Behind the scenes

Page 34: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

34

34

Ingredients of a solutionto ETP3 = Evolutionary theme patterns discovery, summary and exploration

Document / text pre-processing

Document summarization strategy

Selection approach for concepts

Similarity measure to determine relations

Burstiness measure

Interaction approach

STORIES

• Template recognition• Multi-document named entities• Stopword removal, lemmatization

• no topics, but salient concepts & relations• time window; word-span window

• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed

• bursty co-occurrence

• time relevance, a “temporal co-occurrence lift”

• Graphs (& layout)• Comparative statics or morphing• Drill-down: “uncovering” relations• Links to documents (in progress)

ETP3

Page 35: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

35

35

Data collection and preprocessing

Articles from a news archive for search term identifying the top-level story

Only English-language articles

Only freely available articles

Preprocessing: (repeated from above)

HTML cleaning

tokenization

stopword removal

lemmatization

multi-document named-entity recognition

Page 36: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

36

36

Story elements

content-bearing words

the 150 top-TF words without stopwords

Page 37: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

37

37

Story stages: co-occurrence in a window

“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)

Page 38: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

38

38

Salient story elements

1. Split whole corpus T by week (or some other time interval)

2. For each week

Compute the weights for corpus t for this week

3. Weight =

Support of co-occurrence of 2 content-bearing words w1, w2 in t =

(# articles from t containing both w1, w2 in window) / (# all articles in t)

4. Threshold

Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =

support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥

θ2 (e.g., 2) *

5. Rank by TR, for each week identify top 2

6. Story elements = peak words = all elements of these top 2 pairs (# = 38)

Page 39: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

39

39

Salient story stages, and story evolution

7. Story stage = co-occurrences of peak words in t

For each week t: aggregate over t-2, t-1, t moving average

8. Story evolution = how story stages evolve over the t in T

Page 40: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

40

40

Summary

Navigation in story space story building

+

Document search

+

Navigation in document space

lead to understandable, useful + intuitive interfaces

Page 41: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

41

41

Datasets: Where does STORIES not work well?

the need for contiguity and a sizeable dataset

DUC dataset

the need for action and a storyline

Tsunami dataset

the benefit of hindsight

Enron

Page 42: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

42

42

My questions to you – and: The Future

linguistic analysis

sparsity problem

which word classes, which relations?

semantic enhancements

linkage information

indexing and ranking for search

graph mining

business models for search

news/blogs

other texts

Page 43: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt  berendt

43

43

Thanks!

... also for:

your questions!