1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt

1

1

Sailing the Corpus Sea: Tools for Visual

Discovery of Stories in Blogs and News

Bettina Berendt

www.cs.kuleuven.be/~berendt

2

2

About me

3

3

Thanks to ...

Ilija Subašić

Tool forthcoming;

all beta testers and experiment participants welcome!

Daniel Trümper

Tool at http://www.cs.kuleuven.be/~berendt/PORPOISE/

PASCAL

for funding PORPOISE

4

4

References

Subašić, I. & Berendt, B. (2008). Web Mining for Understanding Stories through Graph Visualisation. In Proceedings of ICDM 2008. IEEE Press.

Berendt, B. and D. Trümper (2009). Semantics-based analysis and navigation of heterogeneous text corpora: the Porpoise news and blogs engine. In I.-H. Ting & H.-J. Wu (Eds.), Web Mining Applications in E-commerce and E-services (pp. 45-64). Berlin etc.: Springer, Studies in Computational Intelligence, Vol. 172

Berendt, B. & Subašić, I. (2009). Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz, 02/09, 11-17.

Subašić, I. & Berendt, B. (in press). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems.

Please see http://www.cs.kuleuven.be/~berendt/ for these papers.

5

5

Motivation 1: What‘s the story?

6

6Motivation 2: Global+local interaction; beyond “similar documents“

with respect to what?

7

7Solution vision: Sailing the Internet

GlobalAnalysis

Search

Localanalysis

8

8

Solution approach: Architecture & states overview (version 1)

Construct composite-similarity neighbourhood

Search + selectdocument(s)

Aspect-based similarity search

Storyunderstanding

Selectneighbour- hood

Search

GlobalAnalysis

Localanalysis

Refocus

Source doc.s databaseStory/ontology learning

Import

Web

Retrieval & Preprocessing

Specify sources & filters

Storyspace

Documentspace

9

9

Retrieval and preprocessing

• Crawler / wrapper, using • Yahoo! Web services• Yahoo! / Google News• Blogdigger

• Translator (uses Babelfish)• Preprocessing (uses Textgarden, Terrier)• Named-entity recognition (uses GATE, OpenCalais)• Similarity Computation

Web

Source doc.s databaseRetrieval & Preprocessing

10

10

Story learning




Storyunderstanding


Search

GlobalAnalysis

Localanalysis

Refocus


Web



Storyspace

Documentspace

11

11

Story learning: a focus on news-type stories

12

12

Story learning needs timelines

13

13

... But where‘s the evolution? Approach Temporal latent topics

Mei & Zhai, PKDD 2005

• no fine-grained relational information• “themes“ are fixed by the algorithm• no „drill down“ possible no combination of machine and human intelligence

14

14

The ETP3 problem

Evolutionary theme patterns discovery, summary & exploration

1. identify topical sub-structure in a set (generally, a time-indexed stream) of documents constrained by being about a common topic

2. show how these substructures emerge, change, and disappear (and maybe re-appear) over time

3. provide intuitive interfaces for interactively exploring the topics (story space) and the underlying documents (document space)

and for their own sense-making

– use machine-generated summarization only as a starting point!

15

15

So ...

16

16

Demo

17

17

Evaluations (so far ...)

3. Learning effectiveness

Document search with story graphs leads to averages of

67-75% accuracy on judgments of story fact truth

3.4 nodes/words per query – but this appears to depend on the corpus

1. Information retrieval quality

Challenge: What is the ground truth

Build on Wikipedia or other timelines

Edges – events: up to 80% recall, ca. 30% precision

2. Search quality

Story subgraphs index coherent document clusters

18

18

How to spot events / eventfulness (1)

Search for properties that can be detected formally and visually

Tested so far: Graph topology

Global properties

– Graph size

– Number of connected components

– Size of connected components (avg, max)

Local properties

– Degree

– Degree centrality

– Sum of adjacent TR weights

19

19

Evidence 1 (a story begins)

20

20

Evidence 2 (a suspect is found)

21

21

Evidence 3 (An eventless time)

22

22

Spotting events (2):

Changes in story graphs mark theadvent of an event

23

23

So what happens once you have found documents?




Storyunderstanding


Search

GlobalAnalysis

Localanalysis

Refocus


Import

Web



Storyspace

Documentspace

24

24The neighbourhood of a document

25

25Constructing the similarity measure & neighbourhood (I)

26

26Constructing the similarity measure & neighbourhood (II)

27

27Constructing the similarity measure & neighbourhood (III)

A news source

A German-language blog

Most neighbours are blogs

Most neighbours are English-

language blogs

English blog

German blog

English news

28

28Comparing documents

29

29Comparing documents; utilizing multilingual sources

30

30Refocusing

31

31Structuring a neighbourhood

32

32Ex.: Finding a “story“

33

33

Behind the scenes

34

34

Ingredients of a solutionto ETP3 = Evolutionary theme patterns discovery, summary and exploration

Document / text pre-processing

Document summarization strategy

Selection approach for concepts

Similarity measure to determine relations

Burstiness measure

Interaction approach

STORIES

• Template recognition• Multi-document named entities• Stopword removal, lemmatization

• no topics, but salient concepts & relations• time window; word-span window

• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed

• bursty co-occurrence

• time relevance, a “temporal co-occurrence lift”

• Graphs (& layout)• Comparative statics or morphing• Drill-down: “uncovering” relations• Links to documents (in progress)

ETP3

35

35

Data collection and preprocessing

Articles from a news archive for search term identifying the top-level story

Only English-language articles

Only freely available articles

Preprocessing: (repeated from above)

HTML cleaning

tokenization

stopword removal

lemmatization

multi-document named-entity recognition

36

36

Story elements

content-bearing words

the 150 top-TF words without stopwords

37

37

Story stages: co-occurrence in a window

“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)

38

38

Salient story elements

1. Split whole corpus T by week (or some other time interval)

2. For each week

Compute the weights for corpus t for this week

3. Weight =

Support of co-occurrence of 2 content-bearing words w1, w2 in t =

(# articles from t containing both w1, w2 in window) / (# all articles in t)

4. Threshold

Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =

support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥

θ2 (e.g., 2) *

5. Rank by TR, for each week identify top 2

6. Story elements = peak words = all elements of these top 2 pairs (# = 38)

39

39

Salient story stages, and story evolution

7. Story stage = co-occurrences of peak words in t

For each week t: aggregate over t-2, t-1, t moving average

8. Story evolution = how story stages evolve over the t in T

40

40

Summary

Navigation in story space story building

+

Document search

+

Navigation in document space

lead to understandable, useful + intuitive interfaces

41

41

Datasets: Where does STORIES not work well?

the need for contiguity and a sizeable dataset

DUC dataset

the need for action and a storyline

Tsunami dataset

the benefit of hindsight

Enron

42

42

My questions to you – and: The Future

linguistic analysis

sparsity problem

which word classes, which relations?

semantic enhancements

linkage information

indexing and ranking for search

graph mining

business models for search

news/blogs

other texts

43

43

Thanks!

... also for:

your questions!