1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

Preview:

Citation preview

1

The evolution of a story in a network – a Web mining perspective

Bettina Berendtwww.cs.kuleuven.be/~berendt

2

About me: My public (and mine-able) profile

: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics

: Computer Science

3

Story evolution: texts change (T)

Story evolution: authors change (U)

Web mining:Text, Link structure, Usage

Agenda

Story evolution: communities of authors change (L)

Story evolution: reading behaviour changes (U)

4

Web mining

5

Information retrieval and data mining

What‘s in this list?

How is it ordered?

6

Information retrieval and data mining

7

Data mining & Web mining

Knowledge discovery (aka Data mining):

“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1

Web mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:

Web content mining

Web structure mining

Web usage mining

1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

Navigation, queries, content access & creation

Simple, bipartite, tripartite, ... graphs

Texts, pictures, sounds, ...

8

Story evolution: texts change

– joint work with Ilija Subašić, 2008 –

* All references are given on slide no. 47

9

Dynamic Web content

10

A story begins

http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/05/22/nmaddy122.xml

11

The story unfolds

12

The story unfolds– new actors enter the stage (and old ones change their roles)

13

Basic idea: A story is about relational statements story stages expressed by co-occurrences

Robert Murat – suspect

Kate MccCann (the mother) – suspect

Gabriel Ruget‘s talk

14

Data collection and preprocessing

Articles from Google News 05/2007 – 11/2007 for search term “madeleine mccann“

(there was a Google problem in the December archive)

Only English-language articles

For each month, the first 100 hits

Of these, all that were freely available 477 documents

Preprocessing: HTML cleaning

tokenization

stopword removal

15

Story elements

content-bearing words

the 150 top-TF words without stopwords

16

Story stages:co-occurrence in a window

“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)

17

Salient story elements

1. Split whole corpus T by week (17 = 30 Apr + until 44 = 12 Nov +)

2. For each week

Compute the weights for corpus t for this week

3. Weight =

Support of co-occurrence of 2 content-bearing words w1, w2 in t =

(# articles from t containing both w1, w2 in window) / (# all articles in t)

4. Threshold

Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =

support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥

θ2 (e.g., 2) *

5. Rank by TR, for each week identify top 2

6. Story elements = peak words = all elements of these top 2 pairs (# = 38)

18

Salient story stages, and story evolution

7. Story stage = co-occurrences of peak words in t

For each week t: aggregate over t-2, t-1, t moving average

8. Story evolution = how story stages evolve over the t in T

19

Story stages: Example result

<week 17>

<show sliders>

20

Story evolution: result

<morphAll.py>

21

... the story is lost if we go back to single entities

Robert Murat – suspect

Kate MccCann (the mother) – suspect

22

Future work

“beyond words“ (e.g., semantics)

Web communities

Michael Barber‘s talk

Gabriel Ruget‘s qualia?!

23

Story evolution: authors change(and stories with them)

24

Multi-authored texts

http://en.wikipedia.org/wiki/Madeleine_McCann

25

Who authored?

26

Visualizing conflict – example “edit wars“

Viégas, Wattenberg, & Dave, 2004

27

The bone of contention ...

28

Story evolution: communities of authors develop parallel stories

29

Basic data for Web structure mining:hyperlinks and textual references

30

Example: Political blogs in the US

Adamic & Glance, 2005 (visualization modified)

All links Thresholded (link occurrence ≥ 25)

blue: liberal; red: conservative

Gabriel Ruget‘s

“publics“

31

Example: Blogs sourcing mainstream media

Hyperlinks from blogs to mainstream news media Germany USA

[Berendt, Schlegel, & Koch, in Kommunikation, Partizipation und Wirkungen im Social Web, 2008]

32

The German and the US blogospheres

Data reported in [Berendt, Schlegel, & Koch, 2008]

33Example:The politics of sourcing – what do blogposts on global warming refer to?

Walejko & Ksiazek, in press

34

Story evolution: communities of authors change

35

Who authored? (revisited)

36

Tracing anonymous edits

37

Why?

38

Story evolution: reading behaviour changes

39

The story unfolds– query analysis may reveal more than text analysis

40

Reading may “predate“ writing

41

Request frequency for a specific diagnosis in the investigated eHealth portal, depending on time and request language

Which diagnosis is that?

[Yihune, 2003; see also Heino & Toivonen, 2003]

42

My story has reached its end

43

My story has reached its end

44

My story has reached its end

45

My story has reached its end

46

My story has reached its end

is our discussion‘s beginning!

47

References

Adamic, L., & Glance, N. (2005). The political blogosphere and the 2004 U.S. Election: Divided they blog. In Proc. of the 3rd Int. Worksh. on Link Discovery at ACM SIGKDD (pp. 36–44).

Berendt, B., Schlegel, M., & Koch, R. (2008). Die deutschsprachige Blogosph ¨are: Reifegrad, Politisierung, Themen und Bezug zu Nachrichtenmedien [[The German-speaking blogosphere: Maturity, political focus, and relation to news media]]. To appear in A. Zerfaß, M. Welker, & J. Schmidt (Eds.), Kommunikation, Partizipation und Wirkungen im Social Web (Band 2: Strategien und Anwendungen: Perspektiven für Wirtschaft, Politik, Publizistik) [[Communication, Participation and Eects in Social Web (Vol. 2: Strategies and Applications: Perspectives for the Economy, Politics, and Journalism]] .(pp. 72-96). Köln, Germany: Herbert von Halem Verlag.

Berendt, B. & Subašić, I. (in press). Identifying, measuring and visualizing the evolution of a story: A Web mining approach. To appear in Proc. COLLNET 2008 (Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting). Berlin, July/August 2008.

Griffith, V. (2007). WikiScanner: List anonymous wikipedia edits from interesting organizations. http://wikiscanner.virgil.gr

Heino, J. & Toivonen, H. (2003). Automated Detection of Epidemics from the Usage Logs of a Physicians' Reference Database. In Proc. PKDD 2003. http://www.springerlink.com/content/g8h9f8y2fd3xq7ft/

Viégas, F.B., Wattenberg, M., & Dave, K. (2004). Studying Cooperation and Conflict between Authors with history flow Visualizations. In Proc. CHI 2004 (pp. 575-582).

Walejko, G. & Ksiazek, T. (in press). The Politics of Sourcing: A Study of Journalistic Practices in the Blogosphere. To appear in Proc. Of the Second International Conference on Weblogs and Social Media (ICWSM 2008). Seattle, March/April 2008. http://www.icwsm.org/2008

Yihune, G. (2003). Evaluation eines medizinischen Informationssystems im World Wide Web. Nutzungsanalyse am Beispiel www.dermis.net. Dissertation. Medizinische Fakultät der Ruprecht-Karls-Universität Heidelberg.

48

Backup Slides

49

(Some) further work in text processing

50

Improving on words and weights

51

Stemming

Want to reduce all morphological variants of a word to a single index term

e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document)

Stemming - reduce words to their root form

e.g. fish – becomes a new index term

Porter stemming algorithm (1980)

relies on a preconstructed suffix list with associated rules

e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE

52Inverse document frequency (IDF)

A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents

nj - Number of documents which contain the term j

n - total number of documents in the set

Inverse document frequency

53

Full Weighting (TF-IDF)

The TF-IDF weight of a term j in document di is

54

Beyond words

55

N-grams and Named-Entity Recognition

Madeleine )

Madeleine McCann )

Maddie ) MADELEINE_MCCANN

Maddy )

... )

56

Semantics (e.g., word-sense disambiguation)

57

The need for word sense disambiguation

“She sat by the bank and looked sentimentally at the last fish.“

„She sat by the bank and looked sentimentally at the last coins.““She sat by the bank and looked sentimentally at the last coins.“

58

WordNet semantic relations

59

Web mining for analyzing multiple perspectives:

[Fortuna, Galleguillos, & Cristianini, in press]

What characterizes different news sources?

Nearest neighbour / best reciprocal hitfor document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords

60

Syntactic analysis

From simple part-of-speech tagging to full-scale NLP parsing

Recommended