Upload
moses-dawson
View
216
Download
1
Embed Size (px)
Citation preview
1
The evolution of a story in a network – a Web mining perspective
Bettina Berendtwww.cs.kuleuven.be/~berendt
2
About me: My public (and mine-able) profile
: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics
: Computer Science
3
Story evolution: texts change (T)
Story evolution: authors change (U)
Web mining:Text, Link structure, Usage
Agenda
Story evolution: communities of authors change (L)
Story evolution: reading behaviour changes (U)
4
Web mining
5
Information retrieval and data mining
What‘s in this list?
How is it ordered?
6
Information retrieval and data mining
7
Data mining & Web mining
Knowledge discovery (aka Data mining):
“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1
Web mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:
Web content mining
Web structure mining
Web usage mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press
Navigation, queries, content access & creation
Simple, bipartite, tripartite, ... graphs
Texts, pictures, sounds, ...
8
Story evolution: texts change
– joint work with Ilija Subašić, 2008 –
* All references are given on slide no. 47
9
Dynamic Web content
10
A story begins
http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/05/22/nmaddy122.xml
11
The story unfolds
12
The story unfolds– new actors enter the stage (and old ones change their roles)
13
Basic idea: A story is about relational statements story stages expressed by co-occurrences
Robert Murat – suspect
Kate MccCann (the mother) – suspect
Gabriel Ruget‘s talk
14
Data collection and preprocessing
Articles from Google News 05/2007 – 11/2007 for search term “madeleine mccann“
(there was a Google problem in the December archive)
Only English-language articles
For each month, the first 100 hits
Of these, all that were freely available 477 documents
Preprocessing: HTML cleaning
tokenization
stopword removal
15
Story elements
content-bearing words
the 150 top-TF words without stopwords
16
Story stages:co-occurrence in a window
“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)
17
Salient story elements
1. Split whole corpus T by week (17 = 30 Apr + until 44 = 12 Nov +)
2. For each week
Compute the weights for corpus t for this week
3. Weight =
Support of co-occurrence of 2 content-bearing words w1, w2 in t =
(# articles from t containing both w1, w2 in window) / (# all articles in t)
4. Threshold
Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
Time-relevance TR of co-occurrence(w1, w2) =
support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
θ2 (e.g., 2) *
5. Rank by TR, for each week identify top 2
6. Story elements = peak words = all elements of these top 2 pairs (# = 38)
18
Salient story stages, and story evolution
7. Story stage = co-occurrences of peak words in t
For each week t: aggregate over t-2, t-1, t moving average
8. Story evolution = how story stages evolve over the t in T
19
Story stages: Example result
<week 17>
<show sliders>
20
Story evolution: result
<morphAll.py>
21
... the story is lost if we go back to single entities
Robert Murat – suspect
Kate MccCann (the mother) – suspect
22
Future work
“beyond words“ (e.g., semantics)
Web communities
Michael Barber‘s talk
Gabriel Ruget‘s qualia?!
23
Story evolution: authors change(and stories with them)
24
Multi-authored texts
http://en.wikipedia.org/wiki/Madeleine_McCann
25
Who authored?
26
Visualizing conflict – example “edit wars“
Viégas, Wattenberg, & Dave, 2004
27
The bone of contention ...
28
Story evolution: communities of authors develop parallel stories
29
Basic data for Web structure mining:hyperlinks and textual references
30
Example: Political blogs in the US
Adamic & Glance, 2005 (visualization modified)
All links Thresholded (link occurrence ≥ 25)
blue: liberal; red: conservative
Gabriel Ruget‘s
“publics“
31
Example: Blogs sourcing mainstream media
Hyperlinks from blogs to mainstream news media Germany USA
[Berendt, Schlegel, & Koch, in Kommunikation, Partizipation und Wirkungen im Social Web, 2008]
32
The German and the US blogospheres
Data reported in [Berendt, Schlegel, & Koch, 2008]
33Example:The politics of sourcing – what do blogposts on global warming refer to?
Walejko & Ksiazek, in press
34
Story evolution: communities of authors change
35
Who authored? (revisited)
36
Tracing anonymous edits
37
Why?
38
Story evolution: reading behaviour changes
39
The story unfolds– query analysis may reveal more than text analysis
40
Reading may “predate“ writing
41
Request frequency for a specific diagnosis in the investigated eHealth portal, depending on time and request language
Which diagnosis is that?
[Yihune, 2003; see also Heino & Toivonen, 2003]
42
My story has reached its end
43
My story has reached its end
44
My story has reached its end
45
My story has reached its end
46
My story has reached its end
is our discussion‘s beginning!
47
References
Adamic, L., & Glance, N. (2005). The political blogosphere and the 2004 U.S. Election: Divided they blog. In Proc. of the 3rd Int. Worksh. on Link Discovery at ACM SIGKDD (pp. 36–44).
Berendt, B., Schlegel, M., & Koch, R. (2008). Die deutschsprachige Blogosph ¨are: Reifegrad, Politisierung, Themen und Bezug zu Nachrichtenmedien [[The German-speaking blogosphere: Maturity, political focus, and relation to news media]]. To appear in A. Zerfaß, M. Welker, & J. Schmidt (Eds.), Kommunikation, Partizipation und Wirkungen im Social Web (Band 2: Strategien und Anwendungen: Perspektiven für Wirtschaft, Politik, Publizistik) [[Communication, Participation and Eects in Social Web (Vol. 2: Strategies and Applications: Perspectives for the Economy, Politics, and Journalism]] .(pp. 72-96). Köln, Germany: Herbert von Halem Verlag.
Berendt, B. & Subašić, I. (in press). Identifying, measuring and visualizing the evolution of a story: A Web mining approach. To appear in Proc. COLLNET 2008 (Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting). Berlin, July/August 2008.
Griffith, V. (2007). WikiScanner: List anonymous wikipedia edits from interesting organizations. http://wikiscanner.virgil.gr
Heino, J. & Toivonen, H. (2003). Automated Detection of Epidemics from the Usage Logs of a Physicians' Reference Database. In Proc. PKDD 2003. http://www.springerlink.com/content/g8h9f8y2fd3xq7ft/
Viégas, F.B., Wattenberg, M., & Dave, K. (2004). Studying Cooperation and Conflict between Authors with history flow Visualizations. In Proc. CHI 2004 (pp. 575-582).
Walejko, G. & Ksiazek, T. (in press). The Politics of Sourcing: A Study of Journalistic Practices in the Blogosphere. To appear in Proc. Of the Second International Conference on Weblogs and Social Media (ICWSM 2008). Seattle, March/April 2008. http://www.icwsm.org/2008
Yihune, G. (2003). Evaluation eines medizinischen Informationssystems im World Wide Web. Nutzungsanalyse am Beispiel www.dermis.net. Dissertation. Medizinische Fakultät der Ruprecht-Karls-Universität Heidelberg.
48
Backup Slides
49
(Some) further work in text processing
50
Improving on words and weights
51
Stemming
Want to reduce all morphological variants of a word to a single index term
e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document)
Stemming - reduce words to their root form
e.g. fish – becomes a new index term
Porter stemming algorithm (1980)
relies on a preconstructed suffix list with associated rules
e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE
– BINARIZATION => BINARIZE
52Inverse document frequency (IDF)
A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents
nj - Number of documents which contain the term j
n - total number of documents in the set
Inverse document frequency
53
Full Weighting (TF-IDF)
The TF-IDF weight of a term j in document di is
54
Beyond words
55
N-grams and Named-Entity Recognition
Madeleine )
Madeleine McCann )
Maddie ) MADELEINE_MCCANN
Maddy )
... )
56
Semantics (e.g., word-sense disambiguation)
57
The need for word sense disambiguation
“She sat by the bank and looked sentimentally at the last fish.“
„She sat by the bank and looked sentimentally at the last coins.““She sat by the bank and looked sentimentally at the last coins.“
58
WordNet semantic relations
59
Web mining for analyzing multiple perspectives:
[Fortuna, Galleguillos, & Cristianini, in press]
What characterizes different news sources?
Nearest neighbour / best reciprocal hitfor document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords
60
Syntactic analysis
From simple part-of-speech tagging to full-scale NLP parsing