Where does this new information belong? From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with

Where does this new information belong?

From developing mining algorithms

to supporting knowledge discovery

Bettina Berendt – thanks for joint work with and support from

Ilija SubasićMathias Verbeke

Siegfried NijssenLuc De Raedt

K.U. Leuven

Yes we can! The problem

The solution? Automatic topic dectection

Period 1 Period 2 Period 3 Period 4

Healthcare agenda

Green energy plan

Opposition to healthcare reform

Another healthcare

vote

Climate agenda

A healthcare vote

Peace Nobel Prize

Cophenhagen climate summit

Health 0.017Care 0.015Insurance 0.013American 0.013Uninsured 0.009Families 0.008Working 0.005

Same event/document; different interpretations & categorisations

Visionary president

Party-politics (right and left)

Obama‘s overall agenda

Damp-rag presidentRhetorics

Similar problems in science and learning

Topic detection intime-indexedcorpora of news texts

Conferenceprogramme!

Text mining

Stream mining Media studies

Music collections, multimedia collections: see Andreas Nürnberger‘s talk at SML 2010

Similar problems in other areas

The solution?Context-aware systems / personalisation

Female

Has problems withanger management

You probably do / should think about it this way:...

Politicalactivist

is (nearly) green

What users want

left right

squares / circles

green / not green

... to structure the world how they see it

... to re-use their categories (that they worked so hard to find)

... to be able to see through their eyes

interactivity

... to acknowledge that others see the world differently

semantics

Social similarity / diversity

perspective-taking

... to provide data mining methods to do all that!

Research agenda

interactivity

semantics


perspective-taking


automatic topic dectection

support sense-making

= provide methods / tools for Knowledge Disovery(in the full sense)

The problem

Research agenda

interactivity

semantics


perspective-taking


Our solutionapproach

The problem

automatic topic dectection

support sense-making

= provide methods / tools for Knowledge Disovery(in the full sense)

STORIES: functionality basics

STORIES: functionality basics

STORIES: mining basics (1)Graphical summarisation of multiple text documents

Document / text pre-processing

Document summarization strategy

• Template recognition• Multi-document named entities• Stopword removal, lemmatization•“fact (assertion) recognition”

• no topics, but salient concepts & relations• time window; word-span window

Selection approach for concepts• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed

Similarity measure to determine salient relations• bursty co-occurrence

Burstiness measure• time relevance, a “temporal co-occurrence lift”

Aim: highlight subgraphs that represent an event

Topological properties

Change: Subgraph new in this period

STORIES: mining basics (2)Graph analysis for query recommendation

STORIES: evaluation

4. Comparison with other temporal text mining methods New (and only) framework for cross-method comparison Recall-&precision-style metrics different method rankings

3. Learning effectiveness Document search with story graphs leads to averages of

67-75% accuracy on judgments of story fact truth on average, 1.3-4.7 queries with 3.4-5.2 nodes/words per query

1. Information retrieval quality• Edges – events: up to 80% recall, ca.

30% precision

2. Search quality• Subgraphs index

coherent document clusters

Damilicious: functionality basics

Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result:

Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result:

* Show users and how similarly they group* Apply U4‘s grouping to my new search result:

* Show users and how similarly they group* Apply U4‘s grouping to my new search result:

Damilicious: mining basics (1)Methods and process1. Query

2. Automatic clustering

3. Manual regrouping

4. Re-use1. Learn classifier & present way(s) of grouping

2. Transfer the constructed concepts

Features/methods for the conceptual/predictive clustering: Lingo phrases, Lingo clustering, Ripper co-citation, bibliometric coupling, word or LSA similarity,

combinations; k-means, hierarchical

• “How similarly do two users group documents?“• For each query q, consider their groupings gr:

• For several queries: aggregate

Damilicious: mining basics (2)Measures of grouping and user diversity

Diversity = 1 – similarity = 1 - Normalized mutual information

(entropy-based measure)

NMI = 0

• “How similarly do two users group documents?“• For each query q, consider their groupings gr:

• For several queries: aggregate

Damilicious: evaluation

• Clustering: Does it generate meaningful document groups?– yes (tradition in bibliometrics) – but: data?– Small expert evaluation of CiteseerCluster

• Choosing the clustering and classification methods for conceptual clustering– Experiments: different features, clustering methods,

classification methods quality of reconstruction and extension-over-time (NMI)

• Technology acceptance– End-user experiment (clustering & regrouping)– 5-person formative user study (transfer of own results)

• Sense-making involves– Extracting information from texts– Extracting structural information between entities– Creating, using and modifying categories– Interacting with external representations– Acknowledging diversity and perspective-taking– ...

• Appropriate mining methods, measures, ...?• More/better evaluation methods and frameworks?• Use cases?

KD approachText mining

Graph miningSemantics Interactivity

Usage mining and “model-processing“ (conceptual / predictive clustering)

Conclusions and (some) questions

• Sense-making involves– Extracting information from texts– Extracting structural information between entities– Creating, using and modifying categories– Interacting with external representations– Acknowledging diversity and perspective-taking– ...

Questions ?

you !

Thank

• Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. DOI - 10.1007/s10115-009-0227-x (PDF)

• Berendt, B. & Subašić, I. (2009). STORIES in time: a graph-based interface for news tracking and discovery. n N. Cristianini & M. Turchi (Eds.), Proceedings of Intelligent Analysis and Processing of Web News Content (IAPWNC) at The 2009 IEEE /WIC / ACM International Conferences Web Intelligence (WI'09) / Intelligent Agent Technology (IAT'09). 15 September 2009, Milan, Italy. (Proceedings of WI-IAT.2009, DOI 10.1109/WI-IAT.2009.342, pp. 531-534) (PDF)

• Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on Living Web, collocated with the 8th International Semantic Web Conference (ISWC-2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol-515. (PDF)

• Berendt, B. (2010). Diversity in search: what, how, and what for? Talk at Barcelona Media / Yahoo! Research and UPF, 4 March 2010. (PPT)

• Berendt, B., Krause, B., & Kolbe-Nusser, S. (2010). Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. networks. Information Processing & Management, 46(1), 1-10. (PDF)

To Read

http://www.cs.kuleuven.be/~berendt/Papers/subasic_berendt_2009.pdf

http://www.cs.kuleuven.be/~berendt/Papers/berendt_subasic_authorcopy.pdf

http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-515/livingweb2009_paper6.pdf

http://www.cs.kuleuven.be/~berendt/Talks/berendt_2010_03_04.ppt

http://www.cs.kuleuven.be/~berendt/Papers/Berendt_etal_IPM_authorversion.pdf

Documents

Where does this new information belong? From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with