24
1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper (KU Leuven / HU Berlin) Blaž Fortuna, Marko Grobelnik & Dunja Mladenič (JSI Ljubljana) www.cs.kuleuven.be/~berendt

1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper

Embed Size (px)

Citation preview

1

1

Why and how is this a “related document”?:

Semantics-based analysis of and

navigation through heterogeneous text

corpora

Bettina Berendt & Daniel Trümper(KU Leuven / HU Berlin)

Blaž Fortuna, Marko Grobelnik & Dunja Mladenič (JSI Ljubljana)

www.cs.kuleuven.be/~berendt

2

2ICT Motivation: Global+local interaction; beyond “similar documents“

with respect to what?

3

3

1. News and blogs

Application motivation: Beyond dedicated search engines

(Lloyd et al., Proc. CAAW 2006; Berendt et al., Kommunikation, Partizipation und Wirkungen im Social Web , 2008; Berendt, Fortuna et al., in prep.)

2. Multilingual sources Good results in semi-automatic ontology learning based on simple machine translation

4

4PASCAL motivation: Re-use Textgarden‘s bread&butter and advanced tools

Text to bag-of-words

Ontogen

http://www.textmining.nethttp://ontogen.ijs.si/

5

5Solution vision: PORPOISE – Sailing the Internet

GlobalAnalysis

Search

Localanalysis

6

6

Solution approach: Architecture & states overview

Construct composite-similarity neighbourhood *

SelectDocument *

Aspect-based similarity search*

Build ontology

Selectneighbour- hood *

Search

GlobalAnalysis

Localanalysis

Data / toolExternalTextgarden toolUser actionCreated in this project *

Refocus *

Source doc.s database*Ont. Learning (Ontogen)

Import ontology *

Web

Retrieval & Preprocessing *

Specify sources & filters *

7

7

Retrieval and preprocessing

• Crawler / wrapper * (uses Blogdigger)• Translator * (uses Babelfish)• Preprocessing (Txt2Bow)• NER (GATE)• Similarity Computation *

Web

Source doc.s databaseRetrieval & Preprocessing

8

8Ontology learning (1)

9

9Ontology learning (2)

10

10Ontology learning (3)

11

11Inspection of ontology and instances

12

12Inspection of documents

13

13More on documents

14

14The neighbourhood of a document

15

15Constructing the similarity measure & neighbourhood (I)

16

16Constructing the similarity measure & neighbourhood (II)

17

17Constructing the similarity measure & neighbourhood (III)

A news source

A German-language blog

Most neighbours are blogs

Most neighbours are English-

language blogs

English blog

German blog

English news

18

18Comparing documents

19

19Comparing documents; utilizing multilingual sources

20

20Refocusing

21

21Structuring a neighbourhood

22

22Ex.: Finding a “story“

Evaluation?User studies!

23

23

“Pump-priming“: PORPOISE as catalyst

Using PASCAL softwarefor analyzing

social-media doc.s

Using PASCAL softwarefor analyzing multilingual

social-media doc.s

Analyzingblogs and news

PORPOISE

PORPOISE+:More fine-grained

sailing

STORYGROWTH:Tracking conceptand community

evolutionSupporting

constructive search

DM4E:“More constructive

search“

24

24

Finally ... could I express it better?

Mood:

presentation