39
Prof. Dr. Stefan Wrobel Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler Text Mining: Creating Semantics in the Real World 2 Text Mining: Creating Semantics in the Real World Prof. Dr. Stefan Wrobel Fraunhofer IAIS: Intelligent Analysis and Information Systems 250 people: scientists, project engineers, technical and administrative staff, students Located on Fraunhofer Campus Schloss Birlinghoven/Bonn Joint research groups and cooperation with Core research areas: Machine learning/data mining Multimedia pattern recognition Visual Analytics Process Intelligence Adaptive robotics Cooperating objects Directors: T. Christaller, S. Wrobel (exec.)

TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

Embed Size (px)

Citation preview

Page 1: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

Prof. Dr. Stefan Wrobel

Dr. Gerd Paass, Andreas Schäfer, Dr. Stefan Eickeler

Text Mining: Creating Semantics in the Real World

2

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Fraunhofer IAIS: Intelligent Analysis and Information Systems250 people: scientists, project engineers,

technical and administrative staff, students

Located on Fraunhofer Campus Schloss

Birlinghoven/Bonn

Joint research groups and cooperation with

Core research areas:

Machine learning/data mining

Multimedia pattern recognition

Visual Analytics

Process Intelligence

Adaptive robotics

Cooperating objects

Directors: T. Christaller, S. Wrobel (exec.)

Page 2: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

3

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Where is all the knowledge we lost with information?

T. S. Eliot

Thomas Stearns Eliot, OM (September 26, 1888 – January 4, 1965)

US-born British poet, dramatist and literary critic

Brainyquote.com

4

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

Page 3: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

5

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Internet Trends

Convergence

Ubiquitous intelligent systems

Users as producers

8

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Users as producers

Web 2.0, Social Web, Crowdsourcing

Exploding growth of content

Media providers transform from content to confidence providers, competing

with social communities

Users expect full interactivity and control

Quality control, confidence, choice and searching are becoming central

Page 4: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

9

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Drowning in Data ….Megabytes

Gigabytes

Terabytes

Petabytes

Exabytes

Megabytes

Gigabytes

Terabytes

Petabytes

Exabytes

Size of digital universe:

2007: 161 Exabyte

2010: 998 Exabyte

[IDC]

Size of digital universe:

2007: 161 Exabyte

2010: 998 Exabyte

[IDC]

10

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The data iceberg

Database tables

Excel spreadsheets

Other data with fixed structure

Email, Notes

Word documents

PDF. Power Point

Other text

Images

Video, audio

20%

80%

Page 5: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

11

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Drowning in Unstructured Data ….Megabytes

Gigabytes

Terabytes

Petabytes

Exabytes

Megabytes

Gigabytes

Terabytes

Petabytes

Exabytes

… and need meaning!… and need meaning!

12

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Semantics: The need for meaning

Knowledge will be the driving force

of business excellence

Quality of services increasingly

distinguished by amount of

knowledge they can use

Enormous savings if unstructured

existing documents could be used

Without needing to structure them

first

cf. failures of knowledge

management!

Page 6: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

13

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

intelligent data and text

mining technologies

The challenge of semantics

Intelligent ServiceManual StructuringVery large set

of (electronic)

documents

14

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

Page 7: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

15

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Text Mining is cool, since … the entire world works for us!

215,675,903 websites (Netcraft, March 2009)

19 200 000 000 webpages (Yahoo, Aug 2005)

29 700 000 000 webpages (boutell.com, Jan 2007)

Google index 26 Million (1998), 1 billion (2000), 1 Trillion 1,000,000,000,000) unique URLs (25. 7. 2008)

=> perhaps quadrillions of words (images, videos) …

And most of them put together meaningfully (somewhat)!

=> smart algorithms can build on that.

16

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The basic idea

If two words occur frequently in the same context

- page, paragraph, sentence, part-of-speech

Then there must be some semantic relation between them

Add in a lot of statistics, algorithms, intelligence…

AND YOU CAN DO A LOT!

raw material (web, documents, …)

+ correlations and statistics

+ intelligent data mining algorithms

You can create (a bit of) semantics!

raw material (web, documents, …)

+ correlations and statistics

+ intelligent data mining algorithms

You can create (a bit of) semantics!

Page 8: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

17

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Automated Clustering [Paass 07]

[Paass07]

100 000 documents

<document>

<title>Bayern München verlor Tabellenführung und Elber beim 1 : 1 in Wolfsburg</title>

<text>Ausgerechnet der VfL Wolfsburg hat den FC Bayern München vom Thron der Fußball - Bundesliga gestoßen .

Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den

Deutschen Rekordmeister . Durch das Remis und den gleichzeitigen Sieg von Konkurrent Bayer Leverkusen

beim TSV 1860 München verlor der FC Bayern die Tabellenführung . Carsten Jancker ( 29 . ) hatte die Gäste in

Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern für ihre

pomadige Spielweise durch den Wolfsburger Ausgleichstreffer von Andrzej Juskowiak ( 60 . ) bestraft . Zudem

verlor das Team von Trainer Ottmar Hitzfeld auch noch Stürmer Giovane Elber ( 80 . ) . Er sah wegen einer

Tätlichkeit gegen VfL - Abwehrspieler Holger Ballwanz die Rote Karte . Die Bayern gingen ersatzgeschwächt in

die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfalls angeschlagenen

Mehmet Scholl machte sich bemerkbar . Die Wolfsburger mussten weiter auf die Abwehrspieler Claus Thomsen

und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konnten ihre Ausfälle

anfangs besser kompensieren . Aus einer gestärkten Deckung , die vor der Pause nur selten von den

Wolfsburger Stürmern Juskowiak und Jonathan Akpoborie gefordert wurde , kontrollierten die Bayern die

Partie . Mit ihrer Taktik hatten sie nach knapp einer halben Stunde Erfolg : Jancker spitzelte den Ball nach

einem abgefälschten Freistoß von Michael Tarnat ins Tor . Der Brasilianer Paolo Sergio ( 14 . ) hätte sogar schon

früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte des VfL -

Tores . Die Gastgeber besaßen nur eine Möglichkeit in der ersten Halbzeit , als der starke Spielmacher Dorinel

Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter Oliver Kahn scheiterte .

Nach dem Wechsel wurden die Wolfsburger mutiger und munterer . Sie übernahmen langsam das Kommando .

Beim Ausgleichstreffer durch Juskowiak half die Bayern - Deckung allerdings mit . Samuel Kuffour verlor den

Ball an den polnischen Nationalspieler , Juskowiak zog sofort ab und ließ dem besten Bayern - Spieler Kahn

keine Chance . Danach bemühten sich die Münchner noch einmal und erhöhten den Druck . Doch klare

Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven , so dass die

Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text>

<dpa_TextEndCode>dpa yyni ce jo</dpa_TextEndCode>

</document>

<document>

<title>Bayern München verlor Tabellenführung und Elber beim 1 : 1 in Wolfsburg</title>

<text>Ausgerechnet der VfL Wolfsburg hat den FC Bayern München vom Thron der Fußball - Bundesliga gestoßen .

Mit dem 1 : 1 ( 0 : 1 ) gelang den Wolfsburgern am Samstag der erste Punkt im sechsten Spiel gegen den

Deutschen Rekordmeister . Durch das Remis und den gleichzeitigen Sieg von Konkurrent Bayer Leverkusen

beim TSV 1860 München verlor der FC Bayern die Tabellenführung . Carsten Jancker ( 29 . ) hatte die Gäste in

Führung gebracht . Doch vor 20 400 Zuschauern im ausverkauften VfL - Stadion wurden die Bayern für ihre

pomadige Spielweise durch den Wolfsburger Ausgleichstreffer von Andrzej Juskowiak ( 60 . ) bestraft . Zudem

verlor das Team von Trainer Ottmar Hitzfeld auch noch Stürmer Giovane Elber ( 80 . ) . Er sah wegen einer

Tätlichkeit gegen VfL - Abwehrspieler Holger Ballwanz die Rote Karte . Die Bayern gingen ersatzgeschwächt in

die Partie . Vor allem das Fehlen des verletzten Regisseurs Stefan Effenberg und des ebenfalls angeschlagenen

Mehmet Scholl machte sich bemerkbar . Die Wolfsburger mussten weiter auf die Abwehrspieler Claus Thomsen

und Thomas Hengen sowie den gesperrten Waldemar Kryger verzichten . Die Münchener konnten ihre Ausfälle

anfangs besser kompensieren . Aus einer gestärkten Deckung , die vor der Pause nur selten von den

Wolfsburger Stürmern Juskowiak und Jonathan Akpoborie gefordert wurde , kontrollierten die Bayern die

Partie . Mit ihrer Taktik hatten sie nach knapp einer halben Stunde Erfolg : Jancker spitzelte den Ball nach

einem abgefälschten Freistoß von Michael Tarnat ins Tor . Der Brasilianer Paolo Sergio ( 14 . ) hätte sogar schon

früher sein Team in Führung schießen können . Doch traf er aus 14 m nur die Oberkante der Latte des VfL -

Tores . Die Gastgeber besaßen nur eine Möglichkeit in der ersten Halbzeit , als der starke Spielmacher Dorinel

Munteanu ( 37 . ) mit einem Schuss an dem großartig reagierenden Nationaltorhüter Oliver Kahn scheiterte .

Nach dem Wechsel wurden die Wolfsburger mutiger und munterer . Sie übernahmen langsam das Kommando .

Beim Ausgleichstreffer durch Juskowiak half die Bayern - Deckung allerdings mit . Samuel Kuffour verlor den

Ball an den polnischen Nationalspieler , Juskowiak zog sofort ab und ließ dem besten Bayern - Spieler Kahn

keine Chance . Danach bemühten sich die Münchner noch einmal und erhöhten den Druck . Doch klare

Möglichkeiten besaßen sie nicht mehr . In der hektischen Schlussphase verlor Elber die Nerven , so dass die

Bayern Glück hatten , in Unterzahl nicht auch noch zu verlieren .</text>

<dpa_TextEndCode>dpa yyni ce jo</dpa_TextEndCode>

</document>

18

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Unsupervised hierarchical term Clustering: dpa data

Spiel Bundesliga Team Trainer

Sieg Mannschaft Niederlage

Samstag Platz Saison Erfolg

Punkte Pokal Nationalspieler …

Finale Frankfurt deutsche

Meister Hamburg Zuschauer

Zuschauern Männer Halle WM

Titelverteidiger Final EM …

Basketball Berlin Weltmeister

Bonn Kampf K Hagen Trier

Würzburg LBA Playoff Runde

Box Berliner Klitschko Titel…

Kiel Handball Magdeburg

Flensburg HSG VfL TV THW

Tore Bad Wuppertal Lemgo

Bundesliga Handewitt …

FC Trainer Fußball München

Spieler Bayern Mannschaft

Saison Hertha Stürmer Stadion

Spiel SV Dortmund Coach …

Minute Tor VfL Schiedsrichter

Zuschauer Minuten Führung

Tore Hansa Eintracht Schalke

Bundesliga Karte Wolfsburg …

Bayern League Champions

Fußball UEFA United Hinspiel

Cup Manager Vertrag Leeds

Club Fans Real Hitzfeld …

Team sport

Not football Football

Basketball + Boxing Handball German League European League

Page 9: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

19

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Text Mining Market Size

„The text mining market has roughly $50-100 million annual product

revenue, and is growing at roughly 40-60% annually.“ [Monash 06, texttechnologies.com]

Sounds small …

But then …

• Several research sites devoted to the technology

So the real market must be somewhere else …

20

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The Text Mining Market … is called “Text Analytics”

Primary areas:

Web search, site search

knowledge management, enterprise portals

Information collection, extraction, harvesting

Email handling, security, spam and phishing filtering

Market research

Online advertising

Specialized markets

• litigation, juridical

• Patent search

[cf. Monash/2008]

Page 10: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

21

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Application Field Market Research: Germany 1.6 billion, growing

http://www.adm-ev.de/zahlen.html

Both ad-hoc studies and panels can benefit from text mining

22

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Enterprise Search as a text mining market

More than 1.2 billion $ in 2010

Software

revenue

Million $

Year

12191108989860717

20102009200820072006

[Gartner 2008]

Page 11: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

23

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Companies

24

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Wikinger, Fraunhofer

Web

• Structuring and Monitoring: Semantic Map, EmotionRadar

Conclusion

Page 12: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

25

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Text Mining Tasks

Document classification, scoring and/or ranking, isolated retrieval

• Assign a class, score or rank to an entire document

In-collection, linked retrieval and organization

• Find documents in a collection

• Link results to other results

Information and relation extraction

• Extract pieces of information, fill particular relations

Overview and monitoring of collections

• Give summary impression of information in a collection or source

26

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

Page 13: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

27

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Spotting Faked Offers at Internet Auctions

Techniques to sell fakes

Put faked products on an internet

auction platform, e.g. ebay

Describe product as forged,

falsified, e.g. “very similar to XXX”

Motivation

Aspects

Infringement of registered trade

marks

Violation of patents

Enormous sales volume

28

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Counter Measures

Use trainable classifiers

Compile training set of genuine and

faked internet auction offers

Train classifiers to detect these

classes

use text, format information, etc. as

features

Use different classifier for different

brands / products

Apply to new internet auction

offers

Ban faked offers from auction

Update classifiers to new techniques

Motivation

x1

x2

Fakes

Originals

Hyperplane

Page 14: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

29

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Results

A classifier was developed and

tested

Similar techniqes as for spam

detection

Good results: F-value >> 90%

The Germal Federal Court of Justice:

Internet Auction providers have to

filter the auctions using approriate

methods to detect faked offers.

30

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Phishing

E-mail fraud

Send officially looking

email

Include web link or form

Ask for confidential

information

e.g. password, account

details

Attacker uses information

to withdraw money, enter

computer system, etc.

Motivation

Page 15: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

32

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

AntiPhishMotivation

Project AntiPhish

Develop content-based phishing filters

Include other clues like whitelists

Trainable and adaptive filters

adapt to new phishing attacks

anticipate attacks

Consortium

Fraunhofer IAIS (DE)

Symantec (GB, IRL)

Tiscali (IT)

Nortel (FR)

K.U. Leuven (BE)

33

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Phishing: Defense TechniquesMotivation

Workflow

Obtain training data from email

stream

Extract features

Estimate and update classifiers and

filters

Integrate new filters into email

filtering framework

Deploy at internet service provider

Deploy at central wireless packet

switch

Page 16: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

37

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Approach: Multiple feature sets

38

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Basic Features

Page 17: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

39

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Dynamic Markov Chains

40

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

DMC Details

Page 18: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

41

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Latent Topic Models

42

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Class-Specific Topic Models

Page 19: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

43

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Feature Processing and Selection

46

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Test Corpora

Page 20: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

47

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Overall Result

52

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

Page 21: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

53

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The THESEUS research program in Germany [theseus-program.com]Deutsche

Nationalbibliothek

Deutsche Thomson OHG (DTO)

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI GmbH)

empolis GmbH

Festo AG

Fraunhofer-Gesellschaft (7 Institutes)

Friedrich-Alexander-Universität Erlangen

FZI Forschungszentrum Informatik

Institut für Rundfunktechnik GmbH (IRT)

intelligent views gmbh

Ludwig-Maximilians-Universität (LMU)

moresophy GmbH

LYCOS Europe

mufin GmbH

ontoprise GmbH

SAP AG

Siemens AG

Technische Universität Darmstadt

Technische Universität Dresden

Technische Universität München

Universität Karlsruhe (TH)

Verband Deutscher Maschinen- und Anlagebau e.V. (VDMA)

Wess/07synt

actic

sem

antic

Single author Multiple authors

54

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The THESEUS use cases

ALEXANDRIA

The Internet Knowledge Platform

ALEXANDRIA

The Internet Knowledge Platform

CONTENTUS

Next Generation Digital Libraries

for saving our cultural heritage

CONTENTUS

Next Generation Digital Libraries

for saving our cultural heritage

MEDICO

Semantic image Search

in Medicine

MEDICO

Semantic image Search

in Medicine

ORDO

Personal Ordered Knowledge

Management

ORDO

Personal Ordered Knowledge

Management

PROCESSUS

Semantic Business Processes

PROCESSUS

Semantic Business Processes

TEXO

Business Webs in the Internet

Of Things

TEXO

Business Webs in the Internet

Of Things

Page 22: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

55

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

CONTENTUS - Next Generation Digital Librariesfor saving our cultural heritage

Publishers, Libraries, broadcasters, etc. are interested in using, distributing and saling their archive content

In analog form archives are threatened by deterioration, are notlinked, difficult to use, and huge.

Goals:

Digitalization, optimization of quality, availability

Indexing, semantic and social linking and intelligent search, communities

Rescue of cultural heritage, preventing losses from deteriorationLaufzeit bis

2012

56

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Showcases Semantic Digital Libraries

225 years Neue Zürcher Zeitung NZZ

GDR music archive German National Library

Page 23: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

57

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

CONTENTUS Workflow

Semanticaccess to knowledgeand content

Openknowledgenetworks – useraugmentation

Semanticlinking of metadata

Automatedgeneration of metadata

Automatedoptimization of quality

Digitization

Data generation: registered users / communities, algorithms with acceptable qualityQuality control: self control (see Wikipedia)

Data generation: registered users / communities, algorithms with acceptable qualityQuality control: self control (see Wikipedia)

Controlled qualityData generation: automatically generated through high-quality algorithmsQuality control: training and improvement of algorithms

Controlled qualityData generation: automatically generated through high-quality algorithmsQuality control: training and improvement of algorithms

Guaranteed qualityData generation and correction: Libraries, museums, universities, experts, etc. Quality control: Schooling, rules, advisory boardsHighest stability, highest persistence

Guaranteed qualityData generation and correction: Libraries, museums, universities, experts, etc. Quality control: Schooling, rules, advisory boardsHighest stability, highest persistence

Shell

Mantle

Core

1 2 3 4 5 6

Workflow

58

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

High-ThroughputMethods

Modern bookscanners: Thousands of pagesper day

Almost fullyautomatic

Data volumes: 70TB (NZZ), Peta-Exabytes(DNB)

Digitalisierung

Digitization

1 AutomaticOptimization of quality

2Digitization

1 Automated

Generation of metadata

3 SemanticLinking of metadata

4 Open knowledge networks –user augmentation

5 Semantic access to knowledge and content

6

Page 24: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

59

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

High-ThroughputMethods

Modern bookscanners: Thousands of pagesper day

Almost fullyautomatic

Data volumes: 70TB (NZZ), Peta-Exabytes(DNB)

Digitalisierung

Digitalisierung

1Digitization

1

60

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Development of

intelligent

algorithms for

optimizing print,

images, sound &

movies

Automated

generation of

presentation formats

Qualitätsoptimierung

Digitalisierung

1 AutomatedOptimization of quality

2Digitalization

1

Margin removal

Sharpening,

Straightening

Denoising, declicking

Scratch removal

Page 25: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

61

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Metadatengenerierung

Structural and

contentual metadata

OCR, speech, music,

video recognition

Structure analysis and

type recognition

Linking with current

norms & standards

Digitalisierung

1 AutomatedOptimization of quality

2Digitalization

1 Automated

Generation of metadata

3

62

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Semantische Vernetzung

Link-up with related

media

Incorporation of external

knowledge sources

(metadata systems,

Wikipedia, …)

Disambiguation,

classification, relation

extraction

Digitalisierung

1 AutomatedOptimization of quality

2Digitalization

1 Automated

Generation of metadata

3 Semanticlinking of contents

4

Page 26: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

63

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Determining meaning

The words of natural

language are often

ambiguousÜber Kohl höhnte Strauß: „Er wird nie Kanzler werden. Die Zeit, 18.7.08

» For each word / term, find a meaning

» Subproblem:

» Part of speech recognition: Nouns, Verb, Adjective, …

» Named entity recognition: People, Places, Organizations, …

» Assignment of concepts: Plant, Bird, Politician, …

64

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Named entity recognition

» Analyze

Surroundings of

Words

» “Kohl” in a sentence with “Kanzler” probably “person”

» “Kohl” in a sentence with “kochen” probably “vegetable”

Über Kohl höhnte Strauß: „Er wird nie Kanzler werden.

» Statistical model for person names

» Word + Surroundings -> word is a person

» Training using annotated sentences.

» Automatic Recognition of words / phrases that represent people

Page 27: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

66

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Conditional Random Field Model

» Observed words X1,…,Xn

» Category of words Y1,…,Yn

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡= ∑ ∑

= =−

N

t

N

kttCkkn YYf

ZXYYp

1 11,1

2),,(exp

),,(1

)|,,( XX

λμλ

K

» Properties f may depend on two subsequent states and on all observed words

Example

» Property f10293 has value 1,

- if Yt-1=“PER" and Yt=“PER” and

- Xt has value “Müller”.

Otherwise its value is 0.[Lafferty, McCallum, Pereira 01]

67

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Modeling of names: features for a CRF model

» Title FirstName Connective LastName

» Properties. Recorded for the words xt-2,xt-1,xt,xt+1,xt+2

» Words, stem, part of speech

» Prefix, Suffix (3 letters)

» Shape properties

Capital characters at the beginning, only numbers, contains numbers, mix capital /no

capital, contains hyphens

» LDA topic model class

» Contained in list of first names, contained in list of last names

In Arbeit

Page 28: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

68

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Identity of names

» There are several people named “Helmut Kohl”

» Helmut Kohl, born 1930, Chancellor

» Helmut Kohl, born 1943, Referee

» Helmut Kohl, textile merchand

» … 99 further hits in the telephone book

» Identification in Wikipedia

» Compare words of Wikipedia-article

with the text in which

“Helmut Kohl” was found

» Similar words -> similar person

» Automated assignment:

Person name -> Wikipedia article

70

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Assignment of similarity from the environment

Simple algorithm for assigning people to Wikipedia article

» Occurence in text: Helmut Kohl

» Description using characteristic terms -> x

» Wikipedia article on Kohl

» Description by characteristic terms -> w

» Comparison using a distance metric: for example Cosine distance d(w,u)

» Implemented in a prototype

» Further approach: Assignment as a classification task

f(w,u) = 0 or 1

Master Thesis

Page 29: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

72

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Currently assign semantic categories in the Contentus Prototype

» Names: People, Organizations, points in time, places, …

» Assignment to Wikipedia articles

Semantische Interpretation

Under development:

» Hypernyms in ontology (GermaNet): Nouns, Verbs Supersenses

» Cluster of words with similar meaning: Topics

» Relations between names / concepts

“Berthold Brecht” studied in “München”

» Classes of documents: Politics, Economy, …

Semantic Interpretation

87

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Knowledge store

» Further information for entities that

were found in the text

» Dates, publications

» Number of inhabitants, topological

relationships

Helmut Kohl

Geburtsdatum 30.4.1930

Geburtsort Ludwigshafen

Ehegatte Hannelore K.

Ausbildung Historiker

Religion katholisch

Partei CDU

» Social networks

» Who knows whom?

» Who was at the same place at the same time?

» Who influenced whom?

Berlin

Fläche 891 km2

Einwohner 3.420.786

BIP 83,6 Mrd. €

Höhe 34–115 m

Geo. Breite 52° 31′ N

Geo. Länge 13° 25′ O

Page 30: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

88

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Buchmesse Frankfurt 10.Oktober 2008 | 88

Knowledge store: Format

» Factual knowledge as logical expressions:

» <subject> <predicate> <object>

» Semantic-Web-Standards

» RDF

» RDFS

» OWL

» Technical Basis

» Database MySQL

» Triple-Store Jena + Joseki

» Query language

» SPARQL

89

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

» Semantic Integration of data and information from different sources

» DBPedia: an interpreted form of Wikipedia

» Geonames Ontology: all the places in the world

» Catalogue of the German national library:

Books and publications

» Based on open standards

» W3C Semantic Web Stack

» RDF, RDFS, OWL, SPARQL

Wissensvernetzung

Triplestore

Linking of knowledge

Page 31: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

90

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Buchmesse Frankfurt 10.Oktober 2008 | 90

Knowledge sources

» DBPedia (www.dbpedia.org)

» GeoNames Ontology

» Already in RDF/OWL-Format

» Person reference database PND

» Topic reference database SWD

» Online catalogue OPAC

» Partial export to RDF

» Found entities in the text

» Identification using Wikipedia

» Linking with DBPedia-Daten per Link

91

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Offene Wissensnetzwerke

Further annotations from experts and users

• Completions, corrections

• Cooperation with the ALEXANDRIA project in Theseus

• Suitable measures to assure high quality of data

Digitalisierung

1 AutomatedOptimization of quality

2Digitalization

1 Automated

Generation of metadata

3 Semantic linking of metadata

4 Open knowledge networks –user augmentation

5

Page 32: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

92

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Offene Wissensnetzwerke

The Multiple Shell Model

Controlled QualityData generation: Algorithms of high qualityQuality control: Training and improvement of algorithms

Controlled QualityData generation: Algorithms of high qualityQuality control: Training and improvement of algorithms

Mantel

Open knowledge networkData generation: Registered users / Communities, AlgorithmsQuality control: Self control (cf. Wikipedia)

Open knowledge networkData generation: Registered users / Communities, AlgorithmsQuality control: Self control (cf. Wikipedia)

Outer

Assured qualityData generation and correction: Libraries, Universities, Museums, groups of experts, etc.Quality control: Fixed rules, committes,maximal Stability and Persistence

Assured qualityData generation and correction: Libraries, Universities, Museums, groups of experts, etc.Quality control: Fixed rules, committes,maximal Stability and Persistence

Core

Cf. Wikinger [Bröcker et.al. 08]!

93

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Semantische Suche

The knowledge network

• Digital, multimedia data

• Content is semantically linked

• Is enriched from external sources and

user groups

Access

• Structure by Ontology

• Content relationships become clear

• “Knowledge exploration” is possible

Digitalisierung

1 AutomatedOptimization of quality

2Digitalization

1 Automated

Generation of metadata

3 Semantic linking of metadata

4 Open knowledge networks –user augmentation

5 Semantic access to knowledge and content

6

Page 33: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

95

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The Contentus Demonstrator

102

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

Page 34: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

107

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Example of a classified webpage

Übereinstimmung zu dem Dokumentmodell = 80%Klassifikation als = Projekte

111

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Workflow for semantic processing of documents

Crawl Documents

Search index

Pre-processing

Categorization

EntityRecognitio

nExtractedMetadata

Knowledge

Store

Using the document model

Using the structuremodel

ExtractedSearchregions

Watchout fo

r the relaunchin su

mmer 2009!

Watchout fo

r the relaunchin su

mmer 2009!

Page 35: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

112

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Outline

Why is Text Mining cool?

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

What can we do with Text Mining in the Real World? Some case studies

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer Web

• Structuring and Monitoring: EmotionRadar

Conclusion

113

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Emotion Radar

Which issues are important to

people, where are the

emotional discussions in

blogs and discussion

forums?

Goal: market research, …

Page 36: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

114

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Emotion Radar example: Looking at two large automobil companies A and B*

Selection and Crawling of discussion forums

• Criteria: Search engine ranking, size, activity

• Period used: January 2008 to January 2009

• Storage: 2 GB of data crawl during a period of 7 days

Structure analysis of the discussion forums:

• Manufacturer A:

– Number of postings: 188.487

– Monthly number of new postings: ca. 1.500

– Number of threads: 21.613

– Number of authors: 15.445

• Manufacturer B:

– Number of postings: 406.814

– Monthly number of new postings: ca. 2.700

– Number of threads: 38.758

– Number of authors: 21.919

* anonymisiert

115

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Case study: Internet postings related to the introduction of a new carmodel in Germany 2008

Cars are delivered

Manufacturer

publishes first

pictures

Manufacturer publishes

further product

features/start of sales

* anonymisiert

Page 37: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

116

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Partially automated emotion analysis shows a mood swing from positive to negative („in love-> angry“)

„in love“„in love“ − „hoping“− „hoping“

− „turned off“− „turned off“

− „interested“− „interested“

− „surprised“− „surprised“

− „angry“− „angry“

− „angry“

− „proud“

− „angry“

− „proud“

* anonymisiert

Manufacturer

publishes first

pictures

Manufacturer publishes

further product

features/start of sales

Cars are delivered

117

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Topic recognition shows a change of product features that are discussed fromdesign to gasoline consumption

Auslieferungen

− Fahrzeug-länge, Design

− „verliebt“

− Fahrzeug-länge, Design

− „verliebt“

− Schaltung, Effizienz, Technologie

− Verbrauch

− „hoffend“

− Schaltung, Effizienz, Technologie

− Verbrauch

− „hoffend“

− Chromleisten, Wertanmutung

− „Riesen-fischmaul“

− „abgestoßen“

− Chromleisten, Wertanmutung

− „Riesen-fischmaul“

− „abgestoßen“− Schiebedach,

Lenkrad, Bordcomputer, Audiosystem

− „zugeneigt“

− Schiebedach, Lenkrad, Bordcomputer, Audiosystem

− „zugeneigt“

− Preise Car-Konfiguratorund Liste

− „überrascht“

− Preise Car-Konfiguratorund Liste

− „überrascht“

− Probefahrten

− Verbrauch

− „verärgert“

− Probefahrten

− Verbrauch

− „verärgert“− Verbrauch,

Klappschlüssel, Audio

− „verärgert“

− Nachbarn

− „stolz“

− Verbrauch, Klappschlüssel, Audio

− „verärgert“

− Nachbarn

− „stolz“

Erste Fotos

* anonymisiert

„...kaum zu glauben

was dieses kleine

Auto an Benzin

verbraucht!“

(Kalle83)

„...kaum zu glauben

was dieses kleine

Auto an Benzin

verbraucht!“

(Kalle83)

Page 38: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

118

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

How can manufacturers use these text mining results?

Auslieferungen

− Fahrzeug-länge, Design

− „verliebt“

− Fahrzeug-länge, Design

− „verliebt“

− Schaltung, Effizienz, Technologie

− Verbrauch

− „hoffend“

− Schaltung, Effizienz, Technologie

− Verbrauch

− „hoffend“

− Chromleisten, Wertanmutung

− „Riesen-fischmaul“

− „abgestoßen“

− Chromleisten, Wertanmutung

− „Riesen-fischmaul“

− „abgestoßen“− Schiebedach,

Lenkrad, Bordcomputer, Audiosystem

− „zugeneigt“

− Schiebedach, Lenkrad, Bordcomputer, Audiosystem

− „zugeneigt“

− Preise Car-Konfiguratorund Liste

− „überrascht“

− Preise Car-Konfiguratorund Liste

− „überrascht“

− Probefahrten

− Verbrauch

− „verärgert“

− Probefahrten

− Verbrauch

− „verärgert“− Verbrauch,

Klappschlüssel, Audio

− „verärgert“

− Nachbarn

− „stolz“

− Verbrauch, Klappschlüssel, Audio

− „verärgert“

− Nachbarn

− „stolz“

* anonymisiert

Z.B. durch eine frühzeitige

Erkennung relevanter Themen

(Verbrauch) und Ableiten von

Maßnahmen (Spritspartrainings,

Leichtlaufreifen, Kommunikation)

e.g. short term recognition of relevant

topics (consumption) and preparation

of appropriate resonse (gas saver

trainings, fuel efficient tires, proactive

communikation)

... Long term continuous

monitoring of emotional topics

119

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Page 39: TMS 09 Text Mining Creating Semantics in the Real Worldasv.informatik.uni-leipzig.de/media_asset/link/9/tms09_text_mining... · Text Mining: Creating Semantics in the Real World 2

120

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

Summary

Text Mining is cool!

• Drowning in Data: The Challenge of Meaning

• Text Mining: Creating Meaning from Large Collections

• Text Mining Markets

We can do a lot with Text Mining in the Real World!

• Document classification: eBay, antiPhish

• Retrieval and Relation Extraction: NZZ, THESEUS CONTENTUS, Fraunhofer

Web

• Structuring and Monitoring: EmotionRadar

121

Text Mining: Creating Semantics in the Real World

Prof. Dr. Stefan Wrobel

The fine print: Papers and further reading

Horváth, Tamás; Ramon, Jan:Efficient frequent connected subgraph mining in

graphs of bounded treewidth: Machine learning and knowledge discovery in

database: ECML PKDD 2008. Berlin [u.a.]: Springer, 2008. (Machine learning and

knowledge discovery in databases 1), S. 520-535

Kolb, Inke; Deutschland / Bundesbeauftrager für Kultur und Medien; Fraunhofer-

Institut IAIS: Auf dem Weg zur Deutschen Digitalen Bibliothek (DDB): erstellt im

Auftrag des Beauftragten der Bundesregierung für Kultur und Medien. 2008

Köhler, Joachim; Larson, Martha; Jong, Franciska de Jong; Kraaij, Wessel;

Ordelman, Roeland; Association for Computing Machinery / Special Interest Group

on Information Retrieval: Proceedings of the ACM SIGIR Workshop "Searching

Spontaneous Conversational Speech": held in conjunction with the 31th Annual

International ACM SIGIR Conference 24 July 2008, Singapore, 2008

Krausz, Barbara; Herpers, Rainer: Event detection for video surveillance using an

expert system In: Association for Computing Machinery / Special Interest Group on

Multimedia: 1st ACM International Workshop in Analysis and Retrieval of

Events/Actions and Workflows in Video Streams (AREA 2008): October 31, 2008,

Vancouver, Canada ; in conjunction with ACM Multimedia 2008. New York, NY:

ACM, 2008, S. 49-55

Lioma, Christina; Moens, Marie-Francine; Gomez, Juna-Carlos; De Beer, Jan;

Bergholz, Andre; Paass, Gerhard; Horkan, Patrick: Anticipating Hidden Text Salting

in Emails: extended abstract. In: Lippmann, Richard (Ed.) et al.: Recent advances in

intrusion detection: 11th international symposium, RAID 2008, Cambridge, MA,

USA, September 15-17, 2008 ; proceedings. Berlin [u.a.]: Springer, 2008.

Anja Pilz, Lukas Molzberger, and Gerhard Paa. Entity resolution by kernel methods.

In Proc. Sabre TMS, 2009.

Andre Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens, Gerhard

Paass, Siehyun Strobel. New Filtering Approaches for Phishing Email. Accepted for

publication for Journal of Computer Security (JCS)

Gerhard Paass and Frank Reichartz (2009): Exploiting Semantic Constraints for

Estimating Supersenses with CRFs. Proc. SDM 2009 (accepted for publication)

Paaß, Gerhard; Reinhardt, Wolf; Rüping, Stefan; Wrobel, Stefan: Data

mining for security and crime detection In: Gal, Cecilia S. (Ed.) et al.:

Security informatics and terrorism: social and technical problems of

detecting and controlling terrorists' use of the World Wide Web ;

proceedings of the NATO Advanced Research Workshop on Security

Informatics and Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5

June 2007. Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series

D, Information and Communication Security 15), S. 56-70

Paaß, Gerhard; Kindermann, Jörg: Entity and relation extraction in texts

with semi-supervised extensions In: Gal, Cecilia S. (Ed.) et al.: Security

informatics and terrorism: social and technical problems of detecting

and controlling terrorists' use of the World Wide Web ; proceedings of

the NATO Advanced Research Workshop on Security Informatics and

Terrorism - Patrolling the Web, Beer-Sheva, Israel, 4-5 June 2007.

Amsterdam [u.a.]: IOS Press, 2008. (NATO ASI series : Series D,

Information and Communication Security 15), S. 132-141

Frank Reichartz and Gerhard Paaß. Estimating Supersenses with

Conditional Random Fields. Workshop on High-Level Information

Extraction, ECML/PKDD 2008.

Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel,

Marie-Francine Moens and Brian Witten: Detecting Known and New

Salting Tricks in Unwanted Emails Fifth Conference on Email and Anti-

Spam, CEAS 2008, Aug 21-22, 2008

Andre Bergholz,Jeong-Ho Chang, Gerhard Paass, Frank Reichartz and

Siehyun Strobel. Improved Phishing Detection using Model-Based

Features. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 21-

22, 2008, Mountain View, Ca.

Stefan Eickeler, Lars Br¨ocker, and Ruth Haener. NZZ: 225 Jahre Old

economy vernetzt - Realisierung des digitalen Archivs der Neuen

Zürcher Zeitung. In GI Jahrestagung, pages 73–77, 2005.