42
Ontology construction from text Blaz Fortuna

Ontology construction from text Blaz Fortuna. Outline Big picture OntoGen Future work 2

Embed Size (px)

Citation preview

Page 1: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Ontology construction from text

Blaz Fortuna

Page 2: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Outline Big picture OntoGen Future work

2

Page 3: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Big picture

3

Page 4: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Vision

What is “text”? From single documents to large corpora

different granularity

What is “structured information”? From topic taxonomies to full-blown ontologies

different expressivity

Extracting structured information from text

Extracting structured information from text

4

Page 5: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Available tools Text mining

… for dealing with large corpora Natural Language Processing (NLP)

… for dealing with sentence level structure Machine learning

… for abstracting structure from data (modeling) … inside of many text mining and NLP algorithms

Visualization … for user interactions

5

Page 6: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

6

Page 7: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

OntoGen

7

Page 8: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

OntoGen

Tool for semi-automatic ontology construction from large text corpora

Integrates several text-mining methods Clustering Active learning Classification Visualizations

Publicly available at ontogen.ijs.si

[Fortuna, Mladenić, Grobelnik, 2005]

8

Page 9: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Ontology construction with OntoGen

Semi-Automatic provide suggestions and insights into domain user interacts with parameters of methods final decisions taken by user

Data-Driven most of the aid provided by the system is based

on some underlying data instances are described by features extracted

from the data (e.g. words-vectors)

9

Page 10: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Ontology model in OntoGen

Ontology is a data model representing: a set of concepts within a domain the relationships between these concepts

OntoGen models ontology as a graph/network structure consisting from: a set of concepts (vertices in a graph), a set of instances assigned to a particular

concepts (data records assigned to vertices in a graph)

a set of relationships connecting concepts (directed edges in a graph)

each instance is described by a set of features

10

Page 11: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Example of a Topic Ontology

11

Page 12: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Instance representation Bag of words:

Vocabulary: {wi | i = 1, …, N } Documents are represented with vectors (word space):

Example:

Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”

Document set: d1 = “Canonical Correlation Analysis” d2 = “Numerical Analysis” d3 = “Numerical Linear Algebra”

Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)

Document vector representation: x1 = (1, 1, 1, 0, 0, 0) x2 = (0, 0, 1, 1, 0, 0) x3 = (0, 0, 0, 1, 1, 1,)

Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

Vocabulary: {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

12

Page 13: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Basic idea behind OntoGen

Domain

Text corpus Ontology

Concept AConcept

B

Concept C

1313

Page 14: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept discovery – unsupervised

Clustering based approach K-means clustering of

the instances Clusters offered as

suggestions Users selects relevant

suggestions

14

Page 15: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept discovery – unsupervised

Visualization based Topic-landscape

based visualization One instance one

yellow point on the map

Similar instances appear closer together

User can make a concept by selecting a region of the map Pink points on the map

are selected instances

15

Page 16: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept discovery – supervised Active learning based

approach User enters a query System ranks the instances

according to the query User labels instances:

Yes – belongs to the concept

No – does not belong to the concept

Once there are enough instances, system switches to SVM based active learning

When done, concept added to the ontology.

16

Page 17: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept discovery – supervised

Classification based approach Instances are classified

into a background ontology called OntoLight

Concepts with the most instances provided as sub-concept suggestions

17

Page 18: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept naming – unsupervised

Automatic extraction of keywords, for describing the concepts First approach based on

TFIDF weights of words Second approach based

on SVM based feature selection algorithm

18

Page 19: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept naming – supervised

Classification based approach Concept’s instances are

classified into a background ontology called OntoLight

Names from background ontology, with most classified instances, are provided as suggestions

Shows what is the name in some pre-defined vocabulary

19

Page 20: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Concept visualization

Instances are visualized as points on 2D map.

The distance between two instances on the map correspond to their similarity.

Characteristic keywords are shown for all parts of the map.

User can select groups of instances on the map to create sub-concepts.

20

Page 21: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Ontology visualization

Ontology concepts visualized as points on the 2D topic map.

Topic map generated from a set of text documents.

21

Page 22: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Multiple views of the same data

Simple taxonomy on top of Reuters news articles

Two different views, one focuses on topics, one focuses on geography

Each view offers yields a different taxonomy on the data.

SVM based method detects importance of keywords for each view.

Topics view

Countries view

UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …

UK takeovers and mergersThe following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …

Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …

Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about …

22

Page 23: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Word weight learning The word weight learning

method is based on SVM feature selection.

Besides ranking the words it also assigns them weights based on SVM classifier.

Notation: N – number of documents {x1, …, xN} – documents C(xi) – set of categories for

document xi n – number of words {w1, …, wn} – word weights {nj

1, …, njn} – SVM normal

vector for j-th category

Algorithm:1. Calculate linear SVM

classifier for each category

2. Calculate word weights for each category from SVM normal vectors. Weight for i-th word and j-th category is:

3. Final word weights are calculated separately for each document:

N

kijik

ji nx

N 1,,

1

ixCj

jiik TFx

k

)(,

23

Page 24: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – preprocessing

24

Name-Entity profile Extracted sentences from articles in which they name entity

appears Example: Agassi

Olympic champion Agassi meets of Morocco in the first round.

Co-occurrence profiles Extracted sentences from articles in which two name entities

appear together Example: Sampras – Agassi

There will be no repeat of last year's men's final with eighth-ranked Agassi landing in Sampras's half of the draw.

Relationship By extracting keywords from co-occurrence profiles we can get

summary of relationship between two name entities. Keywords are extracted by from co-occurrence profile bag-of-

words vectors

Page 25: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – example

25

Bill Clinton Iraq [476]

president, missiles, attacks, Kurdish, northern Bob Dole [294]

republican, president, presidential, candidates, poll

United States [204] president, Monday, southern, move, election

White House [146] president, spokesman, reporters, Friday,

campaign Iran [74]

president, investment, gas, law, penalize Congress [66]

president, calling, billion, republican, democrat Chicago [42]

president, conventional, democrat, drug, campaign

Al Gore [40] president, vice, bus, tour, election

Chicago Clinton [236]

conventional, democrat, training, day, campaign

U.S. [164] trader, markets, purchasers, index, future

New York [100] variety, mixed, critical, poll, bulletproof

Dole [70] conventional, democrat, campaign, drug,

Sunday Kansas City [70]

basis, wheat, bushels, fob, red Los Angeles [60]

(variety, mixed, critical, poll, stg Illinois [34]

democrat, state, conventional, trip, mayor Chicago Board of Trade [34]

future, deliverable, stocks, bus, reporters San Francisco [34]

operations, municipal, full, remain, services Boston [32]

fared, comparatively, game, existed, American

Page 26: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – abstraction Clustering of name entities using k-

means clustering Relations between clusters are

established based on the name-entities co-occurrence profiles: Let C1 and C2 be two clusters Let pij be a co-occurrence profile

between document di and dj

P = {pij | so that di from C1 and dj from C2 }

Relation is defined by a profile set P Summary of relation is extracted from

the centroid vector of profiles from P

C1

C2

26

Page 27: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – example Example of clusters:

Cluster 1: Name Entities: Bosnia,

Bosnian, Sarajevo Keywords: serbs,

moslems, bosnian, election

Cluster 2: Name Entities: Russia,

Britain, Germany, France Keywords: meeting,

country, government, told

Cluster 3: Name Entities:

Washington, United States

Keywords: spokesman, military, missiles

Example of relations Cluster 1 vs. Cluster 3:

Name Entities: U.N., U.S., American, Washington, Bosnia, Turkey, Richard Holbrooke, U.N. Security Council, White House

Keywords: election, serb, war, bosnians, moslem, peace, tribunal, police, spokesman, crime

Cluster 1 vs. Cluster 2: Name Entities: NATO,

Yugoslavia, Bosnia, Croatia, Serbia, Belgrade, Balkan, OSCE, Burns

Keywords: country, election, state, international, peace, meeting, secretary, foreign, talks, member

27

Page 28: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – example

Russia, Britain, Germany, France, China,

EU meeting, country,

government, told, officials, union, minister, secretary,

trade, report

Russia, Britain, Germany, France, China,

EU meeting, country,

government, told, officials, union, minister, secretary,

trade, report

Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim

Bolger

Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim

Bolger

minister, prime, meeting, foreign, talks, president, peace, visit,

told, officials

minister, prime, meeting, foreign, talks, president, peace, visit,

told, officials

president, meeting, visit, talks, leaders, minister, secretary, officials, state

president, meeting, visit, talks, leaders, minister, secretary, officials, state

Bill Clinton, Jacques Chirac, Suharto, Hosni

Mubarak, Leonid Kuchma

Bill Clinton, Jacques Chirac, Suharto, Hosni

Mubarak, Leonid Kuchma

Supreme Court, U.S. District Court,

Simpson, Justice Department

Supreme Court, U.S. District Court,

Simpson, Justice Department

courts, case, year, told, rules, trials, charges, sentenced, law, file

courts, case, year, told, rules, trials, charges, sentenced, law, file

plant, powerful, company, venture, electrical, projects,

million, joint, province, state

plant, powerful, company, venture, electrical, projects,

million, joint, province, state

Tennessee Valley Authority, New Hill,

TVA, Florida Power & Light Co, St Lucie

Tennessee Valley Authority, New Hill,

TVA, Florida Power & Light Co, St Lucie

28

Page 29: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Relations – example

29

CountryCountry

PresidentPresidentMinisterMinister

CourtCourt Power plantPower plant

VisitVisit VisitVisit

InvestInvestRuleRule

Page 30: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Evaluation First prototype was

successfully used: Applied in multiple

domains: business, legislations

and digital libraries (SEKT project)

Users were always domain experts with limited knowledge

and experience with ontology construction / knowledge engineering

Feedback from first trails used as input for the second prototype the one presented here

User study performed for the second prototype Main impression

the tool saves time is especially useful when

working with large collections of documents

Main disadvantages abstraction unattractive interface

design

Used in several EU projects SWING, TAO, NEON,

ECOLEAD, E4, TOOLEAST

30

Page 31: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

From the users

31

Page 32: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Future work

32

Page 33: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

33

Page 34: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Move towards bigger granularity Semantic graphs

Extract data-points from sentences level OntoGen does it on a

document level Based on triplets extracted

from sentence structure Subject Predicate Object

Extraction can be done with Parsers Structured learning

Triplets from one document can be merged into Semantic graphs

Stronger then bag-of-words Example application:

Document summarization

34

Page 35: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

35

Page 36: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

36

Page 37: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

The Plan

Expressiveness

gra

nu

lari

ty

OntoGen

TemplateExtraction

document

corpus

SemanticGraphs

Q&A

37

Page 38: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Template extraction

38

Hypothesis: People view events through “templates”

Models of how things evolve, relate Use these models to understand, predict

Goal: automatic extraction of such models from texts

Page 39: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Search over triplets Triplet extraction ran over Reuters corpus

800k news articles from 1996 to 1997

39

Page 40: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Search over triplets

40

Page 41: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Template earthquake

41

Places

Time-period

People

Buildings

Richter scale

Government

HitsHits in

Kills

Earthquake

Collapses

Registered in

Measured by

Page 42: Ontology construction from text Blaz Fortuna. Outline  Big picture  OntoGen  Future work 2

Thank you!

Questions?

42