Machine Learning and Knowledge Discovery for Semantic...

ailab.ijs.si

Machine Learning and Knowledge Discovery for Semantic Web

Dunja MladenićArtificial Intelligence Laboratory,

J. Stefan Institute,

Slovenia

ailab.ijs.si

Jožef Stefan Institute, Artificial Intelligence Laboratory

Selection of FP6 & FP7 Projects (Integrated Projects and Networks of Excellence only):

FP7 IP ACTIVE – Enabling the Knowledge Powered Enterprise

FP7 IP COIN – COllaboration and INteroperability for networked enterprises

FP7 IP EURIDICE – Inter-Disciplinary Research on Intelligent Cargo for Efficient, Safe and Environment-friendly Logistics

FP7 NoE PASCAL2 – Pattern Analysis, Statistical Modeling and Computational Learning

FP7 NoE MetaNet – Machine Translation & Multilingual Information Retrieval

FP7 NoE Multilingual Web

FP6 IP NeOn – Lifecycle Support for Networked Ontologies

FP6 IP ECOLEAD – European Collaborative Networked Organizations Leadership Initiative

FP6 IP SEKT – Semantically-Enabled Knowledge Technologies

Jozef Stefan Institute (JSI) is the leading Slovene research institution for natural sciences (900+ people)

in the areas of computer science, physics, chemistry

Artificial Intelligence Laboratory has over 30 people working in various areas of artificial intelligence(machine learning, data mining, semantic technologies, computational linguistics, logic)

Spinoff-s: Quintlligence, Cyc-Europe, LiveNetLife, ModroOko, Envigence

Selection of Portals and Products:

Text-Garden (http://www.textmining.net)

Enrycher (http://enrycher.ijs.si/)

VideoLectures.NET (http://videolectures.net/)

IST-World (http://www.ist-world.org/)

Project Intelligence (http://pi.ijs.si/)

Search-Point (http://searchpoint.ijs.si/)

OntoGen (http://ontogen.ijs.si/)

Document-Atlas (http://docatlas.ijs.si/)

AnswerArt (http://answerart.net/)

Contextify (http://contextify.net/)

Document-Atlas

VideoLectures.NET

Business Clients: Accenture Labs, Bloomberg, British Telecom, Google Labs, Microsoft Research, New York Times, Siemens, Wikipedia

Academic Partners: Carnegie Mellon, Cornel, Stanford, MIT, Uni. Maryland, KIT, UCL

Enrycher IST-WorldSearchPoint

OntoGen AnswerArt Contextify e-mails

ailab.ijs.si

AILabTechnologies

Graph/Social Network Analysis

(GraphGarden/SNAP, IST-World,

FPIntelligence)

Complex Data Visualization

(DocAtlas, NewsExplorer, SearchPoint)

Computational Linguistics

(Enrycher, AnswerArt)

Social Computing/Web2.0 (LiveNetLife)

Light-Weight Semantic Technologies

(OntoGen, Contextify)

Deep Semantics & Reasoning (Cyc)

Statistical Machine Learning

Data/Web/Text/Stream-Mining

(TextGarden Suite of tools)

ailab.ijs.si

Outline

Motivation

Machine Learning and Ontologies

OntoGen

OntoPlus

Semantics for search and browsing

SearchPoint

AnswerArt

Enrycher

Sensor Search

Real-time data processing

NYTMiner, BBMiner, Personalized News Search

…to conclude

ailab.ijs.si

Motivation

Semantic Web

integrates many existing ideas and technologies focusing on

upgrading the existing nature of web-based information

systems to a more “semantic” oriented nature

typical approach is top-down modeling of knowledge and

proceeding down towards the data

Machine Learning and Knowledge Discovery in

Databases

aims at data modeling and extraction of interesting (non-

trivial, implicit, previously unknown and potentially useful)

information from large datasets

data-driven bottom-up approach trying to discover the

structure in the data and express it in the more abstract ways

and rich knowledge formalisms

ailab.ijs.si

ML & KDD role within Semantic WebOntology construction

SW applications involve deep structured knowledge composed into ontologies

ML/KDD discovering structure in the data - structuring knowledge

semi-automatically extract knowledge from data into ontological structure

Integrating domain knowledgeML/KDD approaches, e.g., “Active Learning” and “Semi-supervised Learning” make use of small pieces of human knowledge for better guidance towards the desired model (e.g., ontology)

reduce human efforts by an order of magnitude preserving the quality of results

Handling data over time - dynamic ontologiesdata and the corresponding semantic structures change in time

KDD technologies for stream mining - deal with the stream of incoming data fast enough to be up-to-date with the corresponding models (ontologies)

Supporting different data modalitiesML/KDD technologies are not limited to a specific data representation -handling different data modalities (databases, text, multimedia, graphs)

ML/KDD for Language Technologies SW mainly deals with textual data, LT are thus important for SW including lexical, syntactical and semantic levels of natural language processing

ML/KDD for modeling natural language by automatic learning from rare/costly data

Scalability KDD approaches consider scalability

SW is ultimately concerned with real-life data on the web which have exponential growth

ailab.ijs.si

Ontology - SW commonly uses ontologies to structure knowledge

Ontology can be seen as a graph/network

structure consisting from:

a set of concepts (vertices in a graph),

a set of relationships connecting concepts

(directed edges in a graph),

a set of instances assigned to a particular

concepts (data records assigned to vertices in

a graph)

ailab.ijs.si

Ontology construction

One of the methodologies defined for ontology construction is a methodology for semi-automatic ontology constructionanalogous to the CRISP-DM methodology can be defined as consisting of the following interrelated phases:

1. domain understanding (what is the area we are dealing with?),

2. data understanding (what is the available data and its relation to semi-automatic ontology construction?),

3. task definition (based on the available data and its properties, define task(s) to be addressed),

4. ontology learning (semi-automated process addressing the task(s)

5. ontology evaluation (estimate quality of the solutions to the addressed task(s)),

6. refinement with human in the loop (perform any transformation needed to improve the ontology and return to any of the previous steps, as desired)

[Grobelnik & Mladenić 2006]

ailab.ijs.si

ML/KDD for ontology learning

Define the ontology learning tasks in terms of mappings between ontology components, where some of the components are given and some are missing and we want to induce the missing ones.

Some typical scenarios in ontology learning are the following:

Inducing concepts/clustering of instances (given instances)

Inducing relations (given concepts and the associated instances)

Ontology population (given an ontology and relevant, but not associated instances)

Ontology generation (given instances and any other background information)

Ontology updating/extending (given an ontology and background information, such as, new instances or the ontology usage patterns)

ailab.ijs.si

Ontology Population via document classification into topic ontology

Goal: given a collection of documents organized into a topic ontology, classify a new document into the ontology

Different classification algorithms were applied on different data representations (e.g., word-vectors, word n-gram vectors, flexible phrase vectors)

on different datasets (e.g., Yahoo! directory of Web pages, US patent database, Directory of Slovenia/Croatian Web pages, News directory)

ailab.ijs.si

OntoClassify

System for scalable classification of text into large

topic ontologies [Grobelnik & Mladenić, 2005]

Available as Web service

for DMoz directory of Web pages

for Inspec ontology for annotating papers

for Mesh medical ontology

ailab.ijs.si

Constructing ontology from data stream

Goal: given a stream of documents (e.g., news

arriving over time) construct ontology

Solution: Framework that incorporates the stream

mining process into a formal definition of ontology[Grobelnik et al., 2006]

Extract named entities and use them as instances of the ontology

Entities and co-occurring entity pairs are represented by feature

vectors based on the content of the documents they occur in

Concepts and relations can be formed either by clustering or by

classification into an existing topic hierarchy

ailab.ijs.si

Illustrative results on Reuters news

Observe change in relations between entities

over time, e.g.,

France – UK relation focused first on

Society (Society, Government, Regional,...) and later

moves to

Business (Investing, Business, Stocks, Bonds,…);

ailab.ijs.si

Ontology Learning from text

Extending the existing ontologycommonly used is the English lexical ontology WordNet that is extended using some text, eg., Web documents [Agirre et al., 2000]

Learning relations for an existing ontology (from docs)learn relations between the concepts (eg., “isa” [Cimiano et al., 2004], “hasPart” [Maedche, Staab, 2001]), extract semantic relations from text based on collocations [Heyer et al., 2001]

Ontology construction based on clustering (from docs)split each document into sentences, parse the text and apply clustering for semi-automatic construction of an ontology [Bisson et al., 2000; Reinberger et al., 2004]

cluster sentences map them upon the concepts of a general ontology (eg., Wordnet [Hotho et al., 2003])

use whole documents and guiding the user through a semi-automatic process of ontology construction [Fortuna et al., 2005]

ailab.ijs.si

Ontology Learning from text (cont)

Ontology construction based on semantic graphsparse the documents and construct semantic graphs, use it for learning document summaries [Leskovec et al., 2004]

Ontology construction from a collection of news stories

represent news as graphs of named entities with relationships based on collocations, used for visualization/browsing [Grobelnik, Mladenić, 2004]

More information in edited book [Buitelaar et al., 2005]

ailab.ijs.si

SEMI-AUTOMATIC DATA-DRIVEN ONTOLOGY CONSTRUCTION

Blaz Fortuna, Dunja Mladenić, Marko Grobelnik

http://ontogen.ijs.si

ailab.ijs.si

Ontology Learning with OntoGen

Semi-Automaticprovide suggestions and insights into the domain

the user interacts with parameters of methods

final decisions taken by the user

Data-Drivenmost of the aid provided by the system is based on some underlying data

instances are described by features extracted from the data (eg., words-vectors)

Installation package available at ontogen.ijs.si

ailab.ijs.si

Main Features

Interactive user interface

User can interact in real-

time with the integrated

machine learning and text

mining methods

Concept discovery

methods:

Unsupervised

k-means clustering

Latent Semantic

Indexing (LSI)

Supervised

Active learning

Concept visualization

Methods for helping at

understanding the

discovered concepts:

Keyword extraction

TFIDF and SVM-normal

based keyword extraction

Concept visualization

LSI and multi-dimensional

scaling based visualization

Also available as a separate

tool named Document

Atlas:http://docatlas.ijs.si

ailab.ijs.si

Ontology management

Concept hierarchy

List of suggested sub-concepts

Ontology visualization

Selected concept

ailab.ijs.si

Concept management

Concept’s details

Concept’s instance

management

Selected concept

Keywords

Selected instance

ailab.ijs.si

Active Learning for concept learning

SVM hyperplane distance based active learning algorithm

First few labelled documents are bootstrapped from a query search

Instances for final concept are selected using the final SVM model

New Concept

ailab.ijs.si

Reuters news articles used in the upper example with two different

sets of categories: topics or list of countries that appear in the news

articles.

Each set of categories offers a different view on the data.

SVM based method detects importance of keywords for each view.

Multiple views of the same data

Topics

Countries

UK takeovers and mergers

The following are additions

and deletions to the

takeovers and mergers list

for the week beginning

August 19, as provided by

the Takeover …

Lloyd’s CEO questioned in

recovery suit in U.S.

Ronald Sandler, chief

executive of Lloyd's of

London, on Tuesday

underwent a second day of

court interrogation about …

ailab.ijs.si

Instances are visualized as points on 2D map. The distance between two

instances on the map correspond to their similarity.

Characteristic keywords are shown for all parts of the map.

User can select groups of instances on the map to create sub-concepts.

Concept’s instances visualization

ailab.ijs.si

New documents

Classification of selected document

Selected document

Ontology population

System uses one vs. all linear SVM trained on created ontology to classify new instances into concepts.

Users can finalize the classifications using an interactive user interface

ailab.ijs.si

ONTOGEN ON IMAGES

Nenad Tomašev, Blaz Fortuna, Dunja Mladenić, Marko Grobelnik

ailab.ijs.si

SIFT features

Extract

features

Mining

Application

Image representation

ailab.ijs.si

Image representation - features

SIFT features

Rotation, scale and translation invariant orientation

gradients located at “interesting” points on an image

Usually, SIFT feature space is quantized to get

“representative” vectors (“codebook” histogram)

Color histogram

Simply divide the color spectrum into “buckets” and

calculate the distribution of colors into these buckets,

(color histogram)

Distance - weighted sum of SIFT codebook and color data

distances

ailab.ijs.si

OntoGen on ImageNet subset (flowers, fire, buildings)

ailab.ijs.si

Document list for quick overview

ailab.ijs.si

Collection visualization (without displaying images)

ailab.ijs.si

Collection visualization(displaying images)

ailab.ijs.si

Creating ontology on images

Grouping similar images - concepts

Displaying relevant features as concept names

ailab.ijs.si

Sub-concept visualization

flower

buildings

ailab.ijs.si

Adding sub-concepts

ailab.ijs.si

TEXT-DRIVEN ONTOLOGYEXTENSION

Inna Novalija, Dunja Mladenić

ailab.ijs.si

OntoPlus

OntoPlus methodology

allows for the effective

extension of the very large

ontologies.

provides the user with

required concepts and

relationships in the form

of the ranked list.

combines textual ontology

content, ontology structure

and co-occurrence

information.

Domain Subset Extraction Module (DSEM)

Ontology Extension

Module (OEM)

Ontology Extender

Validated Entries:

Glossary Term,

Ontology Concept,

Relation

Candidate Entries:

for Each Glossary Term -

Ranked List of Related

Ontology Concept s and

Correspondent Relations

Suggested

Domain

Knowledge

Extractor

Extraction of

ontology concepts

defined in relevant

domains

Extraction of ontology

concepts with denotation

similar to Glossary Term

Extraction of

relevant domains

2 Relevant

Ontology

SubsetUpper-Level

Domain

Extractor

Multi-Domain

Ontology

Domain KB

Domain Information Module (DIM)

Domain

Keywords

Domain Glossary:

Term Names;

Term Descriptions

Domain information

identification

Extraction of the

domain relevant

ontology subsetRelated concepts

extraction

User validation

Ontology reuse

ailab.ijs.si

OntoPlus

Text-Driven Ontology Extension Using Ontology Content,

Structure and Co-occurrence

Ranking existing ontology concepts as corresponding to a new

domain concept suggested for the ontology extension

Experiments using Cyc ontology and textual material from two

domains – Finances and, Fisheries & Aquaculture

Best results by combining content, structure and co-occurrence

information

Financial domain - ontology content and structure

Fisheries & Aquaculture domain - ontology content and co-

occurrence

ailab.ijs.si

Results – Concept Ranking

100 Random Terms

HR (Top 1) HR (Top 5) HR (Top 10)

Weighting Measure Eqv or Hier

Eqv or Hier

Any Rels Eqv or Hier

Any Rels

Baseline - Name: [1.0] 18 28 24 36 25 40

Content (cos. similarity): [1.0] 32 65 60 92 68 95

Co-occur (Jaccard similarity): [1.0] 30 48 48 62 52 73

Content: [0.5]

Structure: [0.4]

Co-occur: [0.1]

38 68 66 95 76 98

100 Random Terms

HR (Top 1) HR (Top 5) HR (Top 10)

Weighting Measure Eqv or Hier

Any Rels Eqv or Hier

Any Rels

Baseline - Name: [1.0] 24 37 25 38 27 40

Content (cos. similarity): [1.0] 32 72 52 88 56 91

Co-occur (Jaccard similarity): [1.0] 33 71 49 89 51 90

Content: [0.5]

Structure: [0.0]

Co-occur: [0.5]

42 84 63 96 66 96

Evaluation of the top suggested candidate concepts for ontology extension

(ASFA thesaurus)

Evaluation of the top suggested candidate concepts for ontology extension

(Financial glossary)

String edit distance of

concept name

Content +

Co-occurrence

Content +

Structure +

Co-occurrence

String edit distance of

concept name

ailab.ijs.si

CONTEXT SENSITIVE SEARCH

Boštjan Pajntar, Marko Grobelnik, Dunja Mladenić

http://SearchPoint.ijs.si

ailab.ijs.si

SearchPoint

Search engines generally work very well

There are cases where it is difficult to specify aquery

Idea: help the user by clustering all the hits and visualise the results space

Some related work: mindset.research.yahoo.com – research vs. shopping aspect

www.ujiko.com – clustering & user interface

vivisimo.com – hierarchical clustering

ailab.ijs.si

Approach Description

Search results clustered and shown in 2D space

Each point in this cluster space coresponds to a ranking

Hits are ordered according to the position of the focus -

the selected point

Initial focus position corresponds to Google ranking

Positioning clusters with respect to centroid to centroid

similarity

Calculating ranking of document using its similarity to each

centroid:

Classifiying documents into web directory (DMoz),

visualising relevant parts of the directory

ailab.ijs.si

Search

“Internet search” – one of the

most common tasks involving

text manipulation in everyday

…but – how smart is search

technology today?

…not too smart!

It is sophisticated, but not smart

ailab.ijs.si

Example – Searching for “jaguar”

Query “jaguar” has many meanings…

…but the first page of search engines doesn’t provide us with many answers

…there are 84M more results

ailab.ijs.si

Conceptual map

Search Point

Dynamic

contextual

ranking based

on the search

Context sensitive search

ailab.ijs.si

SearchPoint

ailab.ijs.si

SearchPoint

ailab.ijs.si

Main advantages

Generated clusters

(in contrast to predefined)

User can search the whole cluster space and is

not forced to select a single cluster

(Computer generated clusters are not necessarily

what user has in mind)

ailab.ijs.si

SearchPoint integrated in Accenture’s intranet search

ailab.ijs.si

ANSWER ART

Luka Bradeško, Lorand Dali, Blaž Fortuna, Marko Grobelnik, Dunja

Mladenić, Inna Novalija, Boštjan Pajntar

http://AnswerArt.net

ailab.ijs.si

TripletsExtendedontology

AnswerArt – System Architecture

AnswerArtpreprocessing

Domain ontology(ASFA, WordNet)

Semantic enhancement

of triplets

AnswerArt

Extraction

Question Answer

ailab.ijs.si

AnswerArt using Medline

ailab.ijs.si

document

AnswerArt using Medline

ailab.ijs.si

Show document

overview

ailab.ijs.si

AnswerArt using ASFA

ailab.ijs.si

document

ailab.ijs.si

Show document

overview

ailab.ijs.si

NATURAL LANGUAGE TEXTENRICHMENT

Tadej Štajner, Delia Rusu, Lorand Dali, Blaž Fortuna,

Dunja Mladenić, Marko Grobelnik

http://enrycher.ijs.si

ailab.ijs.si

Enrycher Service

Annotation Features:

Entity extraction

People, locations, organizations,

dates, percentages and money

amounts

Entity resolution

co-reference

anaphora

Entity linkage to Linked Open

Data (LOD)

Word Sense Disambiguation to

LOD (WordNet 3.0 VUA)

Assertion extraction

Subject – predicate – object sentence

elements together with their modifiers

Categories – from the Open

Directory and the Wikipedia category

schema

ailab.ijs.si

Entity resolution in text

ailab.ijs.si

Enrycher Service Dependencies

The dashed line marks dependencies between components that are optional,

whereas the filled lines mark required dependencies

ailab.ijs.si

A comparative view on five systems: Enrycher, Text Runner, Open Calais, GATE and Read the

Features Enrycher Text Runner Open Calais GATE NELL

Named Entity Extraction

Co-reference and

Anaphora Resolution

Entity resolution

Disambiguation

Assertion Extraction Relationshipextraction

Events andFacts

Relationshipextraction

Machine Learning and Knowledge Discovery for Semantic...

Documents

Gozdovi v svetu in Sloveniji - VideoLectures.nettranslectures.videolectures.net › site › normal_dl › tag=...Gozd – Slovenija Marko Debeljak LESNA ZALOGA – ZASEBNI GOZDOVI

Semantic Technologies: Representing Semantic Data

Anagha Kulkarni Carnegie Mellon University Jaime Teevan ...translectures.videolectures.net/site/normal_dl/tag=... · Anagha Kulkarni Carnegie Mellon University Jaime Teevan, Krysta

Indoor Multi-Dimensional Location GML and Its Application ...information. Indoor location information constitutes the semantic engine that integrates big data, aggregates resources,

Semantic Search with Semantic Web

Biomimetične tekstilije - translectures.videolectures.nettranslectures.videolectures.net/site/normal_dl/tag=559764/7nanodan... · Laboratorij za spektroskopijo materialov, Kemijski

Rolling Guidance Filtertranslectures.videolectures.net/site/normal_dl/tag=921117/eccv2014_jia... · • Bilateral Texture Filtering [Cho et al., 2014] 16. Related Work • Iterated

Semantic Search: Reconciling Expressive Querying and ...translectures.videolectures.net/site/normal_dl/tag=636939/iswc2011... · Music Brainz (Data Incubator) Moseley Folk Discogs

A Reusable and Interoperable Semantic Classification Tool ... · A Reusable and Interoperable Semantic Classification Tool Which Integrates Owl Ontology SAADIA LGARCH1, MOHAMMED KHALIDI

Introduction to ROOT - translectures.videolectures.nettranslectures.videolectures.net/site/normal_dl/tag=65701/cernstudent...Introduction to ROOT 6 ROOT: a Framework and a Library

FHIR CDR Integrates EHRs to

Large Scale High-Precision Topic Modeling on Twittertranslectures.videolectures.net/site/normal_dl/tag=899955/kdd2014_yang_twitter_01.pdfLDA (latent Dirichlet allocation) & variants

Lucid Meetings Integrates with Basecamp

1 - Semantic primes, semantic molecules, semantic templates: Key

Wavestore Integrates Herta V2.indd

Semantic roles and semantic features

IOS Press DogOnt as a Viable Seed for Semantic Modeling of ... · LOD approach exploits the linking and mapping prim-itives deﬁned in OWL and integrates more than 295 datasets1

Deep reinforcement learningtranslectures.videolectures.net/site/normal_dl/tag=1137918/deep... · Deep reinforcement learning — Hado van Hasselt Industrial revolution (1750 - 1850)

Sintezna biologija za bionanotehnologijotranslectures.videolectures.net/site/normal_dl/tag=77060/6nanodan2010... · • biosenzorji in bioremediacija okolja... Sinteza zdravila proti

Boilerplate Detection using Shallow Text Featurestranslectures.videolectures.net/site/normal_dl/tag=73915/... · 2010-08-12 · initiative entitled “Future Internet – Internet,