View
222
Download
0
Category
Tags:
Preview:
Citation preview
Ontology Learning
Shalini Gupta - 07305R02Apoorv Sharma - 07305913
Chirag Patel - 07305909Shitanshu Verma - 07305037
Issue There is lot of information current representation renders it
uninterpretable for machines consequences
most of the information remains undiscovered
Big and popular search engines are able to search only 3-4% of the total information on the web.
What is needed ? Improved machines intelligence. Make them read understand use
modify information. With minimal human intervention.
To Achieve It ? Enable machines
Populate Enrich Evaluate
Maintain Their knowledge representation
What is ontology A representation format that
conceptualizes domain Captures classes, instances ,
attributes, relationships Provides sound semantic ground of
machine-understandable description of digital content
Is used in various fields SE, AI Is represented using languages as
OWL etc
What is ontology learning
Process of preparing updating
ontologies from sources such as Documents in natural language
with the help of dictionaries thesauruses etc
Environment
The flow
Initial ontology is given Information sources are given Machines work over the data sources to
enrich the ontology Once enriched
consistency check is done evaluation
Terms related with the process Ontology enrichment
Improving an existing ontology Ontology population
Creating new ontology or adding new concepts to it
Inconsistency resolution resolving inconsistencies that come up while
acquiring ontologies
Enrichment of Ontology Term Identification Taxonomy Extraction Non taxonomical relationship
extraction
Enrichment of Ontology Term Identification
identify important terms in the text Taxonomy Extraction
identifying taxonomical relationships between terms identified
Non taxonomical relationship extraction identifying other relationships
Review
Ontology learning ontology enrichment
term identification taxonomy extraction non taxonomic relationship extraction
Term Identification: Basics Everything is a concept.
An object, an idea, or a thing. A term lexicalizes a concept.
A Word or Multi-word string that conveys 'a single meaning' within a given community e.g. company, Paris, man, cellphone, Red Hat,
car parking Goal: Find out representative concepts.
Term Identification: Steps Steps:
Term Recognition: Find the terms. Term Classification: Cluster the terms
which are same. Term Mapping: Link the terms to well-
defined concepts of referent data sources.
Various techniques exist for every step.
Term Identification: Tokenizing Different combinations of Linguistics
techniques have been able to surpass this step
Tokenizing Scan the text in order to identify
boundaries of words and complex expressions
Term Identification: Tokenizing Remove the stop words like 'a', 'the', 'of',
'with' E.g. Check of the Electrical Bonding of External
Composite Panels with a CORAS Resistivity-Continuity Test
Terms: Check, Electrical Bonding, External Composite Panels, CORAS Resistivity-Continuity Test Set.
Generally nouns are considered as candidate concepts
Term Identification: Importance of a term
TF-IDF technique can be used to find the important keywords [6] a balanced measure stating that a word is
more important if it appears several times in a target document and at the same time it appears rarely in other documents.
Seed-concepts can be used from existing ontologies.
Term Identification:Importance of a term
Multi-word terms The C/NC-value method: [5]
(1) the frequency of occurrence, (2) the frequency of occurrence as a sub-string of
other candidate terms, (3) the number of candidate terms containing the
given term as a sub-string, (4) the number of words contained in the candidate
term The relevant terms can be determined by
mutual cohesiveness by using Mutual Expectation
Term Identification: Morphological Analysis
Use of morphological knowledge of a word [9] A technique which identifies a word-stem
from a full word-form To identify small domain-specific units studies patterns of word-formation and
attempts to formulate rules using the word structure.
e.g. In the biomedical domain a word ending in “-ofilous” or “-itis” is very probably a bio-molecule or a medical term
Advantage: Can identify “background terms” even with low frequency of appearance
Term Identification:Named Entity Recognition
Recognition of person, location, organization names as single
complex entities Complex date and time expressions percentage, monetary value E.g. 'Merrill Lynch'
The next step associates single words or complex expressions with the concepts
e.g 'Merrill Lynch' is related to the concept organization
Identifying Relationships More information for later steps Dependency Relations:
Between the word and its neighbours, the mind perceives connections, the totality of which forms the structure of the sentence
Structural connections establish dependency relations between the words
Deriving Relationships from Dependency Relations Syntactic dependency relations coincide closely
with semantic relations [3] e.g. France Telecom in Paris offers the new DSL
technology. Dependency relations would give linkage
between France Telecom(organization) and Paris(city)
From this we can derive a semantic relationship between organization and city
Term Identification Identifying Relationships
Taxonomic Relationships
Non-Taxonomic Relationships
Taxonomy Construction
Hierarchy of concepts Inclusion relations provide a tree view of the ontology
and imply inheritance between super-concepts and sub-concepts.
E.g. 'Living being' is a super-concept and 'mammal' is a sub-concept.
In terms of ontology, root node is the most general one for the domain of interest.
Discovering taxonomic relations
Based on lexico-syntactic patterns Can find inclusion relation between concepts
through a simple pattern matching on a set of documents
E.g. NP such as NP, NP,..., and NP ...works by authors such as Herrick, Goldsmith, and
Shakespeare hyponym(“author”, Herrick) hyponym(“author”, Goldsmith) hyponym(“author”, Shakespeare)
Discovering new patterns Idea is to use a pattern learner to generate new
patterns Generated patterns then can be used in order
to generate new information (new inclusion relations), as well as to assess the validity of extracted information
E.g. we can generate new patterns like NP is NP NP, NP,..., and other NP NP, especially NP, NP,..., and NP
From the pattern NP such NP as NP, NP,..., and NP
Algorithm for finding new patterns
1. Decide on a lexical relation, R, that is of interest,e.g., "group/member" E.g. a hyponym relation like (author,Shakespeare).
2. Gather a list of terms/instances for which this relation holds.
3. Find places in the corpus where these terms/instances occur syntactically near one another and record the environment.
4. Find new patterns using this.
5. Once a new pattern has been positively identified, use it to gather more instances of the target relation and go to Step 2.
Multi-word concepts
A concept may be represented by multi-word terms
A concept 'A' is a hyponym of a concept 'B' if A has more tokens than B all the tokens of B are present in A both terms have the same head E.g. Concepts 'private customer' and business
customer' is a hyponym of the concept 'customer'
Mining non-taxonomic relations Relationships other than is-a relationships E.g. Linguistic processing may find that the word
'cost' occurs frequently with the words 'hotel', 'guest house', 'youth hostel' in sentences like 'Costs at the youth hostel are $20 per night'
Relations (cost, hotel), (cost, guest house) and (cost, youth hostel) exist
Discovery algorithm finds support and confidence measures for these pairs as well as relationships at higher levels of abstraction such as accommodation and costs
Finding non-taxonomic relations Based on basic Association Rule Algorithm [3] Basic Association Rule Algorithm
Given a set of transactions, T Each transaction has a set of items, i1,i2, ... in
Goal: Compute association rules of form i1→i2 Trick: Explores the fact that many items
appear together. So occurrence of one implies occurrence of another with a high probability (confidence)
Association Rule Mining
E.g. consider the transactions (bread, butter, jam, chips) (bread, butter, jam, ketchup) (ketchup,chips) (bread, butter, jam, chips) (bread,rice)
Eg. bread → butter, jam Support =n(XUY)/N
E.g. Support = 3/5 Confidence = n(XUY)/n(X)
E.g. Confidence = 3/4
Algorithm 1. Extend each transaction to include the
ancestor of a particular item E.g. include the word 'Accommodation' in the
transactions containing word 'guest house' 2. Determine association rules of the form Xk→Yk
where |Xk| = 1 and |Yk| = 1 3. Determine confidence for all rules that exceed
user determined support 4. Prune the rules subsumed by ancestral rules
E.g. if we found 2 rules, (cost, accommodation) and (cost, hotel), we prune the latter rule (cost, hotel)
Statistics-based Extraction of Taxonomic Relations [12][13]
Uses hierarchical clustering. Groups up the similar terms in a
bottom up fashion Uses cosine similarity function
The cosine measure or normalized correlation coefficient between two vectors x and y is given by
Algorithm
Computation of similarity function The similarity matrix is given by
Hotel vector=(0,14,7,4,6)Accommodation vector=(14,0,11,2,5)cos(Hotel,Accommodation) = 7*11+4*2+6*5/(105*150)
Case study:Web-based Ontology Learning with ISOLDE
ISOLDE (Information System for Ontology Learning and Domain Exploration) produce domain ontology from a base ontology
Uses the following An unsupervised named entity recognition
system Web resources like DWDS, Wikipedia and
Wiktionary.
Analysis steps used by ISODLE
Named-entity recognition (NER) uses a domain-specific corpus, a base ontology and a
general purpose NER system (SproUT, see Drozdzynski et al. 2004) to find instances for the classes in the base ontology.
Linguistic pattern analysis for the extraction of class candidates from the
context of the instances extracted in step 1 by use of lexico-syntactic patterns
Collecting web-based knowledge collect information on and between extracted
class candidates from online resources and integrating this into a new or extended taxonomy/ontology
Architecture
Stage wise Examples
After step 1 we get Ballack,Munich, as 1 named entity from soccer corpus
In the second step we find the class candidates for named entities for the sentence in the corpus and then filter the domains specific candidates using X2
method Ballack, the best midfielder in the German
national team. Gives Midfielder as the calss candidate of Ballack.
In the third step for the class candidates we search on web wikipedia definition on midfielder is A midfielder is a player whose position of play is
midway between the attacking strikers and the defenders
Example contd..
We learn the relation midfielder is a player(taxonomic relationship)
Relevence Factor X2
X2=
O matrix for striker
Issues in Learning
human understandable vs machine understandablelearning higher degree relationmapping to high level ontologyevaluation benchmarkincremental ontology learningmulti agent learning
Application of ontology
is ubiquitous in information systems [2]improving the performance of information retrieval and reasoningmaking data between different applications interoperable ontology-type semantic description of behaviors and services allow software agents in a multi-agent system to better coordinate themselves
References [1] Elias Zavitsanos, Georgios Paliouras, George
Vouros,Ontology Learning and Evaluation: A survey Technical Report, 2006.
[2] Nicolas Weber, Paul Buitelaar, Web-based Ontology Learning with ISOLDE, DFKI GmbH - Language Technology Lab Saarbrücken, German,2006.
[3] Alexander Maedche and Steffen Staab, Mining Ontologies from Text, 2000.
[4] Alexander Maedche, Viktor Pekar, and Steffen Staab, Ontology Learning Part One-On Discovering Taxonomic Relations from the Web, 2003.
References [5] K. Frantzi, S. Ananiadou, and H. Mima. Automatic
recognition of multi-word terms: The c-value/nc-value method. 3(2):115–130, 2000.
[6] A. Saltion, G. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
[7] D.I. Moldovan and R.C. Girju. An interactive tool for the rapid development of knowledge bases. International Journal on Artificial Intelligence Tools (IJAIT), 10(1-2), 2001
References [8] J.D. Cohen. Highlights: Language and domain
independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3):162–174, 1995.
[9] U. Heid. A linguistic bootstrapping approach to the extraction of term candidates from german text. Terminology, 5(2):161–181, 1998.
[10] L.M. Iwanska, N. Mata, and K. Kruger. Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts, pages 335–345. MIT/AAAI Press, 2000.
References [11] J.U. Kietz, A. Maedche, and R. Volz. A Method for
Semi-Automatic Ontology Acquisition from a Corporate Intranet. , Juan-Les-Pins, France, 2000.
[12] A. Maedche, V. Pekar, and S. Staab.Ontology learning part one - on discovering taxonomic relations from the web.In Proceedings of the Web Intelligence conference. Springer Verlag, 2002.
[13] Vincent Schickel-Zuber, Boi Faltings: Using hierarchical clustering for learning theontologies used in recommendation systems. KDD 2007: 599-608
[14] A . Maedche and S. Staab. Discovering Conceptual Relations from Text. In Proceedings of ECAI 2000, IOS Press, Amsterdam, 2000.
Thank You
Recommended