Concept Hierarchy Induction by Philipp Cimiano

Concept Hierarchy Inductionby Philipp Cimiano

Objective

Structure information into categories

Provide a level of generalization to define relationships between data

Application: Backbone of any ontology

Overview

Different approaches of acquiring conceptual hierarchies from text corpus.

Various clustering techniques. Evaluation Related Work Conclusion

Machine Readable Dictionaries

Entries: ‘a tiger is a mammal’, or ‘mammals such as tigers, lions or elephants’.

exploit the regularity of dictionary entries.

the head of the first NP - hypernym.

Example

Exception

Exception

is-a (corolla, part)………..is a NOT VALID

is-a (republican, member) ……….. is a NOT VALID

is-a (corolla, flower)………..is a NOT VALID

is-a (republican, political party)………..is a NOT VALID

Exception

Alshawis solution

Results using MRDs

Dolan et al. - 87% of the hypernym relations extracted are correct

Calzolari cites a precision of > 90%

Alshawi - precision of 77%

Strengths And Weaknesses

Correct, explicit knowledge

Robust basis for ontology learning

Weakness- domain independent

Lexico-Syntactic patterns

Task: automatically learning hyponym relations from the corpora.

'Such injuries as bruises, wounds and broken bones'

hyponym (bruise, injury)

hyponym (wound, injury)

hyponym (broken bone, injury)

Hearst patterns

'Such injuries as bruises, wounds and broken bones'

Requirements

Occur frequently in many text genres.

Accurately indicate the relation of interest.

Be recognizable with little or no pre-encoded knowledge


Identified easily and are accurate

Weakness: patterns appear rarely is-a relation do not appear in Hearst style

pattern

Distribution Similarity

'you shall know a word by the company it keeps’ [Firth, 1957].

semantic similarity of words – similarity of the contexts.

Using distribution similarity


reasonable concept hierarchy.

Weakness: Cluster tree lacks clear and formal interpretation Does not provide any intentional description of

concepts Similarities may be accidental (sparse data)

Formal Concept Analysis (FCA)

FCA output

Similarity measures

Smoothing

Evaluation

Semantic cotopy (SC).

Taxonomy overlap (TO)

Evaluation Measure

100% Precision Recall

Low Recall

Low Precision

Results

Results

Results

Results


FCA generates formal concepts Provides intentional description

Weakness: Size of the lattice can get exponential in the size spurious clusters Finding appropriate labels for the cluster

Problems with Unsupervised Approaches to Clustering

Data sparseness leads to spurious syntactic similarities

Produced clusters can’t be appropriately labeled

Guided Clustering

Hypernyms directly used to guide clusteringWordNetHearst

Agglomerative clustering

Similarity Computation

Ten most similar terms of the tourism reference taxonomy

The Hypernym Oracle

Three sourcesWordNetHearst patterns matched in a corpusHearst patterns matched in the World Wide

Web Record hypernyms and amount of

evidence found in support of hypernyms.

WordNet

Collect hypernyms found in any dominating synset containing term, t

Include number of times the hypernym appears in a dominating synset

Hearst Patterns (Corpus)

Record number of isa-relations found between two terms

Hearst Patterns (WWW)

Download 100 Google abstracts for each concept and clue:

Evidence

Total Evidence for Hypernyms:

•time: 4

•vacation: 2

•period: 2

Clustering Algorithm

1. Input a list of terms

2. Calculate the similarity between each pair of terms and sort from highest to lowest

3. For each potential pair to be clustered consult the oracle.

Consulting the Oracle case 1

If term 1 is a hypernym of term 2 or vice-versa:Create appropriate subconcept relationship.


Find the common hypernym for both terms with greatest evidence.

If one term has already been classified:

t’ = h h is a hypernym of t’ t’ is a hypernym of h


Neither term has been classified:Each term becomes a subconcept of the

common hypernym.


The terms do not share a common hypernym:Set aside the terms for further processing.

r-matches

For all unprocessed terms, check for r-matches (i.e. ‘credit card’ matches ‘international credit card’)

Further Processing

If either term in a pair is already classified as t’, the other term is classified under t’ as well.

Otherwise place both terms under the hypernym of either term with the most evidence.

Any unclassified terms are added under the root concept.

Evaluation

Taxonomic overlap (TO) ignore leaf nodes

Sibling overlap (SO)measures quality of clusters

Evaluation

Tourism domain:Lonely PlanetMecklenburg

Finance domain:Reuters-21578

Tourism Results—TO

Finance Results—TO

Tourism Results—SO

Finance Results—SO

Human Evaluation

Future Work

Take word sense into consideration for the WordNet source.

Summary

Hypernym guided agglomerative clustering works pretty good.Better than the “Golden Standard”Good human evaluation

Provides labels for clusters No spurious similarities Faster than agglomerative clustering

Learning from Heterogeneous Sources of Evidence

Many ways to learn concept hierarchies Can we combine different paradigms?

Any manual attempt to combine strategies would be ad hoc

Use supervised learning to combine techniques

Determining relationships with machine learning Example: Determine if a pair of words has

an “isa” relationship

Feature 1:Matching patterns in a corpus Given two terms t1 and t2 we record how

many times a Hearst-pattern indicating an isa-relation between t1 and t2 is matched in the corpus

Normalize by maximum number of Hearst patterns found for t1

Example

This provided the best F-measure with a single-feature classifier

Feature 2:Matching patterns on the web Use the Google API to count the matches

of a certain expression on the Web

Feature 3:Downloading webpages Allows for matching expressions with a more

complex linguistic structure Assign functions to each of the Hearst patterns to

be matched Use these “clues” to decide what pages to

download

Download 100 abstracts matching the query “such as conferences”

Example

Feature 4:WordNet – All senses Is there a hypernym relationship between

t1 and t2? Can be more than one path from the

synsets of t1 to the synsets of t2

Feature 5:WordNet – First sense Only consider the first sense of t1

Feature 6:“Head”- heuristic If t1 r-matches t2 we derive the relation

isa(t2,t1) e.g.

t1 = “conference”

t2 = “international conference”

isahead(“international conference”,”conference”)

Feature 7:Corpus-based subsumption t1 is a subclass of t2 if all the syntactic

contexts in which t1 appears are also shared by t2

Feature 8:Document-based subsumption t1 is a subclass of term t2 if t2 appears in

all documents in which t1 appears

# of pages where t1 and t2 occur

# of pages where t1 occurs

Example

Naïve Threshold Classifier

Used as a baseline Classify an example as positive if the

value of a given feature is above some threshold t

For each feature, the threshold has been varied from 0 to 1 in steps of 0.01

Baseline Measures

Evaluation

ClassifiersNaïve BayesDecision TreePerceptronMulti-layer perceptron

Evaluation Strategies

Undersampling Remove a number of majority class examples (non-isa

examples)

Oversampling Add additional examples to the minority class

Varying the classification threshold Try different threshold values other than 0.5

Introducing a cost matrix Different penalties for different types of misclassification

One Class SVMs Only considers positive examples

Results

Results (cont.)

Discussion

The best results achieved with the one-class SVM (F = 32.96%) More than 10 points above the baseline classifier

average (F = 21.28%) and maximum (F = 21%) strategies

More than 14 points better than the best single-feature classifier (F = 18.84%) using the isawww feature

Second best results obtained with a Multilayer Perceptron using oversampling or undersampling

Discussion

Gain insight from finding which features were most used by classifiers

Used this information to modify features and rerun experiments

Summary

Using different approaches is useful Machine learning approaches outperform naïve

averaging Unbalanced character of the dataset poses a

problem SVMs (which are not affected by the imbalance)

produce the best results This approach can show which features are the

most reliable as predictors

Related Work

Taxonomy ConstructionLexico-syntactic patternsClusteringLinguistic approaches

Taxonomy Refinement Taxonomy Extension

Lexico-syntactic patterns

Hearst Iswanska et al. – added extra patterns Poesia et al. – anaphoric resolution Ammad et al. – applying to specific domains Etzioni et al. – patterns matched on the www Cederburg and Widdows – precision improved with

Latent Semantic Analysis Others working on learning patterns automatically

Clustering

Hindle group nouns semantically derive verb-subject and verb-object dependencies from a 6

million word sample of Associated Press news stories

Pereira et al. top-down soft clustering algorithm with deterministic annealing words can appear in different clusters (multiple meanings of

words)

Caraballo bottom-up clustering approach to build a hierarchy of nouns uses conjunctive and appositive constructions for nouns derived

from the Wall Street Journal Corpus

Clustering (cont.)

The ASIUM System The Mo'K Workbench Grefenstette Gasperin et al. Reinberger et al. Lin et al. CobWeb Crouch et al. Haav Curran et al. Terascale Knowledge Acquisition

Linguistic Approaches

Linguistic analysis exploited more directly rather than just for feature extraction OntoLT - use shallow parser to label parts of speech and

grammatical relations (e.g. HeadNounToClass-ModToSubClass, which maps a common noun to a concept or class)

OntoLearn - analyze multi-word terms compositionally with respect to an existing semantic resource (Word-Net)

Morin et al. - tackle the problem of projecting semantic relations between single terms to multiple terms (e.g. project the isa-relation between apple and fruit to an isa-relation between apple juice and fruit juice)

Linguistic Approaches

Sanchez and Moreno – download first n hits for a search word and process the neighborhood linguistically to determine candidate modifiers for the search term

Sabou - inducing concept hierarchies for the purpose of modeling web services (applies methods not to full text, but to Java-documentation of web services)

Taxonomy Refinement

Hearst and Schutze Widdows Madche, Pekar and Staab Alfonseca et aL

Taxonomy Extension

Agirre et al. Faatz and Steinmetz Turney

Conclusions

Compared different hierarchical clustering approaches with respect to:effectivenessspeed traceability

Set-theoretic approaches, as FCA, can outperform similarity-based approaches.

Conclusions

Presented an algorithm for clustering guided by a hypernym oracle.

More efficient than agglomerative clustering.

Conclusions

Used machine learning techniques to effectively combine different approaches for learning taxonomic relations from text.

A learned model indeed outperforms all single approaches.

Open Issues

Which similarity or weighting measure should be chosen Which features should be considered to represent a

certain term Can features be aggregated to represent a term at a

more abstract level How should we model polysemy of terms Can we automatically induce lexico-syntactic patterns

(unsupervised!) What other approaches are there for combining different

paradigms; and how can we compare these

Questions

Documents

Concept Hierarchy Induction by Philipp Cimiano