Automated building of taxonomies for search engines

Automated building of

taxonomies for search

engines

Boris Galitsky & Greg Makowski

.

What can be a scalable way to

automatically build a

taxonomies of entities to

improve search relevance?

Taxonomy construction starts from

the seed entities and mines the

web for new entities associated

with them.

To form these new entities, machine

learning of syntactic parse trees

(syntactic generalization) is applied

It form commonalities between

various search results for existing

entities on the web.

Taxonomy and syntactic

generalization is applied to

relevance improvement in search

and text similarity assessment in

commercial setting; evaluation

results show substantial

contribution of both sources.

Automated customer service rep.

Q: Can you reactivate my card which I am trying to use in Nepal?

A: We value you as a customer… We will cancel your card… New card will be mailed to your California address …

A child with severe form of autism

Q: Can you give your candy to my daughter who is hungry now and is about to cry?

A: No, my mom told me not to feed babies. Its wrapper is nice and blue. I need to wash my hands before I eat it … …

Entities need to

make sense together

Why ontologies are

needed for search

Human and auto agents having

difficulties processing texts if

required ontologies are missing

Knowing how entities are connected

would improve search resultsCondition “active paddling” is ignored or

misinterpreted, although Google knows that

it is a valid combination (‘paddling’ can be

‘active’)

• In the above example “white water

rafting in Oregon with active

paddling with kids” active is

meaningless without paddling.

• So if the system can’t find answers

with ‘active paddling’, try finding with

‘paddling’, but do not try finding with

‘active’ but without ‘paddling’.

Difficulty in building

taxonomiesBuilding, tuning and managing taxonomies and ontologies is

rather costly since a lot of manual operations are required.

A number of studies proposed automated building of

taxonomies based on linguistic resources and/or statistical

machine learning, (Kerschberg et al 2003, Liu &Birnbaum

2008, Kozareva et al 2009).

However, most of these approaches have not found practical

applications due to:

– insufficient accuracy of resultant search,

– limited expressiveness of representations of queries of

real users,

– high cost associated with manual construction of

linguistic resources and their limited adjustability.

The main challenge in building a taxonomy tree is to make it

as deep as possible to incorporate longer chains of

relationships, so more specific and more complicated

questions can be answered.

• It is based on initial set of key entities (a

seed) for given vertical knowledge domain.

• This seed is then automatically extended by

mining of web documents which include a

meaning of a current taxonomy node.

• This node is further extended by entities

which are the results of inductive learning of

commonalities between these documents.

• These commonalities are extracted using

an operation of syntactic generalization,

which finds the common parts of syntactic

parse trees of a set of documents, obtained

for the current taxonomy node.

We propose

automated taxonomy

building mechanism

Therefore automated or semi-automated

approach is required for practical apps

ContributionProposed taxonomy learning algorithm aims to improve

vertical search relevance and will be evaluated in a

number of search-related tasks The contribution of this

study is three-fold:

• Propose and implement a mechanism for using

taxonomy trees for the deterministic classification of

answers as relevant and irrelevant.

• Implement an algorithm to automate the building of

such a taxonomy for a vertical domain, given a seed of

key entities.

• Design a domain-independent linguistic engine for

finding commonalities/similarities between texts, based

on parse trees, to support 1 and 2.

A number of currently available general-

purpose resources, such as DBPEdia,

Freebase, and Yago, assist entity-related

searches but are insufficient to filter out

irrelevant answers that concern a certain

activity with an entity and its multiple

parameters. A set of vertical ontologies, such

as last.fm for artists, are also helpful for

entity-based searches in vertical domains;

however, their taxonomy trees are rather

shallow, and their usability for recognizing

irrelevant answers is limited.

For a query q with keywords {a b c} and its arbitrary relevant answer A we define that query is about b,

is-about(q, {b}) if queries {a b} and {b c} are relevant or marginally relevant to A, and {ac} is irrelevant to

A.

Our definition of query understanding, which is rather narrow, is the ability to say which keywords in the

query are essential (such as b in the above example), so that without them the other query terms

become meaningless. Also, an answer which does not contain b is irrelevant to the query which includes

b.

• For example, is-about({machine, learning, algorithm}, {machine, learning}), is-about({machine,

learning, algorithm}, {algorithm}), is-about({machine, learning, algorithm}, {learning, algorithm}), but

not is-about({machine, learning, algorithm}, {machine})

• For query {a b c d}, if b is essential (is-about({a b c d}, {b}), c can also be essential when b is in the

query such that {a b c}, {b c d}, {b c} are relevant, even {a b}, {b d} are (marginally) relevant, but {a d}

is not (is-about({a b c d}, {b c}).

Hence for a query {abcd} and two answers (snippets) {bcd…efg} and {acd…efg}, the former is relevant

and the latter is not.

Must-occur keywords

Achieving relevancy using a taxonomy is based on totally different mechanism than a conventional

TF∗IDF based search.

In the latter, importance of terms is based on the frequency of occurrence.

For an NL query (not a Boolean query) any term can be omitted in the search result if the rest of terms

give acceptable relevancy score.

In case of a Boolean query, this is true for each of its conjunctive member. However, in the taxonomy-

based search we know which terms should occur in the answer and which terms must occur there,

otherwise the search result becomes irrelevant.

Let us consider a totality of keywords in a domain D; these keywords occur in questions and answers in

this domain. There is always a hierarchy of these keywords: some are always more important than

others in a sense of is-about relation.

• Keyword tax is more important than deduction, individual, return, business.

• is-about({tax deduction}, {tax}) but not is-about({tax deduction}, {deduction}), since without context of

tax keyword deduction is ambiguous;

• is-about({individual, tax, return}, {tax, return}) but not is-about({individual, tax, return}, {individual}),

since individual acquires sense as an adjective only in the context of tax.

• At the same time, the above keywords are more important than partial cases of the situations they

denote such as submission deadline;

• is-about({individual, tax, return, submission, deadline}, {individual, tax, return}) but not

is-about({individual, tax, return, submission, deadline}, {submission, deadline})

because submission deadline may refer to something totally different

Hierarchy of must-occur keywords

We introduce a partial order on the set of subsets of

keywords K1, K2 ∈ 2D

K1 > K2 iff is-about(K1 ∪ K2, K1) but not is-about(K1 ∪ K2, K2).

We say that a path Tp covers a query Q if the set of keywords

for the nodes of Tp is a super-set of Q. If multiple paths cover

a

query Q producing different intersections Q ∩ Tp then this

query has multiple meanings in the domain; for each such

meaning a separate set of acceptable answers is expected.

An answer ai A is acceptable if it includes all essential

(according to is_about) keywords from the query Q as found in

the taxonomy path Tp T. For any taxonomy path Tp which

covers the question q (intersections of their keywords is not

empty), these intersection keywords must be in the

acceptable answer ai.

Tp T: Tp ∩ X Xi Tp ∩ X.

Answer acceptable given

taxonomy

For a question

(Q) "When can I file extension of time for my tax

return?"

let us imagine two answers:

• (A1) "You need to file form 1234 to request a 4

month extension of time to file your tax return"

• (A2) "You need to download file with extension

'pdf', print and complete it to do your taxes".

We expect the closest taxonomy path to be:

(T) tax - file-return - extension-of-time.

tax is a main entity, file-return we expect to be in the

seed, and extension-of-time would be the learned

entity, so A1 will match with taxonomy and is an

acceptable answer, and A2 is not.

A question and

two answers

Relevance verification algorithminput: query Q

output: the best answer abest and the set of acceptable answers Aa

1) For a query Q, obtain a set of candidate answers A by available means (using

keywords, using internal index, or using external index of search engine ’s APIs);

2) Find a path of taxonomy Tp which covers maximal number of terms in Q, along with

other paths, which cover Q, to form a set P = { Tp1, Tp2, …}.

Unless acceptable answer is found:

3) Compute the set Tp ∩ Q.

For each answer ai A

4) Compute ai ∩( Tp ∩ Q)) and test if all essential words from the query,

which exists in Tp are also in the answer (acceptability test)

5) Compute similarity score of Q with or each ai

6) Compute the best answer abest and the set of acceptable answers Aa.

If no acceptable answer found, return to 2) for the next path from P.

7) Return abest and the set of acceptable answers Aa if available.

Providing multiple

answers as a result of

default reasoning

Facts Si ,

comprising the

query

representation

(occurrences of

words in a query)

Default rules, establishing the

meanings of words based on the

other words and the meanings

that have been established

Successful &

closed process:

extension

@S1, @S2 ,…

answer 1

Successful &

closed process:

extension

@S3, @S1 ,…

answer 2

Either

unsuccessful or

non-closed

process:

No extension

Using default logic to handle

ambiguity in search

Building extensions of default theory for each

meaning

A simplified step 1 of ontology learning

Currently available: tax – deduct

1) Get search results for currently available expressions

2) Select attributes based on their linguistic occurrence (shown in

yellow)

3) Find common attributes

(commonalities between

search results, shown in

red, like ‘overlook’).

4) Extend the taxonomy

path by adding newly

acquired attribute

Tax-deduct-overlook

Step 2 of ontology learning (more

details)Currently available taxonomy path: tax – deduct - overlook

1) Get search results

2) Select attributes based on their linguistic occurrence (modifiers of

entities from the current taxonomy path)

3) Find common expressions between search results as syntactic

generalization, like ‘PRP-mortgage’

4) Extend the taxonomy path

by adding newly acquired

attribute

Tax-deduct-overlook-

mortgage,

Tax-deduct- overlook –

no_itemize

…

Step 3 of ontology learning

Currently available taxonomy path: tax – deduct – overlook-

mortgage

1) Get search results

2) Perform syntactic generalization, finding common maximal parse

sub-trees excluding the current taxonomy path

3) If nothing in common any more, this is the taxonomy leave (stop

growing the current path).

Possible learning results

(taxonomy fragment)

If a keyword is in a query, and in the

closest taxonomy path, it HAS TO BE

in the answerQuery:

can I deduct tax on

mortgage escrow account:

Closest taxonomy path:

tax – deduct – overlook-

mortgage- escrow_account

Then keywords/multiwords

have to be in the answer:

{deduct ,tax , mortgage

escrow_account }

Wrong answers

sell_hobby=>[[deductions, collection], [making, collection],

[sales, business, collection], [collectibles, collection], [loss,

hobby, collection], [item, collection], [selling, business,

collection], [pay, collection], [stamp, collection], [deduction,

collection], [car, collection], [sell, business, collection], [loss,

collection]]

benefit=>[[office, child, parent], [credit, child, parent],

[credits, child, parent], [support, child, parent], [making, child,

parent], [income, child, parent], [resides, child, parent],

[taxpayer, child, parent], [passed, child, parent], [claiming,

child, parent], [exclusion, child, parent], [surviving, benefits,

child, parent], [reporting, child, parent],

hardship=>[[apply, undue], [taxpayer, undue], [irs, undue],

[help, undue], [deductions, undue], [credits, undue], [cause,

undue], [means, required, undue], [court, undue].

Taxonomy fragment

Improving the precision of text similarity:

articles, blogs, tweets, images and

videos

We verify if an image belongs here, based on

its caption Using syntactic generalization to access

relevance

Generalizing two sentences and

its applicationImprovement of search relevance by checking syntactic similarity between query and sentences in search hits. Syntactic similarity is measured via generalization.

Such syntactic similarity is important when a search query contains keywords

which form a phrase , domain-specific expression, or an idiom, such as “shot to

shot time”, “high number of shots in a short amount of time”.

Based on syntactic similarity, search results can be re-sorted based on the obtained similarity score

Based on generalization, we can distinguish meaningful (informative) and

meaningless (uninformative) opinions, having collected respective datasets

Meaningful sentence to

be shown as

search result

Not very meaningful sentence to be shown,

even if matches the

search query

Generalizing sentences & phrases

noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]]

About ZOOM and DIGITAL CAMERA

verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN-

zoom IN-* NN-camera ]]

To do something with ZOOM –…- CAMERA

prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]]

With/for/to/in CAMERA, FOR something

Obtain parse trees. Group by sub-trees for each phrase type

Extend list of phrases by paraphrasing (semantically equivalent expressions)

For every phrase type

For each pair of tree lists, perform pair-wise generalization

For a pair of trees, perform alignment

For a pair of words (nodes), generalize them

Remove more general trees (if less general exist) from the resultant list

VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN-

camera IN-for VBG-filming NNS-insects ] +

VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ-

digital NN-camera ]

=

[VB-* JJ-* NN-zoom NN-* IN-for NNS-* ]

score = score(NN) + score(PREP) + 3*score(<POS*>)

Meaning:

“Do-something with some-kind-of ZOOM something FOR

something-else”

Generalizing phrasesDeriving a meaning by

generalization

Generalization: from words to

phrases to sentences to paragraphs

Syntactic generalization

helps with microtext when

ontology use is limited

Learning similarity between syntactic

trees

1. Obtain parsing tree for each sentence. For each word (tree node)

we have lemma, part of speech and form of word information, as

well as an arc to the other node.

2. Split sentences into sub-trees which are phrases for each type:

verb, noun, prepositional and others; these sub-trees are

overlapping. The sub-trees are coded so that information about

occurrence in the full tree is retained.

3. All sub-trees are grouped by phrase types.

4. Extending the list of phrases by adding equivalence

transformations Generalize each pair of sub-trees for both

sentences for each phrase type.

5. For each pair of sub-trees yield the alignment, and then generalize

each node for this alignment. For the obtained set of trees

(generalization results), calculate the score.

6. For each pair of sub-trees for phrases, select the set of

generalizations with highest score (least general).

7. Form the sets of generalizations for each phrase types whose

elements are sets of generalizations for this type.

8. Filtering the list of generalization results: for the list of

generalization for each phrase type, exclude more general

elements from lists of generalization for given pair of phrases.

Generalization of semantic role

expressions

Generalization algorithm

*

Evaluation

Media/

method of

text similarity

assessment

Full

size

news

articles

Abstracts

of articles

Blog

posting

Commen

ts

Images Videos

Frequencies

of terms in

documents

29.3% 26.1% 31.4% 32.0% 24.1% 25.2%

Syntactic

generalization

17.8% 18.4% 20.8% 27.1% 20.1% 19.0%

Taxonomy-

based

45.0% 41.7% 44.9% 52.3% 44.8% 43.1%

Hybrid

(taxonomy +

syntactic)

13.2% 13.6% 15.5% 22.1% 18.2% 18.0%

Hybrid approach improves text

similarity/relevance assessment

Ordering of search results based on

generalization, taxonomy, and conventional

search engine

Classification of short texts

Evaluation in vertical search domainQuery phrase sub-

type

Re

leva

nc

yo

fb

aseli

ne

Ya

ho

ose

arc

h,

%,

ave

rag

ing

ove

r20

se

arc

he

s

Re

leva

nc

yo

fb

aseli

ne

Bin

gse

arc

h,

%,

ave

rag

ing

ove

r2

0se

arc

he

s

Re

leva

nc

yo

fre

-so

rtin

g

by

ge

nera

liza

tio

n,

%,

ave

rag

ing

ove

r20

se

arc

he

s

Re

leva

nc

yo

fre

-so

rtin

g

by

usin

gta

xo

no

my,

%,

ave

rag

ing

ove

r20

se

arc

he

s

Re

leva

nc

yo

fre

-so

rtin

g

by

usin

gta

xo

no

my

an

d

ge

ne

raliza

tio

n,

%,

ave

rag

ing

ove

r20

se

arc

he

s

Re

leva

nc

yim

pro

vem

en

t

for

hy

bri

da

pp

roa

ch

,

co

mp

.to

ba

se

lin

e

(ave

rag

ed

for

Bin

g&

Ya

ho

o)

3-4 word

phrases

noun phrase 86.7 85.4 87.1 93.5 93.6 1.088

verb phrase 83.4 82.9 79.9 92.1 92.8 1.116

how-to

expression

76.7 78.2 79.5 93.4 93.3 1.205

average 82.3 82.2 82.2 93.0 93.2 1.134

5-10 word

phrases

noun phrase 84.1 84.9 87.3 91.7 92.1 1.090

verb phrase 83.5 82.7 86.1 92.4 93.4 1.124

how-to

expression

82.0 82.9 82.1 88.9 91.6 1.111

average 83.2 83.5 85.2 91.0 92.4 1.108

2-3

sentences

one verb one

noun phrases

68.8 67.6 69.1 81.2 83.1 1.218

both verb

phrases

66.3 67.1 71.2 77.4 78.3 1.174

one sent of

how-to type

66.1 68.3 73.2 79.2 80.9 1.204

average 67.1 67.7 71.2 79.3 80.8 1.199

This is the

focus of this

study

The higher the

complexity of

query, the

stronger is the

contribution of

the hybrid

system

OpenNLP Contribution

There are four Java classes for building and running taxonomy:

TaxonomyExtenderSearchResultFromYahoo.java performs web mining, by

taking current taxonomy path, submitting formed keywords to Yahoo API web

search, obtaining snippets and possibly fragments of webpages, and extracting

commonalities between them to add the next node to taxonomy. Various

machine learning components for forming commonalities will be integrated in

future versions, maintaining hypotheses in various ways.

TaxoQuerySnapshotMatcher.java is used in real time to obtain a taxonomy-

based relevance score between a question and an answer.

TaxonomySerializer.java is used to write taxonomy in specified format: binary,

text or XML.

AriAdapter.java is used to import seed taxonomy data from a PROLOG

ontology; in future versions of taxonomy builder more seed formats and options

will be supported

Related Work

•Mapping to First Order Logic representations with a general prover and without

using acquired rich knowledge sources

•Semantic entailment [de Salvo Braz et al 2005]

•Semantic Role Labeling, for each verb in a sentence, the goal is to identify all

constituents that fill a semantic role, and to determine their roles, such as Agent,

Patient or Instrument [Punyakanok et al 2005].

•Generic semantic inference framework that operates directly on syntactic trees.

New trees are inferred by applying entailment rules, which provide a unified

representation for varying types of inferences [Bar-Haim et al 2005]

•Generic paraphrase-based approach for a specific case such as relation extraction

to obtain a generic configuration for relations between objects from text [Romano et

al 2006]

Conclusions

Ontologies are more sensitive way to match

keywords (compared to bag-of-words and TF*IDF)

When text for indexing includes abbreviations and

acronyms, and we don’t ‘know’ all mappings,

semantic analysis should be tolerant to omits of

some entities and still understand “what this text

fragment is about”.

Since we are unable to filter out noise “statistically”

like most NLP environments do, we have to rely on

ontologies.

Syntactic generalization takes bag-of-words and

pattern-matching classes of approaches to the next

level allowing to treat unknown words systematically

as long as their part of speech information is

available from context.

We proposed a taxonomy building mechanism

for a vertical domain, extending an approach

where a taxonomy is formed based on:

- specific semantic rules,

- specific semantic templates or

- a limited corpus of texts.

Relying on web search engine API for taxonomy

construction, we are leveraging not only the

whole web universe of texts, but also the

meanings formed by search engines as a result

of learning from user search sessions.

When a user selects certain search results, a

web search engine acquires a set of

associations between entities in questions and

entities in answers. These associations are then

used by our taxonomy learning process to find

adequate parameters for entities being learned

at a current taxonomy building step.

Data & Analytics

Automated building of taxonomies for search engines