Upload
boris-galitsky
View
473
Download
1
Embed Size (px)
Citation preview
Automated building of
taxonomies for search
engines
Boris Galitsky & Greg Makowski
.
What can be a scalable way to
automatically build a
taxonomies of entities to
improve search relevance?
Taxonomy construction starts from
the seed entities and mines the
web for new entities associated
with them.
To form these new entities, machine
learning of syntactic parse trees
(syntactic generalization) is applied
It form commonalities between
various search results for existing
entities on the web.
Taxonomy and syntactic
generalization is applied to
relevance improvement in search
and text similarity assessment in
commercial setting; evaluation
results show substantial
contribution of both sources.
Automated customer service rep.
Q: Can you reactivate my card which I am trying to use in Nepal?
A: We value you as a customer… We will cancel your card… New card will be mailed to your California address …
A child with severe form of autism
Q: Can you give your candy to my daughter who is hungry now and is about to cry?
A: No, my mom told me not to feed babies. Its wrapper is nice and blue. I need to wash my hands before I eat it … …
Entities need to
make sense together
Why ontologies are
needed for search
Human and auto agents having
difficulties processing texts if
required ontologies are missing
Knowing how entities are connected
would improve search resultsCondition “active paddling” is ignored or
misinterpreted, although Google knows that
it is a valid combination (‘paddling’ can be
‘active’)
• In the above example “white water
rafting in Oregon with active
paddling with kids” active is
meaningless without paddling.
• So if the system can’t find answers
with ‘active paddling’, try finding with
‘paddling’, but do not try finding with
‘active’ but without ‘paddling’.
Difficulty in building
taxonomiesBuilding, tuning and managing taxonomies and ontologies is
rather costly since a lot of manual operations are required.
A number of studies proposed automated building of
taxonomies based on linguistic resources and/or statistical
machine learning, (Kerschberg et al 2003, Liu &Birnbaum
2008, Kozareva et al 2009).
However, most of these approaches have not found practical
applications due to:
– insufficient accuracy of resultant search,
– limited expressiveness of representations of queries of
real users,
– high cost associated with manual construction of
linguistic resources and their limited adjustability.
The main challenge in building a taxonomy tree is to make it
as deep as possible to incorporate longer chains of
relationships, so more specific and more complicated
questions can be answered.
• It is based on initial set of key entities (a
seed) for given vertical knowledge domain.
• This seed is then automatically extended by
mining of web documents which include a
meaning of a current taxonomy node.
• This node is further extended by entities
which are the results of inductive learning of
commonalities between these documents.
• These commonalities are extracted using
an operation of syntactic generalization,
which finds the common parts of syntactic
parse trees of a set of documents, obtained
for the current taxonomy node.
We propose
automated taxonomy
building mechanism
Therefore automated or semi-automated
approach is required for practical apps
ContributionProposed taxonomy learning algorithm aims to improve
vertical search relevance and will be evaluated in a
number of search-related tasks The contribution of this
study is three-fold:
• Propose and implement a mechanism for using
taxonomy trees for the deterministic classification of
answers as relevant and irrelevant.
• Implement an algorithm to automate the building of
such a taxonomy for a vertical domain, given a seed of
key entities.
• Design a domain-independent linguistic engine for
finding commonalities/similarities between texts, based
on parse trees, to support 1 and 2.
A number of currently available general-
purpose resources, such as DBPEdia,
Freebase, and Yago, assist entity-related
searches but are insufficient to filter out
irrelevant answers that concern a certain
activity with an entity and its multiple
parameters. A set of vertical ontologies, such
as last.fm for artists, are also helpful for
entity-based searches in vertical domains;
however, their taxonomy trees are rather
shallow, and their usability for recognizing
irrelevant answers is limited.
For a query q with keywords {a b c} and its arbitrary relevant answer A we define that query is about b,
is-about(q, {b}) if queries {a b} and {b c} are relevant or marginally relevant to A, and {ac} is irrelevant to
A.
Our definition of query understanding, which is rather narrow, is the ability to say which keywords in the
query are essential (such as b in the above example), so that without them the other query terms
become meaningless. Also, an answer which does not contain b is irrelevant to the query which includes
b.
• For example, is-about({machine, learning, algorithm}, {machine, learning}), is-about({machine,
learning, algorithm}, {algorithm}), is-about({machine, learning, algorithm}, {learning, algorithm}), but
not is-about({machine, learning, algorithm}, {machine})
• For query {a b c d}, if b is essential (is-about({a b c d}, {b}), c can also be essential when b is in the
query such that {a b c}, {b c d}, {b c} are relevant, even {a b}, {b d} are (marginally) relevant, but {a d}
is not (is-about({a b c d}, {b c}).
Hence for a query {abcd} and two answers (snippets) {bcd…efg} and {acd…efg}, the former is relevant
and the latter is not.
Must-occur keywords
Achieving relevancy using a taxonomy is based on totally different mechanism than a conventional
TF∗IDF based search.
In the latter, importance of terms is based on the frequency of occurrence.
For an NL query (not a Boolean query) any term can be omitted in the search result if the rest of terms
give acceptable relevancy score.
In case of a Boolean query, this is true for each of its conjunctive member. However, in the taxonomy-
based search we know which terms should occur in the answer and which terms must occur there,
otherwise the search result becomes irrelevant.
Let us consider a totality of keywords in a domain D; these keywords occur in questions and answers in
this domain. There is always a hierarchy of these keywords: some are always more important than
others in a sense of is-about relation.
• Keyword tax is more important than deduction, individual, return, business.
• is-about({tax deduction}, {tax}) but not is-about({tax deduction}, {deduction}), since without context of
tax keyword deduction is ambiguous;
• is-about({individual, tax, return}, {tax, return}) but not is-about({individual, tax, return}, {individual}),
since individual acquires sense as an adjective only in the context of tax.
• At the same time, the above keywords are more important than partial cases of the situations they
denote such as submission deadline;
• is-about({individual, tax, return, submission, deadline}, {individual, tax, return}) but not
is-about({individual, tax, return, submission, deadline}, {submission, deadline})
because submission deadline may refer to something totally different
Hierarchy of must-occur keywords
We introduce a partial order on the set of subsets of
keywords K1, K2 ∈ 2D
K1 > K2 iff is-about(K1 ∪ K2, K1) but not is-about(K1 ∪ K2, K2).
We say that a path Tp covers a query Q if the set of keywords
for the nodes of Tp is a super-set of Q. If multiple paths cover
a
query Q producing different intersections Q ∩ Tp then this
query has multiple meanings in the domain; for each such
meaning a separate set of acceptable answers is expected.
An answer ai A is acceptable if it includes all essential
(according to is_about) keywords from the query Q as found in
the taxonomy path Tp T. For any taxonomy path Tp which
covers the question q (intersections of their keywords is not
empty), these intersection keywords must be in the
acceptable answer ai.
Tp T: Tp ∩ X Xi Tp ∩ X.
Answer acceptable given
taxonomy
For a question
(Q) "When can I file extension of time for my tax
return?"
let us imagine two answers:
• (A1) "You need to file form 1234 to request a 4
month extension of time to file your tax return"
• (A2) "You need to download file with extension
'pdf', print and complete it to do your taxes".
We expect the closest taxonomy path to be:
(T) tax - file-return - extension-of-time.
tax is a main entity, file-return we expect to be in the
seed, and extension-of-time would be the learned
entity, so A1 will match with taxonomy and is an
acceptable answer, and A2 is not.
A question and
two answers
Relevance verification algorithminput: query Q
output: the best answer abest and the set of acceptable answers Aa
1) For a query Q, obtain a set of candidate answers A by available means (using
keywords, using internal index, or using external index of search engine ’s APIs);
2) Find a path of taxonomy Tp which covers maximal number of terms in Q, along with
other paths, which cover Q, to form a set P = { Tp1, Tp2, …}.
Unless acceptable answer is found:
3) Compute the set Tp ∩ Q.
For each answer ai A
4) Compute ai ∩( Tp ∩ Q)) and test if all essential words from the query,
which exists in Tp are also in the answer (acceptability test)
5) Compute similarity score of Q with or each ai
6) Compute the best answer abest and the set of acceptable answers Aa.
If no acceptable answer found, return to 2) for the next path from P.
7) Return abest and the set of acceptable answers Aa if available.
Providing multiple
answers as a result of
default reasoning
Facts Si ,
comprising the
query
representation
(occurrences of
words in a query)
Default rules, establishing the
meanings of words based on the
other words and the meanings
that have been established
Successful &
closed process:
extension
@S1, @S2 ,…
answer 1
Successful &
closed process:
extension
@S3, @S1 ,…
answer 2
Either
unsuccessful or
non-closed
process:
No extension
Using default logic to handle
ambiguity in search
Building extensions of default theory for each
meaning
A simplified step 1 of ontology learning
Currently available: tax – deduct
1) Get search results for currently available expressions
2) Select attributes based on their linguistic occurrence (shown in
yellow)
3) Find common attributes
(commonalities between
search results, shown in
red, like ‘overlook’).
4) Extend the taxonomy
path by adding newly
acquired attribute
Tax-deduct-overlook
Step 2 of ontology learning (more
details)Currently available taxonomy path: tax – deduct - overlook
1) Get search results
2) Select attributes based on their linguistic occurrence (modifiers of
entities from the current taxonomy path)
3) Find common expressions between search results as syntactic
generalization, like ‘PRP-mortgage’
4) Extend the taxonomy path
by adding newly acquired
attribute
Tax-deduct-overlook-
mortgage,
Tax-deduct- overlook –
no_itemize
…
Step 3 of ontology learning
Currently available taxonomy path: tax – deduct – overlook-
mortgage
1) Get search results
2) Perform syntactic generalization, finding common maximal parse
sub-trees excluding the current taxonomy path
3) If nothing in common any more, this is the taxonomy leave (stop
growing the current path).
Possible learning results
(taxonomy fragment)
If a keyword is in a query, and in the
closest taxonomy path, it HAS TO BE
in the answerQuery:
can I deduct tax on
mortgage escrow account:
Closest taxonomy path:
tax – deduct – overlook-
mortgage- escrow_account
Then keywords/multiwords
have to be in the answer:
{deduct ,tax , mortgage
escrow_account }
Wrong answers
sell_hobby=>[[deductions, collection], [making, collection],
[sales, business, collection], [collectibles, collection], [loss,
hobby, collection], [item, collection], [selling, business,
collection], [pay, collection], [stamp, collection], [deduction,
collection], [car, collection], [sell, business, collection], [loss,
collection]]
benefit=>[[office, child, parent], [credit, child, parent],
[credits, child, parent], [support, child, parent], [making, child,
parent], [income, child, parent], [resides, child, parent],
[taxpayer, child, parent], [passed, child, parent], [claiming,
child, parent], [exclusion, child, parent], [surviving, benefits,
child, parent], [reporting, child, parent],
hardship=>[[apply, undue], [taxpayer, undue], [irs, undue],
[help, undue], [deductions, undue], [credits, undue], [cause,
undue], [means, required, undue], [court, undue].
Taxonomy fragment
Improving the precision of text similarity:
articles, blogs, tweets, images and
videos
We verify if an image belongs here, based on
its caption Using syntactic generalization to access
relevance
Generalizing two sentences and
its applicationImprovement of search relevance by checking syntactic similarity between query and sentences in search hits. Syntactic similarity is measured via generalization.
Such syntactic similarity is important when a search query contains keywords
which form a phrase , domain-specific expression, or an idiom, such as “shot to
shot time”, “high number of shots in a short amount of time”.
Based on syntactic similarity, search results can be re-sorted based on the obtained similarity score
Based on generalization, we can distinguish meaningful (informative) and
meaningless (uninformative) opinions, having collected respective datasets
Meaningful sentence to
be shown as
search result
Not very meaningful sentence to be shown,
even if matches the
search query
Generalizing sentences & phrases
noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]]
About ZOOM and DIGITAL CAMERA
verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN-
zoom IN-* NN-camera ]]
To do something with ZOOM –…- CAMERA
prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]]
With/for/to/in CAMERA, FOR something
Obtain parse trees. Group by sub-trees for each phrase type
Extend list of phrases by paraphrasing (semantically equivalent expressions)
For every phrase type
For each pair of tree lists, perform pair-wise generalization
For a pair of trees, perform alignment
For a pair of words (nodes), generalize them
Remove more general trees (if less general exist) from the resultant list
VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN-
camera IN-for VBG-filming NNS-insects ] +
VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ-
digital NN-camera ]
=
[VB-* JJ-* NN-zoom NN-* IN-for NNS-* ]
score = score(NN) + score(PREP) + 3*score(<POS*>)
Meaning:
“Do-something with some-kind-of ZOOM something FOR
something-else”
Generalizing phrasesDeriving a meaning by
generalization
Generalization: from words to
phrases to sentences to paragraphs
Syntactic generalization
helps with microtext when
ontology use is limited
Learning similarity between syntactic
trees
1. Obtain parsing tree for each sentence. For each word (tree node)
we have lemma, part of speech and form of word information, as
well as an arc to the other node.
2. Split sentences into sub-trees which are phrases for each type:
verb, noun, prepositional and others; these sub-trees are
overlapping. The sub-trees are coded so that information about
occurrence in the full tree is retained.
3. All sub-trees are grouped by phrase types.
4. Extending the list of phrases by adding equivalence
transformations Generalize each pair of sub-trees for both
sentences for each phrase type.
5. For each pair of sub-trees yield the alignment, and then generalize
each node for this alignment. For the obtained set of trees
(generalization results), calculate the score.
6. For each pair of sub-trees for phrases, select the set of
generalizations with highest score (least general).
7. Form the sets of generalizations for each phrase types whose
elements are sets of generalizations for this type.
8. Filtering the list of generalization results: for the list of
generalization for each phrase type, exclude more general
elements from lists of generalization for given pair of phrases.
Generalization of semantic role
expressions
Generalization algorithm
*
Evaluation
Media/
method of
text similarity
assessment
Full
size
news
articles
Abstracts
of articles
Blog
posting
Commen
ts
Images Videos
Frequencies
of terms in
documents
29.3% 26.1% 31.4% 32.0% 24.1% 25.2%
Syntactic
generalization
17.8% 18.4% 20.8% 27.1% 20.1% 19.0%
Taxonomy-
based
45.0% 41.7% 44.9% 52.3% 44.8% 43.1%
Hybrid
(taxonomy +
syntactic)
13.2% 13.6% 15.5% 22.1% 18.2% 18.0%
Hybrid approach improves text
similarity/relevance assessment
Ordering of search results based on
generalization, taxonomy, and conventional
search engine
Classification of short texts
Evaluation in vertical search domainQuery phrase sub-
type
Re
leva
nc
yo
fb
aseli
ne
Ya
ho
ose
arc
h,
%,
ave
rag
ing
ove
r20
se
arc
he
s
Re
leva
nc
yo
fb
aseli
ne
Bin
gse
arc
h,
%,
ave
rag
ing
ove
r2
0se
arc
he
s
Re
leva
nc
yo
fre
-so
rtin
g
by
ge
nera
liza
tio
n,
%,
ave
rag
ing
ove
r20
se
arc
he
s
Re
leva
nc
yo
fre
-so
rtin
g
by
usin
gta
xo
no
my,
%,
ave
rag
ing
ove
r20
se
arc
he
s
Re
leva
nc
yo
fre
-so
rtin
g
by
usin
gta
xo
no
my
an
d
ge
ne
raliza
tio
n,
%,
ave
rag
ing
ove
r20
se
arc
he
s
Re
leva
nc
yim
pro
vem
en
t
for
hy
bri
da
pp
roa
ch
,
co
mp
.to
ba
se
lin
e
(ave
rag
ed
for
Bin
g&
Ya
ho
o)
3-4 word
phrases
noun phrase 86.7 85.4 87.1 93.5 93.6 1.088
verb phrase 83.4 82.9 79.9 92.1 92.8 1.116
how-to
expression
76.7 78.2 79.5 93.4 93.3 1.205
average 82.3 82.2 82.2 93.0 93.2 1.134
5-10 word
phrases
noun phrase 84.1 84.9 87.3 91.7 92.1 1.090
verb phrase 83.5 82.7 86.1 92.4 93.4 1.124
how-to
expression
82.0 82.9 82.1 88.9 91.6 1.111
average 83.2 83.5 85.2 91.0 92.4 1.108
2-3
sentences
one verb one
noun phrases
68.8 67.6 69.1 81.2 83.1 1.218
both verb
phrases
66.3 67.1 71.2 77.4 78.3 1.174
one sent of
how-to type
66.1 68.3 73.2 79.2 80.9 1.204
average 67.1 67.7 71.2 79.3 80.8 1.199
This is the
focus of this
study
The higher the
complexity of
query, the
stronger is the
contribution of
the hybrid
system
OpenNLP Contribution
There are four Java classes for building and running taxonomy:
TaxonomyExtenderSearchResultFromYahoo.java performs web mining, by
taking current taxonomy path, submitting formed keywords to Yahoo API web
search, obtaining snippets and possibly fragments of webpages, and extracting
commonalities between them to add the next node to taxonomy. Various
machine learning components for forming commonalities will be integrated in
future versions, maintaining hypotheses in various ways.
TaxoQuerySnapshotMatcher.java is used in real time to obtain a taxonomy-
based relevance score between a question and an answer.
TaxonomySerializer.java is used to write taxonomy in specified format: binary,
text or XML.
AriAdapter.java is used to import seed taxonomy data from a PROLOG
ontology; in future versions of taxonomy builder more seed formats and options
will be supported
Related Work
•Mapping to First Order Logic representations with a general prover and without
using acquired rich knowledge sources
•Semantic entailment [de Salvo Braz et al 2005]
•Semantic Role Labeling, for each verb in a sentence, the goal is to identify all
constituents that fill a semantic role, and to determine their roles, such as Agent,
Patient or Instrument [Punyakanok et al 2005].
•Generic semantic inference framework that operates directly on syntactic trees.
New trees are inferred by applying entailment rules, which provide a unified
representation for varying types of inferences [Bar-Haim et al 2005]
•Generic paraphrase-based approach for a specific case such as relation extraction
to obtain a generic configuration for relations between objects from text [Romano et
al 2006]
Conclusions
Ontologies are more sensitive way to match
keywords (compared to bag-of-words and TF*IDF)
When text for indexing includes abbreviations and
acronyms, and we don’t ‘know’ all mappings,
semantic analysis should be tolerant to omits of
some entities and still understand “what this text
fragment is about”.
Since we are unable to filter out noise “statistically”
like most NLP environments do, we have to rely on
ontologies.
Syntactic generalization takes bag-of-words and
pattern-matching classes of approaches to the next
level allowing to treat unknown words systematically
as long as their part of speech information is
available from context.
We proposed a taxonomy building mechanism
for a vertical domain, extending an approach
where a taxonomy is formed based on:
- specific semantic rules,
- specific semantic templates or
- a limited corpus of texts.
Relying on web search engine API for taxonomy
construction, we are leveraging not only the
whole web universe of texts, but also the
meanings formed by search engines as a result
of learning from user search sessions.
When a user selects certain search results, a
web search engine acquires a set of
associations between entities in questions and
entities in answers. These associations are then
used by our taxonomy learning process to find
adequate parameters for entities being learned
at a current taxonomy building step.