MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES

MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES

Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka(WSDM’09)

Speaker : Yi-Ling TaiDate : 2009/11/23

1

OUTLINE Introduction Method

Retrieving Contexts Extracting Lexical Patterns Identifying Semantic Relations Measuring Relational similarity

Experiments Conclusions

2

INTRODUCTION Implicit semantic relations between two words

Google, Youtube (acquisition) Ostrich, bird (is a large)

Similar semantic relations between two words pairs Google, Youtube → Yahoo, Inktomi Ostrich, bird → lion, cat

This paper proposed a method to compute the similarity between implicit semantic relations in two word-pairs.

3

OUTLINE OF THE SIMILARITY METHOD

4

OUTLINE OF THE SIMILARITY METHOD Web search component

query a Web search engine to find the contexts Pattern extraction component

extract lexical patterns that express semantic relations

Pattern clustering component cluster the patterns to identify particular relation

Similarity computation component. compute the relational similarity between two

word-pairs

5

RETRIEVAL CONTEXTS Snippets - brief summaries provided by Web

search engines along with the search results. containing two words, captures the local context

query “Google * *YouTube”

6

RETRIEVAL CONTEXTS “ * ” - wildcard operator, matches one word or

none.

To retrieve snippets for a word pair (A,B) “A * B”, “B * A”, “A * * B”, “B * * A”,“A * * * B”, “B *

* * A”, and A B query words co-occur within a maximum of three

words “ ” ensure that the two words appear in the order

remove duplicates if they contain the exact sequence of all words 7

EXTRACTING LEXICAL PATTERNS shallow lexical pattern extraction algorithm

extract the semantic relations between two words from web snippets.

not require language preprocessing

Consist of the following three steps Step 1:

Replace two words with two variables X and Y replace all numeric values by D do not remove punctuation marks

8

EXTRACTING LEXICAL PATTERNS Step 2:

Exactly one X and one Y must exist in a subsequence The maximum length of a subsequence is L words. Gaps should not exceed g words. Total length of all gaps should not exceed G words. expand all negation contractions, didn’t → did not

Step 3: select subsequences with frequency greater than N

9

EXTRACTING LEXICAL PATTERNS a modified prefixspan algorithm

consider all the words in a snippet not limited to extracting patterns from only the

mid-fix

X to acquire Y, X acquire Y, X to acquire Y for.10

IDENTIFYING SEMANTIC RELATIONS A semantic relation can be expressed using

more than one pattern.

If there are many related patterns between two word-pairs, we can expect a high relational similarity.

cluster lexical patterns using their distributions over word-pairs , to identify semantically related patterns.

11

12

IDENTIFYING SEMANTIC RELATIONS p : word-pair frequency vector of pattern p : frequency of pattern p occurs with

the word-pair SORT : sorts the patterns in the descending

order of their total occurrence in all word-pairs

c : the vector sum of all word-pair frequency vectors corresponding to the patterns that belong to that cluster.

: denote the vector addition : similarity threshold 13

MEASURING RELATIONAL SIMILARITY : feature vector of a word-pair

Elements of the feature vector , are the total frequencies of the word-pair in each cluster.

the relational similarity between two word-pairs

is a correlation matrix 14

MEASURING RELATIONAL SIMILARITY the correlation between clusters and by

the element in

is the union between the two clusters

15

EXPERIMENTS Dataset

100 instances (word or named-entity pairs)

five relation types ACQUIRER-ACQUIREE PERSON-BIRTHPLACE CEO-COMPANY COMPANY-HEADQUARTERS PERSON-FIELD

16

EXPERIMENTS manually select 20 instances for each types.

Wikipedia online newspapers company reviews

For each instance, download snippets using YahooBOSS API

17

EXPERIMENTS - LEXICAL PATTERNS Lexical Patterns

run the pattern extraction algorithm L = 5, g = 2, and G = 4. total number of unique patterns is 473910

we only select the 148655 patterns that occur at least twice. 18

EXPERIMENTS - PATTERN CLUSTERS Ratio : singletons to total number of clusters

19

EXPERIMENTS -RELATION CLASSIFICATION We evaluate the proposed relational similarity

measure in a relation classification task. k-nearest neighbor classification

classification accuracy

average precision

Rel(r) : a binary valued function that returns 1 if the word-pair at rank r has the same relation 20

EXPERIMENTS -RELATION CLASSIFICATION

= 0.955 2629 non-singleton clusters 6930 singletons

21

EXPERIMENTS -RELATION CLASSIFICATION the top 10 clusters with the largest number

of lexical patterns. the top four patterns that occur in most

number of word-pairs

22

RELATIONAL SIMILARITY MEASUREScompare the relational similarity measures VSM:

each word-pair is represented by a vector of pattern frequencies

the relational similarity between two word-pairs is computed as the cosine similarity

LRA: Latent Relational Analysis Create a matrix in which the rows represent

word-pairs and the columns represent lexical patterns

singular value decomposition (SVD) 23

RELATIONAL SIMILARITY MEASURES IP:

set in Formula 2 to the identity matrix compute relation similarity using pattern clusters

CORR: the proposed relational similarity measure.

24

RELATIONAL SIMILARITY MEASURES

25

CONCLUSIONS We proposed a method to compute the

similarity between implicit semantic relations in two word-pairs. only a few queries to compute quickly compute relational similarity for unseen

word-pairs a general framework - designing relational similarity

measures can be modeled as searching for a matrix

26

Documents

MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES