Nearest Neighbor & Information Retrieval Search Artificial Intelligence CMSC 25000 January 29, 2004

Nearest Neighbor &Information Retrieval Search

Artificial Intelligence

CMSC 25000

January 29, 2004

Agenda

• Machine learning: Introduction• Nearest neighbor techniques

– Applications: Robotic motion, Credit rating– Information retrieval search

• Efficient implementations:– k-d trees, parallelism

• Extensions: K-nearest neighbor• Limitations:

– Distance, dimensions, & irrelevant attributes

Nearest Neighbor

• Memory- or case- based learning

• Supervised method: Training– Record labeled instances and feature-value vectors

• For each new, unlabeled instance– Identify “nearest” labeled instance– Assign same label

• Consistency heuristic: Assume that a property is the same as that of the nearest reference case.

Nearest Neighbor Example

• Problem: Robot arm motion– Difficult to model analytically

• Kinematic equations – Relate joint angles and manipulator positions

• Dynamics equations– Relate motor torques to joint angles

– Difficult to achieve good results modeling robotic arms or human arm

• Many factors & measurements


• Solution: – Move robot arm around– Record parameters and trajectory segment

• Table: torques, positions,velocities, squared velocities, velocity products, accelerations

– To follow a new path:• Break into segments • Find closest segments in table• Get those torques (interpolate as necessary)


• Issue: Big table– First time with new trajectory

• “Closest” isn’t close• Table is sparse - few entries

• Solution: Practice– As attempt trajectory, fill in more of table

• After few attempts, very close

Nearest Neighbor Example II

• Credit Rating:– Classifier: Good /

Poor– Features:

• L = # late payments/yr; • R = Income/Expenses

Name L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 G

D 20 0.8 PE 30 0.85 P

F 11 1.2 G

G 7 1.15 GH 15 0.8 P


Name L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 G

D 20 0.8 PE 30 0.85 P

F 11 1.2 G

G 7 1.15 GH 15 0.8 P L

R

302010

1 A

B

C D E

FG

H


L 302010

1 A

B

C D E

FG

HR

Name L R G/P

I 6 1.15

J 22 0.45

K 15 1.2

G

IP

J

??

K

Distance Measure:

Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))

- Scaled distance

Efficient Implementations

• Classification cost:– Find nearest neighbor: O(n)

• Compute distance between unknown and all instances

• Compare distances

– Problematic for large data sets

• Alternative:– Use binary search to reduce to O(log n)

Roadmap

• Problem: – Matching Topics and Documents

• Methods:– Classic: Vector Space Model

• Challenge I: Beyond literal matching– Expansion Strategies

• Challenge II: Authoritative source– Page Rank– Hubs & Authorities

Matching Topics and Documents

• Two main perspectives:– Pre-defined, fixed, finite topics:

• “Text Classification”

– Arbitrary topics, typically defined by statement of information need (aka query)

• “Information Retrieval”

Three Steps to IR● Three phases:

– Indexing: Build collection of document representations

– Query construction:● Convert query text to vector

– Retrieval:● Compute similarity between query and doc

representation● Return closest match

Matching Topics and Documents

• Documents are “about” some topic(s)• Question: Evidence of “aboutness”?

– Words !!• Possibly also meta-data in documents

– Tags, etc

• Model encodes how words capture topic– E.g. “Bag of words” model, Boolean matching– What information is captured?– How is similarity computed?

Models for Retrieval and Classification

• Plethora of models are used

• Here:– Vector Space Model

Vector Space Information Retrieval

• Task:– Document collection– Query specifies information need: free text– Relevance judgments: 0/1 for all docs

• Word evidence: Bag of words– No ordering information

Vector Space Model

Computer

Tv

Program

Two documents: computer program, tv programQuery: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

Vector Space Model

• Represent documents and queries as– Vectors of term-based features

• Features: tied to occurrence of terms in collection

– E.g.

• Solution 1: Binary features: t=1 if present, 0 otherwise– Similiarity: number of terms in common

• Dot product

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd

ji

N

ikijk ttdqsim ,

1,),(

Question

• What’s wrong with this?

Vector Space Model II

• Problem: Not all terms equally interesting– E.g. the vs dog vs Levow

• Solution: Replace binary term features with weights– Document collection: term-by-document matrix

– View as vector in multidimensional space• Nearby vectors are related

– Normalize for vector length

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd

Vector Similarity Computation

• Similarity = Dot product

• Normalization:– Normalize weights in advance– Normalize post-hoc

ji

N

ikijkjk wwdqdqsim ,

1,),(

N

i ji

N

i ki

N

i jikijk

ww

wwdqsim

1

2,1

2,

1 ,,),(

Term Weighting

• “Aboutness”– To what degree is this term what document is about?– Within document measure– Term frequency (tf): # occurrences of t in doc j

• “Specificity”– How surprised are you to see this term?– Collection frequency– Inverse document frequency (idf):

)log(i

i n

Nidf

ijiji idftfw ,,

Term Selection & Formation

• Selection:– Some terms are truly useless

• Too frequent, no content– E.g. the, a, and,…

– Stop words: ignore such terms altogether

• Creation:– Too many surface forms for same concepts

• E.g. inflections of words: verb conjugations, plural

– Stem terms: treat all forms as same underlying

Key Issue

• All approaches operate on term matching– If a synonym, rather than original term, is used,

approach fails

• Develop more robust techniques– Match “concept” rather than term

• Expansion approaches– Add in related terms to enhance matching

• Mapping techniques– Associate terms to concepts

» Aspect models, stemming

Expansion Techniques

• Can apply to query or document

• Thesaurus expansion– Use linguistic resource – thesaurus, WordNet

– to add synonyms/related terms

• Feedback expansion– Add terms that “should have appeared”

• User interaction– Direct or relevance feedback

• Automatic pseudo relevance feedback

Query Refinement

• Typical queries very short, ambiguous– Cat: animal/Unix command– Add more terms to disambiguate, improve

• Relevance feedback– Retrieve with original queries– Present results

• Ask user to tag relevant/non-relevant

– “push” toward relevant vectors, away from nr

– β+γ=1 (0.75,0.25); r: rel docs, s: non-rel docs– “Roccio” expansion formula

S

kk

R

jjii sS

rR

qq11

1

Compression Techniques

• Reduce surface term variation to concepts• Stemming

– Map inflectional variants to root• E.g. see, sees, seen, saw -> see• Crucial for highly inflected languages – Czech, Arabic

• Aspect models– Matrix representations typically very sparse– Reduce dimensionality to small # key aspects

• Mapping contextually similar terms together• Latent semantic analysis

Authoritative Sources

• Based on vector space alone, what would you expect to get searching for “search engine”?– Would you expect to get Google?

Issue

Text isn’t always best indicator of content

Example:

• “search engine” – Text search -> review of search engines

• Term doesn’t appear on search engine pages• Term probably appears on many pages that point

to many search engines

Hubs & Authorities

• Not all sites are created equal– Finding “better” sites

• Question: What defines a good site?– Authoritative– Not just content, but connections!

• One that many other sites think is good• Site that is pointed to by many other sites

– Authority

Conferring Authority

• Authorities rarely link to each other– Competition

• Hubs:– Relevant sites point to prominent sites on topic

• Often not prominent themselves• Professional or amateur

• Good Hubs Good Authorities

Computing HITS

• Finding Hubs and Authorities

• Two steps:– Sampling:

• Find potential authorities

– Weight-propagation:• Iteratively estimate best hubs and authorities

Sampling

• Identify potential hubs and authorities– Connected subsections of web

• Select root set with standard text query

• Construct base set:– All nodes pointed to by root set– All nodes that point to root set

• Drop within-domain links

– 1000-5000 pages

Weight-propagation

• Weights:– Authority weight: – Hub weight:

• All weights are relative

• Updating:

• Converges • Pages with high x: good authorities; y: good hubs

pxpy

qpqpp

pqqpp

xy

yx

,

,

Google’s PageRank

• Identifies authorities– Important pages are those pointed to by many other

pages• Better pointers, higher rank

– Ranks search results

– t:page pointing to A; C(t): number of outbound links• d:damping measure

– Actual ranking on logarithmic scale– Iterate

))(/)(...)(/)(()1()( 11 nn tCtprtCtprddApr

Contrasts

• Internal links– Large sites carry more weight

• If well-designed

– H&A ignores site-internals

• Outbound links explicitly penalized

• Lots of tweaks….

Web Search

• Search by content– Vector space model

• Word-based representation• “Aboutness” and “Surprise”• Enhancing matches• Simple learning model

• Search by structure– Authorities identified by link structure of web

• Hubs confer authority

Efficient Implementation: K-D Trees

• Divide instances into sets based on features– Binary branching: E.g. > value– 2^d leaves with d split path = n

• d= O(log n)

– To split cases into sets,• If there is one element in the set, stop• Otherwise pick a feature to split on

– Find average position of two middle objects on that dimension

» Split remaining objects based on average position» Recursively split subsets

K-D Trees: Classification

R > 0.825?

L > 17.5? L > 9 ?

No Yes

R > 0.6? R > 0.75? R > 1.025 ?R > 1.175 ?

NoYes No Yes

No

Poor Good

Yes No Yes

Good Poor

No Yes

Good Good

No

Poor

Yes

Good

Efficient Implementation:Parallel Hardware

• Classification cost:– # distance computations

• Const time if O(n) processors

– Cost of finding closest• Compute pairwise minimum, successively• O(log n) time

Nearest Neighbor: Issues

• Prediction can be expensive if many features

• Affected by classification, feature noise– One entry can change prediction

• Definition of distance metric– How to combine different features

• Different types, ranges of values

• Sensitive to feature selection

Nearest Neighbor Analysis

• Problem: – Ambiguous labeling, Training Noise

• Solution:– K-nearest neighbors

• Not just single nearest instance

• Compare to K nearest neighbors– Label according to majority of K

• What should K be?– Often 3, can train as well

Nearest Neighbor: Analysis

• Issue: – What is a good distance metric?– How should features be combined?

• Strategy:– (Typically weighted) Euclidean distance– Feature scaling: Normalization

• Good starting point: – (Feature - Feature_mean)/Feature_standard_deviation– Rescales all values - Centered on 0 with std_dev 1

Nearest Neighbor: Analysis

• Issue: – What features should we use?

• E.g. Credit rating: Many possible features– Tax bracket, debt burden, retirement savings, etc..

– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead

• Fundamental problem with nearest neighbor

Nearest Neighbor: Advantages

• Fast training:– Just record feature vector - output value set

• Can model wide variety of functions– Complex decision boundaries– Weak inductive bias

• Very generally applicable

Summary

• Machine learning:– Acquire function from input features to value

• Based on prior training instances

– Supervised vs Unsupervised learning• Classification and Regression

– Inductive bias: • Representation of function to learn• Complexity, Generalization, & Validation

Summary: Nearest Neighbor

• Nearest neighbor:– Training: record input vectors + output value– Prediction: closest training instance to new

data

• Efficient implementations

• Pros: fast training, very general, little bias

• Cons: distance metric (scaling), sensitivity to noise & extraneous features