Upload
dustin-tronson
View
228
Download
1
Tags:
Embed Size (px)
Citation preview
Information Extraction
• Informal Communication: e-mail, mailing lists, bulletin boards
• Issues:– Context switching– Abbreviations & shortened forms– Variable punctuation, formatting, grammar
Thesis Advertisement: Outline
• Thesis is not end-to-end IE system
• We address some IE problems:
1. Identifying & Resolving Named Entites
2. Tracking Context
3. Learning User Preferences
Identifying Named Entities
• “Rialto is now open until 11pm”
• Facts/Opinions usually about a named entity
• Tools typically rely on punctuation, capitalization, formatting, grammar
• We developed criterion to identify topic-oriented words using occurrence stats
[Rennie & Jaakkola, SIGIR 2005]
Resolving Named Entites
• “They’re now open until 11pm”
• What does “they” refer to?
• Clustering– Group noun phrases that co-refer
• McCallum & Wellner (2005)– Excellent for proper nouns
• Our contribution: better modeling of non-proper nouns (incl. pronouns)
Tracking Context
• “The Swordfish was fabulous”– Indirect comment on restaurant.– Restaurant identifed by context.
• Use word statistics to find topic switches
• Contribution: new sentence clustering algorithm
Learning User Preferences
• Examples:– “I loved Rialto last night.”– “Overall, Oleana was worth the money”– “Radius wasn’t bad, but wasn’t great”– “Om was purely pretentious”
• Issues:1. Translate text to partial ordering or rating
2. Predict unobserved ratings
Preference Problems
• Single User w/ Item Features
• Multi-user, no features– Aka Collaborative Filtering
Single User, Item Features
-0.1-0.1+10 +5 0 0 +2
User Weights
+8 -4 +1 -7 -6 -3
Preference Scores
Capacity
Price
French?
New American?
Ethnic?
Formality
Location
10
Ta
ble
s
#9
Pa
rk
Lum
iere
Ta
njo
re
Ch
enn
ai
Rn
dzv
ous
30 90 60 80 40 80
30 60 50 30 20 40
1 0 1 0 0 0
0 1 0 0 0 1
0 0 0 1 1 0
2 4 3 1 0 2
2 3 1 2 0 2
FeatureValues 4=6
3=3
2=-2
1=-5
5
1
3
2
4
Ra
ting
s
Single User, Item Features
? ? ? ? ? ? ?
User Weights
? ? ? ? ? ?
Preference Scores
Capacity
Price
French?
New American?
Ethnic?
Formality
Location
10
Ta
ble
s
#9
Pa
rk
Lum
iere
Ta
njo
re
Ch
enn
ai
Rn
dzv
ous
30 90 60 80 40 80
30 60 50 30 20 40
1 0 1 0 0 0
0 1 0 0 0 1
0 0 0 1 1 0
2 4 3 1 0 2
2 3 1 2 0 2
FeatureValues
5 2 3 1 ? ?
Ratings
-2.5 1.4 -0.9 5.6 3.1 -1.8
-2.7 0.2 -4.2 2.1 0.2 -4.2
2.1 -2.5 1.4 -0.9 5.6 3.1
-1.8 -2.7 0.2 -4.2 2.1 -2.5
1.4 -0.9 5.6 3.1 -1.8 -2.7
0.2 -4.2 -1.4 0.7 3.4 -0.8
1.9 -2.2 4.7 2.6 -3.5 -2.1
Many Users, No Features
2 3 2 3 2 3
2 1 5 1 2 4
1 2 1 3 1 3
5 2 3 5 2 4
4 2 5 2 1 5
3 3 3 5 3 2
4 5 2 4 3 5
?
? ? ? ?
? ?
? ?
? ? ? ?
? ? ?
? ? ?
We
igh
ts
Features
Preference Scores Ratings
??
?
• Possible goals:– Predict missing entries– Cluster users or items
• Applications:– Movies, Books– Genetic Interaction– Network routing– Sports performance
Collaborative Filtering
2 3 2 3 2 3
2 1 5 1 2 4
1 2 1 3 1 3
5 2 3 5 2 4
4 2 5 2 1 5
3 3 3 5 3 2
4 5 2 4 3 5
use
rs
items
Outline
• Single User, Features– Loss functions, Convexity, Large Margin– Loss function for Ratings
• Many Users, No Features– Feature Selection, Rank, SVD– Regularization: tie together multiple tasks– Optimization: scale to large problems
• Extensions
This Talk: Contributions
• Implementation and systematic evaluation of loss functions for Single User prediction.
• Scaling Multi-user regularization to large (thousands of users/items) problems– Analysis of optimization
• Extensions– Hybrid: features + multiple users– Observation model & multiple ratings
Rating Classification
• n ordered classes
• Learn weight vector, thresholds
11
11 11
22 2
22 2
3
33 33
3
w
Loss Functions
0-1 Hinge Logistic
Margin Agreement Smooth Hinge Mod. Least Squares
Convexity
• Convex function => no local minima
• Set convex if all line segments within set
Convexity of Loss Functions
• 0-1 loss is not convex– Local minima, sensitive to small changes
• Convex Bound– Large margin solution with regularization– Stronger guarantees
Proportional Odds
• McCullagh introduced original rating model– Linear interaction: weights & features– Thresholds– Maximum likelihood
[McCullagh, 1980]
1 11
1 112
2 222 2
33
3 33
3
w
Immediate-Thresholds
1 2 3 4 5
[Shashua & Levin, 2003]
Some Errors are Better than Others
User:
System 1:
System 2:
Not a Bound on Absolute Diff.
1 2 3 4 5
All-Thresholds Loss
1 2 3 4 5[Srebro, Rennie & Jaakkola, NIPS 2004]
Experiments
Multi-Class
Imm-Thresh
All-Thresh p-value
MLS .7486 .7491 .6700 1.7e-18
Hinge .7433 .7628 .6702 6.6e-17
Logistic .7490 .7248 .6623 7.3e-22
Least Squares: 1.3368
[Rennie & Srebro, IJCAI 2005]
Many Users, No Features
2 3 2 3 2 3
2 1 5 1 2 4
1 2 1 3 1 3
5 2 3 5 2 4
4 2 5 2 1 5
3 3 3 5 3 2
4 5 2 4 3 5
?
? ? ? ?
? ?
? ?
? ? ? ?
? ? ?
? ? ?
-2.5 1.4 -0.9 5.6 3.1 -1.8
-2.7 0.2 -4.2 2.1 0.2 -4.2
2.1 -2.5 1.4 -0.9 5.6 3.1
-1.8 -2.7 0.2 -4.2 2.1 -2.5
1.4 -0.9 5.6 3.1 -1.8 -2.7
0.2 -4.2 -1.4 0.7 3.4 -0.8
1.9 -2.2 4.7 2.6 -3.5 -2.1
We
igh
ts
Features
Preference Scores Ratings
??
?
Background: Lp-norms
• L0: # non-zero entries: ||<0,2,0,3,4>||0 = 3
• L1: absolute value sum: ||<2,-2,1>||1 = 5
• L2: Euclidean length: ||<1,-1>||2 = 2
• General: ||v||p = (i |vi|p)1/p
Background: Feature Selection
• Objective: Loss + Regularization
L2 Squared L1
Singular Value Decomposition
• X=USV’– U,V: orthogonal (rotation)– S: diagonal, non-negative
• Eigenvalues of XX’=USV’VSU’=USSU’ are squared singular values of X
• Rank = ||s||0• SVD: used to obtain least-squares low-
rank approximation
Low Rank Matrix Factorization
V’U
×
¼X
rank k=
2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2
2 2 1 4 52 4 1 4 2 3
1 3 1 1 4 34 2 2 5 3 1
YY
Use SVD to findGlobal Optimum
Non-convexNo explicit soln.
• Sum-Squared Loss• Fully Observed Y• Classification Error Loss• Partially Observed Y
Low-Rank: Non-Convex Set
Rank 1Rank 1 Rank 2
Trace Norm Regularization
[Fazel et al., 2001]
Trace Norm: sum of singular values
y
Many Users, No Features
2 3 2 3 2 3
2 1 5 1 2 4
1 2 1 3 1 3
5 2 3 5 2 4
4 2 5 2 1 5
3 3 3 5 3 2
4 5 2 4 3 5
-2.5 1.4 -0.9 5.6 3.1 -1.8
-2.7 0.2 -4.2 2.1 0.2 -4.2
2.1 -2.5 1.4 -0.9 5.6 3.1
-1.8 -2.7 0.2 -4.2 2.1 -2.5
1.4 -0.9 5.6 3.1 -1.8 -2.7
0.2 -4.2 -1.4 0.7 3.4 -0.8
1.9 -2.2 4.7 2.6 -3.5 -2.1
We
igh
ts
Features
Preference Scores Ratings
U
V’
X Y
Max Margin Matrix Factorization
• Convex function of X and • Low rank in X
All-Thresholds Loss Trace Norm
[Srebro, Rennie & Jaakkola, NIPS 2004]
Properties of the Trace Norm
The factorization: US, VS minimizes both quantities
Factorized Optimization
• Factorized Objective (tight bound):
• Gradient descent: O(n3) per round
• Stationary points, but no local minima
[Rennie & Srebro, ICML 2005]
Collaborative Prediction Results
size, sparsity:
EachMovie36656x1648, 96%
MovieLens6040x3952, 96%
Algorithm
Weak Error
Strong Error
Weak Error
Strong Error
URP .8596 .8859 .6946 .7104
Attitude .8787 .8845 .6912 .7000
MMMF .8548 .8439 .6650 .6725
[URP & Attitude: Marlin, 2004] [MMMF: Rennie & Srebro, 2005]
Extensions
• Multi-user + Features
• Observation model– Predict which restaurants a user will rate, and– The rating she will make
• Multiple ratings per user/restaurant– E.g. Food, Service and Décor ratings
• SVD Parameterization
Fixed Features
Learned Features
Multi-User + Features
• Feature parameters (V):– Some are fixed– Some are learned
• Learn weights (U) for all features
• Fixed part of V does not affect regularization
V’
Observation Model
• Common assumption: ratings observed at random
• Restaurant selection:– Geography, popularity, price, food style
• Remove bias: model observation process
Observation Model
• Model as binary classification
• Add binary classification loss
• Tie together rating and observation models
X=UXV’ W=UWV’
Multiple Ratings
• Users may provide multiple ratings:– Service, Décor, Food
• Add in loss functions
• Stack parameter matrices for regularization
SVD Parameterization
• Too many parameters: UAA-1V’=X is another factorization of X
• Alternate: U,S,V– U,V orthogonal, S diagonal
• Advantages:– Not over-parameterized– Exact objective (not a bound)– No stationary points
Summary
• Loss function for ratings
• Regularization for multiple users
• Scaled MMMF to large problems (e.g. > 1000x1000)
• Trace norm: widely applicable
• Extensions
Code: http://people.csail.mit.edu/jrennie/matlab
Thanks!
• Helen, for supporting me for 7.5 years!• Tommi Jaakkola, for answering all my questions and
directing me to the “end”!• Mike Collins and Tommy Poggio for add’l guidance.• Nati Srebro & John Barnett for endless valuable
discussions and ideas.• Amir Globerson, David Sontag, Luis Ortiz, Luis Perez-
Breva, Alan Qi, & Patrycja Missiuro & all past members of Tommi’s reading group for paper discussions, conference trips and feedback on my talks.
• Many, many others who have helped me along the way!
Low-Rank Optimization
Low-Rank Minimum
Objective Minimum
Low-RankLocal
MinimumLow-Rank
Low-Rank