View
216
Download
0
Category
Tags:
Preview:
Citation preview
1
Autonomous Web-scale Information Extraction
Doug DowneyAdvisor: Oren Etzioni Department of Computer Science and EngineeringTuring CenterUniversity of Washington
2
Web Information Extraction
…cities such as Chicago… => Chicago City
C such as x => x C [Hearst,1992]
…Edison invented the light bulb…(Edison, light bulb) Invented
x V y => (x, y) V
e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]
3
Identifying correct extractions
…mayors of major cities such as Giuliani… => Giuliani City
Supervised IE: hand-label examples of each concept
Not possible on the Web (far too many concepts)
=> Unsupervised IE (UIE)
How can we automatically identify correct extractions for any concept without hand-labeled data?
4
KnowItAll Hypothesis (KH)
Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct.
Repetitions of the same error are relatively rare
…mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.…
Misinformation is the exception rather than the rule
“Elvis killed JFK” – 200 hits“Oswald killed JFK” – 3000 hits
5
Redundancy
KH can identify many correct statements because the Web is highly redundant
– same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits
(but leveraging the KH is a little tricky => probabilistic model)
Thesis:We can identify correct extractions without labeled data using a probabilistic model of redundancy.
6
1) Background2) KH as a general problem structure
• Monotonic Feature Model
3) URNS model• How does probability increase with repetition?
4) Challenge: The “long tail”• Unsupervised language models
Outline
7
Classical Supervised Learning
?
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)
x1
x2
8
Semi-Supervised Learning (SSL)
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)
x1
x2
9
Monotonic Features
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1
and unlabeled examples (x)
10
Monotonic Features
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1 and unlabeled examples (x)
P(y=1 | x1) increases with x1
11
Common Structure
Task Monotonic FeatureUIE “C such as x”
[Etzioni et al., 2005]
Word Sense Disambiguation
“plant and animal species” [Yarowsky, 1995]
Information Retrieval search query[Kwok & Grunfield, 1995; Thompson & Turtle, 1995]
Document Classification
Topic word, e.g.: “politics”[McCallum & Nigam, 1999; Gliozzo, 2005]
Named Entity Recognition
contains(“Mr.”)
[Collins & Singer, 1998]
12
MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods
Unlike co-training, MF Model doesn’t require labeled data pre-defined “views”
Isn’t this just ___ ?
13
One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is
minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998]
MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases
Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls(under assumptions)
Theoretical Results
14
MFA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier
Classification with the MF Model
15
20 Newsgroups dataset (MF:newsgroup name)
Vs. Two SSL baselines (NB + EM, LP)
Without labeled data:
Experimental Results
16
MFA-SSL provides a 15% error reduction for 100-400 labeled examples.
MFA-BOTH provides a 31% error reduction for 0-800 labeled examples.
Experimental Results
17
Bad News: confusable MFs
For more complex tasks, monotonicity is insufficient
Example: City extractions
MF: extraction frequency with e.g., “cities such as x”
..also MF for:has skyscrapers
has an opera house
located on Earth, …
New York 1488
Chicago 999
Los Angeles 859
… …
Twisp 1
Northeast 1
MF Extraction value
18
Performance of MFA in UIE
19
MFA for SSL in UIE
20
1) Background2) KH as a general problem structure
• Monotonic Feature Model
3) URNS model• How does probability increase with repetition?
4) Challenge: The “long tail”• Unsupervised language models
Outline
21
If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?
Consider a single pattern suggesting C , e.g.,
countries such as x
Redundancy: Single Pattern
22
“…countries such as Saudi Arabia…”
“…countries such as the United States…”
“…countries such as Saudi Arabia…”
“…countries such as Japan…”
“…countries such as Africa…”
“…countries such as Japan…”
“…countries such as the United Kingdom…”
“…countries such as Iraq…”
“…countries such as Afghanistan…”
“…countries such as Australia…”
C = Country
n = 10 occurrences
Redundancy: Single Pattern
23
C = Country
n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
p = probability pattern yields a correct extraction, i.e.,
p = 0.9
0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9 Noisy-or ignores: –Sample size (n) –Distribution of C
Naïve Model: Noisy-Or
Pnoisy-orPnoisy-or(xC | x seen k times)
= 1 – (1 – p)k
[Agichtein & Gravano, 2000; Lin et al. 2003]
24
United States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
C = Country
n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
2
2
1
1
1
1
1
1
0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
As sample size increases, noisy-or becomes inaccurate.
Needed in Model: Sample Size
Pnoisy-or Pnoisy-ork k
25
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
Needed in Model: Distribution of C
Pnoisy-ork
Pfreq(xC | x seen k times)
= 1 – (1 – p)k/n
United States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
26
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Needed in Model: Distribution of C
PfreqkUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
Pfreq(xC | x seen k times)
= 1 – (1 – p)k/n
27
New York
Chicago
. . .
El Estor
Nikki
Ragaz
Villegas
Northeastwards
C = City
n ~50,000
1488
999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Probability xC depends on the distribution of C.
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Needed in Model: Distribution of C
Pfreq Pfreqk kUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
28
Tokyo
U.K.
Sydney
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
…cities such as Tokyo…
Urn for C = City
My solution: URNS Model
29
C – set of unique target labels
E – set of unique error labels
num(C) – distribution of target labels
num(E) – distribution of error labels
Urn – Formal Definition
30
distribution of target labels: num(C) = {2, 2, 1, 1, 1}
distribution of error labels: num(E) = {2, 1}
U.K.
Sydney
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
Urn for C = City
Urn Example
31
If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?
Computing Probabilities
32
Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?
where s is the total number of balls in the urn
Computing Probabilities
33
URNS without labeled data
Needed: num(C), num(E)
Assumed to be Zipf
Frequency of ith element i-z
With assumptions, learn Zipfian parameters for any class C from unlabeled data alone
34
p 1 - p
C Zipf E Zipf
Observed frequency distribution
URNS without labeled data
Constant across C, for a given pattern
Learn num(C) from unlabeled data!
Constant across C
35
New York
Chicago
. . .
El Estor
Nikki
Ragaz
Villegas
Cres
Northeastwards
C = City
n ~50,000
1488
999
1
1
1
1
1
1
0.9999…
0.9999…
0.63
0.63
0.63
0.63
0.63
0.63
C = Country
n ~50,000
3899
1999
1
1
1
1
1
1
0.9999…
0.9999…
0.03
0.03
0.03
0.03
0.03
0.03
Probabilities Assigned by URNS
PURNS PURNSk kUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
36
0
1
2
3
4
5
City Film Country MayorOf
De
via
tio
n f
rom
ide
al l
og
lik
elih
oo
d
urns
noisy-or
pmi
URNS’s probabilities are 15-22x closer to optimal.
Probability Accuracy
37
Sensitivity Analysis
URNS assumes num(E), p are constant
If we alter parameter choices substantially, URNS still outperforms noisy-or, PMI by at least 8x
Most sensitive to p
p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet(solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)
38
Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated
Phrase Hits“Omaha and other cities” 950
“Illinois and other cities” 24,400
“cities such as Omaha” 930
“cities such as Illinois” 6
Multiple Extraction Patterns
39
Benefits from Multiple Urns
10 1.0 1.0
20 0.9875 1.0
50 0.925 0.955
100 0.8375 0.845
200 0.7075 0.71
Precision at K K Single Multiple
Using multiple URNS reduces error by 29%.
40
URNS vs. MFA
41
URNS + MFA in SSL
MFA-ssl (urns) reduces error by 6%, on average.
42
URNS: Learnable from unlabeled data
All URNS parameters can be learned from unlabeled data alone [Theorem 20]
URNS implies PAC learnability from unlabeled data alone [Theorem 21]
Even with confusable MFs (i.e. even without conditional independence)
(with assumptions)
43
Parameters Learnable (1)
We can express the URNS model as:
Compound Poisson Process Mixture gC() + gE() can be learned, given enough
samples [Loh, 1993]
Task: learn power-law distributions gC(), gE() from
their sum
44
Parameters Learnable (2)
Assume:
Sufficiently high frequency => only target elements
Sufficiently low frequency => only errors
Then:
gC() + gE() =
45
1) Background2) KH as a general problem structure
• Monotonic Feature Model
3) URNS model• How does probability increase with repetition?
4) Challenge: The “long tail”• Unsupervised language models
Outline
46
0
250
500
0 50000 100000
Frequency rank of extraction
Nu
mb
er
of
tim
es
ex
tra
cti
on
a
pp
ea
rs i
n p
att
ern
A mixture of correct and incorrect
e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)
Tend to be correct
e.g., (Bloomberg, New York City)
Challenge: the “long tail”
47
Mayor McCheese
48
Strategy1) Model how common extractions occur in text
2) Rank sparse extractions by fit to model
Assessing Sparse Extractions
49
Terms in the same class tend to appear in similar contexts.
“cities including __” 42,000 1
“__ and other cities” 37,900 0
The Distributional Hypothesis
Hits with Hits withContext Chicago Twisp
“__ hotels” 2,000,000 1,670
“mayor of __” 657,000 82
50
Precomputed – scalable
Handle sparsity
Unsupervised Language Models
51
…
cities such as Chicago , Boston ,
But Chicago isn’t the best
cities such as Chicago , Boston ,
Los Angeles and Chicago .
…
Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]
1 2 1… …
such
as
x , B
osto
n
But
x is
n’t th
e
Ang
eles
and
x .
Baseline: context vectors
52
Twisp: < >
HMM(Twisp):
HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)
. . . 0 0 0 1 . . .
0.14 0.01 … 0.06 t=1 2 N
HMM Compresses Context Vectors
53
Task: Ranking sparse TextRunner extractions.
Metric: Area under precision-recall curve.
Language models reduce missing area by 39% over nearest competitor.
Experimental Results
Headquartered Merged Average
Frequency 0.710 0.784 0.713
PL 0.651 0.851 … 0.785
LM 0.810 0.908 0.851
54
Summary of Thesis
Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled
data alone [Corollary 4.1]
MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]
The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]
MF model is insufficient when “subconcepts” are
present [Proposition 12]
55
Summary: MFs (Continued)
MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art
SSL algorithm with 160 labeled examples. [Table 2.1]
Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]
MFA is not effective for UIE [Table 2.2 & Figure 2.6]
56
Summary: URNS
URNS: Formal model of redundancy in IE Describes how probability increases with MF value
[Proposition 13]
Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]
57
URNS Theoretical Results
Uniform Special Case (USC) Odds in USC increase exponentially with repetition
[Theorem 15]
Error decreases exponentially when parameters are known [Theorem 16]
Zipfian Case (ZC) Closed-form expression for ZC probability given
parameters and odds given repetitions [Theorem 17]
Error in ZC is bounded above by K / n1- for any > 0 when parameters are known [Theorem 19]
58
URNS Theoretical Results (cont.)
Zipfian Case (ZC) In ZC, with probability 1-, the parameters of URNS
can be estimated with error < for all , > 0, given sufficient data [Theorem 20]
In ZC, URNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]
59
URNS Experimental Results
Supervised Learning [Table 3.3]
19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM
Semi-supervised IE [Figure 3.4]
6% error reduction over LP Unsupervised IE [Figure 3.2]
1500% error reduction over noisy-or 2200% error reduction over PMI
Improved Efficiency [Table 3.2]
8x faster than PMI
60
Other Applications of URNS
Estimating extraction precision and recall [Table 3.7]
Identifying synonymous objects and relations (RESOLVER) [Yates & Etzioni, 2007]
Identifying functional relations in text [Ritter et al., 2008]
61
Assessing Sparse Extractions
Hidden Markov Model assessor (HMM-T): Error reduction of 23-46% over context vectors on
typechecking task [Table 4.1]
Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]
10-50x more efficient vs. context vectors
Sparse extraction assessment with language models:
Error reduction of 39% over previous work [Table 4.3]
Massively more scalable than previous techniques
62
Acknowledgements:Oren Etzioni
Mike CafarellaPedro DomingosSusan Dumais
Eric HorvitzAlan Ritter
Stef SchoenmackersStephen Soderland
Dan Weld
63
64
65
Extraction is sometimes “easy”: generic extraction patterns
…cities such as Chicago… => Chicago City
C such as x => x C [Hearst,1992]
But most sentences are “tough”:
We walked the tree-lined streets of the bustling metropolis that is Atlanta.
Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x))
Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006]
Web IE without labeled examples
66
Extraction is sometimes “easy”: generic extraction patterns
…cities such as Chicago… => Chicago City
C such as x => x C [Hearst,1992]
But most sentences are “tough”:
We walked the tree-lined streets of the bustling metropolis that is Atlanta.
“cities such as Atlanta” – 21,600 Hits
Web IE without labeled examples
67
Web IE without labeled examples
Extraction is sometimes “easy”: generic extraction patterns
…cities such as Chicago… => Chicago City
C such as x => x C [Hearst,1992]
…Bloomberg, mayor of New York City…(Bloomberg, New York City) Mayor
x, C of y => (x, y) C
The scale and redundancy of the Web makes a multitude of facts “easy” to extract.
68
http://www.cs.washington.edu/research/textrunner/
[Banko et al., 2007]
TextRunner Search
69
Extraction patterns make errors:
“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”
Extraction patterns make errors:
“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”
But…
Task: Assess which extractions are correct Without hand-labeled examples At Web-scale
Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”
70
1) Motivation
2) Background on Web IE
3) Estimating extraction correctness URNS model of redundancy
[Downey et al., IJCAI 2005]
(Distinguished Paper Award)
4) Challenge: The “long tail”
5) Machine learning generalization
Outline
71
2) Multiple patterns
Phrase Hits
1) Repetition
“Chicago and other cities” 94,400
“Illinois and other cities” 23,100
“cities such as Chicago” 42,500
“cities such as Illinois” 7
Redundancy – Two Intuitions
Goal: a formal model of these intuitions.
Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?
72
Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?
If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?
Consider a single pattern suggesting C , e.g.,
countries such as x
Redundancy: Single Pattern
73
“…countries such as Saudi Arabia…”
“…countries such as the United States…”
“…countries such as Saudi Arabia…”
“…countries such as Japan…”
“…countries such as Africa…”
“…countries such as Japan…”
“…countries such as the United Kingdom…”
“…countries such as Iraq…”
“…countries such as Afghanistan…”
“…countries such as Australia…”
C = Country
n = 10 occurrences
Redundancy: Single Pattern
74
C = Country
n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
k2
2
1
1
1
1
1
1
p = probability pattern yields a correct extraction, i.e.,
p = 0.9
0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9 Noisy-or ignores: –Sample size (n) –Distribution of C
Naïve Model: Noisy-Or
Pnoisy-orPnoisy-or(xC | x seen k times)
= 1 – (1 – p)k
[Agichtein & Gravano, 2000; Lin et al. 2003]
75
United States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
C = Country
n = 10
Saudi Arabia
Japan
United States
Africa
United Kingdom
Iraq
Afghanistan
Australia
2
2
1
1
1
1
1
1
0.99
0.99
0.9
0.9
0.9
0.9
0.9
0.9
As sample size increases, noisy-or becomes inaccurate.
Needed in Model: Sample Size
Pnoisy-or Pnoisy-ork k
76
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.9
0.9
0.9
0.9
0.9
Needed in Model: Distribution of C
Pnoisy-ork
Pfreq(xC | x seen k times)
= 1 – (1 – p)k/n
United States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
77
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Needed in Model: Distribution of C
PfreqkUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
Pfreq(xC | x seen k times)
= 1 – (1 – p)k/n
78
New York
Chicago
. . .
El Estor
Nikki
Ragaz
Villegas
Northeastwards
C = City
n ~50,000
1488
999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Probability xC depends on the distribution of C.
C = Country
n ~50,000
3899
1999
1
1
1
1
1
0.9999…
0.9999…
0.05
0.05
0.05
0.05
0.05
Needed in Model: Distribution of C
Pfreq Pfreqk kUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
79
Tokyo
U.K.
Sydney
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
…cities such as Tokyo…
Urn for C = City
My solution: URNS Model
80
C – set of unique target labels
E – set of unique error labels
num(C) – distribution of target labels
num(E) – distribution of error labels
Urn – Formal Definition
81
distribution of target labels: num(C) = {2, 2, 1, 1, 1}
distribution of error labels: num(E) = {2, 1}
U.K.
Sydney
Cairo
Tokyo
Tokyo
Atlanta
Atlanta
Yakima
Utah
U.K.
Urn for C = City
Urn Example
82
If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?
Computing Probabilities
83
Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?
where s is the total number of balls in the urn
Computing Probabilities
84
Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated
Phrase Hits“Chicago and other cities” 94,400
“Illinois and other cities” 23,100
“cities such as Chicago” 42,500
“cities such as Illinois” 7
Multiple Extraction Patterns
85
URNS without labeled data
Needed: num(C), num(E)
Assumed to be Zipf
Frequency of ith element i-z
With assumptions, learn Zipfian parameters for any class C from unlabeled data alone
86
p 1 - p
C Zipf E Zipf
Observed frequency distribution
URNS without labeled data
Constant across C, for a given pattern
Learn num(C) from unlabeled data!
Constant across C
87
New York
Chicago
. . .
El Estor
Nikki
Ragaz
Villegas
Cres
Northeastwards
C = City
n ~50,000
1488
999
1
1
1
1
1
1
0.9999…
0.9999…
0.63
0.63
0.63
0.63
0.63
0.63
C = Country
n ~50,000
3899
1999
1
1
1
1
1
1
0.9999…
0.9999…
0.03
0.03
0.03
0.03
0.03
0.03
Probabilities Assigned by URNS
PURNS PURNSk kUnited States
China
. . .
OilWatch Africa
Religion Paraguay
Chicken Mole
Republics of Kenya
Atlantic Ocean
New Zeland
88
0
1
2
3
4
5
City Film Country MayorOf
De
via
tio
n f
rom
ide
al l
og
lik
elih
oo
d
urns
noisy-or
pmi
URNS’s probabilities are 15-22x closer to optimal.
Probability Accuracy
89
Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence)
vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]
PMI computed with search engine hit counts (inspired by [Turney, 2000])
URNS requires no hit count queries (~8x faster)
Scalability
90
Probabilistic model of redundancy Accurate without hand-labeled examples
15-22x improvement in accuracy Scalable
8x faster
[Downey et al., IJCAI 2005]
URNS: Contributions
91
1) Motivation
2) Background on Web IE
3) Estimating extraction correctness
4) Challenge: The “long tail” Language models to the rescue
[Downey et al., ACL 2007]
5) Machine learning generalization
Outline
92
0
250
500
0 50000 100000
Frequency rank of extraction
Nu
mb
er
of
tim
es
ex
tra
cti
on
a
pp
ea
rs i
n p
att
ern
A mixture of correct and incorrect
e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)
Tend to be correct
e.g., (Bloomberg, New York City)
Challenge: the “long tail”
93
Mayor McCheese
94
Strategy1) Model how common extractions occur in text
2) Rank sparse extractions by fit to model
Unsupervised language models Precomputed – scalable Handle sparsity
Assessing Sparse Extractions
95
The “distributional hypothesis”:Instances of the same relationship tend to appear in similar contexts.
…David B. Shaver was elected as the new mayor of Pickerington, Ohio.
http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp
…Mike Bloomberg was elected as the new mayor of New York City.
http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm
Assessing Sparse Extractions
96
Type errors are common:
Alexander the Great conquered Egypt… (Great, Egypt) Conquered
Locally acquired malaria is now uncommon… (Locally, malaria) Acquired
Type checking
97
…
cities such as Chicago , Boston ,
But Chicago isn’t the best
cities such as Chicago , Boston ,
Los Angeles and Chicago .
…
Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]
1 2 1… …
such
as
x , B
osto
n
But
x is
n’t th
e
Ang
eles
and
x .
Baseline: context vectors (1)
98
Miami: < >Twisp: < >
Problems: Vectors are large Intersections are sparse
. . . 71 25 1 513 . . .w
hen
he v
isite
d X
he v
isite
d X
and
visi
ted
X a
nd o
ther
X a
nd o
ther
citi
es
. . . 0 0 0 1 . . .
Baseline: context vectors (2)
99
ti ti+1 ti+2 ti+3
wi wi+1 wi+2 wi+3
cities such as Seattle
Hidden Markov Model (HMM)
States – unobserved
Words – observed
Hidden States ti {1, …, N} (N fairly small)
Train on unlabeled data – P(ti | wi = w) is N-dim. distributional summary of w
– Compare extractions using KL divergence
100
Twisp: < >
P(t | Twisp):
Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)
. . . 0 0 0 1 . . .
0.14 0.01 … 0.06 t=1 2 N
HMM Compresses Context Vectors
101
Is Pickerington of the same type as Chicago?
Chicago , IllinoisPickerington , Ohio
Chicago:Pickerington:
=> Context vectors say no,
dot product is 0!
291 0 …
<x>
, O
hio
<x>
, Ill
inoi
s
0 1 …
Example
102
HMM Generalizes:
Chicago , Illinois
Pickerington , Ohio
Example
103
Task: Ranking sparse TextRunner extractions.
Metric: Area under precision-recall curve.
Language models reduce missing area by 39% over nearest competitor.
Experimental Results
Headquartered Merged Average
Frequency 0.710 0.784 0.713
PL 0.651 0.851 … 0.785
LM 0.810 0.908 0.851
104
No hand-labeled data Scalability
Language models precomputed=> Can be queried at interactive speed
Improved accuracy over previous work[Downey et al., ACL 2007]
REALM: Contributions
105
1) Motivation
2) Background on Web IE
3) Estimating extraction correctness
4) Challenge: The “long tail”
5) Machine learning generalization Monotonic Features
[Downey et al., 2008 (submitted)]
Outline
106
Common Structure
Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis
Word Sense Disambiguation
“plant and animal species”
One sense per context, one sense per discourse[Yarowsky, 1995]
Information Retrieval
search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]
Document Classification
Topic word, e.g.: “politics”
Semi-supervised Learning
[McCallum & Nigam, 1999; Gliozzo, 2005]
107
Common Structure
Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis
Word Sense Disambiguation
“plant and animal species”
One sense per context, one sense per discourse[Yarowsky, 1995]
Information Retrieval
search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]
Document Classification
Topic word, e.g.: “politics”
Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005]
Identity of a monotonic feature xi such that:P(y = 1 | xi) increases strictly monotonically with xi
Classification of examples x = (x1, …, xd) into classes y {0, 1}
108
Classical Supervised Learning
?
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)
x1
x2
109
Semi-Supervised Learning (SSL)
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)
x1
x2
110
Monotonic Features
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1
and unlabeled examples (x)
111
Monotonic Features
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1
and unlabeled examples (x)
112
Monotonic Features
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1
and unlabeled examples (x)
113
1. No labeled data, MFs given (MA) With noisy labels from MFs, train any classifier
2. Labeled data, no MFs given (MA-SSL) Detect MFs from labeled data, run MA
3. Labeled data and MFs given (MA-BOTH) Run MA with given & detected MFs
Exploiting MF Structure
114
20 Newsgroups dataset
Task: Given text, determine newsgroup of origin
(MFs: newsgroup name)
Without labeled data:
Experimental Results
115
MA-SSL provides a 15% error reduction for 100-400 labeled examples.
MA-BOTH provides a 31% error reduction for 0-800 labeled examples.
Experimental Results
116
Co-training Requires labeled examples and known views
Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure
Relationship to other approaches
117
Best known methods for IE without labeled data Probabilities of correctness (URNS)
Massive improvements in accuracy (15-22x) Handling sparse data (Language models)
Vastly more scalable than previous work Accuracy wins (39% error reduction)
Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification
Summary of Results
118
IE Web IE But still need:
A coherent knowledge base MayorOf(Chicago, Daley) –
the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery
Improved accuracy and coverage Currently, ignore character/document features, recursive
structure, etc. Future work: more sophisticated language models
(e.g. PCFGs)
Conclusions and Future Work
119
Thanks!
Acknowledgements:Oren Etzioni
Mike CafarellaPedro DomingosSusan Dumais
Eric HorvitzStef Schoenmackers
Dan Weld
120
Self-Supervised Learning
Input Examples Output
Supervised Labeled Classifier
Semi-supervised Labeled & Unlabeled Classifier
Self-supervised Unlabeled Classifier
Unsupervised Unlabeled Clustering
121
Language Modeling for IE REALM is simple, ignores:
Character- or Document-Level Features Web structure Recursive structure (PCFGs)
Goal: x won an Oscar for playing a villain…What is P(x) ?
From facts to knowledge Entity resolution and inference
Future Work
122
Named Entity Location Lexical Statistics improve state of the art
[Downey et al., IJCAI 2007]
Modeling Web Search Characterizing user behavior
[Downey et al., SIGIR 2007] (poster)[Liebling et al., 2008] (submitted)
Predictive models [Downey et al., IJCAI 2007]
Other Work
123
Web Fact-Finding
Who has won three or more Academy Awards?
124
Web Fact-FindingProblems:
User has to pick the right words, often a tedious process:
"world foosball champion in 1998“ – 0 hits“world foosball champion” 1998 – 2 hits, no answer
What if I could just ask for P(x) in“x was world foosball champion in 1998?”
How far can language modeling and the distributional hypothesis take us?
125
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .
X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
126
Miami
Twisp
Star Wars
. . . 98 0 20 250 30 513 . . .
. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .
X s
ound
trac
khe
vis
ited
X a
ndci
ties
such
as
XX
and
oth
er c
ities
X lo
dgin
g
KnowItAll Hypothesis
Distributional Hypothesis
127
invent in real time
TextRunner
Ranked by frequency
REALM improves precision of the top 20 extractions by an average of 90%.
128
Tarantella, Santa Cruz
International Business Machines Corporation, Armonk
Mirapoint, Sunnyvale
ALD, Sunnyvale
PBS, Alexandria
General Dynamics, Falls Church
Jupitermedia Corporation, Darien
Allegro, Worcester
Trolltech, Oslo
Corbis, Seattle
TR Precision: 40% REALM Precision: 100%
Improving TextRunner: Example (1)
“headquartered” Top 10:company, Palo Alto
held company, Santa Cruz
storage hardware and software, Hopkinton
Northwestern Mutual, Tacoma
1997, New York City
Google, Mountain View
PBS, Alexandria
Linux provider, Raleigh
Red Hat, Raleigh
TI, Dallas
TR Precision: 40%
129
Arabs, Rhodes
Arabs, Istanbul
Assyrians, Mesopotamia
Great, Egypt
Assyrians, Kassites
Arabs, Samarkand
Manchus, Outer Mongolia
Vandals, North Africa
Arabs, Persia
Moors, Lagos
TR Precision: 60% REALM Precision: 90%
Improving TextRunner: Example (2)
“conquered” Top 10:Great, Egypt
conquistador, Mexico
Normans, England
Arabs, North Africa
Great, Persia
Romans, part
Romans, Greeks
Rome, Greece
Napoleon, Egypt
Visigoths, Suevi Kingdom
TR Precision: 60%
130
Previous n-gram technique (1)
1) Form a context vector for each extracted argument:…
cities such as Chicago , Boston ,
But Chicago isn’t the best
cities such as Chicago , Boston ,
Los Angeles and Chicago .
…
2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005].
1 2 1… …
such as <x> , Boston
But <x> isn’t the
Angeles and <x> .
131
Miami: < >Twisp: < >
Problems: Vectors are large Intersections are sparse
. . . 71 25 1 513 . . .w
hen
he v
isite
d X
he v
isite
d X
and
visi
ted
X a
nd o
ther
X a
nd o
ther
citi
es
. . . 0 0 0 1 . . .
Previous n-gram technique (2)
132
Miami: < >
P(t | Miami):
Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)
. . . 71 25 1 513 . . .
0.14 0.01 … 0.06 t=1 2 N
Compressing Context Vectors
133
Example: N-Grams on Sparse Data
Is Pickerington of the same type as Chicago?
Chicago , IllinoisPickerington , Ohio
Chicago:Pickerington:
=> N-grams says no, dot product is 0!
291 0 …
<x> , Ohio
<x> , Illinois
0 1 …
134
HMM Generalizes:
Chicago , Illinois
Pickerington , Ohio
Example: HMM-T on Sparse Data
135
HMM-T Limitations
Learning iterations take time proportional to (corpus size *Tk+1)
T = number of latent states
k = HMM order
We use limited values T=20, k=3 Sufficient for typechecking (Santa Clara is a city) Too coarse for relation assessment
(Santa Clara is where Intel is headquartered)
136
The REALM ArchitectureTwo steps for assessing R(arg1, arg2) Typechecking
Ensure arg1 and arg2 are of proper type for RMayorOf(Intel, Santa Clara)
Leverages all occurrences of each arg Relation Assessment
Ensure R actually holds between arg1 and arg2MayorOf(Giuliani, Seattle)
Both steps use pre-computed language models=> Scales to Open IE
137
Type checking isn’t enoughNY Mayor Giuliani toured downtown Seattle.
Want: How do arguments behave in relation to each other?
Relation Assessment
138
N-gram language model:
P(wi, wi-1, … wi-k)
arg1, arg2 often far apart => large k (inaccurate)
REL-GRAMS (1)
139
Relational Language Model (REL-GRAMS):
For any two arguments e1, e2:
P(wi, wi-1, … wi-k | wi = e1, e1 near e2)
k can be small – REL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR)
Combine with HMM-T by multiplying ranks.
REL-GRAMS (2)
140
Experiments
Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, MergedREALM vs.
TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005])
Pattern Learning (PL) – based on Snowball [Agichtein 2000]
HMM-T and REL-GRAMS in isolation
141
Learning num(C) and num(E)
From untagged data: ill-posed problem• num(C) can vary wildly with C
e.g., countries vs. cities vs. mayors
Assume:1) Consistent precision of a single co-occurrence,
e.g., in a randomly drawn phrase “C such as x”,x C about p of the time. (0.9 for [Etzioni,
2005])
2) num(E) is constant for all C
3) num(C) is Zipf Estimate num(C) from untagged data using EM
[Downey et al. 2005] (Also: multiple contexts)
142
URNS without labeled data
Frequency Rank
Fre
qu
ency
Frequency Rank
Fre
qu
ency
Frequency Rank
Fre
qu
en
cy
1 -
P(x C) in “C such as x”
Assumed ~0.9
Error Distribution
Assumed large with Zipf parameter 1.0
143
URNS without labeled data
Frequency Rank
Fre
qu
ency
Frequency Rank
Fre
qu
ency
Frequency Rank
Fre
qu
en
cy
1 - Can vary wildly (e.g. cities vs. countries).
Learned from unlabeled data using EM
144
Distributional Similarity
Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2:
Compare context distributions:
P(wb,…, we | seed1, seed2 )
P(wb,…, we | arg1, arg2)But e – b can be large
Many parameters, sparse data => inaccuracy
wb … wh seed1 wh+2 … wi seed2 wi+2 … we
wb … wh arg1 wh+2 … wi arg2 wi+2 … we
145
http://www.cs.washington.edu/research/textrunner/
TextRunner Search
146
Large textual corpora are redundant,
and we can use this observation to bootstrap extraction and classification models
from minimally labeled, or even completely unlabeled data.
Thesis
147
Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Output: concept c: X -> {0, 1} that approximates P(y | x).
Monotonic Features
148
Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Unlabeled examples DU = {(x)} ~ P(x)
Output: concept c: X -> {0, 1} that approximates P(y | x).
Monotonic Features
Smaller
149
Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:
P(y=1 | xi) increases strictly monotonically with xi for all i M.
Output: concept c: X -> {0, 1} that approximates P(y | x).
Potentially empty!
Monotonic Features
150
Problem: num(C) can vary wildly e.g. cities vs. countries
Assume: num(C), num(E) Zipf distributed
freq. of ith element i-z
p and num(E) independent of C
Learn num(C) from unlabeled data alone With Expectation Maximization
URNS without labeled data
151
20 Newsgroups dataset
Task: Given text, determine newsgroup of origin
(MFs: newsgroup name)
Without labeled data:
Experimental Results
152
Typecheck each arg by comparing HMM’s distributional summaries:
Rank arguments in ascending order of f(arg)
arg|,|
||
1(arg) tPseedtP
seedsKLf
ii
HMM Type-checking
153
Classical Supervised Learning
?
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)
x1
x2
154
Semi-supervised Learning (SSL)
Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)
x1
x2
155
Self-supervised Learning
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)
156
Self-supervised Learning
x1
x2
Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)and system labels its own examples
157
Self-supervised Learning
Input Examples Output
Supervised Labeled Classifier
Semi-supervised Labeled & Unlabeled Classifier
Self-supervised Unlabeled Classifier
Unsupervised Unlabeled Clustering
158
Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Output: concept c: X -> {0, 1} that approximates P(y | x).
Monotonic Features
159
Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Unlabeled examples DU = {(x)} ~ P(x)
Output: concept c: X -> {0, 1} that approximates P(y | x).
Monotonic Features
Smaller
160
Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)
Binary output space Y = {0, 1} Inputs
Labeled examples DL = {(x, y)} ~ P(x, y)
Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:
P(y=1 | xi) increases strictly monotonically with xi for all i M.
Output: concept c: X -> {0, 1} that approximates P(y | x).
Potentially empty!
Monotonic Features
Recommended