View
222
Download
0
Category
Tags:
Preview:
Citation preview
Machine Reading of Web
Text Oren Etzioni
Turing CenterUniversity of Washington
http://turing.cs.washington.edu
2
Rorschach Test
3
Rorschach Test for CS
4
Moore’s Law?
5
Storage Capacity?
6
Number of Web Pages?
7
Number of Facebook Users?
8
9
Turing Center Foci
Scale MT to 49,000,000 language pairs 2,500,000 word translation graph P(V F C)? PanImages
Accumulate knowledge from the Web
A new paradigm for Web Search
10
Outline
1. A New Paradigm for Search2. Open Information Extraction3. Tractable Inference4. Conclusions
11
Web Search in 2020?
Type key words into a search box? Social or “human powered” Search? The Semantic Web? What about our technology
exponentials?
“The best way to predict the future is to invent it!”
12
Intelligent Search
Instead of merely retrieving Web pages, read ‘em!
Machine Reading = Information Extraction (IE) + tractable inference
IE(sentence) = who did what? speaker(Alon Halevy, UW)
Inference = uncover implicit information Will Alon visit Seattle?
13
Application: Information Fusion What kills bacteria? What west coast, nano-technology
companies are hiring? Compare Obama’s “buzz” versus
Hillary’s? What is a quiet, inexpensive, 4-star
hotel in Vancouver?
14
Opine (Popescu & Etzioni, EMNLP ’05)
IE(product reviews) Informative Abundant, but varied Textual
Summarize reviews without any prior knowledge of product category
Opinion Mining
15
16
17
But “Reading” the Web is Tough Traditional IE is narrow IE has been applied to small,
homogenous corpora No parser achieves high accuracy No named-entity taggers No supervised learning
How about semi-supervised learning?
18
Semi-Supervised Learning
Few hand-labeled examples Limit on the number of concepts Concepts are pre-specified Problematic for the Web
Alternative: self-supervised learning Learner discovers concepts on the fly Learner automatically labels examples
per concept!
19
2. Open IE = Self-supervised IE (Banko, Cafarella, Soderland, et. al, IJCAI ’07)
Traditional IE Open IE
Input: Corpus + Hand-labeled Data
Corpus
Relations: Specified in Advance
Discovered Automatically
Complexity:
Text analysis:
O(D * R) R relations
Parser + Named-entity tagger
O(D) D documents
NP Chunker
20
Extractor Overview (Banko & Etzioni, ’08)
1. Use a simple model of relationships in English to label extractions
2. Bootstrap a general model of relationships in English sentences, encoded as a CRF
3. Decompose each sentence into one or more (NP1, VP, NP2) “chunks”
4. Use CRF model to retain relevant parts of each NP and VP.
The extractor is relation-independent!
21
TextRunner Extraction
Extract Triple representing binary relation (Arg1, Relation, Arg2) from sentence.
Internet powerhouse, EBay, was originally founded by Pierre Omidyar.
Internet powerhouse, EBay, was originally founded by Pierre Omidyar.
(Ebay, Founded by, Pierre Omidyar)
22
Numerous Extraction Challenges Drop non-essential info: “was originally founded by” founded by Retain key distinctionsEbay founded by Pierr ≠ Ebay founded
Pierre Non-verb relationships“George Bush, president of the U.S…” Synonymy & aliasingAlbert Einstein = Einstein ≠ Einstein Bros.
23
TextRunner (Web’s 1st Open IE
system) 1. Self-Supervised Learner: automatically
labels example extractions & learns an extractor
2. Single-Pass Extractor: single pass over corpus, identifying extractions in each sentence
3. Query Processor: indexes extractions enables queries at interactive speeds
TextRunner Demo
25
26
27
Triples11.3 million
With Well-Formed Relation9.3 million
With Well-Formed Entities7.8 million
Abstract6.8 million
79.2% correct
Concrete1.0 million
88.1%correct
Sample of 9 million Web Pages
Concrete facts: (Oppenheimer, taught at, Berkeley)
Abstract facts: (fruit, contain, vitamins)
28
3. Tractable Inference
Much of textual information is implicit
I. Entity and predicate resolutionII. Probability of correctnessIII. Composing facts to draw conclusions
29
I. Entity Resolution
Resolver (Yates & Etzioni, HLT ’07): determines synonymy based on relations found by TextRunner (cf. Pantel & Lin ‘01)
(X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) (X, friend of, Joe) (M, friend of, Mary)
P(X = M) ~ shared relations
30
Relation Synonymy
(1, R, 2) (2, R 4) (4, R, 8) Etc.
(1, R’ 2) (2, R’, 4) (4, R’ 8) Etc.
P(R = R’) ~ shared argument pairs
•Unsupervised probabilistic model•O(N log N) algorithm run on millions of docs
31
II. Probability of CorrectnessHow likely is an extraction to be correct?
Factors to consider include: Authoritativeness of source Confidence in extraction method Number of independent extractions
32
Counting Extractions
Lexico-syntactic patterns: (Hearst ’92)“…cities such as Seattle, Boston, and…”
Turney’s PMI-IR, ACL ’02: PMI ~ co-occur frequency # results # results confidence in class
membership.
33
Formal Problem StatementIf an extraction x appears k times in a
set of n distinct sentences each suggesting that x belongs to C, what is the probability that x C ?
C is a class (“cities”) or a relation (“mayor of”)
Note: we only count distinct sentences!
34
Combinatorial Model (“Urns”)
Odds increase exponentially with k, but decrease exponentially with n
See Downey et al.’s IJCAI ’05 paper for formal details.
35
0
1
2
3
4
5
City Film Country MayorOf
De
via
tio
n f
rom
ide
al l
og
lik
elih
oo
d
urns
noisy-or
pmi
Performance (15x Improvement)
Self supervised, domain independent method
36
0
250
500
0 50000 100000
Frequency rank of extraction
Nu
mb
er
of
tim
es
ex
tra
cti
on
a
pp
ea
rs i
n p
att
ern
URNS limited on “sparse” facts
A mixture of correct and incorrect
e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)
con
text Tend to be correct
e.g., (Michael Bloomberg, New York City)
37
Language Models to the Rescue (Downey, Schoenmackers, Etzioni, ACL ’07)Instead of only lexico-syntactic patterns, leverage
all contexts of a particular entity
Statistical ‘type check’: does Pickerington “behave” like a city?
does Shaver “behave” like a mayor?
Language model = HMM (built once per corpus) Project string to point in 20-dimensional space Measure proximity of Pickerington to Seattle,
Boston, etc.
38
III Compositional Inference (work in progress, Schoenmackers, Etzioni, Weld)Implicit information, (2+2=4) TextRunner: (Turing, born in, London) WordNet: (London, part of, England) Rule: ‘born in’ is transitive thru ‘part of’ Conclusion: (Turing, born in, England) Mechanism: MLN instantiated on the fly Rules: learned from corpus (future work) Inference Demo
39
Mulder ‘01 WebKB ‘99 PMI-IR ‘01
KnowItAll, ‘04
UrnsBE ‘05
KnowItNow ‘05
TextRunner ‘07
KnowItAll Family Tree
Opine ‘05
Woodward ‘06
Resolver ‘07
REALM ‘07 Inference ‘08
40
KnowItAll Team
Michele Banko Michael Cafarella Doug Downey Alan Ritter Dr. Stephen Soderland Stefan Schoenmackers Prof. Dan Weld Mausam
Alumni: Dr. Ana-Maria Popescu, Dr. Alex Yates, and others.
41
Related Work
Sekine’s “pre-empty IE” Powerset Textual Entailment AAAI ‘07 Symposium on “Machine
Reading” Growing body of work on IE from the
Web
42
4. Conclusions
Imagine search systems that operate over a (more) semantic space
Key words, documents extractions TF-IDF, pagerank relational models Web pages, hyper links entities, relns
Reading the Web new Search Paradigm
43
44
Machine Reading = Unsupervised understanding
of text
Much is implicit tractable inference is
key!
45
HMM in more detail
Training: seek to maximize probability of corpus w given latent states t using EM:
ti ti+1 ti+2 ti+3 ti+4
wi wi+1 wi+2 wi+3 wi+4
cities such as Los Angeles
wordsw
kNt
i
i
1,,...,1
46
Using the HMM at Query Time Given a set of extractions (Arg1, Rln, Arg2) Seeds = most frequent Args for Rln
arg|,|
||
1)(arg, tPseedtP
seedsKLseedsf
ii
1. Distribution over t is read from the HMM
2. Compute KL divergence via f(arg, seeds)
3. For each extraction, average f over Arg1 & Arg2
4. Sort “sparse” extractions in ascending order
47
Language Modeling & Open IE Self supervised Illuminating phrases full context
Handles sparse extractions
48
Focus: Open IE on Web Text
Advantages Challenges
“Semantically tractable”sentences
Redundancy
Search engines
Difficult, ungrammatical sentences
Unreliable information
Heterogeneous corpus
49
II. Probability of CorrectnessHow likely is an extraction to be correct?Distributional Hypothesis: “words that
occur in the same contexts tend to have similar meanings ”
KnowItAll Hypothesis: extractions that occur in the same informative contexts more frequently are more likely to be correct.
50
Relation’s arguments are “typed”:(Person, Mayor Of, City)
Training: Model distribution of Person & City contexts in corpus (Distributional Hypothesis)
Query time: Rank sparse triples by how well each argument’s context distribution matches that of its type
Argument “Type checking” via HMM
51
Silly Example
(Shaver, Mayor of, Pickerington) over (Spice Girls, Mayor of, Microsoft)
Because: Shaver’s contexts are more like “other
mayors” than Spice Girls’, and Pickerington's contexts are more like
“other cities” than Microsoft’s
52
Utilizing HMMs to Check TypesChallenges: Argument types are not known Can’t build model for each argument
type “Textual types” are fuzzy
Solution: Train an HMM for the corpus using EM & bootstrap
REALM improves precision by 90%
53
MLNMLN
Knowledge BasesQuery Formula
Find BestQuery
Run Query
Find ImpliedNodes & Cliques
Results
Best KB + Query
Query Results
New Nodes+ Cliques
TextRunner, WordNetBornIn(Turing, England)? Inference RulesBornIn(X, city) ->
BornIn(X, country)
WordNet: X is in England
London is in England
In(London, England)
TextRunner: Turing born in X
Turing was born in London
BornIn(Turing, London)BornIn(Turing, England)
Query: Was Turing born in England?
In(London, England)BornIn(Turing, London)BornIn(Turing, England)
Yes! Turing wasborn in England!
Recommended