Machine Reading of Web Text Oren Etzioni Turing Center University of Washington

Machine Reading of Web

Text Oren Etzioni

Turing CenterUniversity of Washington

http://turing.cs.washington.edu

Rorschach Test

Rorschach Test for CS

Moore’s Law?

Storage Capacity?

Number of Web Pages?

Number of Facebook Users?

Turing Center Foci

Scale MT to 49,000,000 language pairs 2,500,000 word translation graph P(V F C)? PanImages

Accumulate knowledge from the Web

A new paradigm for Web Search

Outline

1. A New Paradigm for Search2. Open Information Extraction3. Tractable Inference4. Conclusions

Web Search in 2020?

Type key words into a search box? Social or “human powered” Search? The Semantic Web? What about our technology

exponentials?

“The best way to predict the future is to invent it!”

Intelligent Search

Instead of merely retrieving Web pages, read ‘em!

Machine Reading = Information Extraction (IE) + tractable inference

IE(sentence) = who did what? speaker(Alon Halevy, UW)

Inference = uncover implicit information Will Alon visit Seattle?

Application: Information Fusion What kills bacteria? What west coast, nano-technology

companies are hiring? Compare Obama’s “buzz” versus

Hillary’s? What is a quiet, inexpensive, 4-star

hotel in Vancouver?

Opine (Popescu & Etzioni, EMNLP ’05)

IE(product reviews) Informative Abundant, but varied Textual

Summarize reviews without any prior knowledge of product category

Opinion Mining

But “Reading” the Web is Tough Traditional IE is narrow IE has been applied to small,

homogenous corpora No parser achieves high accuracy No named-entity taggers No supervised learning

How about semi-supervised learning?

Semi-Supervised Learning

Few hand-labeled examples Limit on the number of concepts Concepts are pre-specified Problematic for the Web

Alternative: self-supervised learning Learner discovers concepts on the fly Learner automatically labels examples

per concept!

2. Open IE = Self-supervised IE (Banko, Cafarella, Soderland, et. al, IJCAI ’07)

Traditional IE Open IE

Input: Corpus + Hand-labeled Data

Corpus

Relations: Specified in Advance

Discovered Automatically

Complexity:

Text analysis:

O(D * R) R relations

Parser + Named-entity tagger

O(D) D documents

NP Chunker

Extractor Overview (Banko & Etzioni, ’08)

1. Use a simple model of relationships in English to label extractions

2. Bootstrap a general model of relationships in English sentences, encoded as a CRF

3. Decompose each sentence into one or more (NP1, VP, NP2) “chunks”

4. Use CRF model to retain relevant parts of each NP and VP.

The extractor is relation-independent!

TextRunner Extraction

Extract Triple representing binary relation (Arg1, Relation, Arg2) from sentence.

Internet powerhouse, EBay, was originally founded by Pierre Omidyar.

(Ebay, Founded by, Pierre Omidyar)

Numerous Extraction Challenges Drop non-essential info: “was originally founded by” founded by Retain key distinctionsEbay founded by Pierr ≠ Ebay founded

Pierre Non-verb relationships“George Bush, president of the U.S…” Synonymy & aliasingAlbert Einstein = Einstein ≠ Einstein Bros.

TextRunner (Web’s 1st Open IE

system) 1. Self-Supervised Learner: automatically

labels example extractions & learns an extractor

2. Single-Pass Extractor: single pass over corpus, identifying extractions in each sentence

3. Query Processor: indexes extractions enables queries at interactive speeds

TextRunner Demo

Triples11.3 million

With Well-Formed Relation9.3 million

With Well-Formed Entities7.8 million

Abstract6.8 million

79.2% correct

Concrete1.0 million

88.1%correct

Sample of 9 million Web Pages

Concrete facts: (Oppenheimer, taught at, Berkeley)

Abstract facts: (fruit, contain, vitamins)

3. Tractable Inference

Much of textual information is implicit

I. Entity and predicate resolutionII. Probability of correctnessIII. Composing facts to draw conclusions

I. Entity Resolution

Resolver (Yates & Etzioni, HLT ’07): determines synonymy based on relations found by TextRunner (cf. Pantel & Lin ‘01)

(X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) (X, friend of, Joe) (M, friend of, Mary)

P(X = M) ~ shared relations

Relation Synonymy

(1, R, 2) (2, R 4) (4, R, 8) Etc.

(1, R’ 2) (2, R’, 4) (4, R’ 8) Etc.

P(R = R’) ~ shared argument pairs

•Unsupervised probabilistic model•O(N log N) algorithm run on millions of docs

II. Probability of CorrectnessHow likely is an extraction to be correct?

Factors to consider include: Authoritativeness of source Confidence in extraction method Number of independent extractions

Counting Extractions

Lexico-syntactic patterns: (Hearst ’92)“…cities such as Seattle, Boston, and…”

Turney’s PMI-IR, ACL ’02: PMI ~ co-occur frequency # results # results confidence in class

membership.

Formal Problem StatementIf an extraction x appears k times in a

set of n distinct sentences each suggesting that x belongs to C, what is the probability that x C ?

C is a class (“cities”) or a relation (“mayor of”)

Note: we only count distinct sentences!

Combinatorial Model (“Urns”)

Odds increase exponentially with k, but decrease exponentially with n

See Downey et al.’s IJCAI ’05 paper for formal details.

City Film Country MayorOf

noisy-or

Performance (15x Improvement)

Self supervised, domain independent method

0 50000 100000

Frequency rank of extraction

URNS limited on “sparse” facts

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

text Tend to be correct

e.g., (Michael Bloomberg, New York City)

Language Models to the Rescue (Downey, Schoenmackers, Etzioni, ACL ’07)Instead of only lexico-syntactic patterns, leverage

all contexts of a particular entity

Statistical ‘type check’: does Pickerington “behave” like a city?

does Shaver “behave” like a mayor?

Language model = HMM (built once per corpus) Project string to point in 20-dimensional space Measure proximity of Pickerington to Seattle,

Boston, etc.

III Compositional Inference (work in progress, Schoenmackers, Etzioni, Weld)Implicit information, (2+2=4) TextRunner: (Turing, born in, London) WordNet: (London, part of, England) Rule: ‘born in’ is transitive thru ‘part of’ Conclusion: (Turing, born in, England) Mechanism: MLN instantiated on the fly Rules: learned from corpus (future work) Inference Demo

Mulder ‘01 WebKB ‘99 PMI-IR ‘01

KnowItAll, ‘04

UrnsBE ‘05

KnowItNow ‘05

TextRunner ‘07

KnowItAll Family Tree

Opine ‘05

Woodward ‘06

Resolver ‘07

REALM ‘07 Inference ‘08

KnowItAll Team

Michele Banko Michael Cafarella Doug Downey Alan Ritter Dr. Stephen Soderland Stefan Schoenmackers Prof. Dan Weld Mausam

Alumni: Dr. Ana-Maria Popescu, Dr. Alex Yates, and others.

Related Work

Sekine’s “pre-empty IE” Powerset Textual Entailment AAAI ‘07 Symposium on “Machine

Reading” Growing body of work on IE from the

4. Conclusions

Imagine search systems that operate over a (more) semantic space

Key words, documents extractions TF-IDF, pagerank relational models Web pages, hyper links entities, relns

Reading the Web new Search Paradigm

Machine Reading = Unsupervised understanding

of text

Much is implicit tractable inference is

HMM in more detail

Training: seek to maximize probability of corpus w given latent states t using EM:

ti ti+1 ti+2 ti+3 ti+4

wi wi+1 wi+2 wi+3 wi+4

cities such as Los Angeles

wordsw

1,,...,1

Using the HMM at Query Time Given a set of extractions (Arg1, Rln, Arg2) Seeds = most frequent Args for Rln

arg|,|

1)(arg, tPseedtP

seedsKLseedsf

1. Distribution over t is read from the HMM

2. Compute KL divergence via f(arg, seeds)

3. For each extraction, average f over Arg1 & Arg2

4. Sort “sparse” extractions in ascending order

Language Modeling & Open IE Self supervised Illuminating phrases full context

Handles sparse extractions

Focus: Open IE on Web Text

Advantages Challenges

“Semantically tractable”sentences

Redundancy

Search engines

Difficult, ungrammatical sentences

Unreliable information

Heterogeneous corpus

II. Probability of CorrectnessHow likely is an extraction to be correct?Distributional Hypothesis: “words that

occur in the same contexts tend to have similar meanings ”

KnowItAll Hypothesis: extractions that occur in the same informative contexts more frequently are more likely to be correct.

Relation’s arguments are “typed”:(Person, Mayor Of, City)

Training: Model distribution of Person & City contexts in corpus (Distributional Hypothesis)

Query time: Rank sparse triples by how well each argument’s context distribution matches that of its type

Argument “Type checking” via HMM

Silly Example

(Shaver, Mayor of, Pickerington) over (Spice Girls, Mayor of, Microsoft)

Because: Shaver’s contexts are more like “other

mayors” than Spice Girls’, and Pickerington's contexts are more like

“other cities” than Microsoft’s

Utilizing HMMs to Check TypesChallenges: Argument types are not known Can’t build model for each argument

type “Textual types” are fuzzy

Solution: Train an HMM for the corpus using EM & bootstrap

REALM improves precision by 90%

MLNMLN

Knowledge BasesQuery Formula

Find BestQuery

Run Query

Find ImpliedNodes & Cliques

Results

Best KB + Query

Query Results

New Nodes+ Cliques

TextRunner, WordNetBornIn(Turing, England)? Inference RulesBornIn(X, city) ->

BornIn(X, country)

WordNet: X is in England

London is in England

In(London, England)

TextRunner: Turing born in X

Turing was born in London

BornIn(Turing, London)BornIn(Turing, England)

Query: Was Turing born in England?

In(London, England)BornIn(Turing, London)BornIn(Turing, England)

Yes! Turing wasborn in England!

Machine Reading of Web Text Oren Etzioni Turing Center University of Washington

Documents

Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

The First Law of Robotics ( a call to arms )etzioni/papers/first-law-aaai94.pdf · The First Law of Robotics ( a call to arms ) Daniel Weld Oren Etzioni* Department of Computer Science

ETZIONI, Amitai, Organizaciones Modernas

Neal Lesh and Oren Etzioni* Department of Computer Science and ...etzioni/papers/lesh-ijcai95.pdf · a Unix shell. Suppose we observe: >cd /papers There are many plausible goals at

by Amitai Etzioni - George Washington University

Open Information Extraction Systems and Downstream ...mausam/papers/ijcai16-earlycareer.pdfOpen Information Extraction Systems and Downstream Applications Joint work with Oren Etzioni,

Nahal Oren

The Web Servers + Crawlers€¦ · The Web Servers + Crawlers Eytan Adar November 8, 2007. With slides from Dan Weld & Oren Etzioni

Patricia Riddle* University of Washington Seattle, WA 98195etzioni/papers/brute...Richard Segal & Oren Etzioni Dept. of Comp. Sci. & Eng. University of Washington Seattle, WA 98195

Avi Etzioni on Code Review @Outbrain

CS276B Text Information Retrieval, Mining, and Exploitation Lecture 6 Information Extraction I Jan 28, 2003 (includes slides borrowed from Oren Etzioni,

This ﬁle was downloaded … on AI...6 Solon Barocas and Andrew D Selbst, ‘Big Data’s Disparate Impact’ (2016) 104 California Law Review 671. 7 Oren Etzioni, ‘Opinion | How

Oren Etzioni, CEO Allen Institute for AI (AI2) › 2450 › 9d434a93d643a536bd...The Battle for the Future of Data Mining Oren Etzioni, CEO Allen Institute for AI (AI2) March 13, 2014

Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington

Mixed Scanning Etzioni

Semantic Email: Theory and Applicationsprojectsweb.cs.washington.edu/research/semweb/pubs/mcdowellJWS04.pdfSemantic Email: Theory and Applications Luke McDowell , Oren Etzioni, and

The View from AI2 - AKBC · The View from AI2 Oren Etzioni, ... and ACL, Associate Editor in Chief of JAIR Dan Weld ... Inference requires “fuzzy” matching between extracted terms

1 Extracting Product Feature Assessments from Reviews Ana-Maria Popescu Oren Etzioni

AMITAI ETZIONI ( 1929 Colonia )

573 lecture 2 - search I · Uninformed Search Chapter 3 (Based on slides by Stuart Russell, Subbarao Kambhampati, Dan Weld, Oren Etzioni, Henry Kautz, Richard Korf, and ... What is