CS 4705 Robust Semantics, Information Extraction, and Information Retrieval

CS 4705

Robust Semantics, Information Extraction, and Information

Retrieval

Problems with Syntax-Driven Semantics

• Syntactic structures often don’t fit semantic structures very well– Important semantic elements often distributed very

differently in trees for sentences that mean ‘the same’

I like soup. Soup is what I like.

– Parse trees contain many structural elements not clearly important to making semantic distinctions

– Syntax driven semantic representations are sometimes pretty verbose

V --> serves )},(),(),((,{ xeServedyeServerServingeIsayex

Semantic Grammars

• Alternative to modifying syntactic grammars to deal with semantics too

• Define grammars specifically in terms of the semantic information we want to extract– Domain specific: Rules correspond directly to entities

and activities in the domainI want to go from Boston to Baltimore on Thursday,

September 24th

– Greeting --> {Hello|Hi|Um…}– TripRequest Need-spec travel-verb from City to

City on Date

Predicting User Input

• Rely on knowledge of task and (sometimes) constraints on what the user can do– Can handle very sophisticated phenomena

I want to go to Boston on Thursday.

I want to leave from there on Friday for Baltimore.

TripRequest Need-spec travel-verb from City on Date for City

Dialogue postulate maps filler for ‘from-city’ to pre-specified to-city

Priming User Input

• Users will tend to use the vocabulary they hear from the system: lexical entrainment (Clark & Brennan ’96)– Reference to objects: the scarey M&M man– Re-use of system prompt vocabulary/syntax:

Please tell me where you would like to leave/depart from.

Where would you like to leave/depart from?

• Explicit training vs. implicit training• Training the user vs. retraining the system

Drawbacks of Semantic Grammars

• Lack of generality– A new one for each application

– Large cost in development time

• Can be very large, depending on how much coverage you want them to have

• If users go outside the grammar, things may break disastrouslyI want to leave from my house at 10 a.m.

I want to talk to a person.

Information Retrieval• How related to NLP?

– Operates on language (speech or text)– Does it use linguistic information?

• Stemming• Bag-of-words approach• Very simple analyses

– Does it make use of document formatting?• Headlines, punctuation, captions

• Collection: a set of documents• Term: a word or phrase• Query: a set of terms

But…what is a term?

• Stop list• Stemming• Homonymy, polysemy, synonymy

Vector Space Model

• Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection

• Is a term t in this document or in this query or not?D = (t1,t2,…,tn)Q = (t1,t2,…,tn)• Similarity metric:how many terms does a query

share with each candidate document?

• Weighted terms: term-by-document matrixD = (wt1,wt2,…,wtn)Q = (wt1,wt2,…,wtn)

• How do we compare the vectors?– Normalize each term weight by the number of terms in

the document: how important is each t in D?

– Compute dot product between vectors to see how similar they are

– Cosine of angle: 1 = identity; 0 = no common terms

• How do we get the weights?– Term frequency (tf): how often does i occur in Doc j?

– Inverse document frequency (idf): # docs/ # docs term i occurs in

– tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

i

i nNidf log

iji idftfw ji ,,

Evaluating IR Performance

• Precision: #relevant docs returned/total #docs returned -- how often are you right when you say this document is relevant?

• Recall: #relevant docs returned/#relevant docs in collection -- how many of the relevant documents do you find?

• F-measure combines P and R • Are P and R equally important?

RPPR

)(2

Improving Queries

• Relevance feedback: users rate retrieved docs• Query expansion: many techniques

– add top N docs retrieved to query and resubmit expanded query

– WordNet

• Term clustering: cluster rows of terms in term-by-document matrix to produce synonyms and add to query

IR Tasks

• Ad hoc retrieval: ‘normal’ IR• Routing/categorization: assign new doc to one of

predefined set of categories• Clustering: divide a collection into N clusters• Segmentation: segment text into coherent chunks• Summarization: compress a text by extracting

summary items or eliminating less relevant items• Question-answering: find a span of text (within

some window) containing the answer to a question

Information Extraction

• Another ‘robust’ alternative• Idea: ‘extract’ particular types of information from

arbitrary text or transcribed speech• Examples:

– Named entities: people, places, organizations, times, dates

• <Organization> MIPS</Organization> Vice President <Person>John Hime</Person>

– MUC evaluations

• Domains: Medical texts, broadcast news (terrorist reports), …

http://www.itl.nist.gov/iad/894.02/related_projects/muc/

Appropriate where Semantic Grammars and Syntactic Parsers are not

• Appropriate where information needs very specific and specifiable in advance– Question answering systems, gisting of news or mail…– Job ads, financial information, terrorist attacks

• Input too complex and far-ranging to build semantic grammars

• But full-blown syntactic parsers are impractical– Too much ambiguity for arbitrary text– 50 parses or none at all– Too slow for real-time applications

Information Extraction Techniques

• Often use a set of simple templates or frames with slots to be filled in from input text– Ignore everything else– My number is 212-555-1212.– The inventor of the wiggleswort was Capt. John T.

Hart.– The king died in March of 1932.

• Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots

• How to do better than everyone else?

The IE Process

• Given a corpus and a target set of items to be extracted:– Clean up the corpus

– Tokenize it

– Do some hand labeling of target items

– Extract some simple features

• POS tags

• Phrase Chunks …

– Do some machine learning to associate features with target items or derive this associate by intuition

– Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Domain-Specific IE from the Web (Patwardhan &

Riloff ’06)

• The Problem:– IE systems typically domain-specific – a new extraction

procedure for every task

– Supervised learning depends on hand annotation for training

• Goals: – Acquire domain specific texts automatically on the

Web

– Identify domain-specific IE patterns automatically

• Approach:

– Start with a set of seed IE patterns learned from a hand-labeled corpus

– Use these to identify relevant documents on the web

– Find new seed patterns in the retrieved documents

MUC04 IE Task

• Corpus: – 1700 news stories about terrorist events in Latin

America– Answer keys about information that should be extracted

• Problems:– All upper case– 50% of texts irrelevant– Stories may describe multiple events

• Best results: – 50-70% precision and recall with hand-built

components– 41-44% recall and 49-51% precision with automatically

generated templates

Procedure

• Apply pre-defined syntactic patterns to a training corpus of documents for which relevant/irrelevant judgments known

• Count how often partial lexicalizations of each (e.g. <subj> was killed) appear in relevant vs. irrelevant documents

• Rank patterns based on association with domain (frequency in domain documents vs. non-domain documents)

• Manually review patterns and assign thematic roles to those deemed useful– From 40K+ patterns 291

• Now find similar web documents

Domain Corpus Creation

• Create IR queries by crossing names of 5 terrorist organizations (e.g. Al Qaeda, IRA) with 16 terrorist activities (e.g assinated, bombed, hijacked, wounded) 80 queries– Restricted to CNN, English documents

– Eliminated TV transcripts

– Yield from 2 runs: 6,182 documents

– Cleaned corpus: 5,618 documents

Learning Domain-Specific Patterns

• Hypothesis: new extraction patterns co-occurring with seed patterns from training corpus will also be associated with terrorism

• Generate all extraction patterns in CNN corpus (147,712)

• Compute correlation of each with seed patterns based on frequency of co-occurrence in same sentence – keep those occurring more often that chance with some seed

• Rank new patterns by their seed correlations

• Filter: Measure semantic affinity: how often does this pattern extract an entity of a particular category (e.g. victim, target)?

• Compute semantic affinity for each extraction pattern wrt 6 categories: target, victim plus distractors: perpetrator, organization, weapon, other – E.g. Frequency of extracting target/frequency of extracting any of

6 categories weighted by log probability of target

Highly Ranked Patterns

• Remove patterns not strongly associated with desired classes:

• Evaluate on MUC-4– Baseline:

• Recall 64%/Precision 43% on targets

• Recall 50%/Precision 52% on victims

Results for Web-Learned Patterns

• Use 396 terrorism extraction patterns learned from MUC training set as seeds

• Produce ranked list of new patterns from web using semantic affinity of 3.0 threshold

• Chose top N (50-300) patterns to add to seed set• Performance:

Combining IR and IE for QA

• Information extraction: AQUA

Summary

• Many approaches to ‘robust’ semantic analysis– Semantic grammars targeting particular domains

Utterance --> Yes/No Reply

Yes/No Reply --> Yes-Reply | No-Reply

Yes-Reply --> {yes,yeah, right, ok,”you bet”,…}

– Information extraction techniques targeting specific tasks

• Extracting information about terrorist events from news

– Information retrieval techniques --> more like NLP

Documents

CS 4705 Robust Semantics, Information Extraction, and Information Retrieval