CS 4705 Robust Semantics, Information Extraction, and Information Retrieval

CS 4705

Robust Semantics, Information Extraction, and Information

Retrieval

Problems with Syntax-Driven Semantics

• Syntactic structures often don’t fit semantic structures very well– Important semantic elements often distributed very

differently in trees for sentences that mean ‘the same’

I like soup. Soup is what I like.

– Parse trees contain many structural elements not clearly important to making semantic distinctions

– Syntax driven semantic representations are sometimes pretty verbose

V --> serves )},(),(),((,{ xeServedyeServerServingeIsayex

Semantic Grammars

• Alternative to modifying syntactic grammars to deal with semantics too

• Define grammars specifically in terms of the semantic information we want to extract– Domain specific: Rules correspond directly to entities

and activities in the domainI want to go from Boston to Baltimore on Thursday,

September 24th

– Greeting --> {Hello|Hi|Um…}– TripRequest Need-spec travel-verb from City to

City on Date

Predicting User Input

• Rely on knowledge of task and (sometimes) constraints on what the user can do– Can handle very sophisticated phenomena

I want to go to Boston on Thursday.

I want to leave from there on Friday for Baltimore.

TripRequest Need-spec travel-verb from City on Date for City

Dialogue postulate maps filler for ‘from-city’ to pre-specified to-city

Priming User Input

• Users will tend to use the vocabulary they hear from the system: lexical entrainment (Clark & Brennan ’96)– Reference to objects: the scarey M&M man– Re-use of system prompt vocabulary/syntax:

Please tell me where you would like to leave/depart from.

Where would you like to leave/depart from?

• Explicit training vs. implicit training• Training the user vs. retraining the system

Drawbacks of Semantic Grammars

• Lack of generality– A new one for each application

– Large cost in development time

• Can be very large, depending on how much coverage you want them to have

• If users go outside the grammar, things may break disastrouslyI want to leave from my house at 10 a.m.

I want to talk to a person.

Information Retrieval• How related to NLP?

– Operates on language (speech or text)– Does it use linguistic information?

• Stemming• Bag-of-words approach• Very simple analyses

– Does it make use of document formatting?• Headlines, punctuation, captions

• Collection: a set of documents• Term: a word or phrase• Query: a set of terms

But…what is a term?

• Stop list• Stemming• Homonymy, polysemy, synonymy

Vector Space Model

• Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection

• Is a term t in this document or in this query or not?D = (t1,t2,…,tn)Q = (t1,t2,…,tn)• Similarity metric:how many terms does a query

share with each candidate document?

• Weighted terms: term-by-document matrixD = (wt1,wt2,…,wtn)Q = (wt1,wt2,…,wtn)

• How do we compare the vectors?– Normalize each term weight by the number of terms in

the document: how important is each t in D?

– Compute dot product between vectors to see how similar they are

– Cosine of angle: 1 = identity; 0 = no common terms

• How do we get the weights?– Term frequency (tf): how often does i occur in Doc j?

– Inverse document frequency (idf): # docs/ # docs term i occurs in

– tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

i nNidf log

iji idftfw ji ,,

Evaluating IR Performance

• Precision: #relevant docs returned/total #docs returned -- how often are you right when you say this document is relevant?

• Recall: #relevant docs returned/#relevant docs in collection -- how many of the relevant documents do you find?

• F-measure combines P and R • Are P and R equally important?

Improving Queries

• Relevance feedback: users rate retrieved docs• Query expansion: many techniques

– add top N docs retrieved to query and resubmit expanded query

– WordNet

• Term clustering: cluster rows of terms in term-by-document matrix to produce synonyms and add to query

IR Tasks

• Ad hoc retrieval: ‘normal’ IR• Routing/categorization: assign new doc to one of

predefined set of categories• Clustering: divide a collection into N clusters• Segmentation: segment text into coherent chunks• Summarization: compress a text by extracting

summary items or eliminating less relevant items• Question-answering: find a span of text (within

some window) containing the answer to a question

Information Extraction

• Another ‘robust’ alternative• Idea: ‘extract’ particular types of information from

arbitrary text or transcribed speech• Examples:

– Named entities: people, places, organizations, times, dates

• <Organization> MIPS</Organization> Vice President <Person>John Hime</Person>

– MUC evaluations

• Domains: Medical texts, broadcast news (terrorist reports), …

Appropriate where Semantic Grammars and Syntactic Parsers are not

• Appropriate where information needs very specific and specifiable in advance– Question answering systems, gisting of news or mail…– Job ads, financial information, terrorist attacks

• Input too complex and far-ranging to build semantic grammars

• But full-blown syntactic parsers are impractical– Too much ambiguity for arbitrary text– 50 parses or none at all– Too slow for real-time applications

Information Extraction Techniques

• Often use a set of simple templates or frames with slots to be filled in from input text– Ignore everything else– My number is 212-555-1212.– The inventor of the wiggleswort was Capt. John T.

Hart.– The king died in March of 1932.

• Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots

• How to do better than everyone else?

The IE Process

• Given a corpus and a target set of items to be extracted:– Clean up the corpus

– Tokenize it

– Do some hand labeling of target items

– Extract some simple features

• POS tags

• Phrase Chunks …

– Do some machine learning to associate features with target items or derive this associate by intuition

– Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Domain-Specific IE from the Web (Patwardhan &

Riloff ’06)

• The Problem:– IE systems typically domain-specific – a new extraction

procedure for every task

– Supervised learning depends on hand annotation for training

• Goals: – Acquire domain specific texts automatically on the

– Identify domain-specific IE patterns automatically

• Approach:

– Start with a set of seed IE patterns learned from a hand-labeled corpus

– Use these to identify relevant documents on the web

– Find new seed patterns in the retrieved documents

MUC04 IE Task

• Corpus: – 1700 news stories about terrorist events in Latin

America– Answer keys about information that should be extracted

• Problems:– All upper case– 50% of texts irrelevant– Stories may describe multiple events

• Best results: – 50-70% precision and recall with hand-built

components– 41-44% recall and 49-51% precision with automatically

generated templates

Procedure

• Apply pre-defined syntactic patterns to a training corpus of documents for which relevant/irrelevant judgments known

• Count how often partial lexicalizations of each (e.g. <subj> was killed) appear in relevant vs. irrelevant documents

• Rank patterns based on association with domain (frequency in domain documents vs. non-domain documents)

• Manually review patterns and assign thematic roles to those deemed useful– From 40K+ patterns 291

• Now find similar web documents

Domain Corpus Creation

• Create IR queries by crossing names of 5 terrorist organizations (e.g. Al Qaeda, IRA) with 16 terrorist activities (e.g assinated, bombed, hijacked, wounded) 80 queries– Restricted to CNN, English documents

– Eliminated TV transcripts

– Yield from 2 runs: 6,182 documents

– Cleaned corpus: 5,618 documents

Learning Domain-Specific Patterns

• Hypothesis: new extraction patterns co-occurring with seed patterns from training corpus will also be associated with terrorism

• Generate all extraction patterns in CNN corpus (147,712)

• Compute correlation of each with seed patterns based on frequency of co-occurrence in same sentence – keep those occurring more often that chance with some seed

• Rank new patterns by their seed correlations

• Filter: Measure semantic affinity: how often does this pattern extract an entity of a particular category (e.g. victim, target)?

• Compute semantic affinity for each extraction pattern wrt 6 categories: target, victim plus distractors: perpetrator, organization, weapon, other – E.g. Frequency of extracting target/frequency of extracting any of

6 categories weighted by log probability of target

Highly Ranked Patterns

• Remove patterns not strongly associated with desired classes:

• Evaluate on MUC-4– Baseline:

• Recall 64%/Precision 43% on targets

• Recall 50%/Precision 52% on victims

Results for Web-Learned Patterns

• Use 396 terrorism extraction patterns learned from MUC training set as seeds

• Produce ranked list of new patterns from web using semantic affinity of 3.0 threshold

• Chose top N (50-300) patterns to add to seed set• Performance:

Combining IR and IE for QA

• Information extraction: AQUA

Summary

• Many approaches to ‘robust’ semantic analysis– Semantic grammars targeting particular domains

CS 4705 Robust Semantics, Information Extraction, and Information Retrieval

Documents

Semantics-based information extraction for detecting economic … · Multimed Tools Appl (2013) 64:27–52 DOI 10.1007/s11042-012-1122-0 Semantics-based information extraction for

EFFECTIVENESS OF INFORMATION, EDUCATION AND …repository-tnmgrmu.ac.in/4705/1/3003189manonmanic.pdf · 2017-12-22 · EFFECTIVENESS OF INFORMATION, EDUCATION AND COMMUNICATION PACKAGE

META-METADATA: AN INFORMATION SEMANTICS LANGUAGE … · 2019. 7. 31. · Meta-Metadata: An Information Semantics Language and Software Architecture for Collection Visualization Applications

Using semantics to improve interactive information access

4701 4705.output

Answer Set Semantics vs. Information Term Semanticscooml.di.unimi.it/talks/aspslides.pdf · Representing the CooML snapshot semantics in DLV Fcl semantics vs stable model semantics

Adding Semantics to Information Retrieval

Syntactic category information and the semantics of ...engspra/Papers/Morphology/plag-2004.pdf · Syntactic category information and the semantics of derivational morphological rules

Distributional Semantics - College of Information

XML based information systems and formal semantics of …€¦ · XML based information systems and formal semantics of programming languages 3 computer sciences: model checking and

Information Extraction CS 4705 Julia Hirschberg CS 4705

Investigations into information semantics and ethics of computing

Information Retrieval and Question- Answering Julia Hirschberg CS 4705

Ht305su-A2 Dchlllk Mxs 4705

Robust Semantics, Information Extraction, and Information Retrieval

Business Semantics as an Interface between Enterprise Information Management

Pronouns and Reference Resolution CS 4705 Julia Hirschberg CS 4705

CS 4705 Lexical Semantics. Today Words and Meaning Lexical Relations WordNet Thematic Roles Selectional Restrictions Conceptual Dependency

Information Networks and Semantics

Combining semantics an deep learning for intelligent information services