Information Extraction Chris Brew The Ohio State University

Information Extraction

Chris BrewThe Ohio State University

Algorithms have weaknesses

Decision trees embody a strong bias, searching only for models which can be expressed in the tree form.

This bias means that they need less training data than more general models to choose the one that they prefer.

But if the data does not fit comfortably into the class of models which DTs allow, their predictions become unstable and difficult to make sensible use of.

No Free Lunch

In order to model data well, we have to choose between strong assumptions and flexibility

If we are unbiased and flexible , we're bound to need a lot of data to get the right model

If the assumptions are largely valid, we get to learn effectively from not too much data. So, in a certain sense bias is a good thing to have.

But it is always possible to construct data that will make a biased algorithm look bad. So one can never unconditionally say “This is the best learning algorithm”

Using classifiers in information extraction

Focus on the Merge part of information extraction

Create corpus, including accurate annotations of co-reference.

For each pair of two phrases, generate an instance, labelled with coref or not-coref

At training time, learn a classifier to predict coref or not

At test time, create an instance every time a decision needs to be made about a pair of phrases.

Details matter

Two representative systems

Aone and Bennett's MLR McCarthy and Lehnert's RESOLVE

Both work with MUC-5 Joint Venture data, Aone and Bennett on the Japanese corpus. McCarthy and Lehnert on the English

Both formally evaluated, but in different ways

MLR

Uses output of Sentence Analysis component from real IE system. Considers anaphors involving entities that the system thought were organizations.

Evaluated on 250 texts for which correct answers were created.

Features used by MLR

Lexical features (e.g. Whether one phrase contains a character sequence of the other)

The grammatical roles of the phrases Semantic class information Information about position Total of 66 features

RESOLVE

Used hand-created noise-free evaluation corpus (in principle the features used are available from the IE system, but pattern of expected error not known)

Features were domain-specific Evaluated on JV-ENTITIES, things that were

organizations and that had been identified as a party in a joint venture by the extraction component.

Features used by RESOLVE

Whether each phrase contains a proper name from a known list (2 features)

Which of the phrases are names of joint ventures (3 features)

Whether the two phrases are related by an entry in a known list of aliases (1 feature)

Whether phrases have same base noun phrase (1 feature)

Whether phrases are in same sentence (1 feature)

Performance

Baseline: assume non-coreferential ~ 74% for resolve testset

RESOLVE: Precision ~80% Recall ~87% on noise-free data

MLR: Precision ~67% Recall ~80%

Are these comparable?

Data sets are not identical. Feature repertoires are not identical. Exact choice of which pairs the data are

evaluated on not identical. Inter-annotator agreement is not that high:

definite descriptions and bare nominals causing the most difficulty

More evaluation

Aone and Bennett (working on Japanese) find that the character subsequence feature is all that is needed when the phrases are proper names of companies

Ideally one should evaluate these for their effects on the performance of the whole IE task, not just for the sub-task of getting the co-reference decision right.

But if you do that, it becomes even harder to compare work across systems, because there is so much going on.

Advantages of classifiers

Decision trees and other learning may give insight into importance of features. Decision trees are especially easy to interpret.

Once instance set has been created, running multiple variations wth different feature sets is easy.

Disadvantages of classifiers

Because the classifiers answer independently, some patterns of answers make no sense (could say A = B, B=C, but A !=C).

Unlike language models, classifiers don't give a global measure of the quality of the whole interpretation.

Classifiers for global structure

It is possible, and sometimes useful, to string together classifiers in ways which do better justice to global structure.

Documents have non-trivial structure. One of the challenges is to find ways of

incorporating knowledge about such structure, while still leaving the system able to learn.

Background: HMMs

Standard HMMs have two stochastic components

First is a stochastic finite state network, given by P(sn|sn-1). In a part-of-speech tagger, the state labels would be parts of speech.

Second is a stochastic emitter at each state, given by P(observation|sn). In a part-of-speech tagger the observation would (often) be a word.

HMM training is a scheme that adjusts the parameters from data.

When can you use HMMs?

If the output alphabet is of manageable size and there are relatively few states, you can hope to see enough training data to reliably estimate the emission matrices.

But if the outputs are very complex, then there will be many numbers to estimate.

Discriminative and generative

HMMs are generative probabilistic models for the state sequences and their outputs.

In application, we can see the sequence of observations, and only need to infer the best sequence of states.

Modelling the competition between the observations seems like wasted effort.

The plus is that we have good and efficient training algorithms.

Twisting classifiers

Classifiers are not generative, but discriminative.

They model P(class|observations), not P(observations|class)

They represent observations as sets of attributes

No effort is expended on modelling the observations themselves.

But they don't capture sequence.

FAQ classification

McCallum, Freitag, Pereira (citeseer.nj.nec.com/mccallum00maximum.html)

Task: given Usenet FAQ documents, identify the question answer structure.

Intuition: formatting will give cues to this Details: label FAQ lines with (head|question|

answer|tail) Input: one labelled document Test data: n-1 documents from the same FAQ

(which presumably has the same unknown formatting conventions).

Relevant features

Idea is to provide features which will allow modelling of way in which different FAQs use formatting, not to hand-build the models for each set of conventions that exist

begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-Subject blank contains-alphanumcontains-bracketed-number contains-http contains-non-space contains-numbercontains-pipe contains-query contains-question-word ends-with-query first-alpha-capitalized indented indented-1-4 intended-5-10 more-than-one-third-space punctuation-only previous-line-is-blank previous-begins-with-ordinal shorter-than-thirty-chars

Competing approaches

Token HMM: Tokenize each line, treat each token as an output from the relevant state.

Feature HMM: Reduce each line to set of features, treat each feature as an output of the relevant state.

MEMM: Learn a maximum entropy classifier for P(sn|obsn-1,sn-1). This is a discriminative approach. Performs well on the FAQ task. In principle any classifier will do: they used maximum entropy because it deals well with overlapping features. TCMM

Similar tasks

NP Chunking: state labels are things like BEGIN-NP, IN-NP, END-NP, NOT-IN. (Dan Roth's group has explored this approach) Emissions are sets of features of context (perhaps also including parts-of-speech).

Sentence boundary detection. (see me) Lexical analysis for Thai (Chotimongol and

Black)

Constraints

In both cases there are constraints on the sequence of labels (brackets must balance). One idea is to do as McCallum et al do, making models conditional on previous state.

Another is to collect probabilities for each state at every position, then hunt for the best sequence that respects the constraints.

Both approaches can work well, it depends on the task which is better.

Label bias

In an HMM, different states compete to account for the observations that we see.

In a TCMM, we just use observations to choose between available next states

If the next state is known or overwhelmingly probable, the observation has no possibility of influencing the outcome.

If there are two competing situations like this, one with a likely observation, one with a rare one, no useful competition possible.

Conditional Random Fields

Lafferty, McCallum Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data (On citeseer)

Demonstrates that method can handle synthetic data with label bias, and that MEMM does not.

Shows small performance improvement over HMMs and MEMM on POS tagging data.

Method harder to understand than MEMMs, and more costly than HMMs, but more flexible.

Summary

Using classifiers for text tasks Using classifiers for text tasks where there is a

significant sequential component More experience with learning from data Next time: some text tasks that don't use

classifiers, but special techniques.

Reference clustering (McCallum, Nigam,Ungar)

Inductive learning of extraction rules (Thompson, Califf, Mooney)

Non-classifier learning

Mary-Elaine Califf and Ray Mooney (2002) Bottom-Up Relational Learning of Pattern Matching Rulesfor Information Extraction (submitted to JMLR, available from Mooney's web page)

Also, if time permits, something on reference extraction and clustering, from the CMU group around McCallum.

Sample job posting

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: [email protected] PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

The filled template

computer_science_jobid: [email protected]: SOFTWARE PROGRAMMERsalary: Not specifiedcompany: Not specifiedrecruiter: Not specifiedstate: TNcity: Not specifiedcountry: USlanguage: Cplatform: PC \ DOS \ OS-2 \ UNIX Multiple fillersapplication area: Voice Mailreq_years_experience: 2desired_years_experience: 5post_date: 17 Nov 1996

How shallow(!|?)

Feature-based techniques require design of a reasonable set of features.

Might still exclude useful contextual information.

Not obvious what features one wants in the job posting domain.

Califf and Mooney's RAPIER learns a rich relational representation from structured examples. Related to Inductive Logic Programming.

RAPIER's rules

Strongly resemble the patterns used by ELIZA to extract slot fillers.

Pre-Filler Pattern: Filler Pattern: Post-Filler Pattern:

1) syntactic: {nn,nnp} 1) word: undisclosed 1) semantic: price

2) list: length 2 syntactic: jj

This one is highly specific to contexts like

“sold to the bank for an undisclosed amount"

“paid Honeywell an undisclosed price"

Using rules

Given a set of rules, indexed by template and slot RAPIER applies all of them. If the ruleset proposes multiple fillers for a slot after duplicates are eliminated, all are returned.

There are lots of possible rules, some highly specific, some highly general. How should the algorithm structure the search for rules at the right level of abstraction?

Each slot is learnt separately, no attempt to build generalizations across different slots.

Knowledge sources

Words Parts of speech Semantic classes from Wordnet (Could probably get such resources

Spanish/Italian/German/Japanese, but not Pushtun/Urdu/Tamil)

Rule learning strategies

Rule learning systems use two basic strategies

Compression – begin with a set of highly specific rules (usually one per example) that cover all examples. Look for more general rules that can replace several specific rules. Learning ends when no further compression possible.

Covering – start with a set containing all the examples. Find a rule to cover some of the examples. Repeat until all examples covered.

Rule generation

Whether doing covering or compression, need choices about how to generate rule candidates.

Could go top-down, starting with very general rules and specializing them as needed.

Could go bottom-up, using sets of existing rules to gradually generalize away from the data.

Could do a combination of top-down and bottom-up.

Trade-offs

Covering systems are efficient, because they can ignore examples already covered.

But compression systems do a more thorough search at each iteration, so may find better rules in the end.

Bottom-up rule generation may not generalize well enough to cover unseen data.

Top down generation might inefficiently thrash around generating irrelevant rules.

RAPIER's choices

Specific-to-general search (bottom-up)

Because: wanted to start with words actually represented in the examples, not search at random.

Because: specific rules tend to be high-precision at the possible expense of recall, and the designers favoured precision over recall.

Compression-based outer loop

Because: its efforts to generalize counteract the tendency to over-specificity.

The RAPIER algorithm

for each slot, S in the template being learnedSlotRules = most specific rules for S from example documentswhile compression has failed fewer than CompressLim times

initialize RuleList to be an empty priority queue of length krandomly select M pairs of rules from SlotRulesfind the set L of generalizations of the fillers of each rule pairfor each pattern P in L

create a rule NewRule with filler P and empty pre- and post-fillersevaluate NewRule and add NewRule to RuleList

let n = 0loop

increment nfor each rule, CurRule, in RuleList

NewRuleList = SpecializePreFiller (CurRule, n)evaluate each rule in NewRuleList and add it to RuleList

for each rule, CurRule, in RuleListNewRuleList = SpecializePostFiller (CurRule, n)evaluate each rule in NewRuleList and add it to RuleList

until best rule in RuleList produces only valid fillers or the value of the best rule in RuleList has failed to

improve over the last LimNoImprovements iterationsif best rule in RuleList covers no more than an allowable percentage of spurious

fillersadd it to SlotRules and remove empirically subsumed rules

Rule evaluation

Rules are evaluated by their

Ability to separate positive from negative examples

By their size (smaller the better) It allows rules that produce some spurious

slot fillers, primarily because it doesn't want to be too sensitive to human error in the annotation.

Initial rulebase

Filler: word and part-of-speech. No semantic constraint (these are only introduced via generalization). Must be a word actually extracted for one of the templates.

Pre-filler: words and parts-of-speech going right back to beginning of document.

Post-filler: words and parts of speech going on to end of document.

Filler Generalisation

Elements to be generalised

Element A Element Bw o r d : m a n w o r d : w o m a n

s y n t a c t i c : n n p s y n t a c t i c : n n

s e m a n t i c : s e m a n t i c :

Resulting Generalisationsw o r d : w o r d : { m a n , w o m a n }

s y n t a c t i c : { n n , n n p } s y n t a c t i c : { n n , n n p }

s e m a n t i c : p e r s o n s e m a n t i c : p e r s o n

w o r d : w o r d : { m a n , w o m a n }

s y n t a c t i c : s y n t a c t i c :

s e m a n t i c : p e r s o n s e m a n t i c : p e r s o n

How semantic constraints arise

If both items being generalized have semantic constraints, search for a common superclass

If both items being generalized have different words, see whether both belong in same synset

Don't guess semantic constraints for single words, since there are not enough clues to make this useful.

Generalizing contexts

If we know which elements of the pre/post fillers go with which, we can generalize in the same way that we did with fillers.

Patterns of equal length

Patterns to be GeneralizedPattern A Pattern B1) word: ate 1) word: hit syntactic: vb syntactic: vb2) word: the 2) word: the

syntactic: dt syntactic: dt3) word: pasta 3) word: ball

syntactic: nn syntactic: nnResulting Generalizations1) word: {ate, hit} 1) word:

syntactic: vb syntactic: vb2) word: the 2) word: the syntactic: dt syntactic: dt3) word: {pasta, ball} 3) word: {pasta, ball}

syntactic: nn syntactic: nn

1) word: {ate, hit} 1) word:syntactic: vb syntactic: vb

2) word: the 2) word: thesyntactic: dt syntactic: dt

3) word: 3) word:syntactic: nn syntactic: nn

Patterns of unequal length

Risk of combinatorial explosion

Assume every element of shorter must align with something in longer. Still many possibilities.

Align pairs of elements of length two or more assuming they must be correct

Several special cases

Example of unequal length

Patterns to be GeneralizedPattern A Pattern B1) word: bank 1) list: length 3

syntactic: nn word:2) word: vault syntactic: nnp

syntactic: nnResulting Generalizations1) list: length 3 1) list: length 3

word: word:syntactic: {nn,nnp} syntactic:

Specialization

We also need a mechanism for getting the rules learnt down to manageable length

Basic strategy is to calculate all correspondences in the pre/post parts of the parent rules, then work outwards, generating rules that use successively more and more context.

In cases like “6 years experience required” and “4 years experience is required”, we want to be able to work outwards different amounts in the two original rules.

Example

Pre-filler Pattern: Filler Pattern: Post-filler Pattern:

1) word: located 1) word: atlanta 1) word: ,tag: vbn tag: nnp

tag: ,2) word: in

2) word: georgiatag: in

tag: nnp

3) word: .

tag: .

Pre-filler Pattern: Filler Pattern: Post-filler Pattern:

1) word: offices 1) word: kansas 1) word: ,tag: nns tag: nnp tag: ,

2) word: in 2) word: city 2) word: missouri

tag: in tag: nnp tag: nnp

3) word: .

Tag: .

Output of generalizations

Pre-filler Pattern: Filler Pattern: Post-filler Pattern:1) word: in 1)list: max length: 2 1) word: ,

tag: in tag: nnp tag: ,2) tag: nnp semantic: state

Evaluation

Job postings:

Precision: 80% Recall 60% Fast learning from relatively little data Word constraints do most of the jobs POS tags and semantic classes help, but

only a little

Evaluation

Seminar announcements Very good on start-time end-time Less good on location: POS-tags help more

here Speaker name is much the hardest slot,

POS tags help because they indicate proper names

Semantic classes not helpful in this domain.

Conclusions-RAPIER

Highly text specific extraction patterns work well Difficulty of evaluation of complex systems ever

present Sometimes the restriction that each rule

provides a filler for just one slot is annoying. For example, ads often describe multiple positions at different salary and responsibility levels. One really wants to get the connections as well.

Named entity recognition would allow better slot filling. Currently RAPIER sees individual words and POS tags only.

Prospects for STP

Increasing use of learning techniques tuned to the structure of the documents.

XML is a good tool for organizing STP, allowing creation of easily transferable document representations (e.g. Take articles from 6 different newspapers, each of which signal headline, byline, section in different ways. Map all to a common XML representation in which structure is made explicit).

More multilingual work. More work on sub-documents (tables,

captions, pictures)

Prospects for STP

One way of seeing STP is as a way re-organizing and re-presenting data.

Raw documents: conceptualize as a mixture of relevant and irrelevant stuff. Ignore the irrelevant, route the relevant to appropriate locations.

Templates: database row. Contents are NL text, which of has its own internal structure, but which we do not attempt to uncover.

Representation

On the Web, the “same” data exists in multiple places in different forms. If we can detect these forms, we may be able to:

Pull together different presentations of the same information

Begin to work out the characteristics of some of the operations that link different forms

Get closer to automated data “re-presentation”