Information Extraction and Named Entity Recognition-2

Embed Size (px)

Citation preview

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    1/73

    Informatio

    Extraction and N

    Entity Recogni

    Introducing the taGetting simple stru

    information out o

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    2/73

    Christopher Manning

    Information Extraction

    Information extraction (IE) systems Find and understand limited relevant parts of texts

    Gather information from many pieces of text

    Produce a structured representation of relevant information

    relations(in the database sense), a.k.a.,

    a knowledge base Goals:

    1. Organize information so that it is useful to people

    2. Put information in a semantically precise form that alloinferences to be made by computer algorithms

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    3/73

    Christopher Manning

    Information Extraction (IE)

    IE systems extract clear, factual information Roughly: Who did what to whom when?

    E.g., Gathering earnings, profits, board members, headquarters, e

    company reports

    The headquarters of BHP Billiton Limited, and the global of the combined BHP Billiton Group, are located in MelboAustralia.

    headquarters(BHP Biliton Limited, Melbourne, Austral

    Learn drug-gene product interactions from medical research

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    4/73

    Christopher Manning

    Low-level information extraction

    Is now availableand I think popularin applicationor Google mail, and web indexing

    Often seems to be based on regular expressions and

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    5/73

    Christopher Manning

    Low-level information extraction

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    6/73

    Christopher Manning

    Why is IE hard on the web?

    Need this

    price

    Title

    A book,

    Not a toy

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    7/73

    Christopher Manning

    How is IE useful?Classified Advertisements (Real Estate)

    Background: Plain text

    advertisements

    Lowest common

    denominator: only thing

    that 70+ newspapersusing many different

    publishing systems can

    all handle

    2067206v1

    March 02, 1998

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    8/73

    Christopher Manning

    h h

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    9/73

    Christopher Manning

    Why doesnt text search (IR) work?

    What you search for in real estate advertisements: Town/suburb. You might think easy, but:

    Real estate agents:Coldwell Banker, Mosman

    Phrases:Only 45 minutes from Parramatta

    Multiple property ads have different suburbs in one ad Money: want a range not a textual match

    Multiple amounts: was $155K, now $145K

    Variations: offers in the high 700s [but notrents for $270]

    Bedrooms: similar issues: br, bdr, beds, B/R

    Ch i h M i

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    10/73

    Christopher Manning

    Named Entity Recognition (NER)

    A very important sub-task: findand classifynames intext, for example:

    The decision by the independent MP Andrew Wilkie to

    withdraw his support for the minority Labor government

    sounded dramatic but it should not further threaten itsstability. When, after the 2010 election, Wilkie, Rob

    Oakeshott, Tony Windsor and the Greens agreed to support

    Labor, they gave just two guarantees: confidence and

    supply.

    Ch i t h M i

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    11/73

    Christopher Manning

    A very important sub-task: findand classifynames intext, for example:

    The decision by the independent MP Andrew Wilkie to

    withdraw his support for the minority Laborgovernment

    sounded dramatic but it should not further threaten itsstability. When, after the 2010election, Wilkie, Rob

    Oakeshott, Tony Windsor and the Greensagreed to support

    Labor, they gave just two guarantees: confidence and

    supply.

    Named Entity Recognition (NER)

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    12/73

    Christopher Manning

    A very important sub-task: findand classifynames intext, for example:

    The decision by the independent MP Andrew Wilkie to

    withdraw his support for the minority Laborgovernment

    sounded dramatic but it should not further threaten itsstability. When, after the 2010election, Wilkie, Rob

    Oakeshott, Tony Windsor and the Greensagreed to support

    Labor, they gave just two guarantees: confidence and

    supply.

    Named Entity Recognition (NER)

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    13/73

    Christopher Manning

    Named Entity Recognition (NER)

    The uses: Named entities can be indexed, linked off, etc.

    Sentiment can be attributed to companies or products

    A lot of IE relations are associations between named entities

    For question answering, answers are often named entities.

    Concretely:

    Many web pages tag various entities, with links to bio or top

    Reuters OpenCalais, Evri, AlchemyAPI, Yahoos Term Ext

    Apple/Google/Microsoft/ smart recognizers for document

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    14/73

    Informatio

    Extraction and N

    Entity Recogni

    Introducing the taGetting simple stru

    information out o

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    15/73

    Evaluation of N

    Entity Recogn

    Precision, Recall, an

    F measure;

    their extension to seq

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    16/73

    Christopher Manning

    The 2-by-2 contingency table

    correct not correct

    selected tp fp

    not selected fn tn

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    17/73

    p g

    Precision and recall

    Precision: % of selected items that are correctRecall: % of correct items that are selected

    correct not correct

    selected tp fp

    not selected fn tn

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    18/73

    p g

    A combined measure: F

    A combined measure that assesses the P/R tradeoff is(weighted harmonic mean):

    The harmonic mean is a very conservative average; se

    8.3

    People usually use balanced F1 measure

    i.e., with = 1 (that is, = ): F= 2PR/(P+

    RP

    PR

    RP

    F+

    +=

    +

    =2

    2 )1(

    1)1(

    1

    1

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    19/73

    Quiz question

    What is the F1?

    P= 40% R= 40% F1 =

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    20/73

    Quiz question

    What is the F1?

    P= 75% R= 25% F1 =

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    21/73

    The Named Entity Recognition Task

    Task: Predict entities in a text

    Foreign ORG

    Ministry ORG

    spokesman O

    Shen PERGuofang PER

    told O

    Reuters ORG

    : :

    }Standard

    evaluationis per entity,

    notper token

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    22/73

    Precision/Recall/F1 for IE/NER

    Recall and precision are straightforward for tasks likecategorization, where there is only one grain size (do

    The measure behaves a bit funnily for IE/NER when th

    boundary errors (which are common):

    First Bank of Chicago announced earnings

    This counts as both a fp and a fn

    Selecting nothingwould have been better

    Some other metrics (e.g., MUC scorer) give partial cre

    (according to complex rules)

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    23/73

    Evaluation of N

    Entity Recogn

    Precision, Recall, an

    F measure;

    their extension to seq

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    24/73

    Methods for d

    NER and IE

    The space;

    hand-written patt

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    25/73

    Three standard approaches to NER

    1. Hand-written regular expressions Perhaps stacked

    2. Using classifiers

    Generative: Nave Bayes

    Discriminative: Maxent models

    3. Sequence models

    HMMs

    CMMs/MEMMs

    CRFs

    Christopher Manning

    Hand written Patterns for

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    26/73

    Hand-written Patterns for

    Information Extraction

    If extracting from automatically generated web pagesregex patterns usually work.

    Amazon page

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    27/73

    MUC: the NLP genesis of IE

    DARPA funded significant efforts in IE in the early to m

    Message Understanding Conference (MUC) was an anevent/competition where results were presented.

    Focused on extracting information from news articles

    Terrorist events

    Industrial joint ventures

    Company management changes

    Starting off, all rule-based, gradually moved to ML

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    28/73

    MUC Information Extraction Examp

    Bridgestone Sports Co. said Friday

    it had set up ajoint venture in

    Taiwan with a local concern and a

    Japanese trading house to produce

    golf clubs to be supplied to Japan.

    The joint venture, Bridgestone

    Sports Taiwan Co., capitalized at20 million new Taiwan dollars, will

    start production in January 1990

    with production of 20,000 iron

    and metal wood clubs a month.

    JOINT-VENTURE-1

    Relationship: TIE-UP Entities: Bridgestone Sport

    concern, a Japanese trad

    Joint Ent: Bridgestone Spo

    Activity: ACTIVITY-1

    Amount: NT$20 000 000

    ACTIVITY-1

    Activity:PRODUCTION

    Company: Bridgestone Spo

    Product:iron and metal w

    Start date: DURING: Januar

    Christopher Manning

    Natural Language Processing based

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    29/73

    Natural Language Processing-based

    Hand-written Information Extractio

    For unstructured human-written text, some NLP may Part-of-speech (POS) tagging

    Mark each word as a noun, verb, preposition, etc.

    Syntactic parsing

    Identify phrases: NP, VP, PP

    Semantic word categories (e.g. from WordNet)

    KILL: kill, murder, assassinate, strangle, suffocate

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    30/73

    0

    1

    2

    3

    4

    PN s

    Ar

    N

    PN

    P

    s

    Art

    Finite Automaton for

    Noun groups:Johns interesting

    book with a nice cover

    Grep++ = Cascaded grepping

    Christopher Manning

    Natural Language Processing-based

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    31/73

    Natural Language Processing-based

    Hand-written Information Extractio

    We use a cascaded regular expressions to match relat Higher-level regular expressions can use categories matched

    level expressions

    E.g. the CRIME-VICTIM pattern can use things matching NOU

    This was the basis of the SRI FASTUS system in later M

    Example extraction pattern

    Crime victim:

    Prefiller: [POS: V, Hypernym: KILL]

    Filler: [Phrase: NOUN-GROUP]

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    32/73

    Rule-based Extraction Examples

    Determining which person holds what office in what organization

    [person] , [office] of [org]

    Vuk Draskovic, leader of the Serbian Renewal Movement

    [org] (named, appointed, etc.) [person] Prep [office]

    NATO appointed Wesley Clark as Commander in Chief

    Determining where an organization is located

    [org] in[loc]

    NATO headquarters in Brussels

    [org] [loc] (division, branch, headquarters, etc.)

    KFOR Kosovo headquarters

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    33/73

    Methods for d

    NER and IE

    The space;

    hand-written patt

    f i

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    34/73

    Informatio

    extraction as

    classificatio

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    35/73

    Nave use of text classification for I

    Use conventional classification algorithms to clsubstrings of document as to be extractedor

    In some simple but compelling domains, this natechnique is remarkably effective. But do think about when it would and wouldnt work!

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    36/73

    Change of Address email

    Christopher Manning

    Change-of-Address detection

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    37/73

    Change-of-Address detection[Kushmerick et al., ATEM 2001]

    addressnave-Bayes

    model

    P[[email protected]]

    P[[email protected]

    everyone so.... My new email address is: [email protected] Hope a

    From: Robert Kubinsky Subject: Email update

    messagenave Bayesmodel

    Yes

    2. Extraction

    1. Classification

    Christopher Manning

    Change-of-Address detection resul

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    38/73

    Change of Address detection resul[Kushmerick et al., ATEM 2001]

    Corpus of 36 CoA emails and 5720 non-CoA emails

    Results from 2-fold cross validations (train on half, test on ot

    Very skewed distribution intended to be realistic

    Note very limited training data: only 18 training CoA messag

    36 CoA messages have 86 email addresses; old, new, and mi

    P R F1

    Message classification 98% 97% 98%

    Address classification 98% 68% 80%

    I f ti

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    39/73

    Informatio

    extraction as

    classificatio

    Sequence Mode

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    40/73

    Sequence Mode

    Named Enti

    Recognitio

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    41/73

    The ML sequence model approach

    Training

    1. Collect a set of representative training documents

    2. Label each token for its entity class or other (O)

    3. Design feature extractors appropriate to the text and classes

    4. Train a sequence classifier to predict the labels from the data

    Testing

    1. Receive a set of testing documents

    2. Run sequence model inference to label each token

    3. Appropriately output the recognized entities

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    42/73

    Encoding classes for sequence labe

    IO encoding IOB encoding

    Fred PER B-PER

    showed O O

    Sue PER B-PER

    Mengqiu PER B-PERHuang PER I-PER

    s O O

    new O O

    painting O O

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    43/73

    Features for sequence labeling

    Words

    Current word (essentially like a learned dictionary)

    Previous/next word (context)

    Other kinds of inferred linguistic classification

    Part-of-speech tags

    Label context

    Previous (and perhaps next) label

    43

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    44/73

    Features: Word substrings

    drug

    company

    movie

    place

    person

    Cotrimoxazole Wethersfie

    Alien Fury: Countdown to Invasio

    0

    0

    0

    18

    0

    oxa

    708

    0006

    :

    14

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    45/73

    Features: Word shapes

    Word Shapes

    Map words to simplified representation that encodes

    such as length, capitalization, numerals, Greek lette

    punctuation, etc.

    Varicella-zoster Xx-xxxmRNA xXXX

    CPA1 XXXd

    Sequence Mode

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    46/73

    Sequence Mode

    Named Enti

    Recognitio

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    47/73

    Maximum ent

    sequence mo

    Maximum entropy M

    models (MEMMsConditional Markov

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    48/73

    Sequence problems

    Many problems in NLP have data which is a sequence

    characters, words, phrases, lines, or sentences

    We can think of our task as one of labeling each item

    VBG NN IN DT NN IN NN

    Chasing opportunity in an age of upheaval

    POS tagging

    B B I I B I

    Word segmentatio

    PERS O O O ORG ORG

    Murdoch discusses future of News Corp.

    Named entity recognition

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    49/73

    MEMM inference in systems

    For a Conditional Markov Model (CMM) a.k.a. a Maximum Entro

    Markov Model (MEMM), the classifier makes a single decision aconditioned on evidence from observations and previous decisio

    A larger space of sequences is usually explored via search

    -3 -2 -1 0 +1

    DT NNP VBD ??? ???

    The Dow fell 22.6 %

    Local Context

    FeaturW0

    W+1

    W-1

    T-1

    T-1-T-2

    hasDigit?

    (Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

    Decision Point

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    50/73

    Example: POS Tagging

    Scoring individual labeling decisions is no more complex than st

    classification decisions We have some assumed labels to use for prior positions

    We use features of those and the observed data (which can include

    previous, and next words) to predict the current label

    -3 -2 -1 0 +1

    DT NNP VBD ??? ???

    The Dow fell 22.6 %

    Local Context

    FeaturW0

    W+1

    W-1

    T-1

    T-1-T-2

    hasDigit?

    Decision Point

    (Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    51/73

    Example: POS Tagging

    POS tagging Features can include:

    Current, previous, next words in isolation or together.

    Previous one, two, three tags.

    Word-internal features: word types, suffixes, dashes, etc.

    -3 -2 -1 0 +1

    DT NNP VBD ??? ???

    The Dow fell 22.6 %

    Local Context

    FeaturW0

    W+1

    W-1

    T-1

    T-1-T-2

    hasDigit?

    (Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

    Decision Point

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    52/73

    Inference in Systems

    Sequence Level

    Local Level

    Local

    Data

    Feature

    Extraction

    Features

    LabelOptimization

    Smoothing

    Classifier Type

    SequenceData

    Maximum Entropy

    ModelsQuadratic

    Penalties

    Conjugate

    Gradient

    Sequence Model

    Local

    DataLocal

    Data

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    53/73

    Greedy Inference

    Greedy inference:

    We just start at the left, and use our classifier at each position to assign a

    The classifier can depend on previous labeling decisions as well as obse

    Advantages: Fast, no extra memory requirements

    Very easy to implement

    With rich features including observations to the right, it may perform quite

    Disadvantage:

    Greedy. We make commit errors we cannot recover from

    Sequence Model

    Inference

    Best Sequence

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    54/73

    Beam Inference

    Beam inference:

    At each position keep the top kcomplete sequences.

    Extend each sequence in each local way.

    The extensions compete for the kslots at the next position.

    Advantages:

    Fast; beam sizes of 35 are almost as good as exact inference in many c

    Easy to implement (no dynamic programming required).

    Disadvantage:

    Inexact: the globally best sequence can fall off the beam.

    Sequence Model

    Inference

    Best Sequence

    Christopher Manning

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    55/73

    Viterbi Inference

    Viterbi inference:

    Dynamic programming or memoization.

    Requires small window of state influence (e.g., past two states are releva

    Advantage: Exact: the global best sequence is returned.

    Disadvantage:

    Harder to implement long-distance state-state interactions (but beam infeto allow long-distance resurrection of sequences anyway).

    Sequence Model

    Inference

    Best Sequence

    Christopher Manning

    i bi f & h

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    56/73

    Viterbi Inference: J&M Ch. 6

    Im punting on this read J&M Ch. 5/6.

    Ill do dynamic programming for parsing

    Basically, providing you only look at neighboring state

    dynamic program a search for the optimal state sequ

    Christopher Manning

    Vi bi I f J&M Ch 5/6

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    57/73

    Viterbi Inference: J&M Ch. 5/6

    Christopher Manning

    CRF

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    58/73

    CRFs[Lafferty, Pereira, and McCallum 2001]

    Another sequence model: Conditional Random Fields (CRFs)

    A whole-sequence conditional model rather than a chaining of local m

    The space of cs is now the space of sequences

    But if the features firemain local, the conditional sequence likelihood can be calc

    using dynamic programming

    Training is slower, but CRFs avoid causal-competition biases

    These (or a variant using a max margin criterion) are seen as the state

    these days but in practice usually work much the same as MEMMs

    '

    ),'(expc i

    ii dcf=),|( dcP i ii dcf ),(exp

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    59/73

    Maximum ent

    sequence mo

    Maximum entropy M

    models (MEMMsConditional Markov

    The Full Task

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    60/73

    Informatio

    Extraction

    Christopher Manning

    Th F ll T k f I f ti E t

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    61/73

    The Full Task of Information ExtracInformation Extraction =

    segmentation + classification+ associaAs a family of techniques:

    For years, Microsoft CorporationCEOBill

    Gatesrailed against the economic philosophy

    of open-source software with Orwellian fervor,

    denouncing its communal licensing as a

    "cancer" that stifled technological innovation.

    Now Gateshimself says Microsoftwill gladly

    disclose its crown jewels--the coveted code

    behind the Windows operating system--toselect customers.

    "We can be open source. We love the concept

    of shared source," said Bill Veghte, a

    MicrosoftVP. "That's a super-important shift

    for us in terms of code access.

    Richard Stallman, founderof the Free

    Software Foundation, countered saying

    Microsoft Corporation

    CEO

    Bill Gates

    Gates

    Microsoft

    Bill VeghteMicrosoft

    VP

    Richard Stallman

    founder

    Free Software Foundation

    Slide by Andrew McCallum Use Christopher Manning

    An Even Broader View

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    62/73

    An Even Broader View

    Create ontology

    Segment

    Classify

    Associate

    ClusterLoad DB

    Spider

    Query,Search

    IE

    Documentcollection

    Database

    Filter by relevance

    Label training data

    Train extraction models

    Slide by Andrew McCallum Use Christopher Manning

    Landscape of IE Tasks:D t F tti

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    63/73

    Document FormattingText paragraphs

    without formatting

    Grammatical sentence

    and some formatting & l

    Non-grammatical snippets,

    rich formatting & links

    Tables

    Astro Teller is the CEO and co-founder ofBodyMedia. Astro holds a Ph.D. in Artificial

    Intelligence from Carnegie Mellon University,

    where he was inducted as a national Hertz

    fellow. His M.S. in symbolic and heuristic

    computation and B.S. in computer science are

    from Stanford University.

    Slide by Andrew McCallum Use Christopher Manning

    Landscape of IE TasksI t d d B dth f C

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    64/73

    Intended Breadth of Coverage

    Web site specific Genre specific Wide, no

    Amazon.com Book Pages Resumes Univer

    Formatting Layout Lan

    Slide by Andrew McCallum Use Christopher Manning

    Landscape of IE Tasks :Complexity of entities/relations

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    65/73

    Complexity of entities/relations

    Closed set

    He was born in Alabama

    Regular set

    Phone: (413) 545-13

    Complex pattern

    University of Arkansas

    P.O. Box 140

    Hope, AR 71802

    was among the six

    sold by Hope Feldman

    Ambiguous patterns, nee

    and many sources of evi

    The CALD main office is 412

    1299

    The big Wyoming sky

    U.S. states U.S. phone numbers

    U.S. postal addresses Person names

    Headquarters:

    1128 Main Street, 4th Floor

    Cincinnati, Ohio 45210

    Pawel Opalinski, Softw

    Engineer at WhizBang

    Slide by Andrew McCallum Use Christopher Manning

    Landscape of IE Tasks:Arity of relation

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    66/73

    Arity of relation

    Single entity

    Person: Jack Welch

    Binary relationship

    Relation: Person-Title

    Person: Jack Welch

    Title: CEO

    N-ary

    Named entityextraction

    Jack Welch will retire as CEO of General Electric tomorrow. T

    at the Connecticut company will be filled by Jeffrey Immelt.

    Relat ion: Company-Location

    Company: General Electric

    Locat ion : Connecticut

    Relation:

    Company:

    Title:

    Out:In :

    Person: Jeffrey Immelt

    Locat ion: Connecticut

    Slide b Andre McCall m Use Christopher Manning

    Association task = Relation Extract

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    67/73

    Association task = Relation Extract

    Checking if groupings of entities are instances of a rel

    1. Manually engineered rules

    Rules defined over words/entites: located in

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    68/73

    Relation Extraction: Disease Outbre

    May 19 1995, Atlanta -- The Centers for Disease Controland Prevention, which is in the front line of the world'sresponse to the deadly Ebola epidemic in Zaire ,is finding itself hard pressed to cope with the crisis

    Date Disease Name

    Jan. 1995 Malaria

    July 1995 Mad Cow Disease

    Feb. 1995 Pneumonia

    May 1995 Ebola

    InformationExtraction System

    Christopher Manning

    Relation Extraction: Protein Interac

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    69/73

    We show that CBF-A and CBF-C interactwith each other to form a CBF-A-CBF-C complex

    and that CBF-B does not interact with CBF-A or

    CBF-C individually but that it associates with the

    CBF-A-CBF-C complex.

    Relation Extraction: Protein Interac

    CBF-A CBF-C

    CBF-B CBF-A-CBF-C complex

    interactcomplex

    associates

    Christopher Manning

    Binary Relation Association as BinaClassification

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    70/73

    Classification

    Christos Faloutsosconferred with Ted Senator, the KDD 2003 G

    Person-Role (Christos Faloutsos, KDD 2003 General Cha

    Person-Role ( Ted Senator, KDD 2003 General Cha

    Person Person Role

    Christopher Manning

    Resolving coreference(both within and across documents

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    71/73

    (both within and across documents

    John Fitzgerald Kennedy was born at 83 Beals Street in Brookline, Massachusetts on29, 1917, at 3:00 pm,[7] the second son of Joseph P. Kennedy, Sr., and Rose Fitzgera

    turn, was the eldest child of John "Honey Fitz" Fitzgerald, a prominent Boston politi

    was the city's mayor and a three-term member of Congress. Kennedy lived in Brook

    years and attended Edward Devotion School, Noble and Greenough Lower School, a

    School, through 4th grade. In 1927, the family moved to 5040 Independence Avenu

    Bronx, New York City; two years later, they moved to 294 Pondfield Road in Bronxvi

    where Kennedy was a member of Scout Troop 2 (and was the first Boy Scout to bec

    President).[8] Kennedy spent summers with his family at their home in Hyannisport

    Massachusetts, and Christmas and Easter holidays with his family at their winter ho

    Beach, Florida. For the 5th through 7th grade, Kennedy attended Riverdale Country

    private school for boys. For 8th grade in September 1930, the 13-year old Kennedy

    Canterbury School in New Milford, Connecticut.

    Christopher Manning

    Rough Accuracy of Information Ext

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    72/73

    Rough Accuracy of Information Ext

    Errors cascade (error in entity tagerror in relation

    These are very rough, actually optimistic, numbers

    Hold for well-established tasks, but lower for many specific/

    Information type Accuracy

    Entities 90-98%

    Attributes 80%

    Relations 60-70%

    Events 50-60%

    The Full Task

    I f i

  • 7/22/2019 Information Extraction and Named Entity Recognition-2

    73/73

    Informatio

    Extraction