57
23.06.2008 Information Extraction - GATE, JAPE, ANNIE - Presentation for the advanced seminar „Endliche Automaten“ (PD Dr. Karin Haenelt) Sommersemester 2008 Universität Heidelberg Ching-Yi Sabrina Lin, Shajy Valiath, Torsten Hopp

Information Extraction - GATE, JAPE, ANNIE -kontext.fraunhofer.de/haenelt/kurs/Referate/Hopp_Lin...23.06.2008 Hauptseminar Endliche Automaten, SS 2008 3 Introduction What is Information

  • Upload
    lengoc

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

23.06.2008

Information Extraction- GATE, JAPE, ANNIE -

Presentation for the advanced seminar „Endliche Automaten“(PD Dr. Karin Haenelt) Sommersemester 2008Universität Heidelberg

Ching-Yi Sabrina Lin, Shajy Valiath, Torsten Hopp

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 2

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 3

Introduction ►What is Information Extraction?

In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information […] from unstructured machine-readable documents.

Definition (Wikipedia.org):

□ Domain-specific information from free text

□ Searching and structuring

□ Filtering of irrelevant informationReference: [Wik08a]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 4

Introduction ►What is Information Extraction?

□ What is relevant?

� Predefined by domain specific lexicons or rules

□ Core functionality of a IE-system

� Input:

• Specification of the type of relevant information (templates) � Set of attributes

• Set of free text documents

� Output:

• Set of instantiated attributes � filled with identified and normalized text fragments

Reference: [Neu01]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 5

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 6

GATE

□ General Architecture for Text Engineering

□ Framework + graphical development environment

□ Current Version: 4.0 (July 2007)

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 7

GATE ► Design Goals

□ Separate low-level-tasks from language processing algorithms and structures

□ Automating performance-measurement of language processing components

□ Providing standard mechanisms to communicate data about language using standards

� Java, XML

□ Providing baseline set of language processing components

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 8

GATE ► Architecture

□ Architecture:

� Defines organisation of LE system

� Ensures component interactions

□ Framework:

� Reusable design for LE software

� Prefabricated components

□ Development environment:

� Helps users building LE systems

� Debugging mechanisms

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 9

GATE ► Components

□ Language Resources (LR)

� lexicons, corpora, ontologies

□ Processing Resources (PR)

� algorithms, e.g. parsers, generators, ngram modellers

□ Visual Resources (VR)

� visualization + editing (GUI)

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 10

GATE ► Creole

□ Collection of REusable Objects for Language Engineering

� repository XML-File (Name, implementing class, parameters, icons)

� searched by framework to discover available ressources

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 11

GATE ► Data

□ Supports a variety of formats

� E.g. XML, RTF, HTML, SGML, email, plain text

□ Persistent storage

� relational database

� XML-based internal format

� Java serialisation

□ Export functionality (with annotations)

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 12

GATE ► Processing Ressources

□ set of reusable processing resources for common NLP-tasks

� ANNIE � mainly Finite State Algorithms

□ used individually or coupled

□ Implementation:

� Robustness, usability, clear distinction between data and finite state algorithms

� Easily modifieable by user

� Good Performance (because of FSAs)

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 13

GATE ► Evaluation

□ AnnotationDiff

� compares a system-annotated text with a reference text.

� for each annotation type figures are generated:

• precision, recall, F-measure, false positives

□ Benchmarking Tool

� evaluation over a whole corpus

� performance statistics

Reference: [Cun+02]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 14

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 15

JAPE

□ One of the facilities of GATE

□ Version of CPSL

□ IE tool use JAPE

□ Language for writing RE over annotations

□ Provides FST over annotations

□ To recognize RE in annotations on documents

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 16

□ JAPE Grammar consists of a set of phases, each of which consists of a set of pattern rules

□ Consists of LHS and RHS□ LHS contains annotation pattern that may

contain regular expression operators� eg. (+, ? , *)

□ RHS consists of annotation manipulationstatements

□ Label is used to refer the match from LHS to RHS and it is denoted by a preceeding colon (:)� e.g., :location

JAPE ► Grammar

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 17

JAPE ► Grammar

□ Pattern Specification can be done in 3 ways.

Specify a string of Texte.g {Token.string==“of“}

Specify an annotation previously assigned froma gazetteer, tokeniser, or other module

e.g. {Lookup}

Specify the attributes of an annotatione.g. {Token.kind == number}

1

2

3

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 18

JAPE ► Grammar

□ The RHS of the rule contains information about the annotation.

□ It is transferred from LHS using Labels

□ It is annotated with the entity type

□ finally, attributes and their corresponding values are added to the annotation

□ Alternatively, RHS can contain Java code to create or manipulate annotations.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 19

□ Several options like „Control“ and „Debug“

□ Control defines the method of rule matching

□ Debug is used for the display of conflicts on thestandard output

□ Input annotations must be added at thebeginning

JAPE ► Grammar

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 20

JAPE ► Grammar (Example)

□ The pattern specified will be awarded an annotation of type“Enamex“ – it‘s an entityname.

□ “Kind“ is an attribute withvalue “location“

□ “Rule“ is another attributewith value “GazLocation“

Rule : GazLocation

(

{Lookup.majorType ==

location}

)

:location ����

:location.Enamex =

{kind="location", rule =

GazLocation}

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 21

□ Grammar Rules can essentially be of two types:� First type

• No Gazetteer Lookup

• Can be defined using a set of formats

• Little potential for Ambiguity

� Second type• Rely more on Gazetteer lists

• wide range of possibilities

• greater Potential for Ambigutiy

JAPE ► Grammar Rule

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 22

Rule : IPAddress

(

{Token.kind == number}

{Token.string == “.“}

{Token.kind == number}

{Token.string == “.“}

{Token.kind == number}

{Token.string == “.“}

{Token.kind == number}

)

:ipAddress ����

:ipAddress.Address =

{kind=“ipAddress“}

JAPE ► Grammar Rule

□ A single Rule is sufficient to identify an IP Address, because there is only one basic Format

□ A series of numbers, each set connected by a dot.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 23

□ Mon, 23/8/08

□ Mon, 23/Aug/08

□ Mon, 23 August, 2008

□ Mon 23rd of August, 2008

□ Mon. August 23rd, 2008

□ Mon 23 August 2008

□ There are many possiblevariations and so manyrules are needed in order to identify a date or time

□ The same date information can beappear in many formats

JAPE ► Grammar Rule

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 24

□ The late ’80 s

□ Monday

□ St. Andrew‘s day

□ 99 BC

□ Mid- November

□ 1980-81

□ Fr0m March to April

□ Different types of datacan also be expressed.

□ These all are classified as date entities.

JAPE ► Grammar Rule

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 25

JAPE ► Use of context

□ Dealing Context in Grammar Rule

□ Enclosed with set of round brackets

□ Preceding Context can be included in the rule, this is placed before this set of brackets

□ If Context following the pattern need to beincluded, it is placed after the label given to theannotation

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 26

Rule : YearContext

(

{Token.string == “in“}

|

{Token.string == “by“}

)

(YEAR)

:date����

:date.Timex =

{kind=“date“,

rule=“YearContext“}

□ On Assumtion of an appropriate macro for“Year“

□ A year would be onlyrecognised if it occurspreceded by thewords “in“ or “by“

JAPE ► Use of context

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 27

Rule : EmailAddress

( {Token.string == "<"} )

(

(EMAIL)

)

:email

({Token.string == ">"} )

����

:email.Address =

{kind="email",

rule="EmailAddress"}

□ On Assumtion of an appropriate macro for“email“

□ An email would beonly recognised if itoccurs inside anglesbrackets.

JAPE ► Use of context

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 28

JAPE ► Use of priority

□ Grammar has 5 control styles� „brill“, „all“, „first “, „once“ and „appelt“

□ The styles are specified at the beginning of thegrammar

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 29

□ Brill�When more than one rule matches thesame region of the document, they all are fired

□ All � Similar to brill, in that it will also execute all matching rules, but the matching will continuefrom the next offset to the current one

□ First � With the „First“ style, a rule fires for thefirst match that‘s found

□ Once � Once a rule has fired, the whole JAPE phase exits after the first match

□ Appelt � Only one rule can be fired for thesame region of text, according to a set of priorityrules.

JAPE ► Use of priority

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 30

JAPE ► Use of priority

□ Priority Rules:

From all the rules that match a region of the documentstarting at some point X, the one which matches thelongest region is fired

If more than one rule matches the same region, the onewith the highest priority is fired

If there is more than one rule with the same priority, theone defined earlier in the grammar is fired

1

2

3

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 31

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 32

ANNIE

□ ANNIE : � A Nearly-New Information Extraction System

� ANNIE is an IE system• with which GATE is distributed

• Which relies on finite state algorithms and JAPE Language

� ANNIE has 5 components:• Tokeniser

• Gazetteer

• Sentence Splitter

• Semantic Tagger

• Orthographic Coreference - NameMatcher

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 33

ANNIE ►Components:

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 34

ANNIE ► Components: Tokeniser I

□ Tokeniser:� it splits the text into very simple components, TOKENS,

of different types such as• number: any combination of consecutive digits

• punctuation: start-punctuation “ ( “, end-punctuation “ ) “, and other punctuation “ : “ …

• Symbol: currency symbol „$“ , „£“ … and symbol “&“, “#“ …

• SpaceToken: Space token and control token– space token: pure space characters

– control token: control characters

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 35

ANNIE ► Components: Tokeniser II

□ Tokeniser:� word: any set of contiguous upper- or lowercase

letters, including a hyphen. A word also has the „orth“attribute:

• upperInitial - initial letter is uppercase, and the rest are lowercase

• allCaps - all uppercase letters

• lowerCase - all lowercase letters

• mixedCaps - any mixture of upper and lowercase letters

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 36

ANNIE ► Components: Tokeniser III

□ Tokeniser Rules:� it has a left hand side (LHS) and a right hand side

(RHS), and it is separated from each other by “>“symbol. The LHS is a regular expression, which has to match on the input . On the other hand, the RHS describes the annotations to be added to the AnnotationSet.

� LHS operators:• | → or

• * → 0 or more occurrences

• ? → 0 or 1 occurrences

• + →1 or more occurrences

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 37

ANNIE ► Components: Tokeniser IV

□ Tokeniser Rules:� RHS operators: uses “;“ as a separator, and has the

following format• {LHS} > {Annotation type} ; {attribute1} = {value1} ; … ;

{attributeN} = {valueN}

□ Example:� A tokeniser rule for a word beginning with a single

capital letter: • "UPPERCASE_LETTER" "LOWERCASE_LETTER"* >

Token; orth=upperInitial; kind=word;

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 38

ANNIE ► Components: Gazetteer I

□ Gazetteer:� The gazetteer lists used here are plain text files, with

one entry per line. Each list represents a set of names, such as city names, organizations, days of week…

� Example: List of units of currency

Reference: [Cun+08]

EuroNew Taiwan dollarNew Taiwan dollarsPoundPoundsDollarDollars

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 39

ANNIE ► Components: Gazetteer II

□ Gazetteer Files:� In order to access the lists, an index file (lists.def) is

used. For each list, a major type and a minor type (optionally) are/is specified. See the following examples. The first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines.

• currency_prefix.lst : currency_unit : pre_amount

• day.lst : date : day

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 40

ANNIE ► Components: Gazetteer III

□ Flexibility: � In ANNIE arbitrary feature values to be associated with

particular entries in a single list can be allowed by enabling the optional gazetteerFeatureSeparatorparameter to a single character.

� Example:Software_company.lst:company:software

Red Hat&stockSymbol=RHAT

Apple Computer&abbrev=Apple&stockSymbol=AAPL

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 41

ANNIE ► Components: Sentence Splitter I

□ Sentence Splitter:� It is a cascade of finite-state-transducers which

segments the text into sentences, which is required for the tagger.

� The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

� Moreover, the sentence splitter is domain and application-independent.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 42

ANNIE ► Components: Sentence Splitter II

□ Annotations: � each sentence is annotated with the type Sentence.

� each sentence break is also given a Split.• with possible types are: “.“ , „punctuation“ , „CR“ (a line

break) , „multi“ …

□ Another alternative sentence splitter is the RegEx Sentence Splitter. It is intended to improve the execution time and robustness, especially when faced with irregular input.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 43

ANNIE ► Components: Semantic Tagger

□ Semantic Tagger:� ANNIE's semantic tagger is based on the JAPE

language.

� It contains rules which act on annotations assigned in the earlier phase, in order to produce outputs of annotated entities.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 44

ANNIE ► Components: Orthographic Coreference I

□ Orthographic Coreference – NameMatcher:� It adds identity relations between named entities

found by the semantic tagger, in order to perform coreference.

� However, it does not find new named entities, instead it may assign a type to an unclassified proper name, using the type of a matching name.

� The matching rules are only invoked if the names being compared are both of the same type, or if one of them is classified as unknown.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 45

ANNIE ► Components: Orthographic Coreference II

□ GATE Interface� Input: entity annotation with an id attribute

� Output: matches attributes added to the existing entity annotations

□ Resources� A look up table of aliases: recording non-matching string

representing the same entity, e.g. „IBM“ and „Big Blue“ or „Coca-Cola“ and „Coke“

� A table of spurious matches: matching strings which don't represent the same entity, e.g. „BT Wireless“ and „BT Cellnet“

□ Processing� an array of strings, types and IDs of all name annotations are

built from the wrappers

� then passed to a sting comparison function for pairwise comparison of all entries

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 46

ANNIE

A Walk-Through Example

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 47

ANNIE ► A walk-through example

□ A 3-stage procedure using the tokeniser, gazetteer and named-entity grammar.

□ We wish to recognize the phrase „800,000 US dollars“ as an entity of type „Number“, with the feature „money“.

□ We first give an example of a grammar rule (and corresponding macros) for money, which will recognize this type of pattern:

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 48

ANNIE ► A walk-through example Macro:

□ Macro for recognizing the million/billion:

� Macro: MILLION_BILLION

( { Token.string == "m"} |

{ Token.string == "million"} |

{ Token.string == "b"} |

{ Token.string == "billion"} )�

□ Macro for recognizing amount:

� Macro: AMOUNT_NUMBER

( { Token.kind == number}

( { Token.string == ","} |

{ Token.string == "."} )�

{ Token.kind == "number"} )*

( ( SpaceToken.kind == space} )?

( MILLION_BILLION) ? ) )�Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 49

ANNIE ► A walk-through example Macro:

□ Define a grammar rule:� Rule: Money1

// e.g. 30 pounds

(

(AMOUNT_NUMBER)�

(SpaceToken.kind == space) ?

( { Lookup.majorType == currency_unit } )�

)�

money -->

money.Number = { kind = "money", rule = "Money1" }

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 50

ANNIE ► A walk-through example Macro:

□ Step 1 - Tokenisation: “800,000 US dollars”� the tokeniser separates the given phrase into the

following tokens: a word, a number, a punctuation and specetoken

• Token, string = "800", kind = number, length = 3

• Token, string = ",", kind = punctuation, length = 1

• Token, string = "000", kind = number, length = 3

• SpaceToken, string = " ", kind = space, length = 1

• Token, string = "US", kind = word, length = 2 orth = allCaps

• SpaceToken, string = " ", kind = space, length = 1

• Token, string = "dollars", kind = word, length = 7, orth = lowercase

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 51

ANNIE ► A walk-through example Macro:

□ Step 2 - List of Lookup:� the gazetteer lists are then searched to find all

occurrences of matching words in the text.

� It finds the following match for the string „US dollars“.

� Lookup:

• minorType = post_amount

• majorType = currency_unit

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 52

ANNIE ► A walk-through example Macro:

□ Step 3 - Grammar Rules I� Therefore, the grammar rule for money is then invoked.

• The macro MILLION_BILLION is passed through since the given phrase containing non of the specified string.

• The macro AMOUNT_NUMBER recognized the given phrase as a number, followed by sequences of the form comma plus number, and then followed by a space.

• Then, the rule Money1 is invoked, which recognized the string identified by the macro AMOUNT_NUMBER, followed by an optional space, followed by a unit of currency.

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 53

ANNIE ► A walk-through example Macro:

□ Step 3 - Grammar Rules II

• In our case, "US dollars" would be recognized as a currency unit, so the rule Money1 recognized the entire string "800,000 US dollars".

• Finally, it will be annotatted as a Number entity of type Money:

� Result:

“Number, kind = money, rule = Money1”

Reference: [Cun+08]

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 54

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 55

Summary

□ GATE� Framework + graphical development environment

� Provides set of language processing components

□ JAPE� Used by components to define grammars and rules

� Patterns � regular expressions

□ ANNIE� Information Extraction system

� Several components (e.g. Tokeniser, Sentence Splitter)

� Uses JAPE and FSA algorithms

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 56

Thanks for your attention!

Questions?

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 57

References

[Neu01] G. Neumann: „Informationsextraktion.“ in Klabunde et al (eds): „Computerlinguistik und Sprachtechnologie -Eine Einführung.“, Spektrum Akademischer Verlag, Heidelberg, 2001, ISBN 3-8274-1027-4.

[Wik08a] Wikipedia contributors: 'Information extraction', Wikipedia, The Free Encyclopedia, 14 May 2008, 16:53 UTC, http://en.wikipedia.org/wiki/Information_extraction, [accessed 1 June 2008]

[Cun+02] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan: “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications.” in Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.

[Cun+08] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, C. Ursu, M. Dimitrov, M. Dowman, N. Aswani, I. Roberts, Y. Li, A. Shafirin, A. Funk: “Developing Language Processing Components with GATE Version 4 (a User Guide)”, Online at http://gate.ac.uk/sale/tao/tao.pdf [accessed 13 June 2008]