Information Extraction - GATE, JAPE, ANNIE -kontext.fraunhofer.de/haenelt/kurs/Referate/Hopp_Lin...23.06.2008 Hauptseminar Endliche Automaten, SS 2008 3 Introduction What is Information

23.06.2008

Information Extraction- GATE, JAPE, ANNIE -

Presentation for the advanced seminar „Endliche Automaten“(PD Dr. Karin Haenelt) Sommersemester 2008Universität Heidelberg

Ching-Yi Sabrina Lin, Shajy Valiath, Torsten Hopp

23.06.2008 Hauptseminar Endliche Automaten, SS 2008 2

Outline

Introduction• What is information extraction?

1.

GATE• Architecture, Design Goals• Functionality

2.

JAPE• Functionality• Examples

3.

ANNIE• Components and how they work• Walk Trough Example

4.

Summary5.


Introduction ►What is Information Extraction?

In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information […] from unstructured machine-readable documents.

Definition (Wikipedia.org):

□ Domain-specific information from free text

□ Searching and structuring

□ Filtering of irrelevant informationReference: [Wik08a]


Introduction ►What is Information Extraction?

□ What is relevant?

� Predefined by domain specific lexicons or rules

□ Core functionality of a IE-system

� Input:

• Specification of the type of relevant information (templates) � Set of attributes

• Set of free text documents

� Output:

• Set of instantiated attributes � filled with identified and normalized text fragments

Reference: [Neu01]


Outline


1.


2.


3.


4.

Summary5.


GATE

□ General Architecture for Text Engineering

□ Framework + graphical development environment

□ Current Version: 4.0 (July 2007)

Reference: [Cun+02]


GATE ► Design Goals

□ Separate low-level-tasks from language processing algorithms and structures

□ Automating performance-measurement of language processing components

□ Providing standard mechanisms to communicate data about language using standards

� Java, XML

□ Providing baseline set of language processing components

Reference: [Cun+02]


GATE ► Architecture

□ Architecture:

� Defines organisation of LE system

� Ensures component interactions

□ Framework:

� Reusable design for LE software

� Prefabricated components

□ Development environment:

� Helps users building LE systems

� Debugging mechanisms

Reference: [Cun+02]


GATE ► Components

□ Language Resources (LR)

� lexicons, corpora, ontologies

□ Processing Resources (PR)

� algorithms, e.g. parsers, generators, ngram modellers

□ Visual Resources (VR)

� visualization + editing (GUI)

Reference: [Cun+02]


GATE ► Creole

□ Collection of REusable Objects for Language Engineering

� repository XML-File (Name, implementing class, parameters, icons)

� searched by framework to discover available ressources

Reference: [Cun+02]


GATE ► Data

□ Supports a variety of formats

� E.g. XML, RTF, HTML, SGML, email, plain text

□ Persistent storage

� relational database

� XML-based internal format

� Java serialisation

□ Export functionality (with annotations)

Reference: [Cun+02]


GATE ► Processing Ressources

□ set of reusable processing resources for common NLP-tasks

� ANNIE � mainly Finite State Algorithms

□ used individually or coupled

□ Implementation:

� Robustness, usability, clear distinction between data and finite state algorithms

� Easily modifieable by user

� Good Performance (because of FSAs)

Reference: [Cun+02]


GATE ► Evaluation

□ AnnotationDiff

� compares a system-annotated text with a reference text.

� for each annotation type figures are generated:

• precision, recall, F-measure, false positives

□ Benchmarking Tool

� evaluation over a whole corpus

� performance statistics

Reference: [Cun+02]


Outline


1.


2.


3.


4.

Summary5.


JAPE

□ One of the facilities of GATE

□ Version of CPSL

□ IE tool use JAPE

□ Language for writing RE over annotations

□ Provides FST over annotations

□ To recognize RE in annotations on documents

Reference: [Cun+08]


□ JAPE Grammar consists of a set of phases, each of which consists of a set of pattern rules

□ Consists of LHS and RHS□ LHS contains annotation pattern that may

contain regular expression operators� eg. (+, ? , *)

□ RHS consists of annotation manipulationstatements

□ Label is used to refer the match from LHS to RHS and it is denoted by a preceeding colon (:)� e.g., :location

JAPE ► Grammar

Reference: [Cun+08]


JAPE ► Grammar

□ Pattern Specification can be done in 3 ways.

Specify a string of Texte.g {Token.string==“of“}

Specify an annotation previously assigned froma gazetteer, tokeniser, or other module

e.g. {Lookup}

Specify the attributes of an annotatione.g. {Token.kind == number}

1

2

3

Reference: [Cun+08]


JAPE ► Grammar

□ The RHS of the rule contains information about the annotation.

□ It is transferred from LHS using Labels

□ It is annotated with the entity type

□ finally, attributes and their corresponding values are added to the annotation

□ Alternatively, RHS can contain Java code to create or manipulate annotations.

Reference: [Cun+08]


□ Several options like „Control“ and „Debug“

□ Control defines the method of rule matching

□ Debug is used for the display of conflicts on thestandard output

□ Input annotations must be added at thebeginning

JAPE ► Grammar

Reference: [Cun+08]


JAPE ► Grammar (Example)

□ The pattern specified will be awarded an annotation of type“Enamex“ – it‘s an entityname.

□ “Kind“ is an attribute withvalue “location“

□ “Rule“ is another attributewith value “GazLocation“

Rule : GazLocation

(

{Lookup.majorType ==

location}

)

:location ��

:location.Enamex =

{kind="location", rule =

GazLocation}

Reference: [Cun+08]


□ Grammar Rules can essentially be of two types:� First type

• No Gazetteer Lookup

• Can be defined using a set of formats

• Little potential for Ambiguity

� Second type• Rely more on Gazetteer lists

• wide range of possibilities

• greater Potential for Ambigutiy

JAPE ► Grammar Rule

Reference: [Cun+08]


Rule : IPAddress

(

{Token.kind == number}

{Token.string == “.“}






)

:ipAddress ��

:ipAddress.Address =

{kind=“ipAddress“}


□ A single Rule is sufficient to identify an IP Address, because there is only one basic Format

□ A series of numbers, each set connected by a dot.

Reference: [Cun+08]


□ Mon, 23/8/08

□ Mon, 23/Aug/08

□ Mon, 23 August, 2008

□ Mon 23rd of August, 2008

□ Mon. August 23rd, 2008

□ Mon 23 August 2008

□ There are many possiblevariations and so manyrules are needed in order to identify a date or time

□ The same date information can beappear in many formats


Reference: [Cun+08]


□ The late ’80 s

□ Monday

□ St. Andrew‘s day

□ 99 BC

□ Mid- November

□ 1980-81

□ Fr0m March to April

□ Different types of datacan also be expressed.

□ These all are classified as date entities.


Reference: [Cun+08]


JAPE ► Use of context

□ Dealing Context in Grammar Rule

□ Enclosed with set of round brackets

□ Preceding Context can be included in the rule, this is placed before this set of brackets

□ If Context following the pattern need to beincluded, it is placed after the label given to theannotation

Reference: [Cun+08]


Rule : YearContext

(

{Token.string == “in“}

|

{Token.string == “by“}

)

(YEAR)

:date��

:date.Timex =

{kind=“date“,

rule=“YearContext“}

□ On Assumtion of an appropriate macro for“Year“

□ A year would be onlyrecognised if it occurspreceded by thewords “in“ or “by“


Reference: [Cun+08]


Rule : EmailAddress

( {Token.string == "<"} )

(

(EMAIL)

)

:email

({Token.string == ">"} )

��

:email.Address =

{kind="email",

rule="EmailAddress"}

□ On Assumtion of an appropriate macro for“email“

□ An email would beonly recognised if itoccurs inside anglesbrackets.


Reference: [Cun+08]


JAPE ► Use of priority

□ Grammar has 5 control styles� „brill“, „all“, „first “, „once“ and „appelt“

□ The styles are specified at the beginning of thegrammar

Reference: [Cun+08]


□ Brill�When more than one rule matches thesame region of the document, they all are fired

□ All � Similar to brill, in that it will also execute all matching rules, but the matching will continuefrom the next offset to the current one

□ First � With the „First“ style, a rule fires for thefirst match that‘s found

□ Once � Once a rule has fired, the whole JAPE phase exits after the first match

□ Appelt � Only one rule can be fired for thesame region of text, according to a set of priorityrules.


Reference: [Cun+08]



□ Priority Rules:

From all the rules that match a region of the documentstarting at some point X, the one which matches thelongest region is fired

If more than one rule matches the same region, the onewith the highest priority is fired

If there is more than one rule with the same priority, theone defined earlier in the grammar is fired

1

2

3

Reference: [Cun+08]


Outline


1.


2.


3.


4.

Summary5.


ANNIE

□ ANNIE : � A Nearly-New Information Extraction System

� ANNIE is an IE system• with which GATE is distributed

• Which relies on finite state algorithms and JAPE Language

� ANNIE has 5 components:• Tokeniser

• Gazetteer

• Sentence Splitter

• Semantic Tagger

• Orthographic Coreference - NameMatcher

Reference: [Cun+08]


ANNIE ►Components:

Reference: [Cun+08]


ANNIE ► Components: Tokeniser I

□ Tokeniser:� it splits the text into very simple components, TOKENS,

of different types such as• number: any combination of consecutive digits

• punctuation: start-punctuation “ ( “, end-punctuation “ ) “, and other punctuation “ : “ …

• Symbol: currency symbol „$“ , „￡“ … and symbol “&“, “#“ …

• SpaceToken: Space token and control token– space token: pure space characters

– control token: control characters

Reference: [Cun+08]


ANNIE ► Components: Tokeniser II

□ Tokeniser:� word: any set of contiguous upper- or lowercase

letters, including a hyphen. A word also has the „orth“attribute:

• upperInitial - initial letter is uppercase, and the rest are lowercase

• allCaps - all uppercase letters

• lowerCase - all lowercase letters

• mixedCaps - any mixture of upper and lowercase letters

Reference: [Cun+08]


ANNIE ► Components: Tokeniser III

□ Tokeniser Rules:� it has a left hand side (LHS) and a right hand side

(RHS), and it is separated from each other by “>“symbol. The LHS is a regular expression, which has to match on the input . On the other hand, the RHS describes the annotations to be added to the AnnotationSet.

� LHS operators:• | → or

• * → 0 or more occurrences

• ? → 0 or 1 occurrences

• + →1 or more occurrences

Reference: [Cun+08]


ANNIE ► Components: Tokeniser IV

□ Tokeniser Rules:� RHS operators: uses “;“ as a separator, and has the

following format• {LHS} > {Annotation type} ; {attribute1} = {value1} ; … ;

{attributeN} = {valueN}

□ Example:� A tokeniser rule for a word beginning with a single

capital letter: • "UPPERCASE_LETTER" "LOWERCASE_LETTER"* >

Token; orth=upperInitial; kind=word;

Reference: [Cun+08]


ANNIE ► Components: Gazetteer I

□ Gazetteer:� The gazetteer lists used here are plain text files, with

one entry per line. Each list represents a set of names, such as city names, organizations, days of week…

� Example: List of units of currency

Reference: [Cun+08]

EuroNew Taiwan dollarNew Taiwan dollarsPoundPoundsDollarDollars


ANNIE ► Components: Gazetteer II

□ Gazetteer Files:� In order to access the lists, an index file (lists.def) is

used. For each list, a major type and a minor type (optionally) are/is specified. See the following examples. The first column refers to the list name, the second column to the major type, and the third to the minor type. These lists are compiled into finite state machines.

• currency_prefix.lst : currency_unit : pre_amount

• day.lst : date : day

Reference: [Cun+08]


ANNIE ► Components: Gazetteer III

□ Flexibility: � In ANNIE arbitrary feature values to be associated with

particular entries in a single list can be allowed by enabling the optional gazetteerFeatureSeparatorparameter to a single character.

� Example:Software_company.lst:company:software

Red Hat&stockSymbol=RHAT

Apple Computer&abbrev=Apple&stockSymbol=AAPL

Reference: [Cun+08]


ANNIE ► Components: Sentence Splitter I

□ Sentence Splitter:� It is a cascade of finite-state-transducers which

segments the text into sentences, which is required for the tagger.

� The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

� Moreover, the sentence splitter is domain and application-independent.

Reference: [Cun+08]


ANNIE ► Components: Sentence Splitter II

□ Annotations: � each sentence is annotated with the type Sentence.

� each sentence break is also given a Split.• with possible types are: “.“ , „punctuation“ , „CR“ (a line

break) , „multi“ …

□ Another alternative sentence splitter is the RegEx Sentence Splitter. It is intended to improve the execution time and robustness, especially when faced with irregular input.

Reference: [Cun+08]


ANNIE ► Components: Semantic Tagger

□ Semantic Tagger:� ANNIE's semantic tagger is based on the JAPE

language.

� It contains rules which act on annotations assigned in the earlier phase, in order to produce outputs of annotated entities.

Reference: [Cun+08]


ANNIE ► Components: Orthographic Coreference I

□ Orthographic Coreference – NameMatcher:� It adds identity relations between named entities

found by the semantic tagger, in order to perform coreference.

� However, it does not find new named entities, instead it may assign a type to an unclassified proper name, using the type of a matching name.

� The matching rules are only invoked if the names being compared are both of the same type, or if one of them is classified as unknown.

Reference: [Cun+08]


ANNIE ► Components: Orthographic Coreference II

□ GATE Interface� Input: entity annotation with an id attribute

� Output: matches attributes added to the existing entity annotations

□ Resources� A look up table of aliases: recording non-matching string

representing the same entity, e.g. „IBM“ and „Big Blue“ or „Coca-Cola“ and „Coke“

� A table of spurious matches: matching strings which don't represent the same entity, e.g. „BT Wireless“ and „BT Cellnet“

□ Processing� an array of strings, types and IDs of all name annotations are

built from the wrappers

� then passed to a sting comparison function for pairwise comparison of all entries

Reference: [Cun+08]


ANNIE

A Walk-Through Example


ANNIE ► A walk-through example

□ A 3-stage procedure using the tokeniser, gazetteer and named-entity grammar.

□ We wish to recognize the phrase „800,000 US dollars“ as an entity of type „Number“, with the feature „money“.

□ We first give an example of a grammar rule (and corresponding macros) for money, which will recognize this type of pattern:

Reference: [Cun+08]


ANNIE ► A walk-through example Macro:

□ Macro for recognizing the million/billion:

� Macro: MILLION_BILLION

( { Token.string == "m"} |

{ Token.string == "million"} |

{ Token.string == "b"} |

{ Token.string == "billion"} )�

□ Macro for recognizing amount:

� Macro: AMOUNT_NUMBER

( { Token.kind == number}

( { Token.string == ","} |

{ Token.string == "."} )�

{ Token.kind == "number"} )*

( ( SpaceToken.kind == space} )?

( MILLION_BILLION) ? ) )�Reference: [Cun+08]



□ Define a grammar rule:� Rule: Money1

// e.g. 30 pounds

(

(AMOUNT_NUMBER)�

(SpaceToken.kind == space) ?

( { Lookup.majorType == currency_unit } )�

)�

money -->

money.Number = { kind = "money", rule = "Money1" }

Reference: [Cun+08]



□ Step 1 - Tokenisation: “800,000 US dollars”� the tokeniser separates the given phrase into the

following tokens: a word, a number, a punctuation and specetoken

• Token, string = "800", kind = number, length = 3

• Token, string = ",", kind = punctuation, length = 1

• Token, string = "000", kind = number, length = 3

• SpaceToken, string = " ", kind = space, length = 1

• Token, string = "US", kind = word, length = 2 orth = allCaps

• SpaceToken, string = " ", kind = space, length = 1

• Token, string = "dollars", kind = word, length = 7, orth = lowercase

Reference: [Cun+08]



□ Step 2 - List of Lookup:� the gazetteer lists are then searched to find all

occurrences of matching words in the text.

� It finds the following match for the string „US dollars“.

� Lookup:

• minorType = post_amount

• majorType = currency_unit

Reference: [Cun+08]



□ Step 3 - Grammar Rules I� Therefore, the grammar rule for money is then invoked.

• The macro MILLION_BILLION is passed through since the given phrase containing non of the specified string.

• The macro AMOUNT_NUMBER recognized the given phrase as a number, followed by sequences of the form comma plus number, and then followed by a space.

• Then, the rule Money1 is invoked, which recognized the string identified by the macro AMOUNT_NUMBER, followed by an optional space, followed by a unit of currency.

Reference: [Cun+08]



□ Step 3 - Grammar Rules II

• In our case, "US dollars" would be recognized as a currency unit, so the rule Money1 recognized the entire string "800,000 US dollars".

• Finally, it will be annotatted as a Number entity of type Money:

� Result:

“Number, kind = money, rule = Money1”

Reference: [Cun+08]


Outline


1.


2.


3.


4.

Summary5.


Summary

□ GATE� Framework + graphical development environment

� Provides set of language processing components

□ JAPE� Used by components to define grammars and rules

� Patterns � regular expressions

□ ANNIE� Information Extraction system

� Several components (e.g. Tokeniser, Sentence Splitter)

� Uses JAPE and FSA algorithms


Thanks for your attention!

Questions?


References

[Neu01] G. Neumann: „Informationsextraktion.“ in Klabunde et al (eds): „Computerlinguistik und Sprachtechnologie -Eine Einführung.“, Spektrum Akademischer Verlag, Heidelberg, 2001, ISBN 3-8274-1027-4.

[Wik08a] Wikipedia contributors: 'Information extraction', Wikipedia, The Free Encyclopedia, 14 May 2008, 16:53 UTC, http://en.wikipedia.org/wiki/Information_extraction, [accessed 1 June 2008]

[Cun+02] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan: “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications.” in Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.

[Cun+08] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, C. Ursu, M. Dimitrov, M. Dowman, N. Aswani, I. Roberts, Y. Li, A. Shafirin, A. Funk: “Developing Language Processing Components with GATE Version 4 (a User Guide)”, Online at http://gate.ac.uk/sale/tao/tao.pdf [accessed 13 June 2008]

Documents

Information Extraction - GATE, JAPE, ANNIE -kontext.fraunhofer.de/haenelt/kurs/Referate/Hopp_Lin...23.06.2008 Hauptseminar Endliche Automaten, SS 2008 3 Introduction What is Information