Iesl03 Multiling IEff

Embed Size (px)

Citation preview

  • 8/10/2019 Iesl03 Multiling IEff

    1/18

    1(18)

    GATE: A Unicode-based

    Infrastructure SupportingMultilingual Information Extraction

    Kalina Bontcheva, Diana Maynard,Valentin Tablan, Hamish Cunningham

    Department of Computer Science, University of Sheffield

    http://gate.ac.uk/

    Structure of the talk: A brief introduction to GATE

    Multilingual infrastructure in GATE

    Simple multilingual IE components

    http://gate.ac.uk/http://gate.ac.uk/
  • 8/10/2019 Iesl03 Multiling IEff

    2/18

    2(18)

    GATE is... An architecture A macro-level organisational picture for LE

    software systems. A framework For programmers, GATE is an object-oriented

    class library that implements the architecture. A development environment For language engineers,

    computational linguists et al, a graphical developmentenvironment.

    GATE comes with... Some free components... ...and wrappers for other people's

    components Tools for: evaluation; visualise/edit; persistence; IR; IE;

    dialogue; ontologies; etc. Free software (LGPL). Download at

    http://gate.ac.uk/download/

    http://gate.ac.uk/download/http://gate.ac.uk/download/
  • 8/10/2019 Iesl03 Multiling IEff

    3/18

    3(18)

    Architectural principles

    Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse

    XML support, integration of Protg, Jena, Weka...) (Almost) everything is a component, and component sets

    are user-extendable (Almost) all operations are available both from API and GUI

  • 8/10/2019 Iesl03 Multiling IEff

    4/18

    4(18)

    Component-based development

    CREOLE Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML

    configuration The minimal component = 10 lines of Java, 10 lines of

    XML, 1 URL Three types: Language Resources, Processing

    Resources, Visual Resources

    Why bother? Allows the system to load arbitrary language processing

    components

  • 8/10/2019 Iesl03 Multiling IEff

    5/18

    5(18)

    Language Resources (LRs) LRs are documents, ontologies, corpora, lexicons,

    LRs can be associated with DataStores (Oracle,PostgreSQL, XML, Java Serialisation) Documents / corpora:

    Diverse document formats: text, html, XML, email,RTF, SGML

    Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features),

    derivative of TIPSTER, compatible with ATLAS andXCES

    Coping with diverse character encodings: New internationalised versions of JVM support >100

    different encodings. Other encodings: developing system for user-entry of

    mapping tables (remove programming from the process)

  • 8/10/2019 Iesl03 Multiling IEff

    6/18

    6(18)

    Processing Resources (PRs) Algorithmic components knows as PRs beans

    with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple

    repurposing).

    20-30 freebies with GATE Controllers: execute a set of PRs

    SerialController: sequential run of arbitrary PR set SerialAnalyserController: analyser PRs over corpus

    Conditional controllers: execute depend on features Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved

    and restored, and used for embedding / batching

  • 8/10/2019 Iesl03 Multiling IEff

    7/18

  • 8/10/2019 Iesl03 Multiling IEff

    8/18

    8(18)

    VRs (2): Coreference

  • 8/10/2019 Iesl03 Multiling IEff

    9/18

    9(18)

    VRs (3): Syntax

  • 8/10/2019 Iesl03 Multiling IEff

    10/18

    10(18)

    Displaying Multilingual Data

    GATE uses standard (& imperfect) Java rendering engine for displaying text.

  • 8/10/2019 Iesl03 Multiling IEff

    11/18

    11(18)

    GATE Unicode Kit (GUK) Complements Javas facilities

    Support for definingInput Methods (IMs)

    Currently 30 IMs

    for 17 languages Pluggable in otherapplications (e.g.JEdit, EUDICO)

    Can use virtual kybdor standard layoutsover QWERTY

    IMs defined in plain text files GUK comes with a

    standalone Unicode editor

    Editing Multilingual Data

  • 8/10/2019 Iesl03 Multiling IEff

    12/18

    12(18)

    Processing Multilingual Data All processing, visualisation and editing tools use GUK

  • 8/10/2019 Iesl03 Multiling IEff

    13/18

    13(18)

    Multilingual IE ComponentsThe ANNIE system a reusable and easily extendable set of

    components

  • 8/10/2019 Iesl03 Multiling IEff

    14/18

    14(18)

    The Unicode Tokeniser A very portable component for multliple languages:

    splits text into typed tokens based on FSM dynamically constructed from rules based on

    character categories defined by the Unicode, e.g.:UPPERCASE_LETTER(LOWERCASE_LETTER|DASH_PUNCTUATION)*

    > Token;orth=upperInitial;kind=word; output generally localised by a later module (e.g.

    dont do nt) 23 rules seem able to handle without changes Indo-European languages.

    the English tokeniser: Unicode tokeniser + pattern

    grammar FST

  • 8/10/2019 Iesl03 Multiling IEff

    15/18

    15(18)

    POS tagging in new languages

    TIDES Surprise Language: Hepple tagger butsubstituted Cebuano/Hindi lexicon for English

    Used empty ruleset since no training data

    available Used default heuristics (e.g. return NNP forcapitalised words)

    Very experimental, but reasonable results

    67% correctness for Hindi and 75% forCebuano

    Adaptation time per language - 2 days

  • 8/10/2019 Iesl03 Multiling IEff

    16/18

    16(18)

    Porting NE grammars

    Most English JAPE rules based on POS tagsand gazetteer lookup

    Grammars can be reused for languages withsimilar word order, orthography etc.

    No time to make detailed study of Cebuano,but very similar in structure to English

    Most of the rules left as for English, but someadjustments to handle especially dates

    Used both English and Cebuano grammarsand gazetteers, because NEs appear in bothlanguages

  • 8/10/2019 Iesl03 Multiling IEff

    17/18

    17(18)

    TIDES Evaluation Results

    Cebuano EnglishBaseline

    Entity P R F P R F

    Person 71 65 68 36 36 36

    Org 75 71 73 31 47 38

    Location 73 78 76 65 7 12Date 83 100 92 42 58 49

    Total 76 79 77.5 45 41.7 43

  • 8/10/2019 Iesl03 Multiling IEff

    18/18

    18(18)

    Conclusion

    GATE a Unicode-based NLP infrastructure,particularly suitable for multilingual adaptation ofIE systems

    Requires little involvement of native speakers

    and very little annotated data for a basic job Future work

    Improving multilingual support, e.g.,

    morphology support, automatic language andencoding identification Learning gazetteer lists from annotated

    corpora