18
The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825

Golden Gat Ev3 The Plazi Markup System

Embed Size (px)

DESCRIPTION

Talk on Plazi Markup System at Artdatenbanken Workshop

Citation preview

Page 1: Golden Gat Ev3   The Plazi Markup System

The PLAZI Markup System

Donat AgostiTerry Catapano

Robert “Bob“ MorrisGuido Sautter

Universität Karlsruhe (TH) Research University – founded 1825

Page 2: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 2

The PLAZI Markup System

GoldenGATE Document

Editor

PLAZI ServerPLAZI Search Portal

External Data

Sources

Marked-Up Documents

Queries

Treatments, Detail Data,

PDF Document Handles

Links,Materials Citations

Taxon LSIDs, GeoData

New Taxon Names

Taxonomic data sources

& web services

Search portal,TAPIR

provider,RSS feed

Document markup, external

referencing

XML & PDF storage,

treatment server

Page 3: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 3

The PLAZI Search Portal• Series of Java Servlets running in Apache Tomcat• Front-end for SRS Web Service• Linker plug-ins create hyperlinks to other web sites

• HTML based search portal for humans– Search treatments & index data– Links submitting new search queries– Links to external data sources (e.g. HNS, GoogleMaps)– Links to PDF document & XML versions of treatments

• XML document access in various XML schemas• TAPIR provider

– Taxonomic names– Materials citations

• RSS feed for new treatments

Page 4: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 4

Probolomyrmex tani

The PLAZI Search Portal

Page 5: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 5

The PLAZI Markup System

GoldenGATE Document

Editor

PLAZI ServerPLAZI Search Portal

External Data

Sources

Marked-Up Documents

Queries

Treatments, Detail Data,

PDF Document Handles

Links,Materials Citations

Taxon LSIDs, GeoData

New Taxon Names

Taxonomic data sources

& web services

Search portal,TAPIR

provider,RSS feed

Document markup, external

referencing

XML & PDF storage,

treatment server

Page 6: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 6

The PLAZI Server• GoldenGATE Search & Retrieval Server (SRS)

– Extracts individual treatments from XML documents– Stores and indexes treatments– Based on independend, pluggable Indexers

• Taxonomic names• Materials citations• Document meta data• Full text

– Serves treatments or indexed details

• DSpace– Stores PDF and XML documents– Issues Handles for documents

Web Service

SRS

PostgreSQLFile System

TNMCMDFT

Docu

men

t M

an

ag

em

ent

DataIndex

DataXM

L D

ocu

men

ts

IndexersIndexersIndexersIndexers

Page 7: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 7

Apache Tomcat

GoldenGATE Server

DIO

DIO-SRS

EXC

RES

SRS

DIO DB

GoldenGATE Editor

DIO-CON

eXist

SRS Search Portal

SRS TAPIR

Service

OCR Data

Sources GBIF, etc

Public

EOL, etc

Remote GoldenGATE Server

DIO

DIO-SRS

SRS

DIO DB

RES-DIO

SRS DB

SRS DB

Extract individual treatments

Transform to TaxonX

Transform to SPM

Forward updates & deletions

Upload or update

document

Checkout document for further

markup

UAA

GoldenGATE Server

Page 8: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 8

The PLAZI Markup System

GoldenGATE Document

Editor

PLAZI ServerPLAZI Search Portal

External Data

Sources

Marked-Up Documents

Queries

Treatments, Detail Data,

PDF Document Handles

Links,Materials Citations

Taxon LSIDs, GeoData

New Taxon Names

Taxonomic data sources

& web services

Search portal,TAPIR

provider,RSS feed

Document markup, external

referencing

XML & PDF storage,

treatment server

Page 9: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 9

The PLAZI Markup Process

• Process yields semantically enhanced documents:– Treatment as atomic unit of text, inner structure for details– Materials citations machine-readably associated to taxa

• Sequence of steps engineered for maximum automation

• First part (upper row) generic to enhancing OCR output• Only second part (lower row) specific to taxonomy

Printed document

HTML document

Scanning & OCR

Layout Artifacts

Paragraph Normalization

Normalized Document

Taxon Name Markup

Treatment Markup

Treatments & Structure

GoldenGATE Document EditorAbbyy FineReader

MODS

Location Markup

Treatment Structure

Structural Normalization

Page 10: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 10

The GoldenGATE Editor• Java-based editor for semi-automated document markup• Extensible through plug-in mechanism• Independent of specific XML schema

• Element-level XML editing (XML syntax is generated)• Flexible display for clear view on all detail levels• Existing plug-ins provide broad spectrum of functionality:

– NLP-based markup generation• Regular expressions, gazetteers, GATE JAPE• Homegrown and third-party NLP components• Import of data from external sources (e.g. LSIDs)

– Specialized document views for correcting NLP results– Markup transformation & filtering– IO components for different data formats & storage locations

(e.g. for uploading XML documents to PLAZI server)

Page 11: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 11

The GoldenGATE Editor

Page 12: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 12

GoldenGATE Markup Wizard• Basically, a GoldenGATE Editor with another GUI• Higher degree of automation and user guidance• Especially configurable for a specific series of publications

• Automatically decides how to proceed with markup• Highlights potential errors for user to correct• Prevents overlooking errors

– Less effort for correction– No error propagation

• Editing functionality for corrections like in Editor

More efficient markup of well structured publications

Page 13: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 13

Apache TomcatGoldenGATE Server

CMS

DIO-SRS

EXC

DPS

SRS

DIO DB

GoldenGATE Editor

DIO-CON

eXist

SRS Search Portal

SRS TAPIR

Service

OCR Data

Sources GBIF, etc

Public

EOL, etcSRS DB

Markup Community

Portal

Markup Volunteers

Public Ranking

USS

ECS

UAA

DIO

Community Markup

Community does inter-

active part of markup

Online interface, no local program

Shows who con-tributes most

GoldenGATE Editor em-bedded in

Server

Measures contribution

Page 14: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 14

The PLAZI Markup System

GoldenGATE Document

Editor

PLAZI ServerPLAZI Search Portal

External Data

Sources

Marked-Up Documents

Queries

Treatments, Detail Data,

PDF Document Handles

Links,Materials Citations

Taxon LSIDs, GeoData

New Taxon Names

Taxonomic data sources

& web services

Search portal,TAPIR

provider,RSS feed

Document markup, external

referencing

XML & PDF storage,

treatment server

Page 15: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 15

The External Data Sources• Hymenoptera Name Server (HNS)

– Retrieve LSIDs for taxon names– Enter new taxon names in HNS database

• Further LSID sources: ZooBank, Index Fungorum

• GBIF pulls materials citations via TAPIR

• EOL pulls treatments via TAPIR (to start soon)

Page 16: Golden Gat Ev3   The Plazi Markup System

Thank you! Questions?

Donat AgostiTerry Catapano

Robert “Bob“ MorrisGuido Sautter

PLAZI homepagePLAZI search portal

PLAZI main web portalGoldenGATE homepage

Universität Karlsruhe (TH) Research University – founded 1825

[email protected]@[email protected]@ipd.uka.de

http://plazi.orghttp://plazi.org:8080/GgServerhttp://plazi2.cs.umb.edu/GgServerhttp://idaho.ipd.uka.de/GoldenGATE

Page 17: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 17

Outlook• Tighter integration of GoldenGATE editor with server

– Load plug-ins from server Easier update distribution

– Upload documents directly after OCR– Host documents at server throughout markup

Users can share markup work (experts do LSIDs, etc) Treatments available in search portal soon as marked up

– Auto-distribute documents to different storage locations

– Run automated markup generation on server side– Get corrections from community via online feedback forms

• Other extensions of GoldenGATE editor– Simplified, more flexible plug-in architecture– Extensible user interface

Page 18: Golden Gat Ev3   The Plazi Markup System

Guido SautterUniversität Karlsruhe (TH)

The PLAZI Markup System 18

The GoldenGATE Editor V3Plug-in GUI extensions (hideable)

Simplified, more flexible architecture

Pre-OCR page images for correcting OCR errors

Document navigator for finding stuff more quickly