Upload
guest08d3e7
View
435
Download
0
Embed Size (px)
DESCRIPTION
Talk on Plazi Markup System at Artdatenbanken Workshop
Citation preview
The PLAZI Markup System
Donat AgostiTerry Catapano
Robert “Bob“ MorrisGuido Sautter
Universität Karlsruhe (TH) Research University – founded 1825
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 2
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 3
The PLAZI Search Portal• Series of Java Servlets running in Apache Tomcat• Front-end for SRS Web Service• Linker plug-ins create hyperlinks to other web sites
• HTML based search portal for humans– Search treatments & index data– Links submitting new search queries– Links to external data sources (e.g. HNS, GoogleMaps)– Links to PDF document & XML versions of treatments
• XML document access in various XML schemas• TAPIR provider
– Taxonomic names– Materials citations
• RSS feed for new treatments
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 4
Probolomyrmex tani
The PLAZI Search Portal
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 5
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 6
The PLAZI Server• GoldenGATE Search & Retrieval Server (SRS)
– Extracts individual treatments from XML documents– Stores and indexes treatments– Based on independend, pluggable Indexers
• Taxonomic names• Materials citations• Document meta data• Full text
– Serves treatments or indexed details
• DSpace– Stores PDF and XML documents– Issues Handles for documents
Web Service
SRS
PostgreSQLFile System
TNMCMDFT
Docu
men
t M
an
ag
em
ent
DataIndex
DataXM
L D
ocu
men
ts
IndexersIndexersIndexersIndexers
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 7
Apache Tomcat
GoldenGATE Server
DIO
DIO-SRS
EXC
RES
SRS
DIO DB
GoldenGATE Editor
DIO-CON
eXist
SRS Search Portal
SRS TAPIR
Service
OCR Data
Sources GBIF, etc
Public
EOL, etc
Remote GoldenGATE Server
DIO
DIO-SRS
SRS
DIO DB
RES-DIO
SRS DB
SRS DB
Extract individual treatments
Transform to TaxonX
Transform to SPM
Forward updates & deletions
Upload or update
document
Checkout document for further
markup
UAA
GoldenGATE Server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 8
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 9
The PLAZI Markup Process
• Process yields semantically enhanced documents:– Treatment as atomic unit of text, inner structure for details– Materials citations machine-readably associated to taxa
• Sequence of steps engineered for maximum automation
• First part (upper row) generic to enhancing OCR output• Only second part (lower row) specific to taxonomy
Printed document
HTML document
Scanning & OCR
Layout Artifacts
Paragraph Normalization
Normalized Document
Taxon Name Markup
Treatment Markup
Treatments & Structure
GoldenGATE Document EditorAbbyy FineReader
MODS
Location Markup
Treatment Structure
Structural Normalization
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 10
The GoldenGATE Editor• Java-based editor for semi-automated document markup• Extensible through plug-in mechanism• Independent of specific XML schema
• Element-level XML editing (XML syntax is generated)• Flexible display for clear view on all detail levels• Existing plug-ins provide broad spectrum of functionality:
– NLP-based markup generation• Regular expressions, gazetteers, GATE JAPE• Homegrown and third-party NLP components• Import of data from external sources (e.g. LSIDs)
– Specialized document views for correcting NLP results– Markup transformation & filtering– IO components for different data formats & storage locations
(e.g. for uploading XML documents to PLAZI server)
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 11
The GoldenGATE Editor
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 12
GoldenGATE Markup Wizard• Basically, a GoldenGATE Editor with another GUI• Higher degree of automation and user guidance• Especially configurable for a specific series of publications
• Automatically decides how to proceed with markup• Highlights potential errors for user to correct• Prevents overlooking errors
– Less effort for correction– No error propagation
• Editing functionality for corrections like in Editor
More efficient markup of well structured publications
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 13
Apache TomcatGoldenGATE Server
CMS
DIO-SRS
EXC
DPS
SRS
DIO DB
GoldenGATE Editor
DIO-CON
eXist
SRS Search Portal
SRS TAPIR
Service
OCR Data
Sources GBIF, etc
Public
EOL, etcSRS DB
Markup Community
Portal
Markup Volunteers
Public Ranking
USS
ECS
UAA
DIO
Community Markup
Community does inter-
active part of markup
Online interface, no local program
Shows who con-tributes most
GoldenGATE Editor em-bedded in
Server
Measures contribution
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 14
The PLAZI Markup System
GoldenGATE Document
Editor
PLAZI ServerPLAZI Search Portal
External Data
Sources
Marked-Up Documents
Queries
Treatments, Detail Data,
PDF Document Handles
Links,Materials Citations
Taxon LSIDs, GeoData
New Taxon Names
Taxonomic data sources
& web services
Search portal,TAPIR
provider,RSS feed
Document markup, external
referencing
XML & PDF storage,
treatment server
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 15
The External Data Sources• Hymenoptera Name Server (HNS)
– Retrieve LSIDs for taxon names– Enter new taxon names in HNS database
• Further LSID sources: ZooBank, Index Fungorum
• GBIF pulls materials citations via TAPIR
• EOL pulls treatments via TAPIR (to start soon)
Thank you! Questions?
Donat AgostiTerry Catapano
Robert “Bob“ MorrisGuido Sautter
PLAZI homepagePLAZI search portal
PLAZI main web portalGoldenGATE homepage
Universität Karlsruhe (TH) Research University – founded 1825
[email protected]@[email protected]@ipd.uka.de
http://plazi.orghttp://plazi.org:8080/GgServerhttp://plazi2.cs.umb.edu/GgServerhttp://idaho.ipd.uka.de/GoldenGATE
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 17
Outlook• Tighter integration of GoldenGATE editor with server
– Load plug-ins from server Easier update distribution
– Upload documents directly after OCR– Host documents at server throughout markup
Users can share markup work (experts do LSIDs, etc) Treatments available in search portal soon as marked up
– Auto-distribute documents to different storage locations
– Run automated markup generation on server side– Get corrections from community via online feedback forms
• Other extensions of GoldenGATE editor– Simplified, more flexible plug-in architecture– Extensible user interface
Guido SautterUniversität Karlsruhe (TH)
The PLAZI Markup System 18
The GoldenGATE Editor V3Plug-in GUI extensions (hideable)
Simplified, more flexible architecture
Pre-OCR page images for correcting OCR errors
Document navigator for finding stuff more quickly