IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow Development for OCR (and beyond)Clemens Neudecker, KB National Library of the Netherlands
Creating and Communicating Digital Content Conference
Umea, 26 May 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
IMPACT – Improving access to text Funded by the EC as part of the 7th Framework Programme Coordinated by KB – National Library of the Netherlands EU funding: € 12 100 000 26 partners: Libraries, Research Institutes, Industry Partners Start date: 1 January 2008 Duration: 48 Months 2012: Centre of Competence
Project website: www.impact-project.eu IMPACT blog: http://impactocr.wordpress.com/ Twitter: @impactocr, #impactproject Join us on LinkedIn!
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
A familiar scene?
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te /
sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met
beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
OCR: A multitude of challenges…I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
OCR: A multitude of challenges…II. Language challenges (spelling variants, inflection, and many more!)
Example: historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
And a multitude of solutions! 22 different ‘tools’ from diverse WP’s,
developers:OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software!
“One ring to rule them all...”
IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
Requirement: Interoperability Framework Interoperability vs. integration Web based vs. local installation/platform Most important: flexible, scalable, user friendly
Java 6 Apache Axis2 Apache Tomcat Apache Synapse (optional) Taverna Workflow Engine
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
Generic Web Service Wrapper
Only requirement: Command Line Application HTML formAvailable on OPFlabs: https://github.com/openplanets/scape/tree/master/xa-toolwrapper
Minimise integration effort: developers can focus on their application and have to worry less about integration = higher quality software
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
Service Oriented Architecture Java as programming
language = platform independence
Standard Apache components = easy to maintain, well supported
Synapse as enterprise service bus = load balancing & fail over
HTTPS encryption & authentication = secure
Minimise deployment effort: scalability, hot deployment/update
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
Workflow development
OCR workflow = data pipeline
Building blocks =
processing steps (nodes)
Integration = interaction between nodes
(mashup)
Maximise usability
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
Workflow management Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow registry
Share resources and experience
Rate/tag/comment workflows
Organised in groups
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow modules
“Basic” workflows = wraps exactly one software tool/web service Documented inputs/outputs
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
14
Complex workflows
Tool/data pipeline
Easily derived from workflow modules
Task/goal oriented
Reusable
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local client: Taverna Workbench
http://www.taverna.org.uk/
Background: BioSciences
Developed and maintained bymyGrid, UK
Available for Windows/Linux/OSXand as open source
Funding secured until 2014
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Web client: Taverna Server/Workflow Parser
SOAP/REST API Remote execution of workflows (webapp)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
Use case: Workflows for Evaluation Tool A vs Tool B (Tool A(v1) vs Tool A(v2)) Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C) Workflow X vs previously digitised material
Users identify optimal workflow for source material/project
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Other examples
Workflows for Digitisation IMPACT
Workflows for Linguistic Analysis CLARIN
Workflows for Preservation SCAPE
Interface for automatic storage of results, based on DAV, realised as a workflow module (native beanshell support)
And there are many more…
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Benefits & Outlook Modular Transparent Expandable Scalable Platform independent User friendly
Growing interest in workflow management in CH sector Easy to set up, deploy, free (open source) Domain independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you! Questions?