14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT OCR in a nutshell Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España

IMPACT OCR in a nutshell. Clemens Neudecker

Embed Size (px)

DESCRIPTION

Presentada en "Sesión de demostración de IMPACT en la BNE". Octubre. Biblioteca Nacional de España.

Citation preview

Page 1: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT OCR in a nutshellClemens Neudecker, National Library of the NetherlandsIMPACT Demo Day, Biblioteca Nacional de España

Page 2: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR ProcessBinarisation= transform greyscale or colour images to bitonal (b/w)in order to separate foreground (text) from background

Segmentation= detection of layout elements in hierarchical order (blocks/regions, lines, words, glyphs)

Pattern Matching (Recognition)= matching of character shapes with internal font database (classifiers)

Page 3: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY FineReaderMain OCR technology provider in IMPACTOCR technologies experts since 30 yearsIMPACT uses FineReader Engine (SDK)

Page 4: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Binarisation

Page 5: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Adaptive Binarisation

Original scan

Prev. binarization

New binarization

Page 6: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Binarisation

6

Original State of the Art IMPACT

Page 7: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Segmentation

Blocks/Regions Words Glyphs

Page 8: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplePre-Impact FR Engine 9 FR Engine 10

Part of column was misclassified as image

8

Page 9: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplev. 9 v. 10

Linear word order errors

9

Page 10: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Segmentation examplev. 9 v. 10

Lost text

10

Page 11: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Fraktur recognition

Page 12: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Languages and DictionariesGoal:• Develop an interface so that external dictionaries can

be integrated into the FineReader Engine

2008 - 2009:• External Dictionary beta interface• Same quality as with internal dictionaries possible

2010 - 2011:• Make interface work reliably• Teach partners how to use it• Support for any language, any time period

12

Page 13: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ALTO: New native export format

Available since FRE 10 R2Supports most recent schema: ALTO v. 2.0Line coordinates available

Page 14: IMPACT OCR in a nutshell. Clemens Neudecker

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you! Questions?