43
static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha [email protected] OCRFeeder Documents conversion on GNOME FOSDEM 2010

OCRFeeder (FOSDEM 2010)

  • Upload
    igalia

  • View
    353

  • Download
    1

Embed Size (px)

DESCRIPTION

By Joaquim Rocha. OCRFeeder is a document layout analysis and optical character recognition system that I wrote for my Master's Thesis project. Like it says on its website, given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT. I think this is currently the most complete and user friendly OCR application for GNU/Linux out there and, of course, I wrote it to be used mainly with GNOME, featuring a GUI written in PyGTK and respecting, as far as I could, the GNOME User Interface Guidelines. I would like to present how the application works on the inside, for example the page segmentation algorithm I created for it, etc. I think this would be interest for the GNOME community and general attendants of the GNOME Dev room at FOSDEM.

Citation preview

Page 1: OCRFeeder (FOSDEM 2010)

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

Documents conversion on GNOME

FOSDEM 2010

Page 2: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Page 3: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Why is it?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Page 4: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Page 5: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Page 6: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Page 7: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Page 8: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Page 9: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Page 10: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

what you want is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Page 11: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Where are were we at?

* Some closed solutions* Only for proprietary systems

* Various prices* still... arguable results

Page 12: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

How?

Page 13: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

How

Page 14: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Base concept:

1. Clip the contents2. Classify them

2.1. They are graphics → Paste on document

2.2. They are text → Calculate letter size; paste on document

Page 15: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Page 16: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Layouts vary with the type of document

What works on detecting one, won't work on others

Page 17: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

OCRFeeder focus on contents, not on layouts!

Page 18: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Page 19: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

1. A NxN pixel window runs through the document top to bottom, left to

right

Page 20: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

2. For every iteration, if there's a pixel inside the window which contrasts with the background,

then the window gets a 1,otherwise it gets a 0

Page 21: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Page 22: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

It does not check all the pixels so there is a better performance

Page 23: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

3. After all windows have a value assigned,

the ones with the value 1 are grouped

Page 24: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

4. Every time a set of 1s is grouped, each window is reassigned the value 0

(these are called blocks)

Page 25: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

5. When all windows have the value 0, the algorithm

reached the end

Page 26: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Block structure:

Page 27: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Joining Blocks:

Blocks are check with each other and joined when appropriate

When no blocks can be joined the analysis part is finished

Page 28: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Page 29: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Engine configuration:

<?xml version="1.0" encoding="UTF-8"?><engine>

<name>Tesseract</name><image_format>TIFF</image_format><engine_path>/usr/bin/tesseract</engine_path><arguments>$IMAGE $FILE; cat $FILE.txt;

rm $FILE.txt</arguments></engine>

Page 30: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Classification:

It is graphics if:

* Text is empty

* More than 50% of the chars arefailure chars, punctuation or spaces

Page 31: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

“Measures” in pixels the size of each text line

Although it results in different sizes...

Page 32: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

The value equal or greater than the average is chosen

(results in values equal or close to the original font size)

Page 33: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

The font size is calculated in inches using the resolution (DPI)

(if there's no resultion info, assume 300 DPI)

The value is then divided by the DTP (DeskTop Publishing point): 72

Page 34: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Exportation to ODT:

Uses ODFPy

(abstracts ODF creation)(just above XML)

Page 35: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Page 36: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Nuance Omnipage test

ABBY Finereader test

Page 37: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Omnipage'sresults

Finereader'sresults

Page 38: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Demo time!

Page 39: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Exportation to HTML

* Project saving/loading

Page 40: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Future:

* Integrate Ocropus as an alternative analysis backend

* More exportation formats: HOCR, txt, PDF

* Improved a11y* Better integration with GNOME

and other GNOME apps

Page 41: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

GNOME:

Development moved to GNOME's infrastructure since last month

Page 42: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:coming soon...

Page 43: OCRFeeder (FOSDEM 2010)

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Thank you!