43
static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha [email protected] OCRFeeder Documents conversion on GNOME FOSDEM 2010

OCRFeeder, documents conversion on GNOME

Embed Size (px)

DESCRIPTION

The presentation of OCRFeeder for the GNOME track in FOSDEM 2010.

Citation preview

Page 1: OCRFeeder, documents conversion on GNOME

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

Documents conversion on GNOME

FOSDEM 2010

Page 2: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Page 3: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Why is it?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Page 4: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Page 5: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Page 6: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Page 7: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Page 8: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Page 9: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Page 10: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

what you want is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Page 11: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Where are were we at?

* Some closed solutions* Only for proprietary systems

* Various prices* still... arguable results

Page 12: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

How?

Page 13: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

How

Page 14: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Base concept:

1. Clip the contents2. Classify them

2.1. They are graphics → Paste on document

2.2. They are text → Calculate letter size; paste on document

Page 15: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Page 16: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Layouts vary with the type of document

What works on detecting one, won't work on others

Page 17: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

OCRFeeder focus on contents, not on layouts!

Page 18: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Page 19: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

1. A NxN pixel window runs through the document top to bottom, left to

right

Page 20: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

2. For every iteration, if there's a pixel inside the window which contrasts with the background,

then the window gets a 1,otherwise it gets a 0

Page 21: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Page 22: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

It does not check all the pixels so there is a better performance

Page 23: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

3. After all windows have a value assigned,

the ones with the value 1 are grouped

Page 24: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

4. Every time a set of 1s is grouped, each window is reassigned the value 0

(these are called blocks)

Page 25: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Sliding Window Algorithm:

5. When all windows have the value 0, the algorithm

reached the end

Page 26: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Block structure:

Page 27: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Joining Blocks:

Blocks are check with each other and joined when appropriate

When no blocks can be joined the analysis part is finished

Page 28: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Page 29: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Engine configuration:

<?xml version="1.0" encoding="UTF-8"?><engine>

<name>Tesseract</name><image_format>TIFF</image_format><engine_path>/usr/bin/tesseract</engine_path><arguments>$IMAGE $FILE; cat $FILE.txt;

rm $FILE.txt</arguments></engine>

Page 30: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Classification:

It is graphics if:

* Text is empty

* More than 50% of the chars arefailure chars, punctuation or spaces

Page 31: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

“Measures” in pixels the size of each text line

Although it results in different sizes...

Page 32: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

The value equal or greater than the average is chosen

(results in values equal or close to the original font size)

Page 33: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Font Size Detection:

The font size is calculated in inches using the resolution (DPI)

(if there's no resultion info, assume 300 DPI)

The value is then divided by the DTP (DeskTop Publishing point): 72

Page 34: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Exportation to ODT:

Uses ODFPy

(abstracts ODF creation)(just above XML)

Page 35: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Page 36: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Nuance Omnipage test

ABBY Finereader test

Page 37: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Omnipage'sresults

Finereader'sresults

Page 38: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Demo time!

Page 39: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Exportation to HTML

* Project saving/loading

Page 40: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Future:

* Integrate Ocropus as an alternative analysis backend

* More exportation formats: HOCR, txt, PDF

* Improved a11y* Better integration with GNOME

and other GNOME apps

Page 41: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

GNOME:

Development moved to GNOME's infrastructure since last month

Page 42: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:coming soon...

Page 43: OCRFeeder, documents conversion on GNOME

Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Thank you!