Upload
igalia
View
353
Download
1
Embed Size (px)
DESCRIPTION
By Joaquim Rocha. OCRFeeder is a document layout analysis and optical character recognition system that I wrote for my Master's Thesis project. Like it says on its website, given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT. I think this is currently the most complete and user friendly OCR application for GNU/Linux out there and, of course, I wrote it to be used mainly with GNOME, featuring a GUI written in PyGTK and respecting, as far as I could, the GNOME User Interface Guidelines. I would like to present how the application works on the inside, for example the page segmentation algorithm I created for it, etc. I think this would be interest for the GNOME community and general attendants of the GNOME Dev room at FOSDEM.
Citation preview
static void_f_do_barnacle_install_properties(GObjectClass
*gobject_class){
GParamSpec *pspec;
/* Party code attribute */ pspec = g_param_spec_uint64
(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",
0, G_MAXUINT64,
G_MAXUINT64 /* default value */,
G_PARAM_READABLE | G_PARAM_WRITABLE |
G_PARAM_PRIVATE);
g_object_class_install_property (gobject_class,
F_DO_BARNACLE_PROP_CODE,
Joaquim [email protected]
OCRFeeder
Documents conversion on GNOME
FOSDEM 2010
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
What is it?
Document Analysis and Optical Character Recognition
for GNOME
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Why is it?
Paper has a number of problems
No applications for GNU/Linux to do a fair job
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Security
CC Photo by: http://www.flickr.com/photos/badwsky/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Preservation
CC Photo by: http://www.flickr.com/photos/98469445@N00/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Data processing
CC Photo by: http://www.flickr.com/photos/hugovk/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Ecology
CC Photo by: http://www.flickr.com/photos/pranavsingh/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
No fair conversion apps for GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCR != Document Conversion
(it only deals with chars)(does not consider the layout)(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
what you want is
Document Analysis and Recognition
(conversion of documents to an electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Where are were we at?
* Some closed solutions* Only for proprietary systems
* Various prices* still... arguable results
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How?
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Base concept:
1. Clip the contents2. Classify them
2.1. They are graphics → Paste on document
2.2. They are text → Calculate letter size; paste on document
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
So many layouts...
CC Photo by: http://www.flickr.com/photos/uber-tuber/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Layouts vary with the type of document
What works on detecting one, won't work on others
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCRFeeder focus on contents, not on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Key concept:
If a document image can be divided in windows of 1 (content)
or 0 (not content), then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
1. A NxN pixel window runs through the document top to bottom, left to
right
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
2. For every iteration, if there's a pixel inside the window which contrasts with the background,
then the window gets a 1,otherwise it gets a 0
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
It does not check all the pixels so there is a better performance
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
3. After all windows have a value assigned,
the ones with the value 1 are grouped
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
4. Every time a set of 1s is grouped, each window is reassigned the value 0
(these are called blocks)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
5. When all windows have the value 0, the algorithm
reached the end
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Block structure:
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joining Blocks:
Blocks are check with each other and joined when appropriate
When no blocks can be joined the analysis part is finished
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Recognition:
System-wide OCR engines are used
Engines are configured from the GUI or XML files
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Engine configuration:
<?xml version="1.0" encoding="UTF-8"?><engine>
<name>Tesseract</name><image_format>TIFF</image_format><engine_path>/usr/bin/tesseract</engine_path><arguments>$IMAGE $FILE; cat $FILE.txt;
rm $FILE.txt</arguments></engine>
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Classification:
It is graphics if:
* Text is empty
* More than 50% of the chars arefailure chars, punctuation or spaces
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
“Measures” in pixels the size of each text line
Although it results in different sizes...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The value equal or greater than the average is chosen
(results in values equal or close to the original font size)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The font size is calculated in inches using the resolution (DPI)
(if there's no resultion info, assume 300 DPI)
The value is then divided by the DTP (DeskTop Publishing point): 72
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Exportation to ODT:
Uses ODFPy
(abstracts ODF creation)(just above XML)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
User interaction:
Users can edit everythingand review the algorithm's results
So, UI can work in attended and unattended ways
CLI only works in an unattended mode
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Nuance Omnipage test
ABBY Finereader test
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Omnipage'sresults
Finereader'sresults
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Demo time!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Other features:
* PDF importation* Unpaper preprocessor
* Font style edition* Exportation to HTML
* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Future:
* Integrate Ocropus as an alternative analysis backend
* More exportation formats: HOCR, txt, PDF
* Improved a11y* Better integration with GNOME
and other GNOME apps
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
GNOME:
Development moved to GNOME's infrastructure since last month
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Webpage:http://live.gnome.org/OCRFeeder
git:http://git.gnome.org/ocrfeeder
Bugzilla:coming soon...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Thank you!