Upload
joaquim-rocha
View
2.630
Download
3
Tags:
Embed Size (px)
DESCRIPTION
The presentation of OCRFeeder for the GNOME track in FOSDEM 2010.
Citation preview
static void_f_do_barnacle_install_properties(GObjectClass
*gobject_class){
GParamSpec *pspec;
/* Party code attribute */ pspec = g_param_spec_uint64
(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",
0, G_MAXUINT64,
G_MAXUINT64 /* default value */,
G_PARAM_READABLE | G_PARAM_WRITABLE |
G_PARAM_PRIVATE);
g_object_class_install_property (gobject_class,
F_DO_BARNACLE_PROP_CODE,
Joaquim [email protected]
OCRFeeder
Documents conversion on GNOME
FOSDEM 2010
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
What is it?
Document Analysis and Optical Character Recognition
for GNOME
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Why is it?
Paper has a number of problems
No applications for GNU/Linux to do a fair job
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Security
CC Photo by: http://www.flickr.com/photos/badwsky/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Preservation
CC Photo by: http://www.flickr.com/photos/98469445@N00/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Data processing
CC Photo by: http://www.flickr.com/photos/hugovk/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:Ecology
CC Photo by: http://www.flickr.com/photos/pranavsingh/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
No fair conversion apps for GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCR != Document Conversion
(it only deals with chars)(does not consider the layout)(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
what you want is
Document Analysis and Recognition
(conversion of documents to an electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Where are were we at?
* Some closed solutions* Only for proprietary systems
* Various prices* still... arguable results
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How?
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Base concept:
1. Clip the contents2. Classify them
2.1. They are graphics → Paste on document
2.2. They are text → Calculate letter size; paste on document
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
So many layouts...
CC Photo by: http://www.flickr.com/photos/uber-tuber/
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Layouts vary with the type of document
What works on detecting one, won't work on others
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCRFeeder focus on contents, not on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Key concept:
If a document image can be divided in windows of 1 (content)
or 0 (not content), then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
1. A NxN pixel window runs through the document top to bottom, left to
right
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
2. For every iteration, if there's a pixel inside the window which contrasts with the background,
then the window gets a 1,otherwise it gets a 0
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
It does not check all the pixels so there is a better performance
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
3. After all windows have a value assigned,
the ones with the value 1 are grouped
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
4. Every time a set of 1s is grouped, each window is reassigned the value 0
(these are called blocks)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:
5. When all windows have the value 0, the algorithm
reached the end
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Block structure:
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joining Blocks:
Blocks are check with each other and joined when appropriate
When no blocks can be joined the analysis part is finished
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Recognition:
System-wide OCR engines are used
Engines are configured from the GUI or XML files
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Engine configuration:
<?xml version="1.0" encoding="UTF-8"?><engine>
<name>Tesseract</name><image_format>TIFF</image_format><engine_path>/usr/bin/tesseract</engine_path><arguments>$IMAGE $FILE; cat $FILE.txt;
rm $FILE.txt</arguments></engine>
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Classification:
It is graphics if:
* Text is empty
* More than 50% of the chars arefailure chars, punctuation or spaces
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
“Measures” in pixels the size of each text line
Although it results in different sizes...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The value equal or greater than the average is chosen
(results in values equal or close to the original font size)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:
The font size is calculated in inches using the resolution (DPI)
(if there's no resultion info, assume 300 DPI)
The value is then divided by the DTP (DeskTop Publishing point): 72
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Exportation to ODT:
Uses ODFPy
(abstracts ODF creation)(just above XML)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
User interaction:
Users can edit everythingand review the algorithm's results
So, UI can work in attended and unattended ways
CLI only works in an unattended mode
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Nuance Omnipage test
ABBY Finereader test
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Omnipage'sresults
Finereader'sresults
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Demo time!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Other features:
* PDF importation* Unpaper preprocessor
* Font style edition* Exportation to HTML
* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Future:
* Integrate Ocropus as an alternative analysis backend
* More exportation formats: HOCR, txt, PDF
* Improved a11y* Better integration with GNOME
and other GNOME apps
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
GNOME:
Development moved to GNOME's infrastructure since last month
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Webpage:http://live.gnome.org/OCRFeeder
git:http://git.gnome.org/ocrfeeder
Bugzilla:coming soon...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Thank you!