Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Preview:

Citation preview

Development of an OCR System

Nathan Harmata

TJHSST Computer Systems Lab2007-2008

What is OCR?

Optical Character Recognition

Font and handwriting based

Goals of My Project

Generic recognition for Latin-based fonts

Proper handling of most formatting

System built from scratch

Overview of Idocrase System

Image Processing

Transformations

Attribute

Character Model

Transformations

Sector Vector - image is parsed into parts that pass the vertical line test

- then each part is transformed into a collection of line segments

Gap Vector - gaps, if any, are found on the four sides of the image

Transformations

Pixel Concentration Vector – which sides, if any, have a higher concentration of pixels

Character Recognition

GCDD – Generic Character Definition Database

Averages of Character Models for every character from many different fonts

0 PixelConcentrationVector balanced balanced SectorVector 4 3 GapVector

Character Recognition

For a single character:

For words, dictionary and grammar references are used.

Idocrase Application

Results

-Mediocre word recognition-Doesn’t handle formatting well-Doesn’t handle small letters well-Fairly accurate single character recognition (93.7%)

Recommended