14
Optical Character Recognition Qurat-ul-Ain (Ainie) Akram Sarmad Hussain Center for language Engineering Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore, Pakistan Lecture 8

Optical Character Recognition

Embed Size (px)

DESCRIPTION

Lecture 8. Optical Character Recognition. Qurat-ul-Ain ( Ainie ) Akram Sarmad Hussain Center for language Engineering Al- Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore, Pakistan. Syllable String Creation using lookup table. - PowerPoint PPT Presentation

Citation preview

Optical Character Recognition

Qurat-ul-Ain (Ainie) AkramSarmad Hussain

Center for language EngineeringAl-Khawarizmi Institute of Computer Science

University of Engineering and Technology, Lahore, Pakistan

Lecture 8

ISSALE 2014 2

Syllable String Creation using lookup table

Syllable String

Main body ID

Diacritics1_ID

Diacritics1_ID

….

تا 500 2

و 501

پتھر 200 2 1 2

ISSALE 2014 3

Project Presentation

1. Front Page– Optical Character Recognition(in English)– Optical Character Recognition(in Your Language)– Document Image– Output of OCR (Recognized Syllable Strings of

OCR)– Syllable String Recognition Accuracy(Syllables

/Total Syllables*100)– Group Members Name

ISSALE 2014 4

1. Preprocessing– Line Segmentation

• Samples of line segmentation• Line segmentation accuracy results

• Samples of incorrect line segmentation

– Syllable/Ligature Segmentation• Samples of Syllable/Ligature segmentation• Syllable/Ligature Segmentation Accuracy Results• Samples of incorrect Syllable/Ligature segmentation

Total Lines Correctl Lines Incorrect Lines

% Accuracy

Total Syllables Correctly Syllables

Incorrect Syllables

% Accuracy

ISSALE 2014 5

• Pre-processing– Main body and diacritics disambiguation

Total main bodies Correctly classified as main bodies

% Accuracy

Total diacritics Correctly classified as diacritics

% Accuracy

ISSALE 2014 6

• Classification and Recognition– Data Description

• 15 Main body Types (DataSet-1)– Training Data (35 Tokens)– Testing Data (15 Tokens)– Image samples

• Document Images(DataSet-2)– Testing Data

» X Tokens of Y main body Types» X Tokens of Y diacritics Types» Image sample

Main body Type Total tokens in document images

Total unique syllables in document images

500 15 4

ISSALE 2014 7

• Classification and recognition results– Recognition Results on DataSet-1 using Decision Trees

• Main body recognition accuracy– Diacritics recognition accuracy

– Recognition Results on DataSet-1 using Tesseract• Main body recognition accuracy– Diacritics recognition accuracy

Class Type Total SamplesTest data (15 Tokens)

Correctly Recognized

% Accuracy

Class Type Total Samples Test data (15 Tokens)

Correctly Recognized

% Accuracy

ISSALE 2014 8

• Classification and recognition results– Recognition Results on DataSet-2 using Decision Trees

• Main body recognition accuracy– Diacritics recognition accuracy

OR – Recognition Results on DataSet-2 using Tesseract

• Main body recognition accuracy– Diacritics recognition accuracy

Class Type Total Samples Correctly Recognized

% Accuracy

Class Type Total Samples Correctly Recognized

% Accuracy

ISSALE 2014 9

• Post-processing– Syllable String Creation

– Syllable String Recognition Accuracy

Syllable String

Main body ID

Diacritics1_ID

Diacritics1_ID

….

تا 500 2

و 501

Syllable Type Total Samples Correctly Recognized

% Accuracy

ISSALE 2014 10

Output of OCR

• Input Document Image

• OCR Output

ISSALE 2014 11

Deliverables to submit

1. Presentation slides2. OCR Complete Code

1. Line segmentation2. Syllable segmentation3. Recognition of diacritics and main bodies4. Syllable string creation using lookup Table5. Output.txt file generation

3. Data Set-14. Data Set-25. Tesseract Traineddata file

Good Luck

ISSALE 2014 13

Document Image Creation• Syllable_of_MB1_Samples_1 Syllable_of_MB2_Samples_1 Syllable_of_MB2_Samples_1

Syllable_of_MB3_Samples_1 Syllable_of_MB4_Samples_1 Syllable_of_MB5_Samples_1 ,,, Syllable_of_MB15_Samples_1

• Syllable_of_MB1_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB2_Samples_2 Syllable_of_MB3_Samples_2 Syllable_of_MB4_Samples_2 Syllable_of_MB5_Samples_2 ,,, Syllable_of_MB15_Samples_2

• Syllable_of_MB1_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB2_Samples_3 Syllable_of_MB3_Samples_3 Syllable_of_MB4_Samples_3 Syllable_of_MB5_Samples_3 ,,, Syllable_of_MB15_Samples_3

• Syllable_of_MB1_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB2_Samples_4 Syllable_of_MB3_Samples_4 Syllable_of_MB4_Samples_4 Syllable_of_MB5_Samples_4 ,,, Syllable_of_MB15_Samples_4

• ,• ,• ,• Syllable_of_MB1_Samples_15 Syllable_of_MB2_Samples_15 Syllable_of_MB2_Samples_15

Syllable_of_MB3_Samples_15 Syllable_of_MB4_Samples_15 Syllable_of_MB5_Samples_15 ,,, Syllable_of_MB15_Samples_15

Syllable = MB + Diacritics or Syllable = MB

ISSALE 2014 14

Examples of Document Image