28
Census Data Capture Challenge Intelligent Document Capture Solution UNSD Workshop - Minsk Dec 2008 Amir Angel Director of Government Projects

Census Data Capture Challenge Intelligent Document Capture Solution UNSD Workshop - Minsk Dec 2008 Amir Angel Director of Government Projects

Embed Size (px)

Citation preview

Census Data Capture Challenge

Intelligent Document Capture Solution

UNSD Workshop - Minsk Dec 2008

Amir Angel Director of Government Projects

2

The evolution of data capture in census projects

From OCR into IDR Solution

eFLOWeFLOW

Five steps:

3

Manual data entry (Key from paper)Slow processHigh error rate in the data entry processRecruitment, training and management of personnel

Key from Image:ArchiveApprox 20% faster than key from paper

The evolution of data capture in census projects

Key From Paper

Key From Image

4

OMR (Hardware readers for checkbox)– Requires special scanners and specially printed forms

– Cannot handle handwritten/printed data

– Forms are not user-friendly

– OMR requires more answers => more space => increased paper expenditures => more handling and printing costs

– Not flexible, difficult to adjust to other applications once census is over

– No possibility to add business rules: imputation, validations, coding

The evolution of data capture in census projects

OMR

5

The evolution of data capture in census projects

Automated Data Capture– Requires less human intervention, enables to complete the

census data capture much faster (less space, less salaries, less hardware)

– Full flexibility in the type of data gathered (checkbox, OMR, handwritten, alpha and numeric, barcode…)

– Ensures data integrity – enables the use of automatic AND manual: online validations, exception handling, coding

– The most advanced and proven technology for Censuses, recommended by the UN and used by all modern countries for census projects

– Creates a correlation between the image and the actual form

– Remote capabilities enable all forms to be scanned locally and then sent to a central site for processing

eFLOWeFLOW

Automated Data Capture

6

The evolution of data capture in census projects

Intelligent data capture platform (IDR) by using OCR/ICR/OMR/PDA/Web/email:

– Automated data capture +

– Automatic classification for documents

understands and differentiates between various types of documents and languages and Based on state-of-the-art Machine Learning algorithms

Artificial intelligence algorithms which provides enough information for the system to find the location of the fields on its own

Intelligent Data Capture

eFLOWeFLOW

7

Mail Room Scanning Data EntryBack-Office

End Users

Document prepSorting

ManualKey from image

Traditional Data Capture

8

Mail Room Scanning Data EntryBack-Office

End Users

Document prepNo sorting

Reduce manual dataentry by 40-70%

Increase accuracyand consistency

Intelligent Document Capture

9

India 2001Turkey 1997

Brazil 2000

South Africa 2001

Ireland 2002

Italy 2002Cyprus 2002

Turkey 2000

Kenya 2000

Slovak Republic 2001

Hong Kong 2001

Thailand 2008(Community)

Slovenia 2006 Hong Kong 2006 South Africa Survey 2007Ireland 2006

10

Manual

Saving of 25%

Saving of 50%

(Source: CSO – Central Statistic Office Ireland)

Automated Data Capture = time saving

The technology is there

No need to invent the wheel Reducing risks by using an ‘Off the

shelf’ technologies.

11

12

OCR

OMR

ICR

Data Types

13

Automatic Recognition

A * C * E F

1 2 3 4 5 * 7

ICR

*=Unrecognized Character

Improve Recognition – Voting mechanism

14

*7521*7521

OCR Type AOCR Type A OCR Type COCR Type COCR Type BOCR Type B

97*2197*21 9*5*19*5*1

9752197521

VOTINGVOTING

Voting Single Engine vs. Virtual Engines

AEG Nestor OCE Virtual

Good 90.6 88.5 92 90.2

Reject 5.6 5.7 3.3 8.9

False Positive 3.8 5.8 4.7 0.9

16

Figure Of Merit Example

A system recognizes 90% of the characters contained in a batch, but misclassifies 4%

90 - (10*4) = 50

The Figure Of Merit in this example is 50

A system recognizes 80% of the characters contained in a batch, but misclassifies 1%

80- (10*1) = 70

The Figure Of Merit in this example is 50

The second system is more efficient

Benefits of Multiple ICRs

2 8 9 5 6 3 7 4 3 1 6 7 8 5

Identify false positives Alpha & Numeric fields Highlight for verifications Quality control for ICR

Unique Tiling station – Checking for false positives

19

Engine Result

1 25***8

2 2*5378

3 2534784 2*34*8

Voting Methods Example

Assume we have a V. engine that includes 4 engines We want to identify the following number: 253478 The results of each engine are displayed on the right The final results of the V. engines will be:

Safe: 2****8 Normal: 25**78

Majority: 253478

Order: 255378

Equalizer: ??????

20

3 3 8 3

Majority = 3 Safe = *

ICR 1 ICR 2 ICR 3 ICR 4

Processing Example

Automatic Recognition Time + Completion Time + Correction Time =

THROUGHPUTTHROUGHPUT

Recognition

Completion

Image

Fuzzy/Approximate Search

Completion

Recognition

Image

Other Approaches Auto Coding

– Coding tasks and data validations performed on the data capture platform: a ‘cost-effective’ solution

– Use artificial intelligent & statistic software's for “understand” sentences

Q: “What do you do for living?”

A: “I am guiding children” “Teacher” 2030

– Use Approximate Search tools for improving results via DB (Exorbyte)

2525Scanning OCR Validation

Process integrality, Questioner integrity - a work flow according to the client needs

Export

MFlexibilityctivator

26

Flexibility

Flexibility

27

Thank You

Census Data Capture Platform