Upload
norman-morton
View
215
Download
0
Embed Size (px)
Citation preview
Census Data Capture Challenge
Intelligent Document Capture Solution
UNSD Workshop - Minsk Dec 2008
Amir Angel Director of Government Projects
2
The evolution of data capture in census projects
From OCR into IDR Solution
eFLOWeFLOW
Five steps:
3
Manual data entry (Key from paper)Slow processHigh error rate in the data entry processRecruitment, training and management of personnel
Key from Image:ArchiveApprox 20% faster than key from paper
The evolution of data capture in census projects
Key From Paper
Key From Image
4
OMR (Hardware readers for checkbox)– Requires special scanners and specially printed forms
– Cannot handle handwritten/printed data
– Forms are not user-friendly
– OMR requires more answers => more space => increased paper expenditures => more handling and printing costs
– Not flexible, difficult to adjust to other applications once census is over
– No possibility to add business rules: imputation, validations, coding
The evolution of data capture in census projects
OMR
5
The evolution of data capture in census projects
Automated Data Capture– Requires less human intervention, enables to complete the
census data capture much faster (less space, less salaries, less hardware)
– Full flexibility in the type of data gathered (checkbox, OMR, handwritten, alpha and numeric, barcode…)
– Ensures data integrity – enables the use of automatic AND manual: online validations, exception handling, coding
– The most advanced and proven technology for Censuses, recommended by the UN and used by all modern countries for census projects
– Creates a correlation between the image and the actual form
– Remote capabilities enable all forms to be scanned locally and then sent to a central site for processing
eFLOWeFLOW
Automated Data Capture
6
The evolution of data capture in census projects
Intelligent data capture platform (IDR) by using OCR/ICR/OMR/PDA/Web/email:
– Automated data capture +
– Automatic classification for documents
understands and differentiates between various types of documents and languages and Based on state-of-the-art Machine Learning algorithms
Artificial intelligence algorithms which provides enough information for the system to find the location of the fields on its own
Intelligent Data Capture
eFLOWeFLOW
7
Mail Room Scanning Data EntryBack-Office
End Users
Document prepSorting
ManualKey from image
Traditional Data Capture
8
Mail Room Scanning Data EntryBack-Office
End Users
Document prepNo sorting
Reduce manual dataentry by 40-70%
Increase accuracyand consistency
Intelligent Document Capture
9
India 2001Turkey 1997
Brazil 2000
South Africa 2001
Ireland 2002
Italy 2002Cyprus 2002
Turkey 2000
Kenya 2000
Slovak Republic 2001
Hong Kong 2001
Thailand 2008(Community)
Slovenia 2006 Hong Kong 2006 South Africa Survey 2007Ireland 2006
10
Manual
Saving of 25%
Saving of 50%
(Source: CSO – Central Statistic Office Ireland)
Automated Data Capture = time saving
The technology is there
No need to invent the wheel Reducing risks by using an ‘Off the
shelf’ technologies.
11
Improve Recognition – Voting mechanism
14
*7521*7521
OCR Type AOCR Type A OCR Type COCR Type COCR Type BOCR Type B
97*2197*21 9*5*19*5*1
9752197521
VOTINGVOTING
Voting Single Engine vs. Virtual Engines
AEG Nestor OCE Virtual
Good 90.6 88.5 92 90.2
Reject 5.6 5.7 3.3 8.9
False Positive 3.8 5.8 4.7 0.9
16
Figure Of Merit Example
A system recognizes 90% of the characters contained in a batch, but misclassifies 4%
90 - (10*4) = 50
The Figure Of Merit in this example is 50
A system recognizes 80% of the characters contained in a batch, but misclassifies 1%
80- (10*1) = 70
The Figure Of Merit in this example is 50
The second system is more efficient
Identify false positives Alpha & Numeric fields Highlight for verifications Quality control for ICR
Unique Tiling station – Checking for false positives
19
Engine Result
1 25***8
2 2*5378
3 2534784 2*34*8
Voting Methods Example
Assume we have a V. engine that includes 4 engines We want to identify the following number: 253478 The results of each engine are displayed on the right The final results of the V. engines will be:
Safe: 2****8 Normal: 25**78
Majority: 253478
Order: 255378
Equalizer: ??????
Other Approaches Auto Coding
– Coding tasks and data validations performed on the data capture platform: a ‘cost-effective’ solution
– Use artificial intelligent & statistic software's for “understand” sentences
Q: “What do you do for living?”
A: “I am guiding children” “Teacher” 2030
– Use Approximate Search tools for improving results via DB (Exorbyte)
2525Scanning OCR Validation
Process integrality, Questioner integrity - a work flow according to the client needs
Export
MFlexibilityctivator