Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm
Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.
A Thesis Proposal Presented to theDepartment of Computer Science
Brigham Young University
MotivationMotivation
• Millions of people want genealogical information
• Acquiring microfilm is expensive and time consuming
ProblemProblem
• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious
• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower
ProblemProblem
• Tables Tables have different layouts and styles have different layouts and styles
• Tables contain different recordsTables contain different records
• Tables lack information and are Tables lack information and are ambiguousambiguous
Related WorkRelated Work
• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables
• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates
• They ignore the ontological constraints of They ignore the ontological constraints of the informationthe information
Related WorkRelated Work
Input FeaturesInput Features
1.1. Coordinates of each cell.Coordinates of each cell.
2.2. Printed text of each cell.Printed text of each cell.
3.3. Whether or not each cell Whether or not each cell is empty.is empty.
XML Input FileXML Input File
< cell < cell rectanglerectangle="335,114,521,172" ="335,114,521,172" printed_text printed_text =“NAME and =“NAME and
Surname of each Surname of each Person" Person" emptyempty=“0" =“0"
/> /> ……
AlgorithmAlgorithm
SQL Insert Statements
SQL Insert Statements
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
Collect EvidenceCollect
Evidence
Apply Rules
Apply Rules
VerifyResultsVerifyResults
Cell TypesCell Types
Print Label CellsLabel Cells
Value CellsValue Cells
Empty CellsEmpty Cells
InputInput
Genealogical OntologyGenealogical Ontology
Family
Address
Age Gender
1
*
Person
Name
**
4.31.1
1.31
1.11
Extract FeaturesExtract Features
• The algorithm extracts featuresThe algorithm extracts features
• Support or refute a geometric and Support or refute a geometric and ontological relationshipsontological relationships
• Extracted features yield a confidence value Extracted features yield a confidence value between 0 and 1between 0 and 1
Collect Evidence
Collect Evidence
5 Relationships5 RelationshipsCollect Evidence
Collect Evidence
1. Associate value cells to label Cells
2. Associate label cells to label Cells
3. Associate value cells to value Cells
4. Match label cells to object set in the genealogical ontology
5. Identify label cells that factor other label cells
Evidence MatrixEvidence MatrixCollect Evidence
Collect Evidence
Label CellsV
alue
s C
ells
.75 .10
.20 .32
Apply RulesApply Rules
• A set correlation rules associate the values of the A set correlation rules associate the values of the evidence matricesevidence matrices
• The algorithm iterates over the set of correlation The algorithm iterates over the set of correlation rulesrules
Collect Evidence
Collect Evidence
ApplyRules
ApplyRules
A RuleA RuleCollect Evidence
Collect Evidence
j min[LVji & LVjk ] =
min {min[LVji & LVjk ] * [ VVik + .3],
max[LVji & LVjk ] }
ApplyRules
ApplyRules
.75 .10
.20 .32
Label - Value
.90
Value - Value
A RuleA RuleCollect Evidence
Collect Evidence
.90
j min[LVji & LVjk ] =
min { min[LVji & LVjk ] * [ VVik + .3],
max[LVji & LVjk ] }
ApplyRules
ApplyRules
.75
.32
Label - Value
Value - Value
.32
.75
FactoringFactoringCollect Evidence
Collect Evidence
ApplyRules
ApplyRules
[Name] per [Address] = 9 / 2 = 4.5 [Name] per [Address] = 9 / 2 = 4.5
Genealogical OntologyGenealogical Ontology
[Name] per [Address] = 1 * 4.3 * 1.1 = 4.73 [Name] per [Address] = 1 * 4.3 * 1.1 = 4.73
Collect Evidence
Collect Evidence
ApplyRules
ApplyRules
Family
Address
Age Gender
1
*
Person
Name
**
4.31.1
1.31
1.11
A Factoring RuleA Factoring Rule• Compare the expected cardinality, O, Compare the expected cardinality, O,
ratio for a pair of label cells with the ratio for a pair of label cells with the observed cardinality ratio, Nobserved cardinality ratio, Nii/N/Njj..
Collect Evidence
Collect Evidence
ApplyRules
ApplyRules
FMij = FMij * [1 - | Oij – Ni/Nj | + C]
= FMij * [1 - | 4.734.73 – 4.5 4.5 | + .5] = FMij * 1.27
Score ResultsScore Results
• Score extracted record structureScore extracted record structure• Human user for verificationHuman user for verification
Collect Evidence
Collect Evidence
ApplyRules
ApplyRules
StoreResults
StoreResults
Score ResultsScore ResultsCollect Evidence
Collect Evidence
ApplyRules
ApplyRules
StoreResults
StoreResults
Database
NameFamily
01230123
…
Collect Evidence
Collect Evidence
ApplyRules
ApplyRules
StoreResults
StoreResults
• Create SQL Insert statements to store table cell coordinates
…
INSERT INTO Person (Name) VALUES ('335,114,521,172335,114,521,172 ')
INSERT INTO Person (Name) VALUES ('335,173,521,231335,173,521,231')
…
AlgorithmAlgorithm
SQL Insert Statements
SQL Insert Statements
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
Collect EvidenceCollect
Evidence
Apply Rules
Apply Rules
StoreResultsStore
Results
MeasurementsMeasurements
• 5 – 7 Concept Tables• 5 Train Set – Real World Tables• 15 Test Set - Real World Tables
• Precision, recall, and accuracy of the cells written in the SQL statements.
Contributions
• Exploiting both constraints of a Exploiting both constraints of a genealogical ontology and geometrygenealogical ontology and geometry
• Combines extracted features using Combines extracted features using correlation rulescorrelation rules
Delimitations
• Tables of rows and columnsTables of rows and columns
• Genealogical domain.Genealogical domain.
• English language documentsEnglish language documents
• Tables that do not span multiple Tables that do not span multiple documentsdocuments
Artifacts • Application/demo in the Java Application/demo in the Java programming language. programming language.
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm
Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.
A Thesis Proposal Presented to theDepartment of Computer Science
Brigham Young University