Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs...

Preview:

Citation preview

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Submitted to the Faculty ofBrigham Young University

MotivationMotivation

• Millions of people want genealogical Millions of people want genealogical informationinformation

• Acquiring microfilm is expensive and Acquiring microfilm is expensive and time consumingtime consuming

Extraction ProblemExtraction Problem

• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious

• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower

DifficultiesDifficulties

• Tables Tables have different layouts and styles have different layouts and styles

• Tables contain different recordsTables contain different records

• Tables do not use a uniform schemaTables do not use a uniform schema

• Tables lack information and are ambiguousTables lack information and are ambiguous

Related WorkRelated Work

• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables

• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates

• They ignore the ontological constraints of They ignore the ontological constraints of this informationthis information

ContributionsContributions

• Exploit both ontological and geometric Exploit both ontological and geometric constraintsconstraints

• Identify complex recordsIdentify complex records

• Work with tables with hand-written Work with tables with hand-written valuesvalues

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidencesGenerate

Confidences

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

Training SetTraining Set

• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:

– Identify relationships between table cells Identify relationships between table cells

– Create genealogical ontologyCreate genealogical ontology

– Define features to extractDefine features to extract

– Generate rules (constraints)Generate rules (constraints)

Input: Microfilm TableInput: Microfilm Table

Input: Microfilm TableInput: Microfilm Table

Input: Microfilm TableInput: Microfilm Table

Input FeaturesInput Features

1.1. Coordinates of each cell.Coordinates of each cell.

2.2. Printed text for label cells.Printed text for label cells.

3.3. Whether or not each value Whether or not each value cell is empty.cell is empty.

Input: Microfilm TableInput: Microfilm Table

<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">  

<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />      

<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />   

<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />   

<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />

. . .. . .

Genealogical OntologyGenealogical Ontology

Genealogical OntologyGenealogical Ontology

Genealogical OntologyGenealogical Ontology <<OntologyOntology>>

<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>

. . .. . .

Generate ConfidencesGenerate Confidences

• Confidence of relationships Confidence of relationships between pairs of cellsbetween pairs of cells

• Generate confidence values Generate confidence values between 0 and 1between 0 and 1

Generate Confidences

Generate Confidences

RelationshipsRelationshipsGenerate Confidences

Generate Confidences

• A label cell describes a value cell A label cell describes a value cell

• Value cells in same row or columnValue cells in same row or column

• Label cells form a multi-level label Label cells form a multi-level label

• A label cell maps to an object setA label cell maps to an object set

• Identify factoringIdentify factoring

Label Cell and Value CellLabel Cell and Value Cell

A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell

Generate Confidences

Generate Confidences

Label Label

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Label Cell and Value CellLabel Cell and Value Cell

Preferences for label – value Preferences for label – value orientationsorientations

Generate Confidences

Generate Confidences

Label Orientation Confidence

Above 1

Left .75

Right .5

Below .25

Label

Label Cell and Value CellLabel Cell and Value Cell

Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell

Generate Confidences

Generate Confidences

LabelLabelOROR

1100Not SimilarNot Similar SimilarSimilar

Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)

A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells

Generate Confidences

Generate Confidences

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)

A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell

Generate Confidences

Generate Confidences

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )

Compare height and widthCompare height and width

Generate Confidences

Generate Confidences

1100Not SimilarNot Similar SimilarSimilar

Multi-level LabelsMulti-level Labels

• Distance between the midpoints Distance between the midpoints

• A line through the midpointsA line through the midpoints

• Share a common borderShare a common border

Generate Confidences

Generate Confidences

Match Label Cells to Object SetsMatch Label Cells to Object Sets

• Match synonyms of object sets to Match synonyms of object sets to words in a labelwords in a label– Location of matched wordsLocation of matched words– Order that object sets match wordsOrder that object sets match words

Generate Confidences

Generate Confidences

Full NameFull Name

LocationLocation

DayDay

FamilyFamily

Object SetsObject Sets

Enforce ConstraintsEnforce Constraints• A set of rules describe geometric and ontological constraints.A set of rules describe geometric and ontological constraints.

• For example:For example:– Value cells of the same type have the same dimensionsValue cells of the same type have the same dimensions– A family can’t have 100 membersA family can’t have 100 members

• The algorithm iterates over the rulesThe algorithm iterates over the rules

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

LowerLowerConfidenceConfidence

1. Similar Value Cells1. Similar Value CellsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

2. Combine Aggregations2. Combine AggregationsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

3. Multi-level Labels3. Multi-level LabelsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

4. Factoring4. Factoring

• Observed cardinality:Observed cardinality:

– microfilm tablemicrofilm table

• Expected cardinality:Expected cardinality:

– genealogy ontologygenealogy ontology

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints Check Cardinality ConstraintsCheck Cardinality Constraints

Observed CardinalityObserved CardinalityGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67

. . .. . .

Expected CardinalityExpected Cardinality

[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8

Generate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

5. Ontological Similarity5. Ontological SimilarityGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints Increase Confidence of Label Increase Confidence of Label

to Object Set Mappingsto Object Set Mappings

6. Same Microfilm Roll6. Same Microfilm RollGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

• Microfilm from the same roll have Microfilm from the same roll have the same structure and relationships the same structure and relationships

• Generate the confidence values for Generate the confidence values for multiple tables from the same roll multiple tables from the same roll

• Take the average of the respective Take the average of the respective confidence values confidence values

Verify ResultsVerify ResultsGenerate Confidences

Generate Confidences

EnforceConstraints

EnforceConstraints

VerifyResults

VerifyResults

DatabaseDatabase

Full NameFull Name …

Generate Confidences

Generate Confidences

ApplyRules

ApplyRules

VerifyResults

VerifyResults

• Create SQL Insert statements to Create SQL Insert statements to store value cell coordinatesstore value cell coordinates

INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,173,521,231335,173,521,231')')…

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidencesGenerate

Confidences

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

Training Set ResultsTraining Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 100%100% 100%100%

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

74.45%74.45% 100%100% 84.65%84.65%

FactoringFactoring 100%100% 100%100% 100%100%

SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%

Ambiguous FactoringAmbiguous Factoring

ExperimentsExperiments

• 75 Tables from 15 different 75 Tables from 15 different microfilm rollsmicrofilm rolls

• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship

Test Set ResultsTest Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

84.98%84.98% 92.76%92.76% 88.1888.18%%

FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%

SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%

3 Success Examples3 Success Examples

1.1. Specialized RecordSpecialized Record

2.2. Ontology ConstraintsOntology Constraints

3.3. FactoringFactoring

1. Specialized Records1. Specialized Records

1. Specialized Records1. Specialized Records

INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456 ,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3)INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1)INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1)INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483')INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483')INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')

2. Ontology Constraints2. Ontology Constraints

2. Ontology Constraints2. Ontology Constraints

INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1)INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372')INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371')

INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')

3. Factoring3. Factoring

3 Types of Errors3 Types of Errors

1.1. Ambiguous FactoringAmbiguous Factoring

2.2. Long Label NamesLong Label Names

3.3. Ambiguous ColumnsAmbiguous Columns

2. Long Label Names2. Long Label Names

3. Ambiguous Columns3. Ambiguous Columns

ArtifactsArtifacts

• Tool in the Java programming language Tool in the Java programming language

• http://www.rdhd.byu.edu/

• Executable Jar FileExecutable Jar File

• Source CodeSource Code

• Input FilesInput Files

• DocumentationDocumentation

Future WorkFuture Work

• Advanced natural language Advanced natural language processingprocessing

• Hand-written valuesHand-written values

• Machine learningMachine learning

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Presented to theDepartment of Computer Science

Brigham Young University

Recommended