28
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm Kenneth Martin Tubbs Kenneth Martin Tubbs Jr. Jr. A Thesis Proposal Presented to the Department of Computer Science Brigham Young University

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

  • Upload
    paco

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm. A Thesis Proposal Presented to the Department of Computer Science Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information - PowerPoint PPT Presentation

Citation preview

Page 1: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Proposal Presented to theDepartment of Computer Science

Brigham Young University

Page 2: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

MotivationMotivation

• Millions of people want genealogical information

• Acquiring microfilm is expensive and time consuming

Page 3: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

ProblemProblem

• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious

• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower

Page 4: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

ProblemProblem

• Tables Tables have different layouts and styles have different layouts and styles

• Tables contain different recordsTables contain different records

• Tables lack information and are Tables lack information and are ambiguousambiguous

Page 5: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Related WorkRelated Work

• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables

• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates

• They ignore the ontological constraints of They ignore the ontological constraints of the informationthe information

Page 6: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Related WorkRelated Work

Input FeaturesInput Features

1.1. Coordinates of each cell.Coordinates of each cell.

2.2. Printed text of each cell.Printed text of each cell.

3.3. Whether or not each cell Whether or not each cell is empty.is empty.

XML Input FileXML Input File

< cell < cell rectanglerectangle="335,114,521,172" ="335,114,521,172" printed_text printed_text =“NAME and =“NAME and

Surname of each Surname of each Person" Person" emptyempty=“0" =“0"

/> /> ……

Page 7: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Collect EvidenceCollect

Evidence

Apply Rules

Apply Rules

VerifyResultsVerifyResults

Page 8: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Cell TypesCell Types

Print Label CellsLabel Cells

Value CellsValue Cells

Empty CellsEmpty Cells

Page 9: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

InputInput

Page 10: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Genealogical OntologyGenealogical Ontology

Family

Address

Age Gender

1

*

Person

Name

**

4.31.1

1.31

1.11

Page 11: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Extract FeaturesExtract Features

• The algorithm extracts featuresThe algorithm extracts features

• Support or refute a geometric and Support or refute a geometric and ontological relationshipsontological relationships

• Extracted features yield a confidence value Extracted features yield a confidence value between 0 and 1between 0 and 1

Collect Evidence

Collect Evidence

Page 12: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

5 Relationships5 RelationshipsCollect Evidence

Collect Evidence

1. Associate value cells to label Cells

2. Associate label cells to label Cells

3. Associate value cells to value Cells

4. Match label cells to object set in the genealogical ontology

5. Identify label cells that factor other label cells

Page 13: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Evidence MatrixEvidence MatrixCollect Evidence

Collect Evidence

Label CellsV

alue

s C

ells

.75 .10

.20 .32

Page 14: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Apply RulesApply Rules

• A set correlation rules associate the values of the A set correlation rules associate the values of the evidence matricesevidence matrices

• The algorithm iterates over the set of correlation The algorithm iterates over the set of correlation rulesrules

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

Page 15: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

A RuleA RuleCollect Evidence

Collect Evidence

j min[LVji & LVjk ] =

min {min[LVji & LVjk ] * [ VVik + .3],

max[LVji & LVjk ] }

ApplyRules

ApplyRules

.75 .10

.20 .32

Label - Value

.90

Value - Value

Page 16: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

A RuleA RuleCollect Evidence

Collect Evidence

.90

j min[LVji & LVjk ] =

min { min[LVji & LVjk ] * [ VVik + .3],

max[LVji & LVjk ] }

ApplyRules

ApplyRules

.75

.32

Label - Value

Value - Value

.32

.75

Page 17: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

FactoringFactoringCollect Evidence

Collect Evidence

ApplyRules

ApplyRules

[Name] per [Address] = 9 / 2 = 4.5 [Name] per [Address] = 9 / 2 = 4.5

Page 18: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Genealogical OntologyGenealogical Ontology

[Name] per [Address] = 1 * 4.3 * 1.1 = 4.73 [Name] per [Address] = 1 * 4.3 * 1.1 = 4.73

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

Family

Address

Age Gender

1

*

Person

Name

**

4.31.1

1.31

1.11

Page 19: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

A Factoring RuleA Factoring Rule• Compare the expected cardinality, O, Compare the expected cardinality, O,

ratio for a pair of label cells with the ratio for a pair of label cells with the observed cardinality ratio, Nobserved cardinality ratio, Nii/N/Njj..

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

FMij = FMij * [1 - | Oij – Ni/Nj | + C]

= FMij * [1 - | 4.734.73 – 4.5 4.5 | + .5] = FMij * 1.27

Page 20: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Score ResultsScore Results

• Score extracted record structureScore extracted record structure• Human user for verificationHuman user for verification

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

Page 21: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Score ResultsScore ResultsCollect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

Page 22: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Database

NameFamily

01230123

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

• Create SQL Insert statements to store table cell coordinates

INSERT INTO Person (Name) VALUES ('335,114,521,172335,114,521,172 ')

INSERT INTO Person (Name) VALUES ('335,173,521,231335,173,521,231')

Page 23: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Collect EvidenceCollect

Evidence

Apply Rules

Apply Rules

StoreResultsStore

Results

Page 24: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

MeasurementsMeasurements

• 5 – 7 Concept Tables• 5 Train Set – Real World Tables• 15 Test Set - Real World Tables

• Precision, recall, and accuracy of the cells written in the SQL statements.

Page 25: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Contributions

• Exploiting both constraints of a Exploiting both constraints of a genealogical ontology and geometrygenealogical ontology and geometry

• Combines extracted features using Combines extracted features using correlation rulescorrelation rules

Page 26: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Delimitations

• Tables of rows and columnsTables of rows and columns

• Genealogical domain.Genealogical domain.

• English language documentsEnglish language documents

• Tables that do not span multiple Tables that do not span multiple documentsdocuments

Page 27: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Artifacts • Application/demo in the Java Application/demo in the Java programming language. programming language.

Page 28: Recognizing Table Structure from the  Extracted Cells of Genealogical Microfilm

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Proposal Presented to theDepartment of Computer Science

Brigham Young University