Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Preview:

DESCRIPTION

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm. A Thesis Proposal Presented to the Department of Computer Science Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information - PowerPoint PPT Presentation

Citation preview

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Proposal Presented to theDepartment of Computer Science

Brigham Young University

MotivationMotivation

• Millions of people want genealogical information

• Acquiring microfilm is expensive and time consuming

ProblemProblem

• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious

• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower

ProblemProblem

• Tables Tables have different layouts and styles have different layouts and styles

• Tables contain different recordsTables contain different records

• Tables lack information and are Tables lack information and are ambiguousambiguous

Related WorkRelated Work

• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables

• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates

• They ignore the ontological constraints of They ignore the ontological constraints of the informationthe information

Related WorkRelated Work

Input FeaturesInput Features

1.1. Coordinates of each cell.Coordinates of each cell.

2.2. Printed text of each cell.Printed text of each cell.

3.3. Whether or not each cell Whether or not each cell is empty.is empty.

XML Input FileXML Input File

< cell < cell rectanglerectangle="335,114,521,172" ="335,114,521,172" printed_text printed_text =“NAME and =“NAME and

Surname of each Surname of each Person" Person" emptyempty=“0" =“0"

/> /> ……

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Collect EvidenceCollect

Evidence

Apply Rules

Apply Rules

VerifyResultsVerifyResults

Cell TypesCell Types

Print Label CellsLabel Cells

Value CellsValue Cells

Empty CellsEmpty Cells

InputInput

Genealogical OntologyGenealogical Ontology

Family

Address

Age Gender

1

*

Person

Name

**

4.31.1

1.31

1.11

Extract FeaturesExtract Features

• The algorithm extracts featuresThe algorithm extracts features

• Support or refute a geometric and Support or refute a geometric and ontological relationshipsontological relationships

• Extracted features yield a confidence value Extracted features yield a confidence value between 0 and 1between 0 and 1

Collect Evidence

Collect Evidence

5 Relationships5 RelationshipsCollect Evidence

Collect Evidence

1. Associate value cells to label Cells

2. Associate label cells to label Cells

3. Associate value cells to value Cells

4. Match label cells to object set in the genealogical ontology

5. Identify label cells that factor other label cells

Evidence MatrixEvidence MatrixCollect Evidence

Collect Evidence

Label CellsV

alue

s C

ells

.75 .10

.20 .32

Apply RulesApply Rules

• A set correlation rules associate the values of the A set correlation rules associate the values of the evidence matricesevidence matrices

• The algorithm iterates over the set of correlation The algorithm iterates over the set of correlation rulesrules

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

A RuleA RuleCollect Evidence

Collect Evidence

j min[LVji & LVjk ] =

min {min[LVji & LVjk ] * [ VVik + .3],

max[LVji & LVjk ] }

ApplyRules

ApplyRules

.75 .10

.20 .32

Label - Value

.90

Value - Value

A RuleA RuleCollect Evidence

Collect Evidence

.90

j min[LVji & LVjk ] =

min { min[LVji & LVjk ] * [ VVik + .3],

max[LVji & LVjk ] }

ApplyRules

ApplyRules

.75

.32

Label - Value

Value - Value

.32

.75

FactoringFactoringCollect Evidence

Collect Evidence

ApplyRules

ApplyRules

[Name] per [Address] = 9 / 2 = 4.5 [Name] per [Address] = 9 / 2 = 4.5

Genealogical OntologyGenealogical Ontology

[Name] per [Address] = 1 * 4.3 * 1.1 = 4.73 [Name] per [Address] = 1 * 4.3 * 1.1 = 4.73

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

Family

Address

Age Gender

1

*

Person

Name

**

4.31.1

1.31

1.11

A Factoring RuleA Factoring Rule• Compare the expected cardinality, O, Compare the expected cardinality, O,

ratio for a pair of label cells with the ratio for a pair of label cells with the observed cardinality ratio, Nobserved cardinality ratio, Nii/N/Njj..

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

FMij = FMij * [1 - | Oij – Ni/Nj | + C]

= FMij * [1 - | 4.734.73 – 4.5 4.5 | + .5] = FMij * 1.27

Score ResultsScore Results

• Score extracted record structureScore extracted record structure• Human user for verificationHuman user for verification

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

Score ResultsScore ResultsCollect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

Database

NameFamily

01230123

Collect Evidence

Collect Evidence

ApplyRules

ApplyRules

StoreResults

StoreResults

• Create SQL Insert statements to store table cell coordinates

INSERT INTO Person (Name) VALUES ('335,114,521,172335,114,521,172 ')

INSERT INTO Person (Name) VALUES ('335,173,521,231335,173,521,231')

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Collect EvidenceCollect

Evidence

Apply Rules

Apply Rules

StoreResultsStore

Results

MeasurementsMeasurements

• 5 – 7 Concept Tables• 5 Train Set – Real World Tables• 15 Test Set - Real World Tables

• Precision, recall, and accuracy of the cells written in the SQL statements.

Contributions

• Exploiting both constraints of a Exploiting both constraints of a genealogical ontology and geometrygenealogical ontology and geometry

• Combines extracted features using Combines extracted features using correlation rulescorrelation rules

Delimitations

• Tables of rows and columnsTables of rows and columns

• Genealogical domain.Genealogical domain.

• English language documentsEnglish language documents

• Tables that do not span multiple Tables that do not span multiple documentsdocuments

Artifacts • Application/demo in the Java Application/demo in the Java programming language. programming language.

Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm

Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.

A Thesis Proposal Presented to theDepartment of Computer Science

Brigham Young University

Recommended