16
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Embed Size (px)

Citation preview

Page 1: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical

Documents

Elder David W. Embley

Page 2: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Overview

• Big Picture• Diagram• Details & Demo

• Current Status and Expectations

Page 3: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Fe6: 1. Prepare 2. Extract 3. Merge&Split 4. Check&Correct 5. Generate 6. Convert

FROntIER

ListReader

OntoSoar

GreenFIE

Page 4: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

1. Prepare

{

Page 5: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

2. Extract

Page 6: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

3. Merge & Split

Person

Couple

ParentsWithChildren

Page 7: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

4. Check & Correct

Page 8: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

5. Generate

Page 9: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

6. Convert

Page 10: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

HighlightedResults

Page 11: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Fe6: 1. Prepare 2. Extract 3. Merge&Split 4. Check&Correct 5. Generate 6. Convert

FROntIER

ListReader

OntoSoar

GreenFIE

COMET

Page 12: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Precision, Recall, F-Measure ResultsPrecision Recall F-Measure

FROntIER

Person 0.86 0.66 0.75

Couple 1.00 0.40 0.57

ParentsWithChildren 0.89 0.89 0.89

GreenFIE

Person 0.94 0.83 0.88

Couple 1.00 0.90 0.95

ParentsWithChildren 1.00 0.78 0.86

OntoSoar

Person 0.67 0.67 0.67

Couple 0.75 0.30 0.43

ParentsWithChildren 1.00 0.44 0.62

Page 13: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Fe6: 1. Prepare 2. Extract 3. Merge&Split 4. Check&Correct 5. Generate 6. Convert

FROntIER

ListReader

OntoSoar

GreenFIE

FeedbackLoop

Automated Check (Fix & Warn)

“Sanity”Check

Name, Date, Place Standardization

Administrative and Batch-Processing Management System

COMET

Page 14: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Fe6: 1. Prepare 2. Extract 3. Merge&Split 4. Check&Correct 5. Generate 6. Convert

FROntIER

ListReader

OntoSoar

GreenFIE

FeedbackLoop

Automated Check (Fix & Warn)

“Sanity”Check

Name, Date, Place Standardization

Administrative and Batch-Processing Management System

Bootstrapping, Ever-learning, Feedback Loop

Extraction Tools:• Layout• Machine Learning

Non-English Languages

COMET

Page 15: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Summary

• (Semi)automatic Extraction

• Green, Ever-Learning System (improves with use)

• Status:• Extraction Tools (tech-transfer of academic prototypes)• Thin-Line Ensemble Prototype (being thickened)

Page 16: (Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley

Summary

• (Semi)automatic Extraction

• Green, Ever-Learning System (improves with use)

• Status:• Extraction Tools (tech-transfer of academic prototypes)• Thin-Line Ensemble Prototype (being thickened)

BYU Data Extraction Research Groupwww.deg.byu.edu