27
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila Real, Portugal October 5-8, 2007

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Embed Size (px)

Citation preview

Page 1: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT

CLASSIFICATION DURING METADATA EXTRACTION

Kurt Maly

Steven Zeil

Mohammad Zubair

WWW/Internet 2007Vila Real, PortugalOctober 5-8, 2007

Page 2: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

OUTLINE

1. Background: Robust automatic extraction of metadata from heterogeneous collections

2. Validation of extracted metadata

3. Post-hoc classification of document layouts

4. Conclusions

Page 3: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

1. Background

• Diverse, growing government document collections• Amount of metadata available varies considerably• Automated system to extract metadata from new

documents– Classify documents by layout similarity– Template defines extraction process for a layout class

Page 4: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Process Overview

O C R

L ay o u tC las s if ic a tio n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad ata

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad ata

E n ter in tod atab as e

tru s tedm etad ata

c o r r ec tedm etad ata

lay o u t tem p la tes

Page 5: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila
Page 6: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Sample Metadata Record (including mistakes)

<?xml version="1.0"?><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle> <PersonalAuthor> Name of Candidate: Major Matthew H. Fath </PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel

John A. Suprin, M.A. </approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata>

Page 7: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Issue: Layout Classification

• Key to keeping extraction templates simple

• Previously explored a variety of techniques based upon geometric position of text and graphics– e.g., MX-Y trees, learning machines(??)

• Generally unsatisfactory in either accuracy or in compatibility with template approach

Page 8: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Issue: Robustness

• Sources of errors– OCR software failures– Poor document quality– Classification errors– Template errors– Extraction engine faults

• Need to detect dubious outputs– refer to human for inspection & correction

Page 9: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

2. Validation

Exploit statistical and heuristic approaches to evaluate quality of extracted metadata

• Reference Models

• Validation Process– tests– specifications

Page 10: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Reference Models

• From previously extracted metadata– specific to document collection

• Phrase dictionaries constructed for fields with specialized vocabularies– e.g., author, organization

• Statistics collected– mean and standard deviation– permits detection of outputs that are

significantly different from collection norms

Page 11: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Statistics collected

• Field length statistics – title, abstract, author,..

• Phrase recurrence rates for fields with specialized vocabularies – author and organization

• Dictionary detection rates for words in natural language fields– abstract, title,.

Page 12: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3

Field Length (in words), DTIC collection

Page 13: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%

Dictionary Detection (% of recognized words), DTIC collection

Page 14: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Field Phrase length

Avg. Std. Dev.

1 97% 11% Personal author 2 83% 32% 3 71% 45%

1 100% 2%

CorporateAuthor 2 99% 6%

3 99% 10% 4 98% 13%

Phrase Dictionary Hit Percentage, DTIC collection

Page 15: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Validation Process

• Extracted outputs for fields are subjected to a variety of tests– Test results are normalized to obtain

confidence value in range 0.0-1.0

• Test results for same field are combined to form field confidence

• Field confidences are combined to form overall confidence

Page 16: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Validation Tests

• Deterministic– Regular patterns such as date, report

numbers

• Probabilistic– Length: if value of metadata is close to

average -> high score– Vocabulary: recurrence rate according to

field’s phrase dictionary – Dictionary: detection rate of words in English

dictionary

Page 17: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Combining results

• Validation specification describes– which tests to apply to which fields– how to combine field tests into field

confidence– how to combine field confidences into overall

confidence

Page 18: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Validation Specification for DTIC Collection

<?xml version="1.0"?>

<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">

<val:average>

<val:field name="UnclassifiedTitle">

<val:average>

<val:dictionary/>

<val:length/>

</val:average>

</val:field>

<val:field name="PersonalAuthor">

<val:min>

<val:length/>

<val:max>

<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>

</val:max>

</val:min>

</val:field>

Page 19: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Validation Specification - continued

<val:field name="CorporateAuthor">

<val:min>

<val:length/>

<val:max>

<val:phrases length="1"/> <val:phrases length="2"/>

<val:phrases length="3"/> <val:phrases length="4"/>

</val:max>

</val:min>

</val:field>

<val:field name="ReportDate">

<val:dateFormat/>

</val:field>

</val:average>

</val:validate>

Page 20: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

<?xml version="1.0"?>

<metadata confidence="0.460"

warning="ReportDate field does not match required pattern">

<UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron

Will, and Intellect: General Robert L. Eichelberger and Military Genius

</UnclassifiedTitle>

<PersonalAuthor confidence="0.4"

warning="PersonalAuthor: unusual number of words">

Name of Candidate: Major Matthew H. Fath

</PersonalAuthor>

<ReportDate confidence="0.0"

warning="ReportDate field does not match required pattern">

Accepted this 18th day of June 2004 by:

</ReportDate>

<approvedby warning="unvalidated">Approved by: Thesis Committee Chair Jack D. Kem, Ph.D.

, Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. Suprin, M.A.

</approvedby>

<acceptedby warning="unvalidated">Robert F. Baumann, Ph.D.</acceptedby>

</metadata>

Sample Output from the Validator

Page 21: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

3. Classification

• Post hoc classification

• Experimental Results

Page 22: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Post hoc Classification

• Previously attempted a priori classification– choose one layout based on geometry of

page– apply template for that chosen layout

• Alternative: exploit validator for post hoc selection of layout– Apply all templates to given document– Score each output using validator– Select template which scored highest

Page 23: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Experimental Design

• How effective is post-hoc classification?• Selected several hundred documents recently added to

DTIC collection– Visually classified by humans,

• comparing to 4 most common layouts from studies of earlier documents

• discarded documents not in one of those classes• 167 documents remained

• Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice

• Compared validator’s preferred layout to human choices

Page 24: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Automatic vs. Human Classifications

• Post-hoc classifier agreed with human on 74% of cases

Manually Assigned Classes

Validator au

Validator eagle

Validator rand

Validator title

Total Manual

au 86 0 0 0 86

eagle 0 8 33 4 45

rand 0 0 8 4 12

title 0 0 1 23 24

Page 25: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Post hoc Classification

• Problem:– WYSIWYG extraction often results in extra words in

extracted data• E.g., in author field ( ‘name of candidate’, “Major’)

– Not desired in final output• post-processing to remove these anticipated but not yet

implemented

– Artificially reduce validator scores• not part of phrase dictionary

• Solutions:– Post-processing must be done prior to validation

Page 26: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Re-interpreting the experiment

• Subjected author metadata to simulated post-processing– scripts to remove

• known extraneous phrases specific to the document layouts

• military ranks and other honorifics

• Agreement between post-hoc classifier and human classification rose to 99%– far exceeds our best a priori classifiers to date

Page 27: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

Conclusions

• Creating statistical model of existing metadata is very useful tool to validate extracted metadata from new documents

• Validation can be used to classify documents and select the right template for the automated extraction process