EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT

CLASSIFICATION DURING METADATA EXTRACTION

Kurt Maly

Steven Zeil

Mohammad Zubair

WWW/Internet 2007Vila Real, PortugalOctober 5-8, 2007

OUTLINE

1. Background: Robust automatic extraction of metadata from heterogeneous collections

2. Validation of extracted metadata

3. Post-hoc classification of document layouts

4. Conclusions

1. Background

• Diverse, growing government document collections• Amount of metadata available varies considerably• Automated system to extract metadata from new

documents– Classify documents by layout similarity– Template defines extraction process for a layout class

Process Overview

O C R

L ay o u tC las s if ic a tio n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad ata

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad ata

E n ter in tod atab as e

tru s tedm etad ata

c o r r ec tedm etad ata

lay o u t tem p la tes

Sample Metadata Record (including mistakes)

<?xml version="1.0"?><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle> <PersonalAuthor> Name of Candidate: Major Matthew H. Fath </PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel

John A. Suprin, M.A. </approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata>

Issue: Layout Classification

• Key to keeping extraction templates simple

• Previously explored a variety of techniques based upon geometric position of text and graphics– e.g., MX-Y trees, learning machines(??)

• Generally unsatisfactory in either accuracy or in compatibility with template approach

Issue: Robustness

• Sources of errors– OCR software failures– Poor document quality– Classification errors– Template errors– Extraction engine faults

• Need to detect dubious outputs– refer to human for inspection & correction

2. Validation

Exploit statistical and heuristic approaches to evaluate quality of extracted metadata

• Reference Models

• Validation Process– tests– specifications

Reference Models

• From previously extracted metadata– specific to document collection

• Phrase dictionaries constructed for fields with specialized vocabularies– e.g., author, organization

• Statistics collected– mean and standard deviation– permits detection of outputs that are

significantly different from collection norms

Statistics collected

• Field length statistics – title, abstract, author,..

• Phrase recurrence rates for fields with specialized vocabularies – author and organization

• Dictionary detection rates for words in natural language fields– abstract, title,.

Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3

Field Length (in words), DTIC collection

Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%

Dictionary Detection (% of recognized words), DTIC collection

Field Phrase length

Avg. Std. Dev.

1 97% 11% Personal author 2 83% 32% 3 71% 45%

1 100% 2%

CorporateAuthor 2 99% 6%

3 99% 10% 4 98% 13%

Phrase Dictionary Hit Percentage, DTIC collection

Validation Process

• Extracted outputs for fields are subjected to a variety of tests– Test results are normalized to obtain

confidence value in range 0.0-1.0

• Test results for same field are combined to form field confidence

• Field confidences are combined to form overall confidence

Validation Tests

• Deterministic– Regular patterns such as date, report

numbers

• Probabilistic– Length: if value of metadata is close to

average -> high score– Vocabulary: recurrence rate according to

field’s phrase dictionary – Dictionary: detection rate of words in English

dictionary

Combining results

• Validation specification describes– which tests to apply to which fields– how to combine field tests into field

confidence– how to combine field confidences into overall

confidence

Validation Specification for DTIC Collection

<?xml version="1.0"?>

<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">

<val:average>

<val:field name="UnclassifiedTitle">

<val:average>

<val:dictionary/>

<val:length/>

</val:average>

</val:field>

<val:field name="PersonalAuthor">

<val:min>

<val:length/>

<val:max>

<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>

</val:max>

</val:min>

</val:field>

Validation Specification - continued

<val:field name="CorporateAuthor">

<val:min>

<val:length/>

<val:max>

<val:phrases length="1"/> <val:phrases length="2"/>

<val:phrases length="3"/> <val:phrases length="4"/>

</val:max>

</val:min>

</val:field>

<val:field name="ReportDate">

<val:dateFormat/>

</val:field>

</val:average>

</val:validate>

<?xml version="1.0"?>

<metadata confidence="0.460"

warning="ReportDate field does not match required pattern">

<UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron

Will, and Intellect: General Robert L. Eichelberger and Military Genius

</UnclassifiedTitle>

<PersonalAuthor confidence="0.4"

warning="PersonalAuthor: unusual number of words">

Name of Candidate: Major Matthew H. Fath

</PersonalAuthor>

<ReportDate confidence="0.0"

warning="ReportDate field does not match required pattern">

Accepted this 18th day of June 2004 by:

</ReportDate>

<approvedby warning="unvalidated">Approved by: Thesis Committee Chair Jack D. Kem, Ph.D.

, Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. Suprin, M.A.

</approvedby>

<acceptedby warning="unvalidated">Robert F. Baumann, Ph.D.</acceptedby>

</metadata>

Sample Output from the Validator

3. Classification

• Post hoc classification

• Experimental Results

Post hoc Classification

• Previously attempted a priori classification– choose one layout based on geometry of

page– apply template for that chosen layout

• Alternative: exploit validator for post hoc selection of layout– Apply all templates to given document– Score each output using validator– Select template which scored highest

Experimental Design

• How effective is post-hoc classification?• Selected several hundred documents recently added to

DTIC collection– Visually classified by humans,

• comparing to 4 most common layouts from studies of earlier documents

• discarded documents not in one of those classes• 167 documents remained

• Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice

• Compared validator’s preferred layout to human choices

Automatic vs. Human Classifications

• Post-hoc classifier agreed with human on 74% of cases

Manually Assigned Classes

Validator au

Validator eagle

Validator rand

Validator title

Total Manual

au 86 0 0 0 86

eagle 0 8 33 4 45

rand 0 0 8 4 12

title 0 0 1 23 24

Post hoc Classification

• Problem:– WYSIWYG extraction often results in extra words in

extracted data• E.g., in author field ( ‘name of candidate’, “Major’)

– Not desired in final output• post-processing to remove these anticipated but not yet

implemented

– Artificially reduce validator scores• not part of phrase dictionary

• Solutions:– Post-processing must be done prior to validation

Re-interpreting the experiment

• Subjected author metadata to simulated post-processing– scripts to remove

• known extraneous phrases specific to the document layouts

• military ranks and other honorifics

• Agreement between post-hoc classifier and human classification rose to 99%– far exceeds our best a priori classifiers to date

Conclusions

• Creating statistical model of existing metadata is very useful tool to validate extracted metadata from new documents

• Validation can be used to classify documents and select the right template for the automated extraction process

Documents

EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila