Digital Access of Handwritten govind/pdfs/and09_ [Milewski Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao Govindaraju, ICDAR 2007] Outline Recognition Postal Application Paradigms

  • View
    212

  • Download
    0

Embed Size (px)

Text of Digital Access of Handwritten govind/pdfs/and09_ [Milewski Govindaraju, DAS 2006] Farooq et al, DAS...

  • Digital Access of Handwritten Documents

    Venu GovindarajuAnurag Bhardwaj

    Huaigu Cao

    Venu@cubs.buffalo.edu

    mailto:Venu@cubs.buffalo.edu

  • Outline

    RecognitionPostal ApplicationParadigms

    SearchOCR accuracy

    FusionLexicon ReductionStatistical Topic Models

    Document SearchWord Spotting

  • Challenge of Handwriting

  • Motivation

    Vast, irreplaceable, culturally vital legacy collections of historical documents are competing ineffectively for attention with billions of digital documents

    Thus historical archives are threatened with

    neglect, perceived irrelevance, . & eventually,

    oblivion?

    Threat: If its not in Google, it doesnt exist!

    [Baird 2003]

  • Postal Context (138 mil records) ZIP Code30% of ZIP Codes contain a single street name5% of ZIP Codes contain a single primary number2% of ZIP Codes contain a single add-on

    Maximum number of records returned is 3,071

    Maximum number of records returned is 3,070

    Lex Top 1 Top 2

    10 96.5 98.7

    100 89.2 94.1

    1000 75.3 86.3

    LDR

  • Paradigms

    Context Ranked Lexicon

    Lexicon Driven OCR

    LDR

    Lexicon Free OCR

    LFR

    Segmentation Recognition Post-processing

  • Lexicon Free (LFR)

    i[.8], l[.8] u[.5], v[.2]

    w[.6], m[.3]

    w[.7]

    i[.7]u[.3]

    m[.2]m[.1]

    r[.4]

    d[.8]o[.5]

    -Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a w with 0.7 confidence-Image from segment 1 to 5 is a w with 0.6 confidence and an m with 0.3 confidence

    Find the best path in graph from segment 1 to 8

  • Lexicon Driven (LDR)

    w[7.6]

    w[7.2]r[3.8]

    w[5.0]

    w[8.6]

    o[7.6]r[6.3]

    d[4.9]

    w[5.0]

    o[6.6]

    o[6.0]

    o[7.2]o[10.6] d[6.5]

    d[4.4]

    r[7.5]r[6.4]

    o[7.8]r[8.6]

    r[7.6]

    o[8.3]

    o[7.7]r[5.8]

    1 2 3 4 5 6 7 8 9

    o[6.1]

    Find the best way of accounting for charactersw, o, r, d buy consuming all segments 1 to 8

    Distance between lexicon entry word first character w and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6

    [Kim & Govindaraju, TPAMI 1997]

  • a) Amherst b) Buffalo c) Boston

    Interactive Models (LDR)2-way interaction

    a) San Jose b) Buffalo c) Washingtond) None of the above

  • Search for Handwritten Documents

    LexiconGood Quality10K 1K

    Historical10K 1K

    Medical4K

    Top 1 (%) 57 67 12 28 20

    Top 3 (%) 69 72 22 44 27

    Top 10 (%) 74 75 32 72 42

    Lexicons are typically large: >5K Need around 70% accuracy

    Strategy Reduce lexicon size using topic categorization (DAS 06;08) Use Top-N choices returned by OCR (ICDAR 07)

    [Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007]

  • Outline

    RecognitionPostal ApplicationParadigms

    SearchOCR accuracy

    FusionLexicon ReductionStatistical Topic Models

    Document SearchWord Spotting

  • ?1ffN

    Fusion of RecognizersType III

    ),( 2111 ssfN

    LDR

    5.6

    7.4

    LFR

    .52

    .81

    Identification task:

    Amherst

    Buffalo

    Verification task:

    5.6 .52Amherst

    ),( 2212 ssfN

    ),( 211 ssf

    1S

    2S Ni ,...,1maxarg

    =

    >SAccept

    Reject

  • Sum rule

    Weighted sum rule

    Product rule

    Max rule

    Rank-based methods

    Traditional Fusion Rules2121

    1 ),( ssssf +=2

    21

    121

    1 ),( swswssf +=

    21211 ),( ssssf =

    ),max(),( 21211 ssssf =

    }),,{,( 111111

    Niii sssrankrs K=

    21211 ),( iiii rrssf +=

    )|,(),( 21211 genrrPssf iiii =

  • Likelihood RatioVerification Tasks

    Impostor

    Genuine

    Rec

    ogni

    zer s

    core

    2

    Recognizer score 1

    2 classes: imposter and genuine Pattern classification task

    ),(),(

    ),( 2121

    21

    sspssp

    ssfimp

    genlr =

    Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:

    Metaclassification with NN, SVM, etc. also possible

    lrV ff =

    Vf

    [Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]

  • Optimal Combination functions

    LFR is correct 54.8%LDR is correct 77.2%Both are correct 48.9%

    Either is correct 83.0%

    Likelihood Ratio 69.8%Weighted Sum 81.6%

    LR combination is worse than single matcher Vf

    LRV ff =

    Identification Task Results

    Top choice correct rate

    Verification Task Results

    ROC

    [Tulyakov & Govindaraju IJPRAI 2009]

  • )},,,{,,,,( 2121 ikMkkk

    Miiii ssssssfS = KK

    Independence of ScoresIn a single trial

    ),( 2111 ssf

    Amherst

    5.6

    7.4

    Buffalo

    .52

    .81

    LDR

    LFR

    ),( 2212 ssf

    . .

    [Tulyakov & Govindaraju IJPRAI 2009]

  • Dependencies

    OCR

    A B C

    .95 .89 .76

    A B C

    .80 .54 .43

    =j

    kkj

    kj

    kkj

    kj

    k CtspCtsp

    C)|,()|,(

    maxarg

    [Tulyakov & Govindaraju IJPRAI 2009]

  • Initialize a combination function

    Get scores from the same identification trial (for all trials) Update function so Genuine score better than any impostor score

    ),,,(),,,(

    () 2121

    Miiiimp

    Miiigen

    ssspsssp

    fK

    K=

    ),,,( 21 Msssf K

    0,1

    1())( 1

    22

    11

    +

    =+++++ jsss M

    MMe

    f K

    Best Impostor Function

    Sum of Logistic Functions

    Iterative Methods

    Likelihood Ratio

    Weighted sum

    Best Impostor Likelihood Ratio

    Logistic Sum

    Neural Network

    LFR & LDR 69.84 81.58 80.07 81.43 81.67

    [Tulyakov & Govindaraju IJPRAI 2009]

  • Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular

    OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data

    6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed

    Search EngineHandwritten Forms

  • Search Engine for Medical Forms

    Find all people who reported asthma problems in NYHow many people with high blood pressure are on medication X?Is there an epidemic breaking?

  • Lexicon Reduction

    Large Lexicon> 5K

    HandwrittenMedical

    Documents

    Lex Driven

    Improve Performance

    Lex Free ICR Features Topic CategoryReduce Lexicon

    ~2.5K

    [Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

  • ICR Features Index

    [Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

  • cohesion(wa ,wb ) = z f (wa ,wb )

    f (wa )* f (wb ))

    DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS

    Topic Features

    [Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

  • (Chu-Carroll, et al., 1999)

    Bt, c =At, c

    At, e2

    e=1

    n

    IDF( t) = log 2n

    c( t)

    Xt, c = IDF(t) Bt, c

    Topic Categorization

    24

    Cosine similarity between trained topic vectors and test document

  • Results

    CLT to RLT CL to RL CLT to ALT CLT to SLT

    HR 7.48% 7.42% 17.58% 7.42%

    Error Rate 10.78% 10.88% 24.53% 10.21%

    C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present

    [Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

  • Train topic categorization maximum entropy model

    Generate topic distribution of test document

    Use topic distribution to score each topic as new prior

    Compute posterior probability of word recognition

    Improves from 32% to 40% on IAM dataset

    Statistical Topic Modeling

    Input Word Image

    Toggle 0.92

    Google 0.90

    Noodle 0.70

    .

    .

    Google 0.96

    Toggle 0.72

    Noodle 0.58

    .

    .Noisy Output Corrected Output

    Correction Model

    [Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

  • Statistical Topic Modeling

    Input Word Image

    Toggle 0.92

    Google 0.90

    Noodle 0.70

    .

    .

    Google 0.96

    Toggle 0.72

    Noodle 0.58

    .

    .Noisy Output

    p(word-image | term)

    Corrected Output

    P(term | word-image)

    Correction Model

    P( term | word-image )

    = P( word-image | term ) x P ( term )= P( word-image | term ) x { P ( term | LMi ) x P (LMi) }

    [Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

  • Language Model = P( t | LMi )

    Category c1 Document

    Category c2 Document

    P(eye|c1) 0.92

    P(brain|c1) 0.90

    .

    .

    P(china|c1) 0.09

    P(trade|c2) 0.82

    P(bank|c2) 0.78

    .

    .

    P(eye|c2) 0.1

    Category c1 Language Model

    LM1

    Category c2 Language Model

    LM2

    [Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

  • Topic Distribution = P( LMi)

    =

    c

    cdf

    cdf

    iii

    iii

    e

    edcP),(

    ),(

    )|(

    Train the Max-Entropy model - fix i

    fi is feature (e.g., normalized word counts)

    I 0.80

    T - 0.65

    H 0.35

    .

    JULY 0.90

    FULLY - 0.75

    DULY 0.65

    .

    CAVE 0.70

    HAVE - 0.55

    HAS 0.15

    .

    DECEIVED 0.95

    RECEIVED - 0.55

    PERCEIVED 0.30

    .

    FAVOR 0.70

    YOUR - 0.55

    COLOR 0.15