Using OCR for Advanced Risk Measurement in Loan Business · 1 Challenges in Loan Business are taken using OCR Technology › On a daily basis a huge amount of Client related documents

Using OCR for Advanced Risk Measurement in Loan BusinessDirk Thomas, Commerzbank AGmomentum BARCELONA 16, 1st November 2016

1

Challenges in Loan Business are taken using OCR Tec hnology

› On a daily basis a huge amount of Client related documents

– Are handed in hardcopies, scanned and saved as images like PDF / TIFF

– Or already provided as images via online channels

› Images formats do not allow direct machine based reading of numbers and text to be used for further processing and storage in databases

Documents

› By the means of OCR (Optical Character Recognition) technology within respective software packages original information can be regained

› Extraction and normalization of specific infomation like eg. turnover from annual reports requires a specific set up of OCR software and recursive document training

OCR

Data Extraction with OCR has two major aspects:

› To replace manual data capture (already well know in bank transaction services) for efficiency purposes (operational costs)

› To extract additional information for advanced analytics in risk management and marketing purposes (loss prevention and growing potential)

Majoraspects

Development of OCR landscape to identify the main d ocument types and the most suitable OCR method

2

OCR Landscape – more than 100 document types for OCR identifiedDocuments OCR / business logic Systems

CoBadocuments

standardized

Documents for

customers

CoBasolvency

documents

External documents

standardized

others

Real estate

Solvency documents

External documents

individual

others

Real Estate

Solvency documents

Account transfer

SEPA-data

Salary statement

BWA

VBA

Guarantor disclosure

Fin. statement

Schufa

Object data

Cadastre

Comm. register

Death certificate

Lease contract

valuation

Orgchart

signature

Centralized and decentralized

scanning

OCR calculation

kernels

Typs:

› Spread-sheets

› text

› charts

› handwriting

› etc.

Middleware ZSR

CRM

RISK

Backoffice

Static Data

CollateralsLimits

Rating

50

25

20

x Number of document types

Different document types need different OCR methods

3

Free form and zone basedextraction

3 Major document types require different extraction strategies

Sources, order like reading: BdB Germany, Wikipedia, none, 3 x Commerzbank, Wikipedia, none, ADAC

Uns

truc

ture

d

Net income

FormularBased

SemiFormular

Un-structured

4

Scanned documents in e-Archive typically among othe rs represent the biggest data set in bank; Big Data

The appropriate ship to transport the Commerzbank e -Archive print outs* is the “Hamburg Express”

* Basis: A bundle of paper with 500 pages has 2,49 kg and volume of about 0,003 m3, while a 20” container (Twenty-foot Equivalent Unit) = 1 TEU has 33,1 m3

55

OCR Architecture allows machine based end-to-end pro cessing incl. Decision

routing classifcation/routing

central / decentralScanning

OCR Computere-Archive

Middleware

Business LogicDecision Engine

RISK Systems

Hardcopies(Branch Process)

Images (Online Process)

Salary Statement

Accounts Current

Legitimation

Income Access Statement

Quartely Numbers

Second hand car advert.

etc.

Advanced Analytics(Parallel Architecture)

Interface/Deliveries

Using OCR customer related documents and certificates are› processed fast without or reduced manual interaction, and› extended data extraction can be used for advanced analytics

CRM SystemsClient

Exception Handling(disqualified Docs)

Network Analysis

66

Different to classic IT development OCR requires a continuous training, testing and set up in a Lab environment

OCR Laboratory

e-Archive

OCR trainingsComputer

2

3

5

Representative Sample cases

› OCR training configuration can be ported directly into production

› OCR configuration and business Logic is being trained for each document type as a pair

Business logicSample Selection

Result File

Images to OCR Software

Control

Test results

set up/programming

testing

4

1

trainings cycle

7

Generation of entity network/graph grounds on data extraction from text documents using machine learning

A graph consists of nodes and edges/links. In our p articular case- Nodes represent entities / companies- Edges represent the kind of relation between the en tities (owner, customer, supplier, competitor etc.)Both information can be extracted from the processe d document, so one document may hold thename of two entities and their type on relation

Machine based learning starts with annotated docume nts

8

Machine bases data extraction from unstructured tex t starts withannotated documents

9

Database(Extracted Text data)

Documents Annotation Annotated

Documents

TrainingModell Feature

Generation

Training

Documents

Further graph basedData to enrich thenetwork

Name Entity Recognition bases machine learning mode ls trained onhuman annotated documents

Finding entities in text in such a manner does not rely on a classical keyword search, hence entity being found without know specific name s

10

Hollywood

Records

Miramax

Films

Touchstone

Pictures

ABC

Entertainment

Disneyland

Resort

Starwave

Hyperion

Books

Pixar

ABC Family

Worldwide

Inc.

Disney

Channels

Worldwide

ABC

Studios

A+E

NetworksESPN

Inc.

Disney

Interactive

Walt

Disney

Disney-ABC

Television G

roup

Disney

Publishing

Worldwide

Sample extraction from an US News Paper already sho ws that company org charts can be generated

The Walt Disney company org Chart above was generat ed from machine reading of an US American news paper, extracting companies and their owner relation