Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Using OCR for Advanced Risk Measurement in Loan BusinessDirk Thomas, Commerzbank AGmomentum BARCELONA 16, 1st November 2016
1
Challenges in Loan Business are taken using OCR Tec hnology
› On a daily basis a huge amount of Client related documents
– Are handed in hardcopies, scanned and saved as images like PDF / TIFF
– Or already provided as images via online channels
› Images formats do not allow direct machine based reading of numbers and text to be used for further processing and storage in databases
Documents
› By the means of OCR (Optical Character Recognition) technology within respective software packages original information can be regained
› Extraction and normalization of specific infomation like eg. turnover from annual reports requires a specific set up of OCR software and recursive document training
OCR
Data Extraction with OCR has two major aspects:
› To replace manual data capture (already well know in bank transaction services) for efficiency purposes (operational costs)
› To extract additional information for advanced analytics in risk management and marketing purposes (loss prevention and growing potential)
Majoraspects
Development of OCR landscape to identify the main d ocument types and the most suitable OCR method
2
OCR Landscape – more than 100 document types for OCR identifiedDocuments OCR / business logic Systems
CoBadocuments
standardized
Documents for
customers
CoBasolvency
documents
External documents
standardized
others
Real estate
Solvency documents
External documents
individual
others
Real Estate
Solvency documents
Account transfer
SEPA-data
Salary statement
BWA
VBA
Guarantor disclosure
Fin. statement
Schufa
Object data
Cadastre
Comm. register
Death certificate
Lease contract
valuation
Orgchart
signature
Centralized and decentralized
scanning
OCR calculation
kernels
Typs:
› Spread-sheets
› text
› charts
› handwriting
› etc.
Middleware ZSR
CRM
RISK
Backoffice
Static Data
CollateralsLimits
Rating
50
25
20
x Number of document types
Different document types need different OCR methods
3
Free form and zone basedextraction
3 Major document types require different extraction strategies
Sources, order like reading: BdB Germany, Wikipedia, none, 3 x Commerzbank, Wikipedia, none, ADAC
Uns
truc
ture
d
Net income
FormularBased
SemiFormular
Un-structured
4
Scanned documents in e-Archive typically among othe rs represent the biggest data set in bank; Big Data
The appropriate ship to transport the Commerzbank e -Archive print outs* is the “Hamburg Express”
* Basis: A bundle of paper with 500 pages has 2,49 kg and volume of about 0,003 m3, while a 20” container (Twenty-foot Equivalent Unit) = 1 TEU has 33,1 m3
55
OCR Architecture allows machine based end-to-end pro cessing incl. Decision
routing classifcation/routing
central / decentralScanning
OCR Computere-Archive
Middleware
Business LogicDecision Engine
RISK Systems
Hardcopies(Branch Process)
Images (Online Process)
Salary Statement
Accounts Current
Legitimation
Income Access Statement
Quartely Numbers
Second hand car advert.
etc.
Advanced Analytics(Parallel Architecture)
Interface/Deliveries
Using OCR customer related documents and certificates are› processed fast without or reduced manual interaction, and› extended data extraction can be used for advanced analytics
CRM SystemsClient
Exception Handling(disqualified Docs)
Network Analysis
66
Different to classic IT development OCR requires a continuous training, testing and set up in a Lab environment
OCR Laboratory
e-Archive
OCR trainingsComputer
2
3
5
Representative Sample cases
› OCR training configuration can be ported directly into production
› OCR configuration and business Logic is being trained for each document type as a pair
Business logicSample Selection
Result File
Images to OCR Software
Control
Test results
set up/programming
testing
4
1
trainings cycle
7
Generation of entity network/graph grounds on data extraction from text documents using machine learning
A graph consists of nodes and edges/links. In our p articular case- Nodes represent entities / companies- Edges represent the kind of relation between the en tities (owner, customer, supplier, competitor etc.)Both information can be extracted from the processe d document, so one document may hold thename of two entities and their type on relation
Machine based learning starts with annotated docume nts
8
Machine bases data extraction from unstructured tex t starts withannotated documents
9
Database(Extracted Text data)
Documents Annotation Annotated
Documents
TrainingModell Feature
Generation
Training
Documents
Further graph basedData to enrich thenetwork
Name Entity Recognition bases machine learning mode ls trained onhuman annotated documents
Finding entities in text in such a manner does not rely on a classical keyword search, hence entity being found without know specific name s
10
Hollywood
Records
Miramax
Films
Touchstone
Pictures
ABC
Entertainment
Disneyland
Resort
Starwave
Hyperion
Books
Pixar
ABC Family
Worldwide
Inc.
Disney
Channels
Worldwide
ABC
Studios
A+E
NetworksESPN
Inc.
Disney
Interactive
Walt
Disney
Disney-ABC
Television G
roup
Disney
Publishing
Worldwide
Sample extraction from an US News Paper already sho ws that company org charts can be generated
The Walt Disney company org Chart above was generat ed from machine reading of an US American news paper, extracting companies and their owner relation