Oracle – Big Data THE INTELLIGENCE LIFE-CYCLE and Schema-Last Approach Dr Neil Brittliff PhD

Oracle – Big DataTHE INTELLIGENCE LIFE-CYCLEand Schema-Last Approach

Dr Neil Brittliff PhD

A little about myself… Awarded a PhD at the University of Canberra in March this year for my work in

the Big Data space

Currently employed as Data Scientist within the Australian Government

Have been employed by 5 law enforcement agencies

Developed Cryptographic Software to support the Australian Medicare System

First used Oracle products back in 1986

Worked in the IT industry since 1982

Resides in Canberra (capital of Australia)

Canberra is the only capital city in Australia that is not named after a person

Interests

Tennis (play) / Cricket (watch)

Bushwalking and camping

Piano Playing (very bad)

Making stuff out of wood

Enjoys the art of Programming (prefers the ‘C’ language)

Pushing the limits of the Raspberry Pi

2

University of Canberra - 2015

Talk Structure 3

Motivation

Principles and Constraints

Intelligence Life-Cycle Collect & Collate

Analyse & Produce

Report & Disseminate

Motivation

Research What is a Schema

The Problem with ETL

Data Cleansing verses Data Triage

A New Architecture Oracle Big Data

The Schema-Last Approach

Indexing Technologies and Exploitation

User Reaction

Observations and Opportunities


National Criminal Intelligence 4

The Law Enforcement community are also in the business of collecting and

analysing criminal intelligence and data, and where possible, sharing that resulting information…

To do this, they need rich, contemporary, and comprehensive criminal intelligence…

The National Criminal Intelligence Fusion Capability, which brings together

subject matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.

Fusion capability identifies the threats and vulnerabilities through the use of data.

It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.


Australian Institute of Criminology

5

• While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved.

• Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining.

Darren Quick and Kim-Kwang Raymond ChooAustralian Institute of Criminology September 2014


Objectives 6

Support the Australian Intelligence Criminal Model

Simple Interface to exploit the data

Data ingestion must be simple to do

and minimise transformation Support the large variety of data sources

Fast ingestion and retrieval times

Enable exact and fuzzy searching

Support ‘Identity Resolution’

Support metadata

Main the data’s integrity

Preserve Data-Lineage/Provenance

Reproduce the ingested data sourceexactly!

We don’t want this!


The Intelligence Life-Cycle

7

Plan, prioritise & direct

Collect & collate

Report & disseminate

Analyse & produce

Evaluate & review


Intelligence – Data Source Classification

8

Low95%

High5%

Data SOURCE CLASSIFICATION

Low HighVelocity

VarietyVolumeVeracity

Value

Collect

& c

ollate

An

aly

se &

pro

du

ce


Some Definitions: 9

That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.

Collect

& C

ollate

Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values.

Data munging or sometimes referred to as data wrangling means taking data that’s storedin one format and changing it into another format.

Analyse

AnalyseStorage

Schema Application 10

Sch

em

a F

irst

Raw Data

Triage

Cleanse

Raw Data

Storage

Sch

em

a L

ast

Schema

Schema


Data Cleansing … 11

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.

“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.”

Microsoft: 2012

Collect

& C

ollate


Data Sources – Always Increasing

12

Gap

Collect

& C

ollate


Data Cleansing - Doesn’t WORK

13

“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.

“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014.

“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.

Collect

& C

ollate


Data Cleansing – Loss of Format

14

Input Date Cleansed Date

Comment

20 July 2014 20-07-2014 Australian Date

July-20-2014 20-07-2014 American Format(mmm-dd-yyyy)

2014-20-07 20-07-2014 Arabic Format (right to left)

20-07-14 20-07-2014 Data Ambiguity

July 2014 01-07-2014 Imputed Value

"If you torture the data long enough, it will confess.“

Clifton R. Musser

Collect

& C

ollate


ETL vs Triage 15

Initiate

Extract

Determine

Suitability?

Transform

n

Assessment?

Load

Report

Complete

n

Initiate

Triage

Load

Suitability?

Application

n

Verify?

Fuse

Resolve

Complete

n

Collect

& C

ollate

ETL Triage


We did our research … 16


Oracle’s BDA(Big Data Appliance)

17

Collect

& C

ollate


Data Storage/Collation 18

Store the Data Semantically Built on an defined taxonomy/ontology Perfect to capture metadata

Searched for the perfect Triple-Store

Subject Predicate Object

Triple

GraphList

Collect

& C

ollate


The Architecture 19

Collect & Collate Analyse & Produce

Set Store

Hbase

Historical

Data

NewData

RD

F /

Mod

ellin

g

Feeds

Data

Exp

lora

tion

Sem

an

tic S

tore

Disseminate

Index

IIR

Index

SOLR

BDA

Pala

nti

r

Searc

h A

ssis

tan

t

Data Flow

Data

Exp

loit

ati

on

SPARQL

R Language

Apache PIG


Schema Last … 20

‘Triaged’ Data

First NameMiddle NameLast Name

Schema

Full-Name

Street NumberStreet NameSuburbStatePostcode

Full-Address

Collect

& C

ollate

Models


ACC Search Engines – ‘Smackdown’

21

Feature SOLR IIR

License Apache License Commercial

Storage Inverted List Third-partyDatabase

Support Google Like search NextRelease

Score Model Inverse Document Frequency

NormalizedScore

Result Pagination

Homophone Support Can use synonym support

Phoneme Search

Spread indexes across multiple nodes

Schema-less Support

Programming Interface Rest SOAP - API

Geo-spatial

Collect

& C

ollate


Collect & Collation Tool 22

Collect

& C

ollate


Bongo – Exploration 23

An

aly

se &

Pro

du

ce


Palantir – Semantic Interface 24

Rep

ort

& D

issem

inate

User Reaction 25

Time to Triage

< 1 Hour

> 1 Hour < 24 Hour

> 24 Hours

General Size % - Megabytes

< 1 > 1 < 100> 100 < 1000> 1000

• Developed a Palantir Plugin to search the Fusion Data Holding

• Bulk Matching was a great success

• In general, user reaction has been positive

• Time to Triage was usually under an hour where cleansing could take weeks!!!

Australian Crime Commission 2015


Ingestion Rate –The Improvement

26

Collect

& C

ollate


Observations… 27

The Bulk Matcher Performance and Reliability

Interaction with Palantir Configuration over Customisation Search for the ‘Single Source of Truth’

Golden Record

Acceptance of the Schema Last Approach Overwhelmed by Search Results


Further Reading and Contacts

28

Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4

Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8

Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5

Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5

Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2

Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014

A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014

Australian Institute of Criminology http://www.aic.gov.au

University of Canberrahttp://www.Canberra.edu.au

Documents

Oracle – Big Data THE INTELLIGENCE LIFE-CYCLE and Schema-Last Approach Dr Neil Brittliff PhD