Upload
brianna-spencer
View
218
Download
0
Embed Size (px)
Citation preview
Oracle – Big DataTHE INTELLIGENCE LIFE-CYCLEand Schema-Last Approach
Dr Neil Brittliff PhD
A little about myself… Awarded a PhD at the University of Canberra in March this year for my work in
the Big Data space
Currently employed as Data Scientist within the Australian Government
Have been employed by 5 law enforcement agencies
Developed Cryptographic Software to support the Australian Medicare System
First used Oracle products back in 1986
Worked in the IT industry since 1982
Resides in Canberra (capital of Australia)
Canberra is the only capital city in Australia that is not named after a person
Interests
Tennis (play) / Cricket (watch)
Bushwalking and camping
Piano Playing (very bad)
Making stuff out of wood
Enjoys the art of Programming (prefers the ‘C’ language)
Pushing the limits of the Raspberry Pi
2
University of Canberra - 2015
Talk Structure 3
Motivation
Principles and Constraints
Intelligence Life-Cycle Collect & Collate
Analyse & Produce
Report & Disseminate
Motivation
Research What is a Schema
The Problem with ETL
Data Cleansing verses Data Triage
A New Architecture Oracle Big Data
The Schema-Last Approach
Indexing Technologies and Exploitation
User Reaction
Observations and Opportunities
University of Canberra - 2015
National Criminal Intelligence 4
The Law Enforcement community are also in the business of collecting and
analysing criminal intelligence and data, and where possible, sharing that resulting information…
To do this, they need rich, contemporary, and comprehensive criminal intelligence…
The National Criminal Intelligence Fusion Capability, which brings together
subject matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.
Fusion capability identifies the threats and vulnerabilities through the use of data.
It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.
University of Canberra - 2015
Australian Institute of Criminology
5
• While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved.
• Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining.
Darren Quick and Kim-Kwang Raymond ChooAustralian Institute of Criminology September 2014
University of Canberra - 2015
Objectives 6
Support the Australian Intelligence Criminal Model
Simple Interface to exploit the data
Data ingestion must be simple to do
and minimise transformation Support the large variety of data sources
Fast ingestion and retrieval times
Enable exact and fuzzy searching
Support ‘Identity Resolution’
Support metadata
Main the data’s integrity
Preserve Data-Lineage/Provenance
Reproduce the ingested data sourceexactly!
We don’t want this!
University of Canberra - 2015
The Intelligence Life-Cycle
7
Plan, prioritise & direct
Collect & collate
Report & disseminate
Analyse & produce
Evaluate & review
University of Canberra - 2015
Intelligence – Data Source Classification
8
Low95%
High5%
Data SOURCE CLASSIFICATION
Low HighVelocity
VarietyVolumeVeracity
Value
Collect
& c
ollate
An
aly
se &
pro
du
ce
University of Canberra - 2015
Some Definitions: 9
That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience.
Collect
& C
ollate
Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values.
Data munging or sometimes referred to as data wrangling means taking data that’s storedin one format and changing it into another format.
Analyse
AnalyseStorage
Schema Application 10
Sch
em
a F
irst
Raw Data
Triage
Cleanse
Raw Data
Storage
Sch
em
a L
ast
Schema
Schema
University of Canberra - 2015
Data Cleansing … 11
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.
“Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.”
Microsoft: 2012
Collect
& C
ollate
University of Canberra - 2015
Data Sources – Always Increasing
12
Gap
Collect
& C
ollate
University of Canberra - 2015
Data Cleansing - Doesn’t WORK
13
“Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.
“Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014.
“that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.
Collect
& C
ollate
University of Canberra - 2015
Data Cleansing – Loss of Format
14
Input Date Cleansed Date
Comment
20 July 2014 20-07-2014 Australian Date
July-20-2014 20-07-2014 American Format(mmm-dd-yyyy)
2014-20-07 20-07-2014 Arabic Format (right to left)
20-07-14 20-07-2014 Data Ambiguity
July 2014 01-07-2014 Imputed Value
"If you torture the data long enough, it will confess.“
Clifton R. Musser
Collect
& C
ollate
University of Canberra - 2015
ETL vs Triage 15
Initiate
Extract
Determine
Suitability?
Transform
n
Assessment?
Load
Report
Complete
n
Initiate
Triage
Load
Suitability?
Application
n
Verify?
Fuse
Resolve
Complete
n
Collect
& C
ollate
ETL Triage
University of Canberra - 2015
We did our research … 16
University of Canberra - 2015
Oracle’s BDA(Big Data Appliance)
17
Collect
& C
ollate
University of Canberra - 2015
Data Storage/Collation 18
Store the Data Semantically Built on an defined taxonomy/ontology Perfect to capture metadata
Searched for the perfect Triple-Store
Subject Predicate Object
Triple
GraphList
Collect
& C
ollate
University of Canberra - 2015
The Architecture 19
Collect & Collate Analyse & Produce
Set Store
Hbase
Historical
Data
NewData
RD
F /
Mod
ellin
g
Feeds
Data
Exp
lora
tion
Sem
an
tic S
tore
Disseminate
Index
IIR
Index
SOLR
BDA
Pala
nti
r
Searc
h A
ssis
tan
t
Data Flow
Data
Exp
loit
ati
on
SPARQL
R Language
Apache PIG
University of Canberra - 2015
Schema Last … 20
‘Triaged’ Data
First NameMiddle NameLast Name
Schema
Full-Name
Street NumberStreet NameSuburbStatePostcode
Full-Address
Collect
& C
ollate
Models
University of Canberra - 2015
ACC Search Engines – ‘Smackdown’
21
Feature SOLR IIR
License Apache License Commercial
Storage Inverted List Third-partyDatabase
Support Google Like search NextRelease
Score Model Inverse Document Frequency
NormalizedScore
Result Pagination
Homophone Support Can use synonym support
Phoneme Search
Spread indexes across multiple nodes
Schema-less Support
Programming Interface Rest SOAP - API
Geo-spatial
Collect
& C
ollate
University of Canberra - 2015
Collect & Collation Tool 22
Collect
& C
ollate
University of Canberra - 2015
Bongo – Exploration 23
An
aly
se &
Pro
du
ce
University of Canberra - 2015
Palantir – Semantic Interface 24
Rep
ort
& D
issem
inate
User Reaction 25
Time to Triage
< 1 Hour
> 1 Hour < 24 Hour
> 24 Hours
General Size % - Megabytes
< 1 > 1 < 100> 100 < 1000> 1000
• Developed a Palantir Plugin to search the Fusion Data Holding
• Bulk Matching was a great success
• In general, user reaction has been positive
• Time to Triage was usually under an hour where cleansing could take weeks!!!
Australian Crime Commission 2015
University of Canberra - 2015
Ingestion Rate –The Improvement
26
Collect
& C
ollate
University of Canberra - 2015
Observations… 27
The Bulk Matcher Performance and Reliability
Interaction with Palantir Configuration over Customisation Search for the ‘Single Source of Truth’
Golden Record
Acceptance of the Schema Last Approach Overwhelmed by Search Results
University of Canberra - 2015
Further Reading and Contacts
28
Strategic Thinking in Criminal IntelligenceJerry H RatcliffeThe Federation Press – 2009 ISBN 978 186287 734-4
Intelligence-Led PolicingJerry RatcliffeRoutledge – 2008ISBN 978-1-843292-339-8
Data MatchingConcepts and Techniques and Record Linkage, Entity Resolution, and Duplicate DetectionPeter ChristenSpringer – 2012ISBN 978-3-642-31163-5
Foundations of Semantic Web TechnologiesPascal Hitzler, Markus Krötzsch, Sebastian RudolphCRC Press – 2010ISBN 978-1-4200-9050-5
Big Data – A revolution that will transform how we live, work, and thinkViktor Mayer-Schönberger and Kenneth CukierHMH – 2013ISBN 978-0-544-00269-2
Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014
A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014
Australian Institute of Criminology http://www.aic.gov.au
University of Canberrahttp://www.Canberra.edu.au