12
Neo4Dogs Innovation Intelligent Systems Software Engineering Graph Cafe, Teknologihuset, Oslo, 27.06.2014 Totto-14 @javatotto / [email protected]

Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Embed Size (px)

DESCRIPTION

I'll do a talk on how we've used Neo4J for dataquality analysis & corrections as well as breed-analysis and more at NKK, where performance (dogs/second 100-9.000) and (queries/second 200-20.000) are important metrics.. :)

Citation preview

Page 1: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Neo4Dogs

Innovation

Intelligent Systems

Software Engineering

Graph Cafe, Teknologihuset, Oslo, 27.06.2014

Totto-14

@javatotto / [email protected]

Page 2: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

A Global Leader

AMERICAS

EUROPE

ASIA

Bringing our customers' projects to life and boosting their performance through technology and innovation

«  

«  € +1 633 mREVENUES in 2013

+20 000EMPLOYEES in 2013

+20COUNTRIES

Page 3: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

R&D and InnovationFor 30 years, Altran has had a close relationship with innovation.

Where creative ideas become a reality, Altran consultants step up to transform ideas into innovative solutions that can enable technological progress.

In this way, Altran has contributed to major technological advances in recent decades: speed, precision, security, communication, practicality, interoperability, artificial intelligence...

AEGT: the world's mostpowerful electric carAltran was responsible for designing and engineering the electric transmission on this car, capable of reaching speeds of 300 km/h.

Solar Impulse: the first plane to fly on solar energy aloneSince 2003, Altran experts have dedicated their skills to bringing about this formidable technical and human achievement.

The Airport of the Future: outlining a ‘friend-lean’ space in 2040Altran develops revolutionary concepts for airports responding to long-term changes in the industry.

Page 4: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Agenda

● Situation analysis

● From dog register via case management to dog-hub

● The platform

● Performance and some metrics

Page 5: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Initial analysis

● From register to case management – over 20 years of legacy..

– Dog information spread across 30+ relational tables

– 2-3 weeks of work to retrieve «a dog» with some info (every time)

– «impossible» to store new types of data/information on a dog

– Data was hidden/unavailable to people -> «data rot»

– Cascading costs of change and new features

● Recognized the need for a different approach

– But how to get out of the squeeze was not obvious..

– Limited technical skills, system knowledge and functional knowledge

– No time, capacity or money to do a «full rewrite»

● We selected a bottom up, data first, platform aproach. With strong capabilities for continous data quality processes and strong support for semi-structured data.

Page 6: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

From dog register to case management to dog hub

● Quick and easy access to individual dogs

● Scale - 10 to 50 integrations with other systems (hub)

● Handling individual dogs of «questionable» data quality

● Easily extendable to store more data on any individual

– Semi structured strategy for persistence/storage

Page 7: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Top level architecture

Page 8: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

The platform we built

● Dog search & lookup

– SolrCloud with "json_full"

● DogPopulationService

– Pedigree, population structure, breeddata

– Data error, data deviation, data missing -> DogFixer

● DogIDMapper (multi-source, multi-master, map different ID-schemes)

● DogCrawler

– Is it possible to find aditional data to fix this individual?

● DogFixer

– Is it possible to statistically find the right answer?

– Manual process in some corner-cases / difficult cases

● DogServiceREST

– verify & merge, writeback updates

– «tailing» datasources of dog information

Page 9: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Some numbers

● 2 mill reqs/hour

● 10 mill reqs/24 hours

● Breed calculations went form taking «months» to «instant»

– 200-500 joins per individual, 1000/year, 10 years = 2-8 sek

● Latency: 0.2 sek, 99.7% of reqs

● DogIDMapper: 4000 dogs/sec

● DogGraph: 3000 dogs/sec

● DogFixer: 10-15 dogs/sec

● DogCrawler: 100-200 dogs/sec

Page 10: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Handle huge spikes

Page 11: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

And survive «issues» with low latency

Page 12: Neo4Dogs - a data quality platform approach with SolrCloud and graphs

Try it out:

* http://dogsearch.nkk.no* http://dogpopulation.nkk.no/* http://dogpopulation.nkk.no/ras/?breed=Dunker* http://dogpopulation.nkk.no/dogpopulation/concurrent/executor/status

* Code: by request :)