Graph Connect Europe: From Zero To Import

Preview:

Citation preview

EMEA Marketing - March 2015!

From Zero to Graph: ImportMark Needham (@markhneedham)7th May 2015

Neo Technology, Inc Confidential#graphconnect

Chicago Crime dataset

Neo Technology, Inc Confidential#graphconnect

Chicago Crime dataset

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

imported into

The goal

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Neo Technology, Inc Confidential#graphconnect

Exploring the data

LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Neo Technology, Inc Confidential#graphconnect

Sketch a rough initial model

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crimes LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crimes -> Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);

Neo Technology, Inc Confidential#graphconnect

Add indexes

CREATE INDEX ON :Label(property)

Neo Technology, Inc Confidential#graphconnect

Add indexes

CREATE INDEX ON :Label(property) CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name); ...

Neo Technology, Inc Confidential#graphconnect

Periodic Commit

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv

MERGE (crime:Crime { id: row.ID, description: row.Description})

Neo Technology, Inc Confidential#graphconnect

Periodic Commit •  Neo4j keeps all transaction state in memory

which becomes problematic for large CSV files •  USING PERIODIC COMMIT flushes the

transaction after a certain number of rows •  Default is 1000 rows but it’s configurable •  Currently only works with LOAD CSV

Neo Technology, Inc Confidential#graphconnect

LOAD CSV in summary •  ETL power tool •  Built into Neo4J since version 2.1

•  Can load data from any URL

•  Good for medium size data (up to 10M rows)

Neo Technology, Inc Confidential#graphconnect

Bulk loading an initial data set •  Introducing the Neo4j Import Tool

•  Find it in the bin folder of your Neo4j download

•  Used to large sized initial data sets

•  Skips the transactional layer of Neo4j and writes store files directly

Neo Technology, Inc Confidential#graphconnect

Expects files in a certain format

:ID(Crime) :LABEL description :ID(Beat) :LABEL

:START_ID(Crime) :END_ID(Beat)

:TYPE

Nodes

Relationships

Neo Technology, Inc Confidential#graphconnect

What we have…

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

Neo4j ready CSV files

Translation Phase required

Translation Phase

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

Spark all the things

Spark Job

processed by

spits out

Neo4j ready CSV files

imported into

Neo Technology, Inc Confidential#graphconnect

The Spark Job

Neo Technology, Inc Confidential#graphconnect

The Spark Job

Neo Technology, Inc Confidential#graphconnect

Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser 8m2.183ssys 0m24.267s

Neo Technology, Inc Confidential#graphconnect

Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser 8m2.183ssys 0m24.267s

Neo Technology, Inc Confidential#graphconnect

The generated files $ ls -1 /tmp/*.csv/tmp/beats.csv/tmp/crimeDates.csv/tmp/crimes.csv/tmp/crimesBeats.csv/tmp/crimesDates.csv/tmp/crimesLocations.csv/tmp/crimesPrimaryTypes.csv/tmp/dates.csv/tmp/locations.csv/tmp/primaryTypes.csv

Neo Technology, Inc Confidential#graphconnect

Importing into Neo4j DATA=/tmpNEO=./neo4j-enterprise-2.2.1$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace IMPORT DONE in 36s 208ms

Neo Technology, Inc Confidential#graphconnect

This talk brought to you by…

Neo Technology, Inc Confidential#graphconnect

And that’s it…

Recommended