30
From Zero to Graph: Import Mark Needham (@markhneedham) 7 th May 2015

Graph Connect Europe: From Zero To Import

Embed Size (px)

Citation preview

Page 1: Graph Connect Europe: From Zero To Import

EMEA Marketing - March 2015!

From Zero to Graph: ImportMark Needham (@markhneedham)7th May 2015

Page 2: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Chicago Crime dataset

Page 3: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Chicago Crime dataset

Page 4: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

imported into

The goal

Page 5: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Page 6: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Exploring the data

LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1

Page 7: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Page 8: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Exploring the data

Page 9: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Sketch a rough initial model

Page 10: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crimes LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});

Page 11: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});

Page 12: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Import a sample: Crimes -> Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);

Page 13: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Add indexes

CREATE INDEX ON :Label(property)

Page 14: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Add indexes

CREATE INDEX ON :Label(property) CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name); ...

Page 15: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Periodic Commit

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv

MERGE (crime:Crime { id: row.ID, description: row.Description})

Page 16: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Periodic Commit •  Neo4j keeps all transaction state in memory

which becomes problematic for large CSV files •  USING PERIODIC COMMIT flushes the

transaction after a certain number of rows •  Default is 1000 rows but it’s configurable •  Currently only works with LOAD CSV

Page 17: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

LOAD CSV in summary •  ETL power tool •  Built into Neo4J since version 2.1

•  Can load data from any URL

•  Good for medium size data (up to 10M rows)

Page 18: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Bulk loading an initial data set •  Introducing the Neo4j Import Tool

•  Find it in the bin folder of your Neo4j download

•  Used to large sized initial data sets

•  Skips the transactional layer of Neo4j and writes store files directly

Page 19: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Expects files in a certain format

:ID(Crime) :LABEL description :ID(Beat) :LABEL

:START_ID(Crime) :END_ID(Beat)

:TYPE

Nodes

Relationships

Page 20: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

What we have…

Page 21: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

Neo4j ready CSV files

Translation Phase required

Translation Phase

Page 22: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Chicago Crime CSV file

Spark all the things

Spark Job

processed by

spits out

Neo4j ready CSV files

imported into

Page 23: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

The Spark Job

Page 24: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

The Spark Job

Page 25: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser 8m2.183ssys 0m24.267s

Page 26: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar

real 1m25.506suser 8m2.183ssys 0m24.267s

Page 27: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

The generated files $ ls -1 /tmp/*.csv/tmp/beats.csv/tmp/crimeDates.csv/tmp/crimes.csv/tmp/crimesBeats.csv/tmp/crimesDates.csv/tmp/crimesLocations.csv/tmp/crimesPrimaryTypes.csv/tmp/dates.csv/tmp/locations.csv/tmp/primaryTypes.csv

Page 28: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

Importing into Neo4j DATA=/tmpNEO=./neo4j-enterprise-2.2.1$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace IMPORT DONE in 36s 208ms

Page 29: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

This talk brought to you by…

Page 30: Graph Connect Europe: From Zero To Import

Neo Technology, Inc Confidential#graphconnect

And that’s it…