View
405
Download
4
Category
Tags:
Preview:
Citation preview
EMEA Marketing - March 2015!
From Zero to Graph: ImportMark Needham (@markhneedham)7th May 2015
Neo Technology, Inc Confidential#graphconnect
Chicago Crime dataset
Neo Technology, Inc Confidential#graphconnect
Chicago Crime dataset
Neo Technology, Inc Confidential#graphconnect
Chicago Crime CSV file
imported into
The goal
Neo Technology, Inc Confidential#graphconnect
Exploring the data
Neo Technology, Inc Confidential#graphconnect
Exploring the data
LOAD CSV WITH HEADERS FROM"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowRETURN rowLIMIT 1
Neo Technology, Inc Confidential#graphconnect
Exploring the data
Neo Technology, Inc Confidential#graphconnect
Exploring the data
Neo Technology, Inc Confidential#graphconnect
Sketch a rough initial model
Neo Technology, Inc Confidential#graphconnect
Import a sample: Crimes LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});
Neo Technology, Inc Confidential#graphconnect
Import a sample: Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MERGE (:CrimeType { name: row.`Primary Type`});
Neo Technology, Inc Confidential#graphconnect
Import a sample: Crimes -> Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv" AS rowWITH row LIMIT 100MATCH (crime:Crime { id: row.ID, description: row.Description})MATCH (crimeType:CrimeType { name: row.`Primary Type`})MERGE (crime)-[:TYPE]->(crimeType);
Neo Technology, Inc Confidential#graphconnect
Add indexes
CREATE INDEX ON :Label(property)
Neo Technology, Inc Confidential#graphconnect
Add indexes
CREATE INDEX ON :Label(property) CREATE INDEX ON :Crime(id);CREATE INDEX ON :Location(name);CREATE INDEX ON :CrimeType(name);CREATE INDEX ON :Location(name); ...
Neo Technology, Inc Confidential#graphconnect
Periodic Commit
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_present.csv
MERGE (crime:Crime { id: row.ID, description: row.Description})
Neo Technology, Inc Confidential#graphconnect
Periodic Commit • Neo4j keeps all transaction state in memory
which becomes problematic for large CSV files • USING PERIODIC COMMIT flushes the
transaction after a certain number of rows • Default is 1000 rows but it’s configurable • Currently only works with LOAD CSV
Neo Technology, Inc Confidential#graphconnect
LOAD CSV in summary • ETL power tool • Built into Neo4J since version 2.1
• Can load data from any URL
• Good for medium size data (up to 10M rows)
Neo Technology, Inc Confidential#graphconnect
Bulk loading an initial data set • Introducing the Neo4j Import Tool
• Find it in the bin folder of your Neo4j download
• Used to large sized initial data sets
• Skips the transactional layer of Neo4j and writes store files directly
Neo Technology, Inc Confidential#graphconnect
Expects files in a certain format
:ID(Crime) :LABEL description :ID(Beat) :LABEL
:START_ID(Crime) :END_ID(Beat)
:TYPE
Nodes
Relationships
Neo Technology, Inc Confidential#graphconnect
What we have…
Neo Technology, Inc Confidential#graphconnect
Chicago Crime CSV file
Neo4j ready CSV files
Translation Phase required
Translation Phase
Neo Technology, Inc Confidential#graphconnect
Chicago Crime CSV file
Spark all the things
Spark Job
processed by
spits out
Neo4j ready CSV files
imported into
Neo Technology, Inc Confidential#graphconnect
The Spark Job
Neo Technology, Inc Confidential#graphconnect
The Spark Job
Neo Technology, Inc Confidential#graphconnect
Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506suser 8m2.183ssys 0m24.267s
Neo Technology, Inc Confidential#graphconnect
Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit \--driver-memory 5g \--class GenerateCSVFiles \--master local[8] \target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506suser 8m2.183ssys 0m24.267s
Neo Technology, Inc Confidential#graphconnect
The generated files $ ls -1 /tmp/*.csv/tmp/beats.csv/tmp/crimeDates.csv/tmp/crimes.csv/tmp/crimesBeats.csv/tmp/crimesDates.csv/tmp/crimesLocations.csv/tmp/crimesPrimaryTypes.csv/tmp/dates.csv/tmp/locations.csv/tmp/primaryTypes.csv
Neo Technology, Inc Confidential#graphconnect
Importing into Neo4j DATA=/tmpNEO=./neo4j-enterprise-2.2.1$NEO/bin/neo4j-import \--into $DATA/crimes.db \--nodes $DATA/crimes.csv \--nodes $DATA/beats.csv \--nodes $DATA/primaryTypes.csv \--nodes $DATA/locations.csv \--relationships $DATA/crimesBeats.csv \--relationships $DATA/crimesPrimaryTypes.csv \--relationships $DATA/crimesLocations.csv \--stacktrace IMPORT DONE in 36s 208ms
Neo Technology, Inc Confidential#graphconnect
This talk brought to you by…
Neo Technology, Inc Confidential#graphconnect
And that’s it…
Recommended