Upload
kenny-bastani
View
835
Download
9
Embed Size (px)
DESCRIPTION
Graphs are a perfect solution to organize information and to determine the relatedness of content. Neo4j Developer Evangelist Kenny Bastani will discuss using Neo4j to perform document classification and text classification using a graph database. Kenny will demonstrate how to build a scalable architecture for classifying natural language text using a graph-based algorithm called Hierarchical Pattern Recognition. This approach encompasses a set of techniques familiar to Deep Learning practitioners.
Citation preview
(graphs)-[:are]->(everywhere)
Document Classification with Neo4j
© All Rights Reserved 2014 | Neo Technology, Inc.
@kennybastani
Neo4j Developer Evangelist
© All Rights Reserved 2014 | Neo Technology, Inc.
Agenda
• Introduction to Neo4j
• Introduction to Graph-based Document Classification
• Graph-based Hierarchical Pattern Recognition
• Generating a Vector Space Model for Recommendations
• Graphify for Neo4j
• U.S. Presidential Speech Transcript Analysis
2
© All Rights Reserved 2014 | Neo Technology, Inc.
Introduction to Neo4j
3
© All Rights Reserved 2014 | Neo Technology, Inc.
The Property Graph Data Model
4
© All Rights Reserved 2014 | Neo Technology, Inc.
John
Sally
Graph Databases Book
Friend Of
Friend Of
Has R
ead
Has Read
5
© All Rights Reserved 2014 | Neo Technology, Inc.
name: John age: 27
name: Sally age: 32
title: Graph Databasesauthors: Ian Robinson, Jim Webber
FRIEND_OFsince: 01/09/2013
HAS_READon: 2/03/2013rating: 5
HAS_READon: 02/09/2013rating: 4
FRIEND_OFsince: 01/09/2013
6
© All Rights Reserved 2014 | Neo Technology, Inc.
The Relational Table Model
7
© All Rights Reserved 2014 | Neo Technology, Inc.
Customers AccountsCustomer_Accounts
143 Alice
326 $100
725$63
2
981 $212
143 981
143 725
143 326
8
© All Rights Reserved 2014 | Neo Technology, Inc.
The Neo4j Browser
9
© All Rights Reserved 2014 | Neo Technology, Inc.
http://localhost:7474/
Neo4j Browser - finding help
10
© All Rights Reserved 2014 | Neo Technology, Inc.
Execute Cypher, Visualize
11
© All Rights Reserved 2014 | Neo Technology, Inc.
Introduction to Document Classification
12
© All Rights Reserved 2014 | Neo Technology, Inc.
Document Classification
Automatically assign a document to one or more classes
Documents may be classified according to their subjects or
according to other attributes
Automatically classify unlabeled documents to a set of
relevant classes using labeled training data
13
© All Rights Reserved 2014 | Neo Technology, Inc.
Example Use Cases for Document Classification
14
© All Rights Reserved 2014 | Neo Technology, Inc.
Sentiment Analysis for Movie Reviews
Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie.
Problem: The user reviews are unstructured text.
How do I automatically generate a score indicating whether the review was positive or negative?
Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative.
15
© All Rights Reserved 2014 | Neo Technology, Inc.
Recommend Relevant Tags
Scenario: A Q/A website allows users to submit questions and receive answers from other users.
Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers.
Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions.
16
© All Rights Reserved 2014 | Neo Technology, Inc.
Recommend Similar Articles
Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics.
Problem: The site needs to increase user engagement and time spent on the site.
Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page.
17
© All Rights Reserved 2014 | Neo Technology, Inc.
How Automated Document Classification Works
18
© All Rights Reserved 2014 | Neo Technology, Inc.
X YDocument
Document
Document
Document
Label Label
Assign a set of labels that describes the document’s text
Supervised Learning
Step 1: Create a Training Dataset
Z
Label
19
© All Rights Reserved 2014 | Neo Technology, Inc.
State machines represent predicates that evaluate to 0 or 1 for a text match
Deep feature representations are selected and learned using an evolutionary algorithm
Step 2: Train a Natural Language Parsing Model
State machines map to classes of document labels that matched text during training
Deep Learning
pp
p p p
p
Class
X Y
Class
Z
Class
= State Machine
20
© All Rights Reserved 2014 | Neo Technology, Inc.
Unlabeled Document
The natural language parsing model is used to classify other
unlabeled documents
XClass
YClass
ZClass
0.99
0.67
0.01
cos(θ)
cos(θ)
cos(θ)
Step 3: Classify Unlabeled Documents
21
© All Rights Reserved 2014 | Neo Technology, Inc.
Hierarchical Pattern Recognition (HPR)
22
© All Rights Reserved 2014 | Neo Technology, Inc.
What is Hierarchical Pattern Recognition (HPR)?
HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time —
I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM).
Designed for scalable performance in P time:
23
© All Rights Reserved 2014 | Neo Technology, Inc.
Influences & Inspirations
24
Ray Kurzweil(Pattern Recognition Theory of
Mind)
Jeff Hawkins(Hierarchical Temporal
Memory)
+ =
Hierarchical Pattern Recognition
pp
p p p
p
X Y Z
© All Rights Reserved 2014 | Neo Technology, Inc.
How does feature extraction work?
25
Hierarchical Pattern Recognition
“Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in.
The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text.
pp
p p p
p
X Y Z
HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.
© All Rights Reserved 2014 | Neo Technology, Inc.
Graph-based feature learning
26
© All Rights Reserved 2014 | Neo Technology, Inc.
Learning new features from matches on training data
27
© All Rights Reserved 2014 | Neo Technology, Inc.
Cost Function for the Generations of Features
Reproduction occurs after a threshold of matches has been exceeded for a feature.
After replication the cost function is applied to increase that threshold every time the feature reproduces.
is the current threshold on the feature node.
is the minimum threshold, which I chose as 5 for new features.
Cost function:
28
© All Rights Reserved 2014 | Neo Technology, Inc.29
© All Rights Reserved 2014 | Neo Technology, Inc.
Vector Space Model
30
© All Rights Reserved 2014 | Neo Technology, Inc.
Generating Feature Vectors
The natural language parsing model created during training can be turned into a global feature index.
This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy.
Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy.
31
© All Rights Reserved 2014 | Neo Technology, Inc.
Relevance Rankings
“Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia
32
© All Rights Reserved 2014 | Neo Technology, Inc.
Vector-based Cosine Similarity Measure
33
In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:
© All Rights Reserved 2014 | Neo Technology, Inc.
Cosine Similarity & Vector Space Model
34
© All Rights Reserved 2014 | Neo Technology, Inc.
Vector-based Cosine Similarity Measure
“The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.”
via Wikipedia
35
© All Rights Reserved 2014 | Neo Technology, Inc.
Graphify for Neo4j
36
© All Rights Reserved 2014 | Neo Technology, Inc.
Graphify for Neo4j
Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.
https://github.com/kbastani/graphify
37
© All Rights Reserved 2014 | Neo Technology, Inc.
Example Project
Head over to the GitHub project page and clone it to your local machine.
Follow the directions listed in the README.md to install the extension.
Navigate to the /examples directory of the project.
Run:
examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java
38
© All Rights Reserved 2014 | Neo Technology, Inc.
U.S. Presidential Speech Transcript Analysis
39
© All Rights Reserved 2014 | Neo Technology, Inc.
Identify the Political Affiliation of a Presidential Speech
This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase.
40
© All Rights Reserved 2014 | Neo Technology, Inc.
The Presidents• Ronald Reagan
• labels: liberal, republican, ronald-reagan
• George H.W. Bush
• labels: conservative, republican, bush41
• Bill Clinton
• labels: liberal, democrat, bill-clinton
• George W. Bush
• labels: conservative, republican, bush43
• Barack Obama
• labels: liberal, democrat, barack-obama
41
© All Rights Reserved 2014 | Neo Technology, Inc.
Training
Each of the presidents in the example have 6 speeches to analyze.
4 of the speeches are used to build a natural language parsing model.
2 of the speeches are used to test the validity of that model.
42
© All Rights Reserved 2014 | Neo Technology, Inc.
Get Similar Labels/Classes
43
© All Rights Reserved 2014 | Neo Technology, Inc.
Ronald Reagan
44
Class Similarity
republican 0.7182046285385341
liberal 0.644281223102398
democrat 0.4854114595950056
conservative 0.4133639188595147
bill-clinton 0.4057969121945167
barack-obama 0.323947855372623
bush41 0.3222644898334092
bush43 0.3161309849153592
© All Rights Reserved 2014 | Neo Technology, Inc.
George H.W. Bush
45
Class Similarity
conservative 0.7032274806766954
republican 0.6047256274615608
liberal 0.4439742461594541
democrat 0.39114918238853674
bill-clinton 0.3234223107986785
ronald-reagan 0.3222644898334092
barack-obama 0.2929260544514002
bush43 0.29106733975087984
© All Rights Reserved 2014 | Neo Technology, Inc.
Bill Clinton
46
Class Similarity
democrat 0.8375678825642422
liberal 0.7847858060182163
republican 0.5561860529059708
conservative 0.45365774896422445
barack-obama 0.4507676679770066
ronald-reagan 0.4057969121945167
bush43 0.365042482383354
bush41 0.3234223107986785
© All Rights Reserved 2014 | Neo Technology, Inc.
George W. Bush
47
Class Similarity
conservative 0.820636570272315
republican 0.7056890956512284
liberal 0.5075788396061254
democrat 0.4505424322086937
bill-clinton 0.365042482383354
barack-obama 0.33801949243378965
ronald-reagan 0.3161309849153592
bush41 0.29106733975087984
© All Rights Reserved 2014 | Neo Technology, Inc.
Barack Obama
48
Class Similarity
democrat 0.7668017370739147
liberal 0.7184792203867296
republican 0.4847680475425114
bill-clinton 0.4507676679770066
conservative 0.4149264161292232
bush43 0.33801949243378965
ronald-reagan 0.323947855372623
bush41 0.2929260544514002
© All Rights Reserved 2014 | Neo Technology, Inc.
Get involved in the Neo4j community
49
© All Rights Reserved 2014 | Neo Technology, Inc.
http://stackoverflow.com/questions/tagged/neo4j
50
© All Rights Reserved 2014 | Neo Technology, Inc.
http://groups.google.com/group/neo4j
51
© All Rights Reserved 2014 | Neo Technology, Inc.
https://github.com/neo4j/neo4j/issues
52
© All Rights Reserved 2014 | Neo Technology, Inc.
http://neo4j.meetup.com/
53
© All Rights Reserved 2014 | Neo Technology, Inc.
(Thank You)
54
© All Rights Reserved 2014 | Neo Technology, Inc.
Get in touch
55
Twitter www.twitter.com/kennybastani
www.linkedin.com/in/kennybastani
GitHub www.github.com/kbastani