PROCESSING LARGE GRAPHS - unimi.it‣ When MapReduce tasks are chained large subgraphs may passe ... and the inout hyperparameter, q control the probability of a walk staying inward

PROCESSINGLARGE GRAPHS

Ernesto Damiani and Paolo [email protected]

Università degli Studi di MilanoDipartimento di Informatica

PROCESSING LARGE GRAPHS

‣ Processing large graphs is challenging

‣ Data Management

‣ The graph is partitioned under multiple nodes, traversing the graph implies managing I/O operations

‣ Data Analytics

‣ Only a subset of the statistical tools we typically use can be used with graphs

‣ Most SNAs require a visit to the entire graph

‣ Differently to images graphs cannot easily be encoded in matrices of fixed length

DATA MANAGEMENT

‣ When evaluating a query on a large dataset, one wants to partition and distribute the data to multiple machines

‣ The goal is to efficiently evaluate queries in parallel

‣ When MapReduce tasks are chained large subgraphs may passe from one node to the other

‣ Communication overhead may outclass parallelisation gains

‣ Pregrel works with Master - workers model with steps and super steps

‣ Latency when waiting for super step to be finished

‣ Bottlenecks when graphs have high density or nodes high centrality values

GRAPHS ENHANCING MACHINE LEARNING

‣ Feature extraction and selection helps us take raw data and create a suitable subset and format for training our machine learning models

‣ Representing the context may be crucial to avoid creating overfitting models

‣ For example, we can map features to nodes in a graph, create relationships based on similar features, and then compute the centrality of features. Feature relationships can be defined by the ability to preserve cluster densities of data points

‣ In [1] Pagerank and other SNAs are used to weight the nodes that other ML methods will assess

1. Fakhraei, Shobeir, James Foulds, Madhusudana Shashanka, and Lise Getoor. "Collective spammer detection in evolving multi-relational social networks." In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, pp. 1769-1778. ACM, 2015.

GRAPH EMBEDDINGS

‣ Graph embeddings are the transformation of property graphs to a vector or a set of vectors

‣ Embedding should capture the graph topology, vertex-to-vertex relationship, and other relevant information about graphs, subgraphs, and vertices

‣ More properties the embedder encodes the better results can be retrieved in later tasks

‣ Vertex embeddings: We encode each vertex (node) with its own vector representation

‣ Graph embeddings: Here we represent the whole graph with a single vector

GRAPH EMBEDDINGS


GRAPH EMBEDDINGS

‣ In [1] a more sensible classification is proposed

‣ Factorisation based Methods, represent the connections between nodes in the form of a matrix and factorise this matrix to obtain the embedding

‣ Random Walk based Methods, random walk algorithms are used to mine the neighbourhood of nodes

‣ Deep Learning based Methods, deep auto-encoder to generate an embedding model that can capture non-linearity in graphs

1.Goyal, Palash, and Emilio Ferrara. "Graph embedding techniques, applications, and performance: A survey." Knowledge-Based Systems 151 (2018): 78-94.

GRAPH EMBEDDINGS

1. Goyal, Palash, and Emilio Ferrara. "Graph embedding techniques, applications, and performance: A survey." Knowledge-Based Systems 151 (2018): 78-94.

GRAPH EMBEDDINGS


GRAPH EMBEDDINGS


GRAPH EMBEDDINGS

‣ GEM is a Python library (https://github.com/palash1992/GEM) that provides implementations of

‣ (i) Locally Linear Embedding, (ii) Laplacian Eigenmaps, (iii) Graph Factorization, (iv) HOPE, (v) SDNE, (vi) node2vec

‣ For node2vec, it uses a C++ implementation and yield a Python interface

‣ Graphs are stored using the DiGraph format in python package NetworkX

‣ A NetworkX API for Neo4j is available: https://neo4j.com/graphconnect-2018/session/networkx-api-graph-algorithms


https://github.com/palash1992/GEM

https://neo4j.com/graphconnect-2018/session/networkx-api-graph-algorithms

https://neo4j.com/graphconnect-2018/session/networkx-api-graph-algorithms

NODE2VEC

‣ The node2vec framework learns low-dimensional representations for nodes in a graph by optimizing a neighborhood preserving objective

‣ The objective is flexible, and the algorithm accomodates for various definitions of network neighborhoods by simulating biased random walks

‣ Specifically, it provides a way of balancing the exploration-exploitation tradeoff that in turn leads to representations obeying a spectrum of equivalences from homophily to structural equivalence

After transitioning to node v from t, the return hyperparameter, p and the inout hyperparameter, q control the probability of a walk staying inward revisiting nodes (t), staying close to the preceeding nodes (x1), or moving outward farther away (x2, x3).

WORD2VEC


WORD2VEC


‣ Common Bag Of Words (CBOW)

WORD2VEC


‣ Skip-Gram model

T-SNE

‣ Distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions

PCA t-SNA

GRAPH2VEC

‣ Intuition tell us that best accuracy can obtained when the entire graph is embedded (or in alternative subgraphs) because this will preserve the structure

‣ Factorisation-based methods can encode the entire graph

‣ Graph Kernels are widely adopted approaches to measure similarities among graphs

‣ A Graph Kernel measures the similarity between a pair of graphs by recursively decomposing them into atomic substructures (e.g., walk, shortest paths, graphlets etc.)

‣ Formally:

GRAPH2VEC

GRAPH2VEC

‣ Graph Kernel by design ignores subgraph similarities and considers each of the subgraphs as individual features

‣ Suffers for diagonal dominance, that is, a given graph is similar to itself but not to any other graph in the dataset

GRAPH2VEC

‣ Deep Graph Kernel are aimed at capturing the relationship between substructures

‣ Formally:

‣ where M represents a VxV positive semi-definite matrix that encodes the relationship between substructures and V represents the vocabulary of substructures obtained from the training data

‣ Deep learning algorithms are used to compute M

CORRELATION IS NOT CAUSATION

‣ Simpson’s Paradox

Errori Fatali

1. O1,Workout [36] — μ 65min 2. O2,Workout [39] — μ 57min

3. O1,Workout:M [20] — μ 80min 4. O1,Workout:P [16] — μ 46min 5. O2,Workout:M [10] — μ 85min 6. O2,Workout:P [29] — μ 48min

Errori Fatali

1. Load:F1 [100] ⊃ Load:F1 ⪼ Fix [12]— 12% 2. Load:F2 [80] ⊃ Load:F2 ⪼ Fix [2]— 2,5% 3. Load:F3 [30] ⊃ Load:F3 ⪼ Fix [2]— 6%

Errori Fatali

1. Load:F1:M [70] ⊃ Load:F1:M ⪼ Fix [10]— 14%

2. Load:F1:P [30] ⊃ Load:F1:P ⪼ Fix [2]— 6%





Errori Fatali

1. Load:M [90] ⊃ Load:M ⪼ Fix [13]— 14% 2. Load:P [120] ⊃ Load:P ⪼ Fix [3]— 2%

VALIDATION

‣ In ML validation is required to assess the quality of a learning stage

‣ Overfitting and Underfitting are both undesirable behaviours that must be countered. Both lead to poor predictions on new data sets

‣ They both depends on:

‣ Bias is originated from erroneous assumptions in the learning process

‣ Variance is originated from sensitivity to fluctuations

‣ At data or model level!

Documents

PROCESSING LARGE GRAPHS - unimi.it‣ When MapReduce tasks are chained large subgraphs may passe ... and the inout hyperparameter, q control the probability of a walk staying inward