51
Large graph analysis Paola Vocca - Università della Tuscia

Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Large graph analysis

Paola Vocca - Università della Tuscia

Page 2: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Outline

• Large graphs: Social graphs, web graphs …

• Extremal measures: Diameter, centrality, eccentricity, average

distance, separation degree;

• Exact and approximate algorithms and data structure

o exact

o Sampling

o Data stream model: Probabilistic estimator

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 2

Page 3: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Graphs

• Graph allows to represent relations between «thinghs» or

entities

• Nodes or vertices represent the entities

• Edges represent the relation

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 3

Relation: Who is the master of whom?

Page 4: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Bow tie structure of the Web Graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 4

• An AltaVista crawl of 200

million pages and 1:5 billion

links.

• A giant strongly connected

component containing 28%

of the nodes.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,

and J. Wiener. Graph structure in the Web: experiments and models. Computer

Networks, 33(1–6):309–320, 2000.

Page 5: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Graphs

Dolphin interactions

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 5

Page 6: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Graphs example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 6

Page 7: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Graph Datasets

Hyperlinks (the Web)

Social graphs (Facebook, Twitter, LinkedIn,…)

Email logs, phone call logs , messages

Commerce transactions (Amazon purchases)

Road networks

Communication networks

Protein interactions

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 7

Page 8: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Properties

Directed/Undirected

Snapshot or with time dimension (dynamic)

One or more types of entities (people, pages, products)

Meta data associated with nodes or edges (labels)

Some graphs are really large: billions of edges for Facebook and Twitter graphs

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 8

Page 9: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Mining the graph

Connected/Strongly connected components

Eccentricity (the maximum distance d(v, u) for all u).

Radius r(G) is the minimum eccentricity of the nodes. A node is central if e(u) = r(G) and the center of G is the set of all central nodes. I

Diameter (longest shortest s-t path)

Effective diameter (90% percentile of pairwise distance)

Distance distribution (number of pairs within each distance)

Average distance

Degree distribution

Clustering coefficient: Ratio of the number of closed triangles to open triangles.

Centrality

…..

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 9

Page 10: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

…Mining the link structure

• Centrality (who are the most important nodes?)

• Similarity of nodes (link prediction, targeted ads,

friend/product recommendations, Meta-Data completion)

• Communities: set of nodes that are more tightly related to

each other than to others

• “cover:” set of nodes with good coverage (facility location,

influence maximization)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 10

Page 11: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Connected components

Number of connected

components 2

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 11

Page 12: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Eccentricity

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 12

Ecc= 2

Ecc= 2

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Ecc= 3

Page 13: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Diameter

Diameter is 3

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 13

Page 14: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distance distribution

Distance 1: 27 (number of edges)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 14

#nodes: 13

#edges: 27

#pairs: 78

1 2

3 4

5

6 7

8

9

10

11

12

13

Page 15: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distance distribution

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 15

#nodes: 13

#edges: 27

#pairs: 78

Distance 1: 27 (number of edges)

Distance 2: 33

1 2

3 4

5

6 7

8

9

10

11

12

13

Page 16: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distance distribution

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 16

#nodes: 13

#edges: 27

#pairs: 78

Distance 1: 27 (number of edges)

Distance 2: 33

Distance 3: 18

1 2

3 4

5

6 7

8

9

10

11

12

13

Page 17: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Average distance

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

17

#nodes: 13

#edges: 27

#pairs: 78

𝑨𝒗𝒈𝑫𝒊𝒔𝒕 =𝟏

𝟕𝟖𝟐𝟕 ∙ 𝟏 + 𝟑𝟑 ∙ 𝟐 + 𝟏𝟖 ∙ 𝟑 ≅ 𝟏, 𝟖𝟖

1 2

3 4

5

6 7

8

9

10

11

12

13

Page 18: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Triangles

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 18

closed triangle 1 2

3 4

5

6 7

8

9

10

11

12

13

• Social graphs have many more closed triangle than random graphs

• “Communities” have more closed triangles

Page 19: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Communities

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 19

1 2

3 4

5

6 7

8

9

10

11

12

13 Star Wars

Ninjago

Page 20: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Communities in Les Miserables Network

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 20

The network of interactions between major characters in the novel Les Miserables by Victor Hugo

Page 21: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Centrality

• Which are the most important nodes ?

– Depends on the criteria and what we want to model

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 21

• Degree (in/out): largest number of followers, friends. Easy to

compute locally. Spammable.

• PageRank: Your importance/ reputation recursively depend

on that of your friends

• Betweenness: Your value as a “hub” -- being on a shortest

path between many pairs.

• Closeness: Centrally located, able to quickly reach/infect many nodes … the inverse of eccentricity

Page 22: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Centrality: example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 22

1 2

3 4

5

6 7

8

9

10

11

12

13 • Central nodes respect to all criteria

Page 23: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Random graphs and real graphs

• Experiments show that statistical measures in real complex networks are significantly different with respect to random

generated graphs

• Biological networks, social networks, Internet, Web have similar

measures.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 23

High aggregation degree Low separation degree

Clustering coefficient Average distance

Page 24: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Degree distribution

• In a random graph (a link is created according to a uniform probability distribution) all nodes have the same importance: The probability that 𝒗 is connecte to 𝒗’ is the same as it conencted to 𝒗’’ ,

• the probability that a node has exactly degree k is

𝑷 𝒌 =nodes with degree 𝒌Nodes of the graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 24

Random graphs Real graphs

• Nodes with low or a high degree are rare • Most of the nodes have a degree in the

average

Page 25: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 25

small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

• Social networks: a few individuals with many friends (celebrities)

• Web: few websites with lots of links

• metabolic networks: a few metabolites participating in many metabolic processes

• Internet: works well 'cause it's true that any two computers are connected by not more than 10 or 20 "hops" (physical links)

Page 26: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Degree distribution

• Hubs are shortcuts on the paths so influencing the

separtion degree

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 26

small-world effect: very few nodes with many connections, which ensure the

speed of transmission of information (or

gossip ...) in the network

small-world effect, crucial • in the study of the epidemic spreading

• In the in the service-optimization problem in telecommunications networks,

• in the study of neuronal interconnections in the brain,

• in ecological networks, etc

Page 27: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Small-world effect

• a network evolves over time, and

each new node joining the network

does not have the same chance to

connect to a node rather than to

another, but

• it is more likely that the new node

connects to an already connected

node, rather than to a node

isolated: preferential attachment

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 27

Why?

hubs are

• Reason for Network Resiliency: if it falls a node case is rare

that it is a hub, then the connectivity remains "guaranteed"

• but they are also the target of targeted attacks!

Page 28: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Small world effect

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 28

Kevin

Bacon Number # of People

0 1

1 3303

2 381495

3 1383150

4 356429

5 30815

6 3640

7 584

8 116

9 26

10 1 Total number of linkable actors: 2159560

Weighted total of linkable actors: 6522634

Average Kevin Bacon number: 3.020

The average Bacon number is 3.020

How good a center isKevin Bac

Page 29: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Facebook

• Boldi, Rosa, and Vigna. Hyperanf: approximating the neighbourhood function of very large graphs on a budget. In

WWW 2011.

• The average distance of Facebook (721:1M nodes and 68:7G

edges) is 4.7 and the diameter is 41.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 29

Page 30: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Neighbourhood function

• The neighbourhood 𝑵 𝒕 of a graph returns, for each 𝑡 ∈ ℕ the number of pairs of nodes 𝒙, 𝒚 such that 𝒚 is reachable from

𝒙 in less that 𝑡 steps.

• All the previous measures on graphs can be derived from the

computation of 𝑵 𝒕 .

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 30

Real graphs are huge

Page 31: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

The Internet 2003

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi

31

Asia Pacific – Red

Europe/Middle East/Central

Asia/Africa – Green

North America – Blue

Latin American and Caribbean –

Yellow

RFC1918 IP Addresses – Cyan

Unknown – White

Graph Colors:

Figure by the Opte Project (www.opte.org).

The vertices are “class C subnets”: groups of computers with similar Internet addresses, usually managed by a single organization the connections represent the routes taken by data packets as they hop between subnets. The geometric positions of the vertices are chosen simply to give a pleasing layout and are not related to geographic position.

Page 32: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Internet 2010

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 32

Page 33: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Internet 2015

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 33

North America (ARIN)

Europe (RIPE)

Latin America (LACNIC)

Asia Pacific (APNIC)

Africa (AFRINIC)

“Backbone” (highly

connected networks)

Graph Colors:

Page 34: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Exact computations

How can one compute the distance distribution?

• Weighted graphs:

o Dijkstra (single-source: 𝑶(𝒏𝟐)),

o Floyd-Warshall (all-pairs: 𝑶 𝒏𝟑

• unweighted graphs:

o a single BFS solves the single-source version of the problem: 𝑶 𝒎

• if we repeat it from every source: 𝑶 𝒎𝒏

• Matrix multiplication Still too expensive.

o 𝑶(𝒏𝟑+𝝎

𝟐 𝐥𝐨𝐠 𝒏 ) where 𝜔 is the exponent of the matrix

multiplication.

o U. Zwick. All pairs shortest paths using bridging sets and rectangular matrix

multiplication. J. ACM, 49(3):289–317, 2002.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 34

Page 35: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Computing on Very Large Graphs

Sampling

Approximation

Probabilistic counter

General algorithm design principles :

keep total computation/ communication/ storage “linear” in the size of the data

Parallelize (minimize chains of dependencies)

Localize dependencies

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 35

Page 36: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Sampling

• Strategy o Sample at random a source x

o Compute a full BFS from x

• It is an unbiased estimator only for undirected and connected graphs o Uses anyway BFS...

o ...not cache friendly

o ...not compression friendly

The average distance can obtained by sampling for each node only 𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐

random nodes and not all n nodes with an error of 𝜀, reducing to

𝑶𝒍𝒐𝒈 𝒏

𝜺𝟐(𝒏 log 𝒏 +𝒎 the time complexity.

David Eppstein and Joseph Wang. 2001. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms (SODA '01). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 228-229.

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 36

Page 37: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Sampling

Sampling algorithms • Sampling by random node selection: Authors in showed that RN does

not retain power-law o Vertex choice can be made

• proportional to its PageRank

• Random Degree Node (RDN) sampling has even more bias towards high degree nodes.

• Sampling by random edge selection: sampled graphs will be very sparsely connected and will thus have large diameter and will not respect community structure.

• Sampling by exploration o Random Node Neighbor (RNN)

o Random Walk (RW)

o Random Jump (RJ)

Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 631-636

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 37

Page 38: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Diffusion

Basic idea • ANF: • Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos. 2002. ANF: a fast

and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 81-90.

• HyperANF: Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.

• Let 𝑩𝒕 𝒙 be the ball of radius 𝒕 about x (the set of nodes at distance = 𝒕 from 𝒙)

• Clearly 𝑩𝒐(𝒙) = {𝒙} • Moreover 𝑩𝒕+𝟏(𝒙) = 𝑩𝒕(𝒚) ∪𝒙⟶𝒚 {𝒙}

• So computing 𝑩𝒕+𝟏 starting from 𝑩𝒕 just need a single (sequential) scan of the graph

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 38

Page 39: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Example

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39 3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 39

𝒙

𝒙𝟏 𝒙𝟐

𝒙𝟑

𝑩𝟏(𝒙𝟑)

𝑩𝟏(𝒙𝟐)

𝑩𝟏(𝒙𝟏)

𝑩𝟐(𝒙)

Page 40: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Easy but expensive

• Every set requires 𝑶(𝒏) bits, hence 𝑶(𝒏𝟐) bits overall

• Too many!

• What about using approximated sets?

• We need probabilistic counters, with just two primitives: add

and size?

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 40

Page 41: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

ANF e HyperANF

• ANF: use the probabilistic counter of Flajolet&Martin (1985)

implemented in the framework SNAP

• HyperANF : used HyperLogLog counters [Flajolet et al., 2007]

and implemented in the framework WebGraph to study the

web graph.

o With 40 bits you can count up to 4 billion with a

o standard deviation of 6%

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 41

Page 42: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Probabilistic counter: Streaming model

Sequence of elements from some domain

<x1, x2, x3, x4, ..... >

Bounded storage:

working memory << stream size

usually 𝑶(𝒍𝒐𝒈𝒌𝒏) or 𝑶(𝒏𝜶) for 𝜶 < 𝟏

Fast processing time per stream element

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 42

Page 43: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Counting Distinct Elements

Keys occur multiple times, we want to count the number of

distinct keys in the stream

In this example:

Number of distinct key is 𝒏 = 𝟔

Number of stream elements is 11

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 43

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Page 44: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distinct Elements: Approximate Counting

Exact counting of 𝑛 distinct element requires a structure of size

Ω 𝑛

We are often happy with an approximate count obtained using a

small-size working memory.

We want to be able to compute and maintain a small sketch 𝒔(𝑵) of the set 𝑁 of distinct items seen so far 𝑵 = {𝟑𝟐, 𝟏𝟐, 𝟏𝟒, 𝟕, 𝟔, 𝟒}

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 44

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Page 45: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Wanted: Distinct Elements Sketch Small size s(𝑁) ≪ 𝑁 = 𝑛

Can query 𝐬(𝐍) to get a good estimate 𝒏 (𝒔) of 𝑛 (small relative

error)

Streaming: For a new element 𝑥, easy to compute s(𝑁 ∪ 𝑥) from

s 𝑁 and 𝑥

Mergeability: If 𝑁1 and 𝑁2 are (possibly overlapping) sets then we

can compute the union sketch from their sketches: 𝑠(𝑁1 ∪ 𝑁2) from

𝑠(𝑁1) and s 𝑁2

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 45

Page 46: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

MinHash Sketch: [Flajolet & Martin 85, …]

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

ℎ 𝑥 ∼ 𝑈[0,1] ℎ is a random hash function from keys to uniform random numbers in [0,1]

Maintain the Min-Hash value 𝑦:

Initialize 𝑦 ← 1

Processing an element with key 𝑥:

𝑦 ← min {𝑦, ℎ 𝑥 }

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 46

Page 47: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distinct Elements: Approximate Counting

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, 𝑥

ℎ(𝑥)

𝑦

𝑛 0

1

1 2 3 3 4 4 4 4 5 5 6

0.45 0.21

0.35 0.92

0.14

0.45 0.45 0.45 0.74

0.35 0.35

0.35 0.35

0.35

0.21 0.21

0.21 0.21 0.21 0.14 0.14

0.14

The minimum hash value 𝑦 = min ℎ x is: Non-increasing and unaffected by repeated elements.

Precise relation: E 𝑦 =1

𝑛+1

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 47

Page 48: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Distinct Elements: Approximate Counting

How does the minimum hash 𝑦 give information on the number of distinct elements 𝑛 ?

0 1

The expectation of the minimum is 𝐄 𝐦𝐢𝐧 𝒉 𝒙 =𝟏

𝒏+𝟏

minimum

A single value gives only limited information. To boost information, we maintain 𝒌 ≥ 𝟏 values

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 48

Page 49: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Advantages

• Scalability: a minimum of 20 bytes per node

• On a 2TiB machine, 100 billion nodes

• The algorithms can be implemented on scalable architecture

for big data (Hadoop, Sparc)

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 49

Page 50: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

Camparing approaches

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 50

Page 51: Large graph analysis - ISCOM · • Large graphs: Social graphs, web graphs … • Extremal measures: Diameter, centrality, eccentricity, average distance, separation degree; •

THANK YOU!!

3/16/2017 Big Data: Tecnologie, metodologie e applicazioni per l'alnalisi dei dati massivi 51