COMP6237 – Data Mining and Networks Markus Brede Brede

COMP6237 – Data Mining and Networks

Markus Brede

[email protected]

Lecture slides available here:

http://users.ecs.soton.ac.uk/mb8/stats/datamining.html

mailto:[email protected]

http://users.ecs.soton.ac.uk/mb8/stats/datamining.html

Outline

● Why?– The WWW is a major application of data mining

● Search engines, etc.

– Social networks are a fertile source of information for data mining

● Agenda:– Applications

– Some graph theory

– The structure of the WWW – can we understand it?

– Preferential attachment and implications

– Summary

Applications ...

● In the last decade the structure of many real world networked systems has been captured and described – often networks from very different contexts share common properties.

● Roughly, applications can be classified as follows:– Technological networks

● Internet, telephone networks● Power grids, infrastructure networks

– Biological Networks● Ecological/Biochemical/Neural

– Social Networks

– Networks of information● WWW, Citation networks, ...

Power Grids

Vulnerability to attacks?Distributed energy generation? Electric vehicles?

Airline Transportation Networks

Understand flows of people → analysis of migration, epidemics.

Trade Networks (1)

● Nodes = countries● Links = $ trade flows

between countriesb=1.3

Australia

China

textiles, toys,car parts, ...

ion ore, coal,gas, ...

Understand patterns in economic activity → better understanding of economic growth.

Trade Networks (2)

M. Brede and F. Boschetti, Lect. Notes of the Inst. f. Comp. Sci. 4, 1093-1104 (2009)

Ecosystems

Ecosystems and Food Webs

Vulnerability of ecosystems?

Gene Regulatory Networks

Gene 1

Gene 2

RNA 1

Protein 1

RNA 2

Protein 2

Better understanding of diseases, etc.

“Climate Networks”

nodes = grid cells, links = significant correlations in T-field

A. A. Tsonis and K. L. Swanson, Phys. Rev. Lett. 100, 228502 (2008)

Konigsberg Bridge Problem● Graph Theory goes back to Euler (1736) and

the Konigsberg bridge problem

● Find a walk through city that crosses each bridge once and only once

Konigsberg Bridge Problem (2)

Some Formalities

• Graph G = (V,E) V … set of vertices, E … set of edges

undirected graph:

directed graph:

path from u to v P(u,v): sequence of edges (u,a), (a,b),…,(x,v)

Eulerian path: path that visits every edge exactly once

cycle: path that contains at least one vertex twice

length of a path: number of edges traversed along it

adjacency matrix A = (aij)

1

43

2

A =

0 0 0 11 0 1 10 1 0 00 0 1 0

(An)kl … #paths of length n from k to l

How to “Store a Network”?

● Adjacency matrices (as in previous slide)● Adjacency lists

– In-lists● 1: 4; 2: 1,3,4; 3: 2; 4: 3

– Out-lists:● 1: 2; 2:3; 3: 2,4; 4: 1,2

– Both?

● Adjacency matrix vs adjacency lists– Memory: expensive efficient

– Direct access: fast slow

– Neighbour access: slow fast

1

43

2

Network Metrics● How would we “measure” a network?

– Number of nodes/density of connections● Sparse vs. dense?

– Number of components?

– Importance of nodes (centrality)?

– “Distances” – Average pathlengths/diameters – large vs. small?

– Heterogeneity● Degree distributions? Narrow vs. broad?

– Mixing● Assortative vs. disortative

– Local structure● Local coherence (clustering), motifs ...

Components

● First question might be: is the network connected?

● Maximum sets of connected nodes are called components/clusters.– Here we have three components, (1,2,3,4) (5,6) (7)

● How to find components? E.g. breadth first search.● The typical network we are interested in will have one giant

component and a number of very small other components, often focus on giant component and ignore rest.

1

2

4

3

7

5

6

Strong/Weak Components

● A bit more subtle for directed networks

● Weak components: ignore direction of links– E.g. (1,2,3,4,5,6,7) is one component

● Strong components: maximum sets of nodes that can be reached from each other.– (1,2,3,4), (5), (6), (7) are separate components

1

2

4

3

7

5

6

Degrees

● First go at characterising a network might be to count how many neighbours each node has (degree)

→ this gives a sequence, the degree sequence● For large networks, we might be interested in a

statistical characterisation, i.e. degree distributions● If the network is directed, we have in-degree and out-

degree distributions

1

2

5

4

3 (1,3,2,3,1)

Degree Distributions – Examples

● How would the following networks “look”?

P(k)

k

?

Degree Distributions – Examples

● How would the following networks “look”?

P(k)

kE.g.:

regular graph

Degree Distributions – Examples (2)

● Homogeneous degree dist.

P(k)

k

All nodes roughly “equal”All nodes roughly “equal”

E.g.:

Degree Distributions – Examples (3)

● Heterogeneous degree dist.'s

P(k)

k

Nodes with very different propertiespresent, e.g. some with very smalldegrees and some with very large degrees

E.g.:

Degree Distributions

● How to measure this unevenness in degrees?– E.g. by variance of the degree distribution (or other

measures we use to characterise distributions)

– For real world examples these distributions often follow a power law with cut-off (later)

→ Can often use this exponent to characterise the distribution.

(E.g. f(k)=exp(-k) or a similarly fast decaying function)

P (k)∝k−γ f (k )

Distances

● There is a simple way to define distances on graphs by the minimum number of “hops” it takes to get from one node to another

● E.g.: d(1,2)=1; d(1,5)=3, etc.● Set distance between disconnected nodes to

infinity, e.g. d(7,5)=infty

1

2

4

3

7

5

6

APLs and Diameters

● If we have a given network, we can characterise its “extent” by– The average shortest pathlength

● i.e. we calculate shortest pathlength between all pairs of nodes and average over it

● For very large networks we might want to sample only a selection of pairs of nodes

– Its diameter● i.e. the maximum distance between any pair of nodes● Quite difficult to calculate, since it is an “extremal”

property

Distances

● Milgram's small-world experiment (1960s)– Milgram sent 96 packages to randomly selected

individuals from the phone directory in Omaha; package contained the name of a target individual and its address in Boston

– Individuals were asked to pass package on to someone they new on first name basis who might be closer to target

– 18 found their way back

– Mean lengths of paths was 5.9! (the famous “6 degrees of separation”)

– This is quite remarkable also in terms of navigation …

Milgram's Experiment

Transitivity/Clustering

● In maths transitivity usually implies: a ~ b and b ~ c → a ~ c

● In network science one often uses “~” = connected by an edge, i.e. if a is connected to b and b to c then also a should be connected to c

● Perfect transitivity only in cliques, partial transitivity is of more interest

a b

c

a b

c

(open triad) (closed triad)

Localised Clustering Coefficients• Clustering coefficient of a node

– CC(node)=fraction of pairs of friends that are friends of each other

• Example:

• In the context of social networks missing triadic closures are called structural holes– Might impede flow of information or give power to a node

Markus

David John Brian

Peter

2 * #links between friends

k (k-1)CC=

k(Markus)=4#links between friends=2CC(Markus)=4/(4*3)=1/3

Motifs in Complex Networks

(from Milo R et al. (2002). Science 298 (5594): 824–827. )

Homophily

High school friendship: James Moody, Race, school integration, and friendship segregation in America, American Journal of Sociology 107, 679-716 (2001).

Nodes are colour coded by race (black, white, other)

Mixing is clearly not random.

Most “social” characteristicsIn social networks mix likethis.

How do we measure suchmixing patterns?

Assortment by Degree

● Of particular interest is mixing by degree.● In how far does degree dictate position of

edges?

r=∑ij

(aij−k ik j /2L)k ik j

∑ij(k iδij−k ik j /2L)k ik j

r>0 r<0

(assortative network) (disassortative network)

Tends to have core-periphery structure

Mining Network Structures

● Suppose we have a data set for a network and want to data mine it for “pecularities”, how to go about it?– OK, we can measure various quantities, e.g. those

explained on the preceding slides

– We might then conclude that the network has an APL of 3.5, a clustering coefficient of 0.2 and is quite heterogeneous

– Problem: ● Is 3.5 small? Does 0.2 mean the network is highly clustered?● Some of these quantities might be interdependent (e.g. the fact

that the network is heterogeneous might cause the small size or similar)

● Need some kind of reference model

Reference Models (1)

● Various approaches:– Randomization: “rewire” edges in such a way that

certain properties are preserved, but all other correlations are destroyed.

– E.g.:

a rewiring that preserves the degree distribution

(potential problems with ergodicity)


● Built (analytical) models of networks in which some properties are fixed, but the rest is random– Exponential random graphs

● Define some ensemble of graphs P(G). ● Suppose on average a graph has some property x

→ MaxEnt approach, demand that entropy of distribution is maximised subject to constraints

● This leads to exponential distributions of the type

with

– Used in the quantitative social sciences

– Analytically very challenging, often simulated.

P (G)=1Z

exp(−bH (G)) H (G)=∑ibi xi(G)


● Third option:– Compare to a certain set of standard models

– Typically used: random graphs or Erdos-Renyi random graphs)

E.g.: consider a set of N nodes and connect each pair of nodes with probability p.

Erdos-Renyi Random Graphs

● First studied by Solomonoff and Rapaport, but named after Paul Erdos and Alfred Renyi to honour their contributions in the 1950s and 60s

● One of the best studied graph models – in spite of its simplicity some interesting properties

Summary

● So, roughly we can characterise a network by:– Its degree distribution

● Broad: diversity of nodes, some with many neighbours and others with few neighbours present

● Narrow: all nodes roughly have the same numbers of neighbours

– Its APL/diameter● Large/small: many/few hops required to get move from A to B

– Clustering● Its local cohesiveness, are friends of friends typically friends of each

other?

– Mixing patterns● Are high degree nodes neighbours of each other?

● … and we have some idea how to analyse networks

A reason for “network theory”

● Most of these systems share certain structures– Why is this so? (Universal principles?)

– What does it imply for system function?

● A “structural” perspective on systems– Simple interactions but complex structure?

● What models of networks do we have?– Understand what shapes system structure

– How does structure relate to function?

Some prominent network types

● Spatial grids ● Random graphs

● Scale-free networks● Small worlds

● regular● large● cliquish

● ~ regular● small● cliquish

● ~regular● small● not cliquish

● v. heterogeneous

● ultrasmall

Scalefree Networks

Barabasi and Albert, Science 286, 509-512 (1999)

Partial map of the internet (Jan 2015),Nodes are IP addresses, length oflines indicative of delay in connectionsbetween nodes

ScaleFree Networks

Often hub nodes strongly influence system behaviour

Usually in [2,3]

small

large

Scale-free NetworksP (k)∝k−γ

● Degree distribution– Why scale-free?

● Power laws have no inherent length scale● “Self-similarity”

– Heavy tails

● For <= 2:

● For <= 3:

f (α x)=c f (x)(can take c copies of system at scale x to generate system at scale x)

E [k ]=∫k min

∞

k p(k )dk=C∫kmin

∞

k−γ+1dk=C

2−γ[k2−γ ]k min

∞

=1−γ

2−γkmin

E [k ]=∞V [k ]=∞

http://www.wired.com/2004/10/tail/

An aside: Detecting Power Laws in Data

● Estimating the power law exponent from data:– Bad: fit line on loglog axes using least squares

regression– Plot complementary CDF P(X>x) i.e. if

then– Use MLE

P (x)∝x−αP (X>x)∝x−(α−1)

α=1+N [∑i=1

Nln

k i

kmin]−1

(cf. Clauset, Shalizi, Newman (2007))

Scale-free Networks

• Mechanisms:➔ Preferential attachment (Barabasi and Albert 1999, Price

1965)new vertices preferentially form links to old vertices withalready high degrees

● In principle, this result was already known from Herbert Simon's work: power laws arise from “rich gets richer” effect

)2/()()(

)(

1

Nvdvd

vdN

v

v

Preferential Attachment

● Is there a simple way to understand this result?● How many links at time t?

● Consider the number of nodes with degree 1 at time t+1 of the evolution

⟨N 1⟩ (t+1)=⟨N 1 ⟩(t )⏟already there

+ 1⏟new node

−⟨N 1⟩ (t )

∑kk ⟨N k (t)⟩⏟

new node linkswith degree1node

L(t+1)=L(t )+1 (Every new node forms one connection)

L(t)=L0+t

=⟨N 1⟩ (t)+1− ⟨N 1⟩ (t )/2L⏟handshake lemma ∑i

k i=2L


● For large t we expect● Hence:

● What about nodes of degree k?

⟨N 1⟩ (t)=n1 t

n1(t+1)=n1t+1−n1t /2(L0+t )⏟→n1/2 for t→∞

n1=2 /3

⟨N k ⟩(t+1)=⟨N k ⟩(t )⏟already there

+ (k−1)⟨N k−1⟩ /2 L⏟nodes of degree k−1become nodesof degree k

−

− k ⟨N k ⟩ /2L⏟nodesof degree k become nodesof dgree k+1


● Assuming again that ⟨N k ⟩ (t)=nk t for t≫1



−


nk (t+1)=nk t+1 /2(k−1)nk−1−1 /2k nk

nk=k−12+k

nk−1=(k−1)(k−2)

(2+k )(1+k )nk−2

nk=(k−1)(k−2)(k−3)(k−4)(k−5)(k−6)⋅...2⋅1

(2+k )(1+k )k (k−1)(k−2)(k−3)⋅...5⋅4n1

nk=3⋅2⋅1

k (k+1)(k+2)

23∝k−3

(for large k)


● For large t we expect● Hence:

● What about nodes of degree k?

⟨N 1⟩ (t)=n1 t

n1(t+1)=n1t+1−n1t /2(L0+t )⏟→n1/2 for t→∞

n1=2 /3



−


Consequences of Scale-Free Degree Distributions

● (BA or random) SF Networks are generally “ultra-small”, i.e.

(i.e. very quick to traverse)● Clustering?

– BA or random SF NWs are not cliquish (in the sense that CC → const.>0 for N>>1)

– E.g. RANs or some networks grown by optimization have high clustering

d∝ log log N

Attack Tolerance and Percolation

● Can reformulate this somewhat:– Does a connecting path from top to bottom exist?

– Does a giant component (a component containing a finite fraction of all nodes) exist?

– Alternatively: measure average path lengths

● Take a given network– One usually argues the (e.g. transport) system is

more or less functional if it has a giant component

– Remove some fraction q of nodes (links) ● At random – simulates random failure● Targeting certain nodes (typically hub nodes) – simulates

targeted attacks

Attack Tolerance and Percolation

● Percolation: – One of the simplest models of (2nd

order) phase transitions

– Consider a 2d lattice with empty and occupied sites (links=bonds), occupy sites (bonds) with some probability p

– Existence of a sharp threshold

Attack Tolerance of SF NWs

● Scale free networks are very robust to random attacks, but extremely vulnerable to targeted attacks

(Albert and Barabasi 2000)

A Side Note on Preferential Attachment and Predicting Preferences

● Preferentially choosing what others have chosen before is quite a common phenomenon when making choices

→ This can “obscure” quality signals and introduces a lot of noise

● Chance events (i.e. who came first, was spotted first, … etc.) strongly influence what gets “locked in” as popular later

● Makes it difficult for newcomers

→ … can also be exploited when social networks are known (i.e. choice of a node is likely to be similar to that of neighbours)

Prediction in Artificial Cultural Markets

● Sagalnik, Dodds and Watts did a very nice study on this (Science 311, 806 (2006)):– Created an “artificial music market”, comprising songs

from bands unknown to participants.

– Participants form two groups:● A reference group: people see band names and song titles.

Can pick songs to listen to, then give it a rating and have a choice to download the song

– Gives a measure of quality of the songs ...

● A socially influenced group: people additionally see how many times others have downloaded songs before

– Two scenarios: presented in random order (A) or ordered by popularity (B)

– 8 parallel worlds

Unevenness/Predictability

● Social influence makes prediction much harder; especially so when choices are presented ranked

● Nevertheless there is a correlation popularity-quality, bad songs never became very popular, good ones never very unpopular. Most effected: middle range of qualities.

Summary

● What is a network?– Application areas

– Network representation

– Degree (distribution), components, path lengths, clustering, etc.

● Scale free networks– What is it?

– Preferential attachment

– Preferential attachment and prediction

Documents

COMP6237 – Data Mining and Networks Markus Brede Brede