60
COMP6237 – Data Mining and Networks Markus Brede [email protected] Lecture slides available here: http://users.ecs.soton.ac.uk/mb8/stats/datamining.html

COMP6237 – Data Mining and Networks Markus Brede Brede

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP6237 – Data Mining and Networks Markus Brede Brede

COMP6237 – Data Mining and Networks

Markus Brede

[email protected]

Lecture slides available here:

http://users.ecs.soton.ac.uk/mb8/stats/datamining.html

Page 2: COMP6237 – Data Mining and Networks Markus Brede Brede

Outline

● Why?– The WWW is a major application of data mining

● Search engines, etc.

– Social networks are a fertile source of information for data mining

● Agenda:– Applications

– Some graph theory

– The structure of the WWW – can we understand it?

– Preferential attachment and implications

– Summary

Page 3: COMP6237 – Data Mining and Networks Markus Brede Brede

Applications ...

● In the last decade the structure of many real world networked systems has been captured and described – often networks from very different contexts share common properties.

● Roughly, applications can be classified as follows:– Technological networks

● Internet, telephone networks● Power grids, infrastructure networks

– Biological Networks● Ecological/Biochemical/Neural

– Social Networks

– Networks of information● WWW, Citation networks, ...

Page 4: COMP6237 – Data Mining and Networks Markus Brede Brede

Power Grids

Vulnerability to attacks?Distributed energy generation? Electric vehicles?

Page 5: COMP6237 – Data Mining and Networks Markus Brede Brede

Airline Transportation Networks

Understand flows of people → analysis of migration, epidemics.

Page 6: COMP6237 – Data Mining and Networks Markus Brede Brede

Trade Networks (1)

● Nodes = countries● Links = $ trade flows

between countriesb=1.3

Australia

China

textiles, toys,car parts, ...

ion ore, coal,gas, ...

Understand patterns in economic activity → better understanding of economic growth.

Page 7: COMP6237 – Data Mining and Networks Markus Brede Brede

Trade Networks (2)

M. Brede and F. Boschetti, Lect. Notes of the Inst. f. Comp. Sci. 4, 1093-1104 (2009)

Page 8: COMP6237 – Data Mining and Networks Markus Brede Brede

Ecosystems

Page 9: COMP6237 – Data Mining and Networks Markus Brede Brede

Ecosystems and Food Webs

Vulnerability of ecosystems?

Page 10: COMP6237 – Data Mining and Networks Markus Brede Brede

Gene Regulatory Networks

Gene 1

Gene 2

RNA 1

Protein 1

RNA 2

Protein 2

Better understanding of diseases, etc.

Page 11: COMP6237 – Data Mining and Networks Markus Brede Brede

“Climate Networks”

nodes = grid cells, links = significant correlations in T-field

A. A. Tsonis and K. L. Swanson, Phys. Rev. Lett. 100, 228502 (2008)

Page 12: COMP6237 – Data Mining and Networks Markus Brede Brede

Konigsberg Bridge Problem● Graph Theory goes back to Euler (1736) and

the Konigsberg bridge problem

● Find a walk through city that crosses each bridge once and only once

Page 13: COMP6237 – Data Mining and Networks Markus Brede Brede

Konigsberg Bridge Problem (2)

Page 14: COMP6237 – Data Mining and Networks Markus Brede Brede

Some Formalities

• Graph G = (V,E) V … set of vertices, E … set of edges

undirected graph:

directed graph:

path from u to v P(u,v): sequence of edges (u,a), (a,b),…,(x,v)

Eulerian path: path that visits every edge exactly once

cycle: path that contains at least one vertex twice

length of a path: number of edges traversed along it

adjacency matrix A = (aij)

1

43

2

A =

0 0 0 11 0 1 10 1 0 00 0 1 0

(An)kl … #paths of length n from k to l

Page 15: COMP6237 – Data Mining and Networks Markus Brede Brede

How to “Store a Network”?

● Adjacency matrices (as in previous slide)● Adjacency lists

– In-lists● 1: 4; 2: 1,3,4; 3: 2; 4: 3

– Out-lists:● 1: 2; 2:3; 3: 2,4; 4: 1,2

– Both?

● Adjacency matrix vs adjacency lists– Memory: expensive efficient

– Direct access: fast slow

– Neighbour access: slow fast

1

43

2

Page 16: COMP6237 – Data Mining and Networks Markus Brede Brede

Network Metrics● How would we “measure” a network?

– Number of nodes/density of connections● Sparse vs. dense?

– Number of components?

– Importance of nodes (centrality)?

– “Distances” – Average pathlengths/diameters – large vs. small?

– Heterogeneity● Degree distributions? Narrow vs. broad?

– Mixing● Assortative vs. disortative

– Local structure● Local coherence (clustering), motifs ...

Page 17: COMP6237 – Data Mining and Networks Markus Brede Brede

Components

● First question might be: is the network connected?

● Maximum sets of connected nodes are called components/clusters.– Here we have three components, (1,2,3,4) (5,6) (7)

● How to find components? E.g. breadth first search.● The typical network we are interested in will have one giant

component and a number of very small other components, often focus on giant component and ignore rest.

1

2

4

3

7

5

6

Page 18: COMP6237 – Data Mining and Networks Markus Brede Brede

Strong/Weak Components

● A bit more subtle for directed networks

● Weak components: ignore direction of links– E.g. (1,2,3,4,5,6,7) is one component

● Strong components: maximum sets of nodes that can be reached from each other.– (1,2,3,4), (5), (6), (7) are separate components

1

2

4

3

7

5

6

Page 19: COMP6237 – Data Mining and Networks Markus Brede Brede

Degrees

● First go at characterising a network might be to count how many neighbours each node has (degree)

→ this gives a sequence, the degree sequence● For large networks, we might be interested in a

statistical characterisation, i.e. degree distributions● If the network is directed, we have in-degree and out-

degree distributions

1

2

5

4

3 (1,3,2,3,1)

Page 20: COMP6237 – Data Mining and Networks Markus Brede Brede

Degree Distributions – Examples

● How would the following networks “look”?

P(k)

k

?

Page 21: COMP6237 – Data Mining and Networks Markus Brede Brede

Degree Distributions – Examples

● How would the following networks “look”?

P(k)

kE.g.:

regular graph

Page 22: COMP6237 – Data Mining and Networks Markus Brede Brede

Degree Distributions – Examples (2)

● Homogeneous degree dist.

P(k)

k

All nodes roughly “equal”All nodes roughly “equal”

E.g.:

Page 23: COMP6237 – Data Mining and Networks Markus Brede Brede

Degree Distributions – Examples (3)

● Heterogeneous degree dist.'s

P(k)

k

Nodes with very different propertiespresent, e.g. some with very smalldegrees and some with very large degrees

E.g.:

Page 24: COMP6237 – Data Mining and Networks Markus Brede Brede

Degree Distributions

● How to measure this unevenness in degrees?– E.g. by variance of the degree distribution (or other

measures we use to characterise distributions)

– For real world examples these distributions often follow a power law with cut-off (later)

→ Can often use this exponent to characterise the distribution.

(E.g. f(k)=exp(-k) or a similarly fast decaying function)

P (k)∝k−γ f (k )

Page 25: COMP6237 – Data Mining and Networks Markus Brede Brede

Distances

● There is a simple way to define distances on graphs by the minimum number of “hops” it takes to get from one node to another

● E.g.: d(1,2)=1; d(1,5)=3, etc.● Set distance between disconnected nodes to

infinity, e.g. d(7,5)=infty

1

2

4

3

7

5

6

Page 26: COMP6237 – Data Mining and Networks Markus Brede Brede

APLs and Diameters

● If we have a given network, we can characterise its “extent” by– The average shortest pathlength

● i.e. we calculate shortest pathlength between all pairs of nodes and average over it

● For very large networks we might want to sample only a selection of pairs of nodes

– Its diameter● i.e. the maximum distance between any pair of nodes● Quite difficult to calculate, since it is an “extremal”

property

Page 27: COMP6237 – Data Mining and Networks Markus Brede Brede

Distances

● Milgram's small-world experiment (1960s)– Milgram sent 96 packages to randomly selected

individuals from the phone directory in Omaha; package contained the name of a target individual and its address in Boston

– Individuals were asked to pass package on to someone they new on first name basis who might be closer to target

– 18 found their way back

– Mean lengths of paths was 5.9! (the famous “6 degrees of separation”)

– This is quite remarkable also in terms of navigation …

Page 28: COMP6237 – Data Mining and Networks Markus Brede Brede

Milgram's Experiment

Page 29: COMP6237 – Data Mining and Networks Markus Brede Brede

Transitivity/Clustering

● In maths transitivity usually implies: a ~ b and b ~ c → a ~ c

● In network science one often uses “~” = connected by an edge, i.e. if a is connected to b and b to c then also a should be connected to c

● Perfect transitivity only in cliques, partial transitivity is of more interest

a b

c

a b

c

(open triad) (closed triad)

Page 30: COMP6237 – Data Mining and Networks Markus Brede Brede

Localised Clustering Coefficients• Clustering coefficient of a node

– CC(node)=fraction of pairs of friends that are friends of each other

• Example:

• In the context of social networks missing triadic closures are called structural holes– Might impede flow of information or give power to a node

Markus

David John Brian

Peter

2 * #links between friends

k (k-1)CC=

k(Markus)=4#links between friends=2CC(Markus)=4/(4*3)=1/3

Page 31: COMP6237 – Data Mining and Networks Markus Brede Brede

Motifs in Complex Networks

(from Milo R et al. (2002). Science 298 (5594): 824–827. )

Page 32: COMP6237 – Data Mining and Networks Markus Brede Brede

Homophily

High school friendship: James Moody, Race, school integration, and friendship segregation in America, American Journal of Sociology 107, 679-716 (2001).

Nodes are colour coded by race (black, white, other)

Mixing is clearly not random.

Most “social” characteristicsIn social networks mix likethis.

How do we measure suchmixing patterns?

Page 33: COMP6237 – Data Mining and Networks Markus Brede Brede

Assortment by Degree

● Of particular interest is mixing by degree.● In how far does degree dictate position of

edges?

r=∑ij

(aij−k ik j /2L)k ik j

∑ij(k iδij−k ik j /2L)k ik j

r>0 r<0

(assortative network) (disassortative network)

Tends to have core-periphery structure

Page 34: COMP6237 – Data Mining and Networks Markus Brede Brede

Mining Network Structures

● Suppose we have a data set for a network and want to data mine it for “pecularities”, how to go about it?– OK, we can measure various quantities, e.g. those

explained on the preceding slides

– We might then conclude that the network has an APL of 3.5, a clustering coefficient of 0.2 and is quite heterogeneous

– Problem: ● Is 3.5 small? Does 0.2 mean the network is highly clustered?● Some of these quantities might be interdependent (e.g. the fact

that the network is heterogeneous might cause the small size or similar)

● Need some kind of reference model

Page 35: COMP6237 – Data Mining and Networks Markus Brede Brede

Reference Models (1)

● Various approaches:– Randomization: “rewire” edges in such a way that

certain properties are preserved, but all other correlations are destroyed.

– E.g.:

a rewiring that preserves the degree distribution

(potential problems with ergodicity)

Page 36: COMP6237 – Data Mining and Networks Markus Brede Brede

Reference Models (2)

● Built (analytical) models of networks in which some properties are fixed, but the rest is random– Exponential random graphs

● Define some ensemble of graphs P(G). ● Suppose on average a graph has some property x

→ MaxEnt approach, demand that entropy of distribution is maximised subject to constraints

● This leads to exponential distributions of the type

with

– Used in the quantitative social sciences

– Analytically very challenging, often simulated.

P (G)=1Z

exp(−bH (G)) H (G)=∑ibi xi(G)

Page 37: COMP6237 – Data Mining and Networks Markus Brede Brede

Reference Models (3)

● Third option:– Compare to a certain set of standard models

– Typically used: random graphs or Erdos-Renyi random graphs)

E.g.: consider a set of N nodes and connect each pair of nodes with probability p.

Page 38: COMP6237 – Data Mining and Networks Markus Brede Brede

Erdos-Renyi Random Graphs

● First studied by Solomonoff and Rapaport, but named after Paul Erdos and Alfred Renyi to honour their contributions in the 1950s and 60s

● One of the best studied graph models – in spite of its simplicity some interesting properties

Page 39: COMP6237 – Data Mining and Networks Markus Brede Brede

Summary

● So, roughly we can characterise a network by:– Its degree distribution

● Broad: diversity of nodes, some with many neighbours and others with few neighbours present

● Narrow: all nodes roughly have the same numbers of neighbours

– Its APL/diameter● Large/small: many/few hops required to get move from A to B

– Clustering● Its local cohesiveness, are friends of friends typically friends of each

other?

– Mixing patterns● Are high degree nodes neighbours of each other?

● … and we have some idea how to analyse networks

Page 40: COMP6237 – Data Mining and Networks Markus Brede Brede

A reason for “network theory”

● Most of these systems share certain structures– Why is this so? (Universal principles?)

– What does it imply for system function?

● A “structural” perspective on systems– Simple interactions but complex structure?

● What models of networks do we have?– Understand what shapes system structure

– How does structure relate to function?

Page 41: COMP6237 – Data Mining and Networks Markus Brede Brede

Some prominent network types

● Spatial grids ● Random graphs

● Scale-free networks● Small worlds

● regular● large● cliquish

● ~ regular● small● cliquish

● ~regular● small● not cliquish

● v. hetero- geneous

● ultrasmall

Page 42: COMP6237 – Data Mining and Networks Markus Brede Brede

Scale­free Networks

Barabasi and Albert, Science 286, 509-512 (1999)

Partial map of the internet (Jan 2015),Nodes are IP addresses, length oflines indicative of delay in connectionsbetween nodes

Page 43: COMP6237 – Data Mining and Networks Markus Brede Brede

Scale­Free Networks

Often hub nodes strongly influence system behaviour

Usually in [2,3]

small

large

Page 44: COMP6237 – Data Mining and Networks Markus Brede Brede

Scale-free NetworksP (k)∝k−γ

● Degree distribution– Why scale-free?

● Power laws have no inherent length scale● “Self-similarity”

– Heavy tails

● For <= 2:

● For <= 3:

f (α x)=c f (x)(can take c copies of system at scale x to generate system at scale x)

E [k ]=∫k min

k p(k )dk=C∫kmin

k−γ+1dk=C

2−γ[k2−γ ]k min

=1−γ

2−γkmin

E [k ]=∞V [k ]=∞

Page 45: COMP6237 – Data Mining and Networks Markus Brede Brede

http://www.wired.com/2004/10/tail/

Page 46: COMP6237 – Data Mining and Networks Markus Brede Brede

An aside: Detecting Power Laws in Data

● Estimating the power law exponent from data:– Bad: fit line on log­log axes using least squares 

regression– Plot complementary CDF P(X>x) i.e. if

                    then– Use MLE

P (x)∝x−αP (X>x)∝x−(α−1)

α=1+N [∑i=1

Nln

k i

kmin]−1

(cf. Clauset, Shalizi, Newman (2007))

Page 47: COMP6237 – Data Mining and Networks Markus Brede Brede
Page 48: COMP6237 – Data Mining and Networks Markus Brede Brede

Scale-free Networks

• Mechanisms:➔ Preferential attachment (Barabasi and Albert 1999, Price

1965)new vertices preferentially form links to old vertices withalready high degrees

● In principle, this result was already known from Herbert Simon's work: power laws arise from “rich gets richer” effect

)2/()()(

)(

1

Nvdvd

vdN

v

v

Page 49: COMP6237 – Data Mining and Networks Markus Brede Brede

Preferential Attachment

● Is there a simple way to understand this result?● How many links at time t?

● Consider the number of nodes with degree 1 at time t+1 of the evolution

⟨N 1⟩ (t+1)=⟨N 1 ⟩(t )⏟already there

+ 1⏟new node

−⟨N 1⟩ (t )

∑kk ⟨N k (t)⟩⏟

new node linkswith degree1node

L(t+1)=L(t )+1 (Every new node forms one connection)

L(t)=L0+t

=⟨N 1⟩ (t)+1− ⟨N 1⟩ (t )/2L⏟handshake lemma ∑i

k i=2L

Page 50: COMP6237 – Data Mining and Networks Markus Brede Brede

Preferential Attachment

● For large t we expect● Hence:

● What about nodes of degree k?

⟨N 1⟩ (t)=n1 t

n1(t+1)=n1t+1−n1t /2(L0+t )⏟→n1/2 for t→∞

n1=2 /3

⟨N k ⟩(t+1)=⟨N k ⟩(t )⏟already there

+ (k−1)⟨N k−1⟩ /2 L⏟nodes of degree k−1become nodesof degree k

− k ⟨N k ⟩ /2L⏟nodesof degree k become nodesof dgree k+1

Page 51: COMP6237 – Data Mining and Networks Markus Brede Brede

Preferential Attachment

● Assuming again that ⟨N k ⟩ (t)=nk t for t≫1

⟨N k ⟩(t+1)=⟨N k ⟩(t )⏟already there

+ (k−1)⟨N k−1⟩ /2 L⏟nodes of degree k−1become nodesof degree k

− k ⟨N k ⟩ /2L⏟nodesof degree k become nodesof dgree k+1

nk (t+1)=nk t+1 /2(k−1)nk−1−1 /2k nk

nk=k−12+k

nk−1=(k−1)(k−2)

(2+k )(1+k )nk−2

nk=(k−1)(k−2)(k−3)(k−4)(k−5)(k−6)⋅...2⋅1

(2+k )(1+k )k (k−1)(k−2)(k−3)⋅...5⋅4n1

nk=3⋅2⋅1

k (k+1)(k+2)

23∝k−3

(for large k)

Page 52: COMP6237 – Data Mining and Networks Markus Brede Brede

Preferential Attachment

● For large t we expect● Hence:

● What about nodes of degree k?

⟨N 1⟩ (t)=n1 t

n1(t+1)=n1t+1−n1t /2(L0+t )⏟→n1/2 for t→∞

n1=2 /3

⟨N k ⟩(t+1)=⟨N k ⟩(t )⏟already there

+ (k−1)⟨N k−1⟩ /2 L⏟nodes of degree k−1become nodesof degree k

− k ⟨N k ⟩ /2L⏟nodesof degree k become nodesof dgree k+1

Page 53: COMP6237 – Data Mining and Networks Markus Brede Brede

Consequences of Scale-Free Degree Distributions

● (BA or random) SF Networks are generally “ultra-small”, i.e.

(i.e. very quick to traverse)● Clustering?

– BA or random SF NWs are not cliquish (in the sense that CC → const.>0 for N>>1)

– E.g. RANs or some networks grown by optimization have high clustering

d∝ log log N

Page 54: COMP6237 – Data Mining and Networks Markus Brede Brede

Attack Tolerance and Percolation

● Can reformulate this somewhat:– Does a connecting path from top to bottom exist?

– Does a giant component (a component containing a finite fraction of all nodes) exist?

– Alternatively: measure average path lengths

● Take a given network– One usually argues the (e.g. transport) system is

more or less functional if it has a giant component

– Remove some fraction q of nodes (links) ● At random – simulates random failure● Targeting certain nodes (typically hub nodes) – simulates

targeted attacks

Page 55: COMP6237 – Data Mining and Networks Markus Brede Brede

Attack Tolerance and Percolation

● Percolation: – One of the simplest models of (2nd

order) phase transitions

– Consider a 2d lattice with empty and occupied sites (links=bonds), occupy sites (bonds) with some probability p

– Existence of a sharp threshold

Page 56: COMP6237 – Data Mining and Networks Markus Brede Brede

Attack Tolerance of SF NWs

● Scale free networks are very robust to random attacks, but extremely vulnerable to targeted attacks

(Albert and Barabasi 2000)

Page 57: COMP6237 – Data Mining and Networks Markus Brede Brede

A Side Note on Preferential Attachment and Predicting Preferences

● Preferentially choosing what others have chosen before is quite a common phenomenon when making choices

→ This can “obscure” quality signals and introduces a lot of noise

● Chance events (i.e. who came first, was spotted first, … etc.) strongly influence what gets “locked in” as popular later

● Makes it difficult for newcomers

→ … can also be exploited when social networks are known (i.e. choice of a node is likely to be similar to that of neighbours)

Page 58: COMP6237 – Data Mining and Networks Markus Brede Brede

Prediction in Artificial Cultural Markets

● Sagalnik, Dodds and Watts did a very nice study on this (Science 311, 806 (2006)):– Created an “artificial music market”, comprising songs

from bands unknown to participants.

– Participants form two groups:● A reference group: people see band names and song titles.

Can pick songs to listen to, then give it a rating and have a choice to download the song

– Gives a measure of quality of the songs ...

● A socially influenced group: people additionally see how many times others have downloaded songs before

– Two scenarios: presented in random order (A) or ordered by popularity (B)

– 8 parallel worlds

Page 59: COMP6237 – Data Mining and Networks Markus Brede Brede

Unevenness/Predictability

● Social influence makes prediction much harder; especially so when choices are presented ranked

● Nevertheless there is a correlation popularity-quality, bad songs never became very popular, good ones never very unpopular. Most effected: middle range of qualities.

Page 60: COMP6237 – Data Mining and Networks Markus Brede Brede

Summary

● What is a network?– Application areas

– Network representation

– Degree (distribution), components, path lengths, clustering, etc.

● Scale free networks– What is it?

– Preferential attachment

– Preferential attachment and prediction