P2P systems: epidemic scheduling, content placement and user profiling

P2P systems:epidemic scheduling, content placement

and user profiling

Laurent Massoulié

Thomson, Paris Research Lab

2

Outline

Epidemic schemes for live streaming– Rate-optimality

– Delay-optimalityContent placement

– Optimisation framework

– Adaptive replication User profiling

– Spectral clustering

– Linear programming

3

Outline




– Adaptive replication and 3/4 - competitivityUser profiling


– Linear Programming

4

Context

P2P systems for live streaming on the Internet– PPLive, CoolStreaming, Sopcast, TVants,TVUPlay,

Joost…

5

Network constraints

● Graph connecting nodes ● Capacities assigned to edges

Achievable broadcast rate [Edmonds, 73]: Equals maximal number of edge-disjoint spanning trees that can be packed in graph Coincides with minimum over receivers of max-flow ( = min-cut) between source and receiver

6

Based on local informations

No explicit construction of spanning trees

Random Useful chunk selection and Edmonds’ theorem [LM, A. Twigg, C. Gkantsidis & P. Rodriguez]

1 4

51 2 4 5 7 8

When injection rate at source is strictly feasible, Markov process is ergodic.

Chunks successfully broadcast with bounded delay

?

??

?

?

?

??

?

7

Network with access (node) constraints

…

● Scarce resource: access capacity

● Complete communication graph: Everyone can send to anyone

●Bound on maximum streaming rate λ:

Let ci = uplink b/w of node iNecessary condition for feasibility:

i

is cN

c1

1 , min*

8

Deprived Peer / Random Useful Chunk [LM, A. Twigg, C. Gkantsidis & P. Rodriguez]

1 2 4 5 7 8

Sender’s packets

1 5 7 8 1 4

Potential receiver 1 Potential receiver 2

5

Source policy: sends “fresh” packets if any(fresh = not sent yet to anyone)

9

Deprived Peer / Random Useful Chunk [LM, A. Twigg, C. Gkantsidis & P. Rodriguez]

1 2 4 5 7 8

Sender’s packets

1 5 7 8 1 4

Potential receiver 1 Potential receiver 2

5

Neighborhood management:Periodically add random neighbor & suppress least deprived neighbor Fixed neighborhood sizes

10

Main result

When λ < λ* , Markov process is ergodic.Hence all packets are received at all nodes after time

bounded in probability

11

Multiple commodities

Several sources s, Dedicated receiver sets V(s) Can overlap

Sources are not receivers Nodes cannot relay commodities they don’t consume

…

12

Multiple commodities

Necessary conditions for feasibility:

Bundled most deprived / random useful: do not distinguish between commodities when

– measuring deprivation– Chosing random useful packet

SKcV

Ssc

sKs Vu

us

Ks

s

ss

, 1

,

System is ergodic when Conditions hold with strict inequality

13

Symmetric Networks (c1 = c2 = ... = cN = 1 chunk / sec )

Previous lower bound reads log2(N)

Achievable [J. Mundinger & R. Weber]:

source

t

t-1 t-1

t-2 t-2 t-2 t-2

t-3 t-3 t-3 t-3 t-3 t-3 t-3 t-3t+1

Makes use of log2(N) trees; not robust against churn

14

A look at the corresponding trees

N=4

N=8

N=16

N=32

15

Random target / latest useful packet

?

Sender’s packets

Receiver’s packets

Latest useful pkt

???

1 2 4 5 7 8

1 2 3 8

16

I.e: Diffusion at rates arbitrarily close to optimal feasible under optimal delay ( plus constant)

Random target / latest useful packet

For arbitrary injection rate λ<1 and constant x>0,Each peer receives fraction 1- 1/x of packets in time log2(N)+O(x).

[T. Bonald, LM, F. Mathieu, D. Perino & A. Twigg]

17

Open questions

Delay optimality in heterogeneous environmentsCost optimalityConvergence time scale

18

Outline







19

Outline







20

Problem statement

•N users•Storage capacity: m objects•Service capacity: B requests•Local accesses are free

•Request rate: f for object f•Request duration: 1•Aim: minimize number of lost requests

21

Optimal placement structure

Let Mf = number of replicas of object f

Schedulable region: request rates xf verifying

Effective arrival rates:

BMxf

NBx

ff

f f

,

times K if objects can be split into K size (1/K) sub-objects

N

Mx f

ff 1

22

Hot/Warm/Cold partition

Sort objects according to popularity : 1 2 …

Replicate everywhere (Mf=N) top popular objects 1…,f(1)Partial replication of objects f(1)+1,…f(2) :

No replication of objects for f>f(2)

f(1) and f(2) : such that “warm objects” generate requests at rate BN, and all memory is used

ff

f BMN

M

1

23

Adaptive replication

Replication policy: – Create new replica for object f after each dropped request– Remove object chosen at random

Ignoring object-specific capacity constraints, caricature dynamics:

Equilibrium:

CstpN

MM

dt

dloss

fff

1

mNMCC

Mf

ff

f

s.t. where1

24

Adaptive replication (ctd)

Compare to full replication of only top popular objects, i.e.

Then reductions to offered rates verify

“Value of foresight” is less than 25%...

mfNM f ,...,1 ,*

mf

ff

ff N

M

4

3

25

Outline







26

Outline







27

User profiling

Aim: predict tastes of users

Applications:– Further optimization of placement

– Recommender Systems

28

Netflix dataset

17, 770 movies, rated by 480, 000 users

29

The planted partition model

Users partitioned into clusters k=1,…,K

Each pair of users (i,j) : conflict level C(i,j) in [0,1](e.g., fraction of movies rated differently)

Statistical assumptions: – C(i,j) independent over i<j

– E(C(i,j)) = bkl D/N if users i,j belong clusters k, l

30

A spectral algorithm

Step 1: find suitable “de-noised” descriptors of users

Form normalized eigenvectors x(1),…,x(K) associated to K largest (in absolute value) eigenvalues of conflict matrix

To each user i, assign vector zi=(xi (1),…,xi (K))

31

A spectral algorithm

Step 2: do crude clustering on descriptors

Pick a random set of A users u(1),…,u(A)

Identify pair with closest descriptors (for L2 norm) and remove one of them, until only K users are left, say v(1),…,v(K)

Cluster the nodes according to proximity of their descriptors to the cluster exemplars v(1),…,v(K)

32

Theorem

Assume that – Fixed number K of clusters, each of size (N)

– Matrix (bkl) has full rank K– DC log(N) for some constant C

Then with probability 1-o(1) , Algorithm partitions correctly fraction 1-o(1) of nodes for suitable A

( 1<< A << D1/2 )

Main tool: control of spectral structure of E-R graph adjacency matrix when average degree DC log(N)

[Feige-Ofek]

33

Open question

Brute force Maximum Likelihood: retrieves clusters when D>>1

Efficient procedure under this assumption?

34

Another algorithmic version of Netflix

Objective: for user n, find inference of all unknown ratings that maximizes number of users fully agreeing with user n

NP-hard (badly so)

Probabilistic model– Users belong to clusters k=1,…,K, with sizes a(k) N– Within a cluster, identical ratings (i.i.d., +1 or -1 w.p. ½ for

each movie, F movies in total)– Each rating of each user: revealed w.p. p

35

Proposed algorithm (inspiration: compressive sensing; see [Decoding by linear programming, Candes&Tao])

Consider user 1For suitable cost function g, determine full rating vectors X(n) , compatible with known ratings (i.e. PnX(n)=Y(n) ), that minimize

A proxy to (intractable) minimization of

(I) 11

n

XnXg

(II) 11

1

n

XnX

36

Conditions for optimality

Assume optimum of (II) : “clustered” reconstruction X**(n) such that X**(n)=X**(1) for all indices n A

Then optimum of (I) such that X*(n)=X*(1), n A

provided:

'Im****'Im0 0 nnAnnAn PXXgPgw

37

Application to probabilistic model

Necessary condition for hidden cluster to be optimal:

Sufficient condition for LP algorithm to retrieve hidden cluster, under choice g= |.| :

Differ by factor at most K-1

2/expsup 2Fplaka kl

kl

Fplaka 2/exp 2

38

Outlook

Clustering– Robustness of proposed schemes to statistical modeling

assumptions

– Efficient (distributed?) implementations

Documents

P2P systems: epidemic scheduling, content placement and user profiling