Influence at Scale: Distributed Computation of Complex

Influence at Scale: Distributed Computation of Complex Contagion in Networks

Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University

Motivation

• Information requirements: how much of the graph do we need to examine? (query complexity).• Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data.

The FrameworkThe influence model: The Independent Cascade Model [KKT’03]• Input: edge-weighted graph 𝐺 = (𝑉, 𝐸,p), initial seeds 𝑆0 ⊆ 𝑉.• At every synchronous step 𝑡 > 0, every node previously infected node

𝑢 ∈ 𝑆𝑡 − 𝑆𝑡−1 successfully infects an uninfected neighbor 𝑣 w.p. 𝑝𝑢𝑣.𝑆𝑡+1 -- set of infected nodes at step 𝑡 + 1.

• The influence function: 𝑓 𝑆0 = 𝐸[𝑆𝑡]

The MRC parallel computation model [KSV’10]• Inspired by the MapReduce paradigm.• Synchronous round computation on 𝑁 𝑘; 𝑣 on tuples. On every round:

1. Map: apply local transformation on tuples in a streaming fashion.2. Reduce: polytime computation on aggregates of tuples with same

key.• Constraintaints: 𝑁1−𝜖 machines and space per machine; log𝑐𝑁 rounds.

• Graph access model: Link server.

The Information Requirements of Influence Estimation

• Question: how much of the graph to we need to know in order to estimate the influence?

• Concretely: what is the query complexity of approximating 𝑓(𝑆0)?

Theorem: Let 𝜖 ∈ 0,1

3, 𝛿 ∈ [0,

1

2]. For large enough 𝑛 = |𝑉|, ∃𝐺 = (𝑉, 𝐸, 𝒑)

for which getting an estimate 𝑓(𝑆0) of 𝑓(𝑆0) s.t. 1 − 𝜖 𝑓 𝑆0 ≤ 𝑓 𝑆0 ≤ 𝑓(𝑆0) with confidence ≥ 1 − 𝛿 requires Ω((1 − 2𝛿) 𝑛 link server queries.

Main algorithm: Intermediate Layer; Estimating the Expectation

Main intuitions:1. High/low influence few/many samples needed,

2. Can compute the Riemann integral for CDF (𝑓 𝑆0 = 0

∞1 − 𝐹(𝐼) 𝑑𝐼)

Lemma (informal):w.h.p., VerifyGuess returns True if 𝜏 > 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒.

Main Algorithm: Top Layer; Iterating on Candidate Values

Theorem: For any 𝜖 ∈ (0,1

4), InfEst provides a (1 + 8𝜖)-approx. w.h.p.

Algorithm: Bottom Layer; Bounded SamplesHighlights: Input: Seeds 𝑆0 ⊆ 𝑉, link-server oracle 𝑄(⋅), 𝐿- # samples, bound 𝑡.1. Simultaneous prob-BFS in MapReduce, one layer per round.2. Finalize a sample when at least 𝑡 nodes infected.3. Return fraction of samples that reached bound.

Theorem (informal): 1) If 𝐿 is suff. small, # of tuples is Ω(𝑚1.5 log4 𝑚).2) If 𝑑𝑖𝑎𝑚 𝐺 = log𝑐 𝑛, algorithm takes polylog rounds.

Experimental Results

1

2

5

3

6

7

4

𝐺

𝑏

𝑎

𝑥

𝑐 𝑑

𝑆0

𝑝𝑥𝑏

𝑝𝑥𝑎𝑝𝑏𝑑𝑝𝑏𝑐

𝑝𝑑𝑎

L-S𝑄(𝑢)

𝑁+(𝑢)

𝑝

Running timeApprox. ratios (Epinions)

Heuristics(Epinions)

Scalable estimation of the influence of individuals in massive social networks

Documents

Influence at Scale: Distributed Computation of Complex