1
Influence at Scale: Distributed Computation of Complex Contagion in Networks Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University Motivation β€’ Information requirements: how much of the graph do we need to examine? (query complexity). β€’ Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data. The Framework The influence model: The Independent Cascade Model [KKT’03] β€’ Input: edge-weighted graph = (, ,p), initial seeds 0 βŠ† . β€’ At every synchronous step >0, every node previously infected node ∈ βˆ’ βˆ’1 successfully infects an uninfected neighbor w.p. . +1 -- set of infected nodes at step +1. β€’ The influence function: 0 = [ ] The MRC parallel computation model [KSV’10] β€’ Inspired by the MapReduce paradigm. β€’ Synchronous round computation on ; on tuples. On every round: 1. Map: apply local transformation on tuples in a streaming fashion. 2. Reduce: polytime computation on aggregates of tuples with same key. β€’ Constraintaints: 1βˆ’ machines and space per machine; log rounds. β€’ Graph access model: Link server. The Information Requirements of Influence Estimation β€’ Question: how much of the graph to we need to know in order to estimate the influence? β€’ Concretely: what is the query complexity of approximating ( 0 )? Theorem: Let ∈ 0, 1 3 , ∈ [0, 1 2 ]. For large enough = ||, βˆƒ = (, , ) for which getting an estimate ( 0 ) of ( 0 ) s.t. 1βˆ’ 0 ≀ 0 ≀ ( 0 ) with confidence β‰₯1βˆ’ requires Ξ©((1 βˆ’ 2) link server queries. Main algorithm: Intermediate Layer; Estimating the Expectation Main intuitions: 1. High/low influence few/many samples needed, 2. Can compute the Riemann integral for CDF ( 0 = 0 ∞ 1 βˆ’ () ) Lemma (informal):w.h.p., VerifyGuess returns True if > . Main Algorithm: Top Layer; Iterating on Candidate Values Theorem: For any ∈ (0, 1 4 ), InfEst provides a (1 + 8)-approx. w.h.p. Algorithm: Bottom Layer; Bounded Samples Highlights: Input: Seeds 0 βŠ† , link-server oracle (β‹…), - # samples, bound . 1. Simultaneous prob-BFS in MapReduce, one layer per round. 2. Finalize a sample when at least nodes infected. 3. Return fraction of samples that reached bound. Theorem (informal): 1) If is suff. small, # of tuples is Ξ©( 1.5 log 4 ). 2) If = log , algorithm takes polylog rounds. Experimental Results 1 2 5 3 6 7 4 0 L-S () + () Running time Approx. ratios (Epinions) Heuristics (Epinions) Scalable estimation of the influence of individuals in massive social networks

Influence at Scale: Distributed Computation of Complex

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Influence at Scale: Distributed Computation of Complex

Influence at Scale: Distributed Computation of Complex Contagion in Networks

Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University

Motivation

β€’ Information requirements: how much of the graph do we need to examine? (query complexity).β€’ Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data.

The FrameworkThe influence model: The Independent Cascade Model [KKT’03]β€’ Input: edge-weighted graph 𝐺 = (𝑉, 𝐸,p), initial seeds 𝑆0 βŠ† 𝑉.β€’ At every synchronous step 𝑑 > 0, every node previously infected node

𝑒 ∈ 𝑆𝑑 βˆ’ π‘†π‘‘βˆ’1 successfully infects an uninfected neighbor 𝑣 w.p. 𝑝𝑒𝑣.𝑆𝑑+1 -- set of infected nodes at step 𝑑 + 1.

β€’ The influence function: 𝑓 𝑆0 = 𝐸[𝑆𝑑]

The MRC parallel computation model [KSV’10]β€’ Inspired by the MapReduce paradigm.β€’ Synchronous round computation on 𝑁 π‘˜; 𝑣 on tuples. On every round:

1. Map: apply local transformation on tuples in a streaming fashion.2. Reduce: polytime computation on aggregates of tuples with same

key.β€’ Constraintaints: 𝑁1βˆ’πœ– machines and space per machine; log𝑐𝑁 rounds.

β€’ Graph access model: Link server.

The Information Requirements of Influence Estimation

β€’ Question: how much of the graph to we need to know in order to estimate the influence?

β€’ Concretely: what is the query complexity of approximating 𝑓(𝑆0)?

Theorem: Let πœ– ∈ 0,1

3, 𝛿 ∈ [0,

1

2]. For large enough 𝑛 = |𝑉|, βˆƒπΊ = (𝑉, 𝐸, 𝒑)

for which getting an estimate 𝑓(𝑆0) of 𝑓(𝑆0) s.t. 1 βˆ’ πœ– 𝑓 𝑆0 ≀ 𝑓 𝑆0 ≀ 𝑓(𝑆0) with confidence β‰₯ 1 βˆ’ 𝛿 requires Ξ©((1 βˆ’ 2𝛿) 𝑛 link server queries.

Main algorithm: Intermediate Layer; Estimating the Expectation

Main intuitions:1. High/low influence few/many samples needed,

2. Can compute the Riemann integral for CDF (𝑓 𝑆0 = 0

∞1 βˆ’ 𝐹(𝐼) 𝑑𝐼)

Lemma (informal):w.h.p., VerifyGuess returns True if 𝜏 > 𝑖𝑛𝑓𝑙𝑒𝑒𝑛𝑐𝑒.

Main Algorithm: Top Layer; Iterating on Candidate Values

Theorem: For any πœ– ∈ (0,1

4), InfEst provides a (1 + 8πœ–)-approx. w.h.p.

Algorithm: Bottom Layer; Bounded SamplesHighlights: Input: Seeds 𝑆0 βŠ† 𝑉, link-server oracle 𝑄(β‹…), 𝐿- # samples, bound 𝑑.1. Simultaneous prob-BFS in MapReduce, one layer per round.2. Finalize a sample when at least 𝑑 nodes infected.3. Return fraction of samples that reached bound.

Theorem (informal): 1) If 𝐿 is suff. small, # of tuples is Ξ©(π‘š1.5 log4 π‘š).2) If π‘‘π‘–π‘Žπ‘š 𝐺 = log𝑐 𝑛, algorithm takes polylog rounds.

Experimental Results

1

2

5

3

6

7

4

𝐺

𝑏

π‘Ž

π‘₯

𝑐 𝑑

𝑆0

𝑝π‘₯𝑏

𝑝π‘₯π‘Žπ‘π‘π‘‘π‘π‘π‘

π‘π‘‘π‘Ž

L-S𝑄(𝑒)

𝑁+(𝑒)

𝑝

Running timeApprox. ratios (Epinions)

Heuristics(Epinions)

Scalable estimation of the influence of individuals in massive social networks