Beyond Models: Forecasting Complex Network Processes ...ribeiro/pdf/Ribeiro_etal_Beyond15.pdf · to accurately forecast a variety of metrics under complex scenarios. We also show

Beyond Models: Forecasting Complex Network

Processes Directly from Data

Bruno RibeiroCarnegie Mellon University

Pittsburgh, PA, [email protected]

Minh X. HoangUniversity of California,

Santa Barbara, CA, [email protected]

Ambuj K. SinghUniversity of California,

Santa Barbara, CA, [email protected]

ABSTRACTComplex network phenomena – such as information cascadesin online social networks – are hard to fully observe, model,and forecast. In forecasting, a recent trend has been to forgothe use of parsimonious models in favor of models with in-creasingly large degrees of freedom that are trained to learnthe behavior of a process from historical data. Extrapolat-ing this trend into the future, eventually we would renouncemodels all together. But is it possible to forecast the evo-lution of a complex stochastic process directly from the datawithout a model? In this work we show that model-free fore-casting is possible. We present SED, an algorithm that fore-casts process statistics based on relationships of statisticalequivalence using two general axioms and historical data. Tothe best of our knowledge, SED is the first method that canperform axiomatic, model-free forecasts of complex stochas-tic processes. Our simulations using simple and complexevolving processes and tests performed on a large real-worlddataset show promising results.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database applications—Data mining

General TermsAlgorithms; Measurement

KeywordsModel-free Forecasting; Cascade Forecast

1. INTRODUCTIONComplex networked processes – such as information cas-

cades and the spread of influence and viruses over onlinesocial networks – are hard to fully observe, model, and fore-cast. Self-reinforcing and latent e↵ects can drive randomand deterministic amplifications that are hard to predictwithout aggregate, large-population models. For instance,the way people react to social and monetary incentives cana↵ect how information cascades spread over online social

Copyright is held by the International World Wide Web Conference Com-

mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the

author’s site if the Material is used in electronic media.

WWW 2015, May 18–22, 2015, Florence, Italy.

ACM 978-1-4503-3469-3/15/05.

http://dx.doi.org/10.1145/2736277.2741677.

networks [12]; individuals also have evolving social inter-ests [19,24].

In our work we focus on forecasting statistics of complexnetwork processes, such as the distribution of sizes of an epi-demic process over a network. Recently, the availability oflarge-sample historical data – that details the evolution ofsimilar processes in the past – has started a trend of forgo-ing parsimonious models in favor of models with increasinglylarge degrees of freedom that are trained over this historicaldata; these include latent Markovian infection models [24],auto-regressive models [20, 21], distance-based models fortime series [14,37], and classification-based approaches suchas logistic regression [4] and naıve Bayes classifiers [13]. Hid-den Markov models have also been proposed to replace par-simonious population models in ecology research [26].

Extrapolating this trend into the future, eventually wecould renounce models all together. But is this possible?That is, can we forecast statistics of complex network pro-cesses directly from the data without the help of models?Such forecasting algorithm would be the ultimate data-drivenmethod.

ContributionsIn this work, we propose SED (Statistical Equivalence Di-graph algorithm), an algorithm to forecast statistics of com-plex networked processes using general axioms that do notentail a model. Our algorithm works by extracting statisti-cal information contained in the historical data through twoaxioms that define relationships of statistical equivalence be-tween stochastic processes. To the best of our knowledge,our algorithm is the first that can perform (true) model-freeforecast of network processes. Using simulation results andalso a large real-world dataset, we show that SED is ableto accurately forecast a variety of metrics under complexscenarios. We also show that accurate unbiased forecast-ing is limited to time horizons less than twice of that ofthe training data for a class of seemly simple infection pro-cesses. This proof further motivates our quest for forecastingdirectly from the data.

An important property of SED is its ability to adjust pre-dictions according to the amount of evidence. Consider fore-casting the distribution of sizes of hashtag cascades #A and#B on Twitter:

• We observe 20 hashtag seeds1 of #A: ten seeds get cas-cades of size one and ten get cascades of size at least ten.

1Seeds are network nodes that are infected independently fromother already infected nodes.

• We also observe two seeds of #B generating cascades ofsize one and of size greater than ten, respectively.

Should we make the same forecast for #A and #B? As#B has less samples than #A we are more uncertain about#B’s behavior. The lack of evidence prompts SED to assigngreater uncertainty when forecasting metrics of #B, auto-matically adjusting forecasts to the amount of evidence.

OutlineSection 2 provides some definitions used throughout the pa-per. Section 3 explains the SED algorithm using the predic-tion of cascade statistics as an example. Section 4 presentsthe theory behind our approach. Section 5 tests the accu-racy of SED using simulated data and shows an applicationof SED to forecast complex cascades over a large online so-cial network dataset. Finally, Section 6 discusses the relatedwork and Section 7 presents our conclusions.

2. DEFINITIONSWe first give a few definitions used throughout this work.

In this paper we also often simplify our exposition by de-scribing what are really general stochastic processes as aninfection process on a network, a.k.a., a cascade process.

Definition 2.1. [Cascade Process] (i) 2 ⌦ is a cascadeprocess with ID i, i = 1, . . . , |⌦|. For each cascade process (i) 2 ⌦ we observe a set of sample paths. The randomvariable X(i)

�t is a measure over the sample paths of (i)

over relative time interval [0,�t], where time zero is thetime relative to the beginning of the cascade process. Forinstance, X(i)

�t may be a random variable that gives the sizeof a cascade of (i) over time window [0,�t].

The definition of X(i)�t allows us to be conservative and “mis-

takenly”merge two or more independent cascades of process (i) into a single larger cascade. This property is particu-larly useful when there is ambiguity to which independentcascade seed is responsible to infect a set of nodes.

Definition 2.2. [Forecasting] The process to forecast (0),

has k0 observations x(0,1)�t1

, . . . , x(0,k0)�t1

⇠ X(0)�t1

over time win-dow [0,�t1], �t1 > 0, where time zero represents the timethat the process starts. A forecast is an estimate of an statis-tic of X(0)

�t2where �t2 > �t1. A forecast assumes that all

processes in ⌦ have been observed for at least �t2 time.

For instance, if (0) is a viral marketing campaign we maywant to forecast, using the observations of the first day ofthe campaign, what will be the average number of infectednodes per infected seed after a week, a statistic that predictsthe performance of the campaign in the first week per userexposure.

Definition 2.3. [Historical Data] The historical data of

process (i), i = 0, . . ., defined as X (i)�t = {x(i,1)

�t , . . . , x(i,ki)�t ⇠

X(i)�t}, is the set of all available observations of process (i)

over time interval length �t.

In what follows we illustrate the SED algorithm. Thetheory behind the algorithm is explained in Section 4.

3. ALGORITHMIn this section, we present the SED Algorithm. Here we

focus on the “how-to” solution. Our description of SED fo-cuses on cascade processes to simplify our exposition; the

Algorithm 1 The SED Algorithm

Input: Historical data X (i)�t , i = 1, . . . , |⌦|, �t = �t1,�t2

and X (0)�t1

;↵: Probability amplifier parameter;n: Bootstrap resample size;m: Number of bootstrap resamples;Stat: Statistic of interest of X(0)

�t2.

Output: Set of predicted observations of Stat(X(0)�t2

)1: for i := 0 to |⌦| do

2: pi (P|⌦|

j=0,j 6=i PKuiper(X(i)�t1

,X (j)�t1

)

|⌦|

3: w(i⇤ 0) (PKuiper(X

(i)�t1

,X (0)�t1

)

pi4: end for5: ~W ( [w(1⇤ 0)↵, ..., w(|⌦|⇤ 0)↵]

6: ~W ( 1k ~Wk1

~W

7: Y ( [ ]8: for j = 1 to m do9: Sample i with probability ~Wi, i 2 {1, . . . , |⌦|}.10: X (0)

�t2( BoostrapSampling(src = X (i)

�t2, size = n)

11: Y [j] ( Stat(X (0)�t2

)12: end for13: return Y

algorithm for general networked processes is presented inSection 4, along with the justifications and the theory be-hind SED.

The SED algorithm (Algorithm 1) takes as input the set ofobservations of each process (i), over time interval lengths�t1 > 0 and �t2 > �t1 for i = 1, . . . , |⌦| and over intervallength �t1 for the process we want to forecast i = 0. Thegoal is to forecast a statistic of (0) over time interval length�t2. The algorithm also requires a parameter ↵ that laterwe describe how to automatically optimize. For now it isenough to know that a large value of ↵ indicates that theprocess to forecast behaves like an outlier in our dataset.Two other parameters are related to bootstrap sampling, nand m, where n = m = 100 should su�ce for most appli-cations but larger values always give more accurate results.The final input is the statistic we wish to forecast.

Step 2 of Algorithm 1 applies a two-sample Kuiper’s test [16]to get the probabilities that the observations of two processesi and j, i 6= j, come from the same underlying distribu-tion. Step 3 of Algorithm 1 computes the equivalence scorew(i ⇤ 0) between process (i) and the process to forecast (0). We call this equivalence score the Equivalence OddsRatio (EOR), described in details in Section. 4.

With the EOR computed we build a weight vector ~W atstep 5 of Algorithm 1, and normalize it using its L1-norm atstep 6. The probability amplifier ↵ (step 5) is a real numberthat is chosen automatically by another algorithm that wedescribe later. The next steps (9-11) forecast the statistic ofinterest using a two-step bootstrapping process with m andn resamples. These statistics may include, but are not lim-ited to, the Complementary Cumulative Distribution Func-tion (CCDF), mean, standard deviation and second momentof cascade sizes. We start by randomly sampling a processindex i according to the weight in ~W (step 9). At step 10we perform bootstrap sampling from the long-term observa-tion X (i)

�t2to obtain an estimate of the bootstrap resample

of X (0)�t2

, where |X (0)�t2

| = n. Finally, the statistics of the

estimated X (0)�t2

, e.g., the mean of X (0)�t2

, are computed and

stored in the vector Y (step 11). By repeating steps 9 to 11m times we obtain the bootstrapped statistics, which pro-vide the forecast of the statistic of interest over X(0)

�t2.

3.1 Choosing Amplifier ↵ and Sample Size nTo complete the algorithm, we also need to choose the

amplification factor ↵ and the sample size n. Larger valuesof n make the forecast more accurate, thus n is only limitedby computational constraints. In our forecasts, however,we would like to compare our prediction with the groundtruth of X(0)

�t2, thus, we set n = |X(0)

�t2| to make the com-

parison easy. Finally, we determine ↵ heuristically by min-imizing the squared error in SED’s prediction of the valueof Stat(X(0)

�t1), which we can easily get an estimate with

Stat(H(0)�t1

).

4. THEORYThe framework of classical statistics, as developed by pi-

oneers like R.A. Fisher [6], describes stochastic processesusing models as building blocks. For instance, a process canbe a Galton-Watson process, Markovian, auto-regressive, ora hierarchical combination of processes and non-parametricdistributions. In 1933 Kolmogorov [15] laid the modern ax-iomatic foundations of probability theory that eliminatedthe necessity of models, although models are still useful inKolmogorov’s framework. To the best of our knowledge, ourSED method is the first method able to perform forecastingusing axioms rather than models.

Roadmap: In what follows (Section 4.1) we detail our ax-iomatic forecast framework and our SEDmethod. Section 4.2takes an in-depth look at the problem providing forecast ac-curacy bounds for a class of stochastic processes, provingthat in some scenarios even for the best model cannot makeaccurate long-term forecasts of seemly simple processes.

4.1 SED: Axiomatic ForecastingIn what follows, we describe our SED method. SED per-

forms forecasts using the following two axioms in lieu of amodel. For the sake of conciseness and clarity, the definitionof dynamic networked processes is given later in the text inDefinition 4.1.

Axiom 4.1. The state of any dynamic networked processis unique at any time t � 0.

Axiom 4.1 guarantees that at any given time t there is noambiguity about the true state of the process, even if thistrue state is not observable.

Axiom 4.2. Let ⌦ be a set of processes with historicaldata and let (0) be the process that we wish to forecast.Each process 0 2 ⌦, where ⌦0 = ⌦ [ { (0)}, with the ex-ception of at most one process, is equivalent to one and onlyone other process in ⌦0 besides itself according to a measure� : S ! R, where S is the appropriate �-algebra.

Note that Axioms 4.1 and 4.2 do not entail a model. Toillustrate the use of Axiom 4.2 in our method consider thefollowing example. ⌦0 is the set of epidemic processes overthe same graph G = (V,E). We wish to forecast another

epidemic (0) using multiple independent sample paths ofthe processes in ⌦0. Let �(�) as defined in Axiom 4.2 countthe number of infected nodes in a sample path � 2 S. Theexample also works if � is another non-trivial metric, suchas the cumulative number of infected nodes, the Wiener in-dex, the reproduction number at time t > 0. Axiom 4.2guarantees that the set of all observed infection processes⌦0 is such that for any process 0 2 ⌦0 there is one and onlyone other process 00 2 ⌦0 for which the distribution of thenumber of infected nodes is exactly the same.

Unless stated otherwise in what follows we simplify ourexposition assuming that ⌦0 has an even number of elements.Our method also can be readily extended to support � : S !Rn, n > 1, to incorporate complex features of the process.As we see later, the scenario with n > 1 requires the use ofkernel two-sample tests [10] for the forecast. In what followswe consider � : S ! R unless stated otherwise.

Intuition: Axiom 4.2 goes at the heart of what it meansto forecast the evolution of a process using historical data.We can only predict the future if, somehow, the presentmimics what has happened in the past. When aided bymodels, there is a notion of behavior distance (e.g. likeli-hood function), which helps us find which processes in thepast look similar to the process we are trying to forecastand use them to predict the future. Because models oftenhave trouble assigning measures of similarity between twosignificantly di↵erent processes, many modern methods onlyconsider “close enough distances”, creating a manifold thatis then used in the forecast. One of the challenges in man-ifold learning is determining a close enough distance wherethe model empirically works without being fooled by thedata. Axiom 4.2 makes this distance precisely zero, whichbrings some advantages. First, the forecast does not needto learn parameters, often a computationally intensive task.Second, in a complex scenario it is probably better to stateignorance about process behavior than being very wrong.Network processes can be hard to forecast. In Section 4.2we show that the statistical information used in the forecastof a seemly simple process cannot be extrapolated by modelsbeyond the time horizon of the dataset (�t2).

Choice of �: The measure � can be thought of as a “fin-gerprint” of process behavior. With a bad choice of �, toomany“unrelated”processes have similar (but not equivalent)statistics, which in turn makes it hard to empirically distin-guish them. In these cases SED assigns similar weights tomost processes, and forecasting is performed conservatively(large confidence intervals). If the process to forecast hastoo few observations, SED also acts conservatively to reflectthe uncertainty in the limited number of observations.

Violating Axiom 4.2: If no process in the historicaldata is equivalent to the process we wish to forecast (a real-ity in some scenarios), then one of two things can happen. Ifthe process to forecast and most processes in the historicaldata have similar processes but not an equal process, thenAxiom 4.2 provides a robust metric of process similarity. Itis robust because it does not assume how processes behaveand cannot be fooled by data. This shows well in our experi-ments with a real dataset. If most processes in the historicaldata are equal and the process to forecast is an outlier, SEDwill forecast in an uninformative way with large confidenceintervals. This outlier status is easy to spot and stating thelack of confidence in the prediction is highly desirable.

Formal Definitions: In what follows, we use RandomMarked Point Processes (RMPP) to formalize the networkedprocess. Note that RMPPs allow us to avoid describing howthe process evolves and, thus, no model is described. LetG = (V,E) be a directed graph with node set V and edgeset E ✓ V ⇥ V . Let LV and LE denote an arbitrary set ofnode and edge labels that will be used to define the stateof the process at any time. As an example, in an epidemicprocess LV = {infected, susceptible} and LE = ;. The fol-lowing generalization takes into account the complexity ofreal-world cascading processes.

Definition 4.1. [Cascade Process] Let Z(i)k ✓ V ⇥ LV ⇥

E ⇥ LE denote the state of the i-th process after k > 0events. Let W (i)

k > 0 be the time between the k-th andthe k + 1-st events. A dynamic networked process (i) isa simple Random Marked Point Process (RMPP) (i) =

{(Z(i)k ,W (i)

k )}k2Z? operating over V ⇥ LV ⇥ E ⇥ LE, whereZ? is the set of non-negative integers. By convention,

0 = T (i)0 < T (i)

1 < · · ·are the successive times at which the process evolves withW (i)

k := T (i)k+1 � T (i)

k , k � 0.

Note that the dynamic networked process need not followthe edges of G. RMPPs are versatile stochastic models thatallow arbitrary spatial and temporal correlations in processevolution. Also note that in real life we only have a finitenumber of realizations of the processes and, thus, we maynot be able to distinguish some of them. In what follows,we define the measure � precisely and introduce the notionof �-stochastic equivalence.

Definition 4.2 (�-stochastic Equivalence). We de-fine two stochastic processes (a) and (b) as equivalent ac-cording to a Lebesgue measure � : S ! R at time window

[0,�t], �t 2 (0,1), if X(a)�t

d=X(b)

�t , where X(k)�t ⌘ �(�(�t)

k )

is a random variable, �k = {(Z(k)i ,W (k)

i ) : 0 T (k)i �t}

is a sample path of the stochastic process (k), k 2 {a, b}, Sis the appropriate �-algebra, and the operator

d= defines the

random variables are equal in distribution.

In what follows, we use Axioms 4.1 and 4.2 and the def-inition of �-stochastic equivalence to build an equivalencedigraph representing the equivalence relationships betweenall processes in ⌦0.

4.1.1 Stochastic Equivalence DigraphAxioms 4.1 and 4.2 state that the cascade process 2

⌦0 – e.g., the process of spreading a Twitter hashtagoverthe Twitter network – has one and only one �-stochasticequivalent process 0 2 ⌦0. To create our forecast, all weneed to do is to identify this process. This idea yields asimple statement wherein lies the answer to our practicalforecasting problem:

Theorem 4.1. Any two-sample hypothesis test can be usedto determine whether two processes (i), (j) 2 ⌦0 are �-stochastic equivalent at time interval �t > 0.

Proof. Axiom 4.1 states that sample paths are simpleand thus admit a stochastic equivalence measure �. FromAxiom 4.2, we know that all processes in ⌦0 have a �-stochastic equivalent process (except at most one process if|⌦0| is odd). The set of sample paths of (i) and (j) entail

a set of independent observations of X(i)�t and X(i)

�t). Thus,

�-stochastic equivalence yields X(i)�t

d=X(i)

�t . A two-samplehypothesis test assesses whether independent observationsof the random variables X(i)

�t and X(i)�t are drawings of the

same underlying distribution. Hence, a two-sample hypoth-esis test is a tool to determine whether (i) and (j) are�-stochastic equivalent at time �t > 0.

Interestingly, a test with low statistical power forces us tomake conservative forecasts with large confidence intervals.The forecast uncertainty reflects the lack of statistical infor-mation. In such case, we can identify the problem as thetest says that all processes in ⌦0 have a similar p-value. Aninteresting consequence of Axiom 4.2 and Theorem 4.1 is ourability to create a complete digraph whose nodes are the pro-cesses in ⌦0 and the weights represent the probability thatthey are �-stochastic equivalent. We call the weights of theedges in this digraph the Equivalence Odds Ratio (EOR),defined as follows.

Definition 4.3. [Equivalence Odds Ratio (EOR)] The EOR,denoted wt(i⇤ j), i 6= j, under n observations

X (i)�t = {x(i,1)

�t , . . . , x(i,n)�t } of X(i)

�t and m observations

X (j)�t = {x(j,1)

�t , . . . , x(j,m)�t } of X(j)

�t is

wt(i⇤ j) = |⌦0| PKuiper(X (j)�t ,X

(i)�t )P

(h)2⌦0\{ (j)} PKuiper(X (i)�t ,X

(h)�t )

(1)

where PKuiper(·, ·) is the p-value of a two-sample Kuiper’stest [16, 32], X (h) are the observations of process (h), and

with ⌦0 is as defined in Axiom 4.2 and X(i)�t, X(j)

�t as inDefinition 4.2.

The Equivalence Odds Ratio (EOR) wt(i ⇤ j) is the ratio

between p(X (i)�t ,X

(j)�t ) and the average p-value when i is com-

pared to all other cascades in ⌦0\{ (j)}. The EOR is not tobe confused with a likelihood-ratio test, which compares thelikelihood of two models given the data. In our frameworkthere are no models.

Definition 4.4 (Stochastic Equivalence Digraph).

Let GX�t be a simple (no loops) complete digraph that con-nects all processes in ⌦0 with directed edge weights. For anytwo cascade processes (i), (j) 2 ⌦0, i 6= j, the weight ofedge i ! j is the EOR wt(i⇤ j).

An EOR wt(i⇤j) > 1 suggests that process (i) is wt(i⇤j)times more likely to be �-stochastic equivalent to (j) thana random process selected from ⌦0\{ (j)}. Note that theequivalence odds ratio is not symmetric, i.e., there can beprocesses (i) and (j) such that wt(i⇤j) 6= wt(j⇤i). Notethat Definition 4.4 does not impose the one-to-one equiva-lence required by Axiom 4.2. This is because the one-to-oneequivalence is unnecessary and computationally expensive if⌦0 is large, thus we assume we can safely approximate itthrough probabilistic matching. The Kuiper test in PKuiper

can be also replaced by any two-sample hypothesis test suchas Kolmogorov-Smirnov’s test or extend our method to in-clude Pkernel [10] that allows the use of a multi-dimensional� through kernel two-sample tests. In one dimension, wechoose Kuiper’s test over Kolmogorov-Smirnov’s test be-cause the former gives equal importance to all domain valueswhile the latter tends to be most sensitive around the me-dian value [16,32].

Next, we show how to use the stochastic equivalence di-graph GX�t to make predictions of cascade characteristics.

4.1.2 Forecasting Procedure in TimeUsing the stochastic equivalence digraph GX�t1

we can

forecast (i) using an of estimate P [X(i)�t2

> x], 0 < �t1 <�t2, obtained by the mixture

P [X(i)�t2

> x] =

1C

X

(j)2⌦0\{ (i)}

w�t1(j ⇤ i)P [X(j)�t2

> x],

where C =P

j2⌦0 w�t1(j ⇤ i) is a normalization constant.The above equation gives the first iteration of Sinkhorn’s al-gorithm that finds probabilistic pairwise matches in a weightedgraph [31]. For computational reasons we limit our match-ing to a single iteration. Sinkhorn’s algorithm converges toa doubly stochastic matrix that defines the probability ofpairwise matchings in a weighted graph, widely used in softmatching problems [40]. If the SED adjacency matrix is ir-reducible then the probabilistic matching is unique [31]. Toincrease estimation accuracy when ⌦0 is large, we can addan exponential amplification factor:

P [X(i)�t2

> x] =

1C↵

X

(j)2⌦0\{ (i)}

w�t1(j ⇤ i)↵P [X(j)�t2

> x],

where C↵ =P

j2⌦0 w�t1(j ⇤ i)↵ and ↵ is chosen as tominimize a regret function over the estimated distributionP [X(i)

�t02> x] for �t02 ⇡ �t1. Note that even for historical

cascades (j) 2 ⌦0\{ (i)}, we do not have the true function

P [X(j)�t2

> x] that is used inside the sum. Therefore, we

estimate P [X(b)�t2

> x] by bootstrapping the observations of

X(j)�t2

in the historical cascades. In Section 5.2 we show thatthe estimates have good accuracy in practice.

4.2 A Big Data Forecast ParadoxIn what follows we present a seemly simple forecasting

problem and prove that under conditions common to largesocial networks, no model or procedure is capable of extrap-olating accurate unbiased cascade statistics beyond the timehorizons of the historical data. This inability of models toextrapolate cascade statistics beyond what is already in thehistorical (training) data indicate that model-free forecastscan do as well as forecasts with perfect models.

Moreover, we also show the following paradox: As a thesize of a power law network scales and increases both thehistorical data and the maximum cascade sizes, forecasts be-yond the historical data horizon get more inaccurate whileforecasts within the horizon get more accurate. This para-dox poses a great challenge for big data analytics, whereshort-term forecasts can get increasingly better as the sys-tem and the historical data grow while long-term forecastsget worse.

This paradox happens because the present is a biasedview of the future. More precisely, in the historical data weare more likely to see at least one infection from a cascadethat will be larger in the future than a cascade that will besmaller. This bias is related to the inspection paradox [36]and depends on the observation window. The bias is so hardto correct for power law distributions that it denies us theability to forecast beyond historical data horizons, where wehave not yet recorded the bias. Moreover, our results are notconfined to power laws; we consider all distributions (Type

I, II, and III). The results apply to statistics such as theunbiased average of cascade sizes and the cascade size dis-tribution.

Model: We collect data during time interval [0, T ] and seekto forecast statistics for interval [0, cT ], c > 1. Considern independent cascades. Cascade infections arrive accord-ing to a constant-rate Poisson process with possibly distinctrates in the interval [0, cT ]. The size of a cascade at time cThas distribution ⇤(cT ) ⇠ ✓, where ✓ is a distribution withsupport {1, . . . ,W}. Our initial goal is to estimate an un-biased average of the cascade sizes over the interval [0, cT ],c > 1. At first inspection the forecasting problem lookslike a simple task and the impossibility results stated aboveseem surprising. We choose the Poisson process in our illus-tration precisely because its simplicity allows us to clearlyunderstand why the forecasting problem is hard.

Our use of the Poisson process is also of interest becauseit connects the axiomatic framework with the classic model-based one. Poisson processes are arguably among the sim-plest and most widely used stochastic processes. Let Nt bethe number of events in the time interval [0, t]. A Poissonprocess {⇤(�t)} with constant rate � is defined as an arrivalprocess that satisfies the following three axioms:

A1. lim✏!0 P [⇤(�t)� ⇤(�t+ ✏) = 1] = �✏+ o(✏), andlim�t!0 P [Nt � Nt+�t > 1] = o(�t); that is, no twoevents can happen at the same time;

A2. for all t, s > 0, Nt+s�Nt is independent of the historyup to t, {Nu, u t};

A3. for all t, s > 0, Nt+s �Nt is independent of t.

An equivalent constructive way to define the same processis: P [Nt � k] = P [

Pki=1 Xi < t], where {X1, . . . , Xk} are

independent and identically distributed (i.i.d.) exponentialrandom variables with parameter �.

Forecast Problem Definition. Let OT denote the set of in-dices of the cascades in the historical data observed duringinterval [0, T ]. The forecasting problem is defined as follows:We observe the cascades during time window [0, T ] and wishto forecast the average number of events we will observe inthe window [0, cT ], mcT =

Pni=1 N

(i)cT /n.

4.2.1 Forecast Accuracy BoundsIn the following theorem, we prove that forecasting the

average number of events in our mixture Poisson process isa hard problem.

Theorem 4.2. Let {N (i)cT }ni=1, N

(i)cT ⇠ (✓1, . . . , ✓W ), be a

set of observed events of a mixture of n Poisson processesduring time interval [0, cT ]. Let |OT | denote the number ofprocesses with at least one arrival in the interval [0, T ]. As-sume we are given the best unbiased forecast function mcT |Tof the average number of events mcT , where mcT |T takesas input the observations over time interval [0, T ]. Then,the following conditions regarding the mean squared errorMSE(mcT |T ) = E[(mcT |T �mcT )

2] hold:

1. If ✓W decreases faster than exponentially in W , i.e.,� log ✓W = !(W ), then MSE(mcT |T ) = ⌦(1/|OT |).

2. If ✓W decreases exponentially in W , i.e., log ✓W = W log b+o(W ) for some 0 < b < 1, then

(a) log[MSE(mcT |T )] = ⌦(W � log |OT |), provided c >1 + 1/b,

(b) MSE(mcT |T ) = ⌦(W/|OT |), provided c = 1 + 1/b,(c) MSE(mcT |T ) = ⌦(1/|OT |), provided c < 1 + 1/b.

3. If ✓W decreases more slowly than exponential, i.e.,� log ✓W = o(W ), then

(a) log[MSE(mcT |T )] = ⌦(W � log |OT |),provided c > 2,

(b) MSE(mcT |T ) = !(1/|OT |), provided c = 2 andPWj=1 j

2✓j = !(1),(c) MSE(mcT |T ) = ⌦(1/|OT |), provided either c < 2 or

c = 2 andPW

j=1 j2✓j = O(1).

Proof of Theorem 4.2. Our proof of Theorem 4.2 showsthat the forecasting mixtures of Poisson processes can bemapped into the set size distribution estimation problem;we then use our results on the hardness of estimating setsize distributions from sampling [23] to prove the theorem.We start by showing how to map the forecasting mixturesof Poisson processes into the set size distribution estimationproblem. Let P [N (i)

T = k|N (i)cT ] be the probability that k

events happen in the time window [0, T ] given the number

of events at time cT is N (i)cT , c > 1. The distribution of

P [N (i)T = k|N (i)

cT ] is given in the following lemma.

Lemma 4.1. If NcT is the number of events of a Poissonprocess at time cT and N 0

T is the number of events thathappened between [0, T ], c > 1, then P [N 0

T = k|NcT ] =�NcTk

�c�k(1� 1/c)NcT�k.

Proof. A Poisson process is defined as an arrival pro-cess that satisfies the three axioms listed above. Axiom A3states that 8t, s > 0, Nt+s �Nt is independent of t. Thus,the counts of the number of events at a time interval [0, T ] isindependent of the number of events at time interval (T, cT ](axiom A2). Our Poisson process is time-homogeneous andthus the probability that an event lands in [0, T ] is pro-portional to T/(cT ) = 1/c (axiom A1). As the number ofevents between [0, T ] and (T, cT ] are independent, we haveP [N 0

T = k|NcT ] =�NcTk

�c�k(1� 1/c)NcT�k.

In what follows, we show that the timestamps of the pastevents in the interval [0, T ] are not relevant in the forecast.

Lemma 4.2. The event counts in the historical data pro-vide all the statistical information collected in the interval[0, T ] w.r.t. the counts {N (i)

cT }ni=1 in the interval [0, cT ].

Proof. Axiom A2 states that the exact timestamps ofthe events in the interval [0, T ] gives no statistical informa-tion about future events in (T, cT ].

We are now ready for theorem that connects the forecastwith an estimation problem that is known to be hard.

Theorem 4.3. The problem of forecasting any functiong({N (i)

cT }ni=1) of the mixture Poisson sample paths describedin Section 4.2 is equivalent to the set size estimation prob-lem [23], where elements are randomly sampled from a col-lection of non-overlapping sets and we seek to recover theoriginal set size distribution from the samples.

Proof. We prove the theorem by mapping the forecast-ing problem into the set size estimation problem. Let {N (i)

cT }ni=1

be the sizes of n sets, whose elements are sampled indepen-dently with probability 1/c > 0, leading to sampled set sizesthat are binomially distributed as in Lemma 4.1. Finally,Lemma 4.2 shows that these non-zero sampled set sizes arethe only information available to forecast the process.

Theorem 4.3 shows that the present is a biased view of thefuture, as the set size problem su↵ers from the inspectionparadox [23]. In what follows we prove our main theorem.

Proof of Theorem 4.2. Using Theorem 4.3 we can con-struct a perfect map between the forecasting problem andthe set size distribution problem. Theorem 4.2 follows fromusing Theorem 4.3 of our previous work (Murai et al. [23])to this mapping, where we analytically compute the inverseof the Fisher information matrix of the data, giving rise tothe error bounds in the theorem (through the Cramer-Raobound).

Theorem 4.2 shows clear limitations of forecasting with mod-els. In realistic scenarios, where event sizes are large andhave heavier-than-exponential distributions, no algorithmcan obtain accurate unbiased forecasts of the average num-ber of events beyond time interval [T, 2T ] if the dataset islimited to time horizon [0, T ]. Fortunately, SED is not im-pacted by this problem as it limits its forecasts to the timehorizons in the historical data. These results inspire the fol-lowing apparent paradox.

4.2.2 A Big Network Data ParadoxWhen More Data Means Worse Forecasts. Theorem 4.2

creates an apparent paradox. The forecasting can becomemore inaccurate as the dataset OT grows. The forecastingis not just more inaccurate, the forecast simply breaks down;as OT grows, and the data time horizon remains the same[0, T ], there is nearly zero statistical (Fisher) information toforecast beyond horizon [T, 2T ].

To showcase the impact of this paradox, consider a grow-ing network where the distribution of the size of events (cas-cade sizes) is Pareto with parameter � > 1. Our resultshold generally but for illustration we use a scenario wherethe number of seeds increases proportionally with networksize. In this scenario the relationship between the maxi-mum number of events per seed W and the number of seeds|OT | � 1 is E[W ] ⇡ �

p|OT | �(� � 1), a known result from

extreme value theory [28]. Theorem 4.2 case 3(a) showsthat the MSE error is lower bounded by log[MSE(mcT |T )] =

⌦(W � log |OT |). As E[W ] / �p

|OT | and |OT | increases,so does W . We can think of the MSE lower bound growingroughly as exp( �

p|OT |)/|OT |.

Thus, if the training data OT has time horizon [0, T ] andevent (cascade) sizes have a power law distribution, The-orem 4.2 cases 3(b-c) show that as the network grows theestimate in the interval [T, 2T ] gets more accurate. However,Theorem 4.2 case 3(a) shows that as the network grows theestimation error in the interval (2T, cT ], c > 2, grows expo-nentially with network size.

5. RESULTSIn this section we present our results. We start with our

simulation results in Section 5.1. Section 5.2 shows our re-sults over a large Twitter dataset.

5.1 Forecasting Using Simulated DataIn this section, we test the forecasts of SED on synthetic

datasets. We test the performance of SED over two dis-tinct models: the Poisson process and the Galton-Watsonprocess. Recall that ⌦0 = ⌦ [ { (0)} is the union of thehistorical processes ⌦ and the process to forecast (0). SED

(a) ID 10 (b) ID 60 (c) ID 160

Figure 1: Prediction for mixture Poisson processes (Scenario 1)of the future CCDF, average, standard deviation, and second mo-ment metrics. SED predictions match well the true future distri-butions and clearly outperform uninformed predictions.

forecasts metrics of (0) at time interval �t2 using datafrom time interval �t1, �t1 < �t2 In addition to SED, wealso have two baseline naive predictors: uniformed predic-tion, and K-means prediction. Uninformed prediction re-turns the bootstrapped statistics at time interval �t2 of arandomly selected process in ⌦. Whereas, K-means predic-tion first clusters the set of historical cascades using K-meansalgorithm and Euclidean distances between the CCDFs ofcascade metrics (probability density functions, PDFs, givesimilar results). The number of cluster k is chosen usingthe elbow method, without making the clusters too small (kequals 10 for the two synthetic datasets, and 20 for the Twit-ter dataset). After that, given a new cascade (0), K-meansprediction finds the cluster it belongs to, and returns thebootstrapped statistics at time interval �t2 of a randomlyselected process in that cluster.

For evaluation, we use violin plots to compare the pre-dicted statistics with the bootstrap statistics from the ground-truth empirical distribution of (0) at �t2. A violin plot isa box plot combined with a kernel density estimate of theprobability density function, giving a more detailed view ofthe data’s variance. The box extends from the first quartileQ1 to the third quartile Q3 of the data, with a line at themedian. The whiskers extend from the box to Q1�1.5⇤IQRandQ3+1.5⇤IQR, where IQR = Q3�Q1 is the inter-quartilerange. The violin plots show the spread.

(a) ID 0 (b) ID 10 (c) ID 60

Figure 2: Prediction for Galton-Watson processes (Scenario 2)of the future CCDF, average, standard deviation, and secondmoment metrics. SED predictions again match well the futuredistributions and are clearly superior to uninformed predictions.

5.1.1 Scenario 1: Mixture Poisson ProcessesOur mixture Poisson process is a Poisson process with

random rate � that has support (0,W ], W > 0.Data Generation: The simulated historical data H is gen-erated through simulations from time zero until �t2 = 1month from two sets of mixture Poisson processes. The firstset has n1 = 100 processes with rate distribution �1 / ��2

and the second set has n2 = 100 processes with rate distri-bution �2 / ��3. Poisson rates are measured in number ofevents per month. Each process has 1, 000 infection seeds.The process to forecast (0) belongs to one of two sets andwas observed from time zero until �t1 = �t2/4, approxi-mately one week. In what follows we test how well SED canforecast (0). In Section 4.2 we show that estimating thePoisson rates is hard due to observation bias.

Results: For conciseness, Figure 1 shows only the resultsfor three processes, one per column: processes 10, 60 (type1), and 160 (type 2). The remaining results can be foundin our technical report [29]. Four forecasts are evaluated:forecasting the CCDFs, mean, standard deviation, and sec-ond moment. For comparison baselines we use uninformedprediction – uniform matching weights – and matching us-ing K-means with Euclidean distances, as the latter havebeen extensively and successfully used in mining and clas-sifying time series [37]. For each process we present thetrue statistics at �t1 (Present) and �t2 (Future), and thepredicted statistics at �t2, including SED, uninformed, and

K-means predictions. The Figure 1 shows that SED is themethod that better represents, consistently, the true statis-tics at time �t2 under all scenarios (Future).

5.1.2 Scenario 2: Galton-Watson ProcessThe Galton-Watson process is a branching process {Xn}

such that Xn+1 =PXn

j=1 (n)j , where X0 = 1 and { (n)

j ⇠Poisson(�) : n, j 2 N} is a set of i.i.d. natural number-

valued random variables [22]. Intuitively, (n)i is the number

of male children of the j-th descendants, and Xn is the num-ber of descendants in the n-th generation. The total numberof people with the considered family name at time t wouldbe

Pti=0 Xt. A simple epidemic cascading on a network can

be modeled as a Galton-Watson process.To make this model harder to forecast, we replace the

Poisson distribution of the number of children by a log-normal distribution. The purpose of the log-normal dis-tribution is to skew the obtained distribution of the num-ber of children, mimicking real life situation where the de-gree of a node in a network follows a heavy tail distri-bution. We call the values generated by the log-normaldistribution the descendant e↵ect of a node. Here is themodel we use Xn+1 =

PbXncj=1 (n)

j , where X0 = 1 and

{ (n)j ⇠ log-normal(µ,�2) : n, j 2 N}. The operation bXnc

is to get an integer-valued number of descendants.

Cascade Process Evolution: For each information cascade,we use a birth rate � to generate new seeds over time. Givena single seed node (X0 = 1), the total descendant e↵ect upto time t in a Galton-Watson process, i.e.,

Pti=0 Xi, will

be used as the cascade size at time t for this seed. As aresult, we obtain a sample of the cascade size per seed. Toadd more complexity to the model, we allow the parametersto change over time: µt = µ0 + t ⇤ �µ, �t = �0 + t ⇤ ��,�t = �0 + t ⇤�� , where µt, �t, �t are the parameter valuesat time point t, and �µ, ��, �� are the amount of changeat each time step.

Table 1 shows our parameter settings. We create 9 di↵er-ent cascade types, each one containing 10 cascades spanninga period of seven time units. These cascade types can begrouped into three big groups – small, medium and large –based on their mean µ in the log-normal distribution. Largercascades have a larger variation (larger �) and higher birthrate �. For each of these big groups, we generate three sub-groups with di↵erent evolutionary trends: increasing in size(inc), decreasing in size (dec), and constant size (const). Fi-nally, we set �t1 = 2 time units and �t2 = 7 time units.

Results: Figure 2 shows the results of three processes: 0(Small-inc), 10 (Small-dec) and 60 (Large-inc). We get sim-ilar results in the other processes. Note that the statisticsof Present and Future are significantly di↵erent. Figure 2shows that the uninformed and k-means forecasts are farfrom the truth that the confidence intervals do not fit in ourplot. Over the same scenario SED accurately forecasts themean, standard deviation and second moment as shown inthe last three rows of Figure 2.

5.2 Forecasting Twitter CascadesWe study the Twitter dataset collected by Yang et al. [39]2,

which contains 467 million Twitter posts from 20 million

2http://snap.stanford.edu/data/twitter7.html

Parameters Temporal delta

Process Type IDs µ0 �0 �0 �µ ��

Small-inc 0..91.5 1 300

+0.2 +0.1 +30Small-dec 10..19 -0.2 -0.1 -30Small-const 20..29 0 0 0Medium-inc 30..39

2 2 500+0.2 +0.1 +30

Medium-dec 40..49 -0.2 -0.1 -30Medium-const 50..59 0 0 0Large-inc 60..69

3 2.5 700+0.2 +0.1 +30

Large-dec 70..79 -0.2 -0.1 -30Large-const 80..89 0 0 0

Table 1: Galton-Watson process simulation parameters.

users covering a 7-month period from June 1st to December31st, 2009. The underlying Twitter follower-followee net-work is obtained from Kwak et al. [18]. The fact that thisTwitter dataset is only a sample of all tweets does not in-validate the dataset to test SED. SED is designed to makepredictions over general dissemination processes, includingsampled cascading processes such as our Twitter dataset.

Hashtag cascadesWe are interested in the di↵usion of a hashtag over the Twit-ter online social network. When an individual uses a hash-tag in her tweet at a given time, if neither she nor any ofher followees have tweeted the same hashtag for at least aweek prior to the post timestamp, we consider her as a seed.The timestamp of the first tweet is the timestamp of a nodehashtag “infection”. Once the seed of a cascade has beenidentified, we start tracking the cascade over time. Notethat our definition of seed is stringent and we may mergealmost unrelated cascades into a single cascade. From ourRMPP definition, metrics over merged cascade processes areallowed and, thus, we can be conservative to define indepen-dent cascades in our dataset.

If a hashtag was not been used for more than a week,we assume the cascade has ended. The first occurrence ofthe same hashtag after that would mark the beginning of anew cascade. Due to this one-week time window, hashtagsappearing for the first time in the dataset during the firstand last weeks of the monitored time frame are ignored.The time di↵erence between the first and the last posts of aninformation cascade is defined as the duration of the cascade.

Cascade metricIn our case study, we are interested in the distribution of thenumber of infected descendants of a seed node for each hash-tag in Twitter. In particular, given an infected seed node,we quantify how many of its descendants are “infected” by ahashtag complying with temporal infection order (infectedchildren in the branching process must be infected after theirparents). These quantities are obtained from the inducedsubgraph of the infected nodes over the original follower-followee network. Note that a user may have several fol-lowees who posted the same hashtag before she did. As aresult, simply counting the number of infected descendantsof an infected user does not guarantee the independence be-tween samples required by our theory. To reduce the e↵ect ofthese independence violations, we assume that two infectedparents of an infected node are evenly responsible for theinfection. Under this assumption we propose a normalizedmetric as follows.

Figure 3: An example induced subgraphs of infected nodes andcascade metric. Node infection timestamps are shown in red.

Definition 5.1. Given a time window length T , the num-ber of normalized infected descendants (NNID) of a node vare defined as:

NNID(v) =X

w2NTdes(v)

�(v ! w)

where NTin(v) and NT

des(v) are the set of infected parents,and infected descendants of node v within the time windowT respectively; �(v ! v) = 1; �(v ! w) is the amount ofinfection spread from node v to w through all possible time-ordering paths, and is defined recursively as follows:

�(v ! w) =

Pu2NT

in(w)\NTdes(v)

�(v ! u)

|NTin(w)|

We assume that if a node v is infected at time point t, thenafter time point t+T , its e↵ect on other nodes will diminishto zero. Thus, the set of infected descendants of node vdo not contains nodes infected after time t + T . To get adistribution of NNIDs for a cascade c at a given time pointt, we first extract the induced subgraph Gind = (Vind, Eind),where Vind is the set of infected nodes of c up to time t. Inaddition, given two infected nodes u and v, if the edge (u !v) 2 E and tinfect(v)� tinfect(v) T , then (u ! v) 2 Eind,where tinfect(v) is the timestamp at which v is infected inc. Next, for each seed node of the cascade, i.e., nodes withno infected parents within a time window T , we compute itsNNID in Gind using Definition. 5.1. In the end, we obtaina sample of NNID values, each for a seed in c, giving us anempirical distribution of NNIDs of c.

Example 1. An induced subgraph of infected nodes isshown in Fig. 3. Given a time window T = 7, the descendantset of C is {D,E,H}; the infected parent sets of D and Eare {B,C} and {C} respectively. Thus, �(C ! D) = 1

2 , and�(C ! E) = 1. The infection from C can spread to H fol-lowing two di↵erent paths: C ! D ! H, and C ! E ! H.Thus, �(C ! H) = �(C!D)+�(C!E)

2 = 34 . Finally, the num-

ber of normalized descendants of C is NNID(C) = �(C !D) + �(C ! E) + �(C ! H) = 1

2 + 1 + 34 = 9

4 .Similarly, the set of seed nodes with no infected parents

within a time window T = 7 is {A, B, C, F , I}, whichgives a sample of NNIDs {1, 3

4 ,94 , 1, 1} accordingly.

Twitter ResultsFor our tests, we identify the Twitter hashtag cascades thatlast at least 8 weeks. Using this dataset, we predict dis-tribution of NNID (time window T = 1 week) at week 8(�t2) using data from the week 2 (�t1). To guarantee thatenough data is available for prediction, we only consider cas-cades that contain at least 300 infected users after 8 weeksand at least 30 infected users after one week. In the end, weare left with 1050 cascades split among the di↵erent hash-tags. Similar to the simulations, besides the CCDF plots,

we use violin plots to compare the predicted statistics (SEDprediction and uninformed prediction) with the bootstrapstatistics from the ground-truth sample at �t2. It is worthnoting that the uninformed prediction simply returns theaverage statistics of all cascades in ⌦ at time �t2.

The forecasting results for five Twitter hashtags are shownin Figure 4. Our technical report shows all other hashtagforecasts left out of this paper due to space constraints [29].SED forecasts frequently outperform uninformed and k-meanspredictions. For example, all #FORASARNEY statisticsdi↵er considerably from those of uninformed and k-meanspredictions. Whereas SED accurately forecasts the averageCCDFs and all three statistics, as observed by contrastingSED forecast violin plots against those of the ground truth(Future). For #H1N1, #FIERFOX and #SOUNDCLOUD,the uninformed forecasts are rather uncertain of the valuesof the mean, the standard deviation, and the second mo-ment, leading to significantly lengthened violin plots. Onthe other hand, the SED forecasts are, once again, closeto the ground truth. In the case of forecasting the meanof #JQUERY (last column, second row), the true mean(Future) is so close to the future average statistics (unin-formed prediction), that SED shows no clear improvementover naive approaches. We include this #JQUERY exam-ple to show that in hard-to-forecast cases SED does no worsethan uninformed and k-means. Note, however, that for stan-dard deviation and the second moment of #JQUERY showSED superior to uninformed and k-means.

It must be noted that the hashtag Twitter dataset thatwe are using is noisy (cascades are not fully independent),making our forecasting task very di�cult. Moreover, formany of the cascades in this dataset, the distribution at week8 (�t2) is very close to the average prediction (uninformedprediction). Thus, more challenging datasets and furtherresearch are needed to further validate practical aspects ofour approach.

6. RELATED WORKAnalyzing and predicting network processes in general,

and information cascades in particular, has attracted muchattention in recent years. Multiple works focus on build-ing theoretical models (e.g., [8, 9, 38]) of dissemination pro-cesses. These models capture information di↵usion at nodesthrough network topology as well as user interests and theinformation content [7, 27, 38]. Wang et al. [35] uses partialdi↵erential equations to predict information di↵usion overboth temporal and spatial dimensions. One challenge inthis line of research is the gap between empirical data andtheory. Building parsimonious models is a complex task andthe resulting models are often not vetted against real data.

Most relevant to this paper are the works that predictcascade popularity. Some of these works predict the volumeof aggregate activity, such as the number of votes on Diggstories [33], or Twitter hashtag usage [20]. A few worksexamine how information cascades grow in size dependingon its content [19, 30] and user interests [2]. Other worksmake prediction using observations of a cascade during agiven fixed time interval [17, 20, 34]. The cascade size pre-diction task is seen as a regression problem in a variety ofworks [3, 17, 33, 34]. Matsubara et al. [21] proposes a par-simonious model that captures the rise and fall patterns ininformation di↵usion. In lieu of predicting the exact cascadesizes, many studies bin the cascades sizes and solve a binary

(a) FORASARNEY (b) H1N1 (c) FIREFOX (d) SOUNDCLOUD (e) JQUERY

Figure 4: Predictions of statistics of NNID for five example hashtags in Twitter.

classification problem of whether or not a cascade becomesviral [11,13,17] or if it doubles in size [4]. Instead of model-ing the di↵usion process, Najar et al. [25] directly predictsthe final propagation state of the information given its initialstate. Outside cascade sizes, Cheng et al. [4] also predictsstructural features of the cascade sample paths. Our goalis di↵erent as we do not perform binary classification of theevolution of a sample path, or just predict cascade sizes.We propose a general axiomatic forecasting framework thatis not tied to a specific set of cascade features or process andcan be used o↵-the-shelf in any similar forecasting scenario.

Since we predict cascade statistics, our work also relates toresearch on fitting empirical data to parsimonious statisticalmodels [1, 5]. While the empirical data can be readily fit-ted to many known parsimonious models such as power laws,log-normal, or exponential, there is no guarantee that the fit-ted model can be used to predict the tail of the distributionor how the distribution changes with the observation win-dow. Indeed, our experimental results using Twitter datashow that cascade statistics can significantly change overtime. Thus, the need to go beyond parsimonious modelsto design a data-driven statistical approach using axiomaticforecasting.

7. CONCLUSIONSIn this work we propose SED, a new algorithm for fore-

casting statistics of complex networked processes. SED isthe first of its kind, a (true) model-free approach that usesaxioms rather than models to extract statistical informa-tion from the data, pointing to a promising new directionof axiomatic forecasting. More importantly, we provide theunderlying theory behind SED’s model-free axiomatic fore-casting approach. We test SED on two synthetic datasetsand a large Twitter dataset, showing that SED can forecastaccurately a variety of statistics under complex scenarios.

AcknowledgementsWe thank C. Faloutsos and D. Towsley for their helpful com-ments. This work was partially supported by NSF grantsCNS-1065133, IIS-1219254, and ARL Cooperative Agree-ment W911NF-09-2-0053. The views and conclusions con-tained in this document are those of the author and shouldnot be interpreted as representing the o�cial policies, eitherexpressed or implied of the NSF, ARL, or the U.S. Govern-ment. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstand-ing any copyright notation hereon.

8. REFERENCES

[1] Lada A Adamic. Zipf, power-laws, and pareto-aranking tutorial. Xerox Palo Alto Research Center,2000.

[2] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProc. SIGKDD, 2006.

[3] Eytan Bakshy, Brian Karrer, and Lada A Adamic.Social influence and the di↵usion of user-createdcontent. In EC, 2009.

[4] Justin Cheng, Lada Adamic, P Alex Dow,Jon Michael Kleinberg, and Jure Leskovec. Cancascades be predicted? In Proc. WWW, 2014.

[5] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJNewman. Power-law distributions in empirical data.SIAM review, 51, 2009.

[6] R. A. Fisher. The design of experiments. Oliver &Boyd, 1935.

[7] W Galuba, K Aberer, and D Chakraborty.Outtweeting the twitterers-predicting informationcascades in microblogs. In Proc. WOSN, 2010.

[8] Michaela Goetz, Jure Leskovec, Mary McGlohon, andChristos Faloutsos. Modeling blog dynamics. InICWSM, 2009.

[9] Amit Goyal, Francesco Bonchi, and Laks VSLakshmanan. Learning influence probabilities in socialnetworks. In Proc. WSDM, 2010.

[10] Arthur Gretton, Karsten M. Borgwardt, Malte J.Rasch, Bernhard Scholkopf, and Alexander Smola. Akernel two-sample test. JMLR, 13(1):723–773, March2012.

[11] Liangjie Hong, Ovidiu Dan, and Brian D Davison.Predicting popular messages in twitter. In Proc.WWW, 2011.

[12] Ting-Kai Huang, Bruno Ribeiro, Harsha VMadhyastha, and Michalis Faloutsos. Thesocio-monetary incentives of online social networkmalware campaigns. In Proc. COSN, pages 259–270.ACM, 2014.

[13] Maximilian Jenders, Gjergji Kasneci, and FelixNaumann. Analyzing and predicting viral tweets. InProc. WWW, 2013.

[14] Eamonn Keogh, Stefano Lonardi, andCA Ratanamahatana. Towards parameter-free datamining. Proc. SIGKDD, pages 206–215, 2004.

[15] A. N. Kolmogorov. Foundations of the Theory ofProbability. Chelsea Publishing Co., 1950.

[16] N H Kuiper. Tests concerning random points on acircle. Proceedings of the Koninklijke NederlandseAkademie, 1960.

[17] Andrey Kupavskii, Liudmila Ostroumova, AlexeyUmnov, Svyatoslav Usachev, Pavel Serdyukov, GlebGusev, and Andrey Kustarev. Prediction of retweetcascade size over time. In CIKM, 2012.

[18] Haewoon Kwak, Changhyun Lee, Hosung Park, andSue Moon. What is Twitter, a social network or anews media? In Proc. WWW, 2010.

[19] Jure Leskovec, Lars Backstrom, and Jon Kleinberg.Meme-tracking and the dynamics of the news cycle. InProc. SIGKDD, 2009.

[20] Zongyang Ma, Aixin Sun, and Gao Cong. Onpredicting the popularity of newly emerging hashtagsin twitter. JASIST, 64, 2013.

[21] Yasuko Matsubara, Yasushi Sakurai, B AdityaPrakash, Lei Li, and Christos Faloutsos. Rise and fallpatterns of information di↵usion: model andimplications. In Proc. SIGKDD, pages 6–14, 2012.

[22] Fabricio Murai, Bruno Ribeiro, Don Towsley, andKrista Gile. Characterizing Branching Processes from

Sampled Data. In Proc. WWW Companion, pages805–811, 2013.

[23] Fabricio Murai, Bruno Ribeiro, Don Towsley, andPinghui Wang. On Set Size Distribution Estimationand the Characterization of Large Networks viaSampling. IEEE Journal on Selected Areas inCommunications, 31(6):1017–1025, 2013.

[24] Seth A Myers, Chenguang Zhu, and Jure Leskovec.Information di↵usion and external influence innetworks. In Proc. SIGKDD, 2012.

[25] Anis Najar, Ludovic Denoyer, and Patrick Gallinari.Predicting information di↵usion on social networkswith partial knowledge. In Proc. WWW Companion,pages 1197–1204. ACM, 2012.

[26] Charles T Perretti, Stephan B Munch, and GeorgeSugihara. Model-free forecasting outperforms thecorrect mechanistic model for simulated andexperimental data. PNAS, 110(13):5253–7, March2013.

[27] Daniel Ramage, Susan T Dumais, and Daniel JLiebling. Characterizing microblogs with topic models.ICWSM, 10, 2010.

[28] Sidney I Resnick. Extreme values, regular variation,and point processes. Springer Science & BusinessMedia, 2007.

[29] Bruno Ribeiro, Minh X. Hoang, and Ambuj K. Singh.Beyond models: Forecasting complex networkprocesses directly from data. http://www.cs.cmu.edu/~ribeiro/pdf/Ribeiro_etal_BeyondTR15.pdf.

[30] Daniel M Romero, Chenhao Tan, and Johan Ugander.On the interplay between social and topical structure.In ICWSM, 2013.

[31] Richard Sinkhorn and Paul Knopp. Concerningnonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967.

[32] M A Stephens. Use of the Kolmogorov-Smirnov,Cramer–Von-Mises and related statistics withoutextensive tables. J. of the Royal Statistical SocietySeries B, 32, 1970.

[33] Gabor Szabo and Bernardo A Huberman. Predictingthe popularity of online content. Communications ofthe ACM, 53, 2010.

[34] Oren Tsur and Ari Rappoport. What’s in a hashtag?:content based prediction of the spread of ideas inmicroblogging communities. In Proc. WSDM, 2012.

[35] Feng Wang, Haiyan Wang, and Kuai Xu. Di↵usiveLogistic Model Towards Predicting InformationDi↵usion in Online Social Networks. WINE,cs.SI(June), August 2011.

[36] James R. Wilson. The inspection paradox inrenewal-reward processes. Operations ResearchLetters, 2(1):27–30, April 1983.

[37] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. Abrief survey on sequence classification. ACM SIGKDDExplorations Newsletter, 12(1):40, November 2010.

[38] Jaewon Yang and Jure Leskovec. Modelinginformation di↵usion in implicit networks. In ICDM,2010.

[39] Jaewon Yang and Jure Leskovec. Patterns of temporalvariation in online media. In Proc. WSDM, 2011.

[40] Ron Zass and Amnon Shashua. Probabilistic graphand hypergraph matching. In CVPR, pages 1–8.IEEE, 2008.

Documents

Beyond Models: Forecasting Complex Network Processes ...ribeiro/pdf/Ribeiro_etal_Beyond15.pdf · to accurately forecast a variety of metrics under complex scenarios. We also show