82
Statistics and networks Survey sampling Extending the sampling design Application to Twitter data Sampling graphs efficiently: model assisted designs and application to Twitter data Antoine Rebecq Universit´ e Paris X - INSEE 3/23/17 Antoine Rebecq Sampling designs for graphs

Sampling graphs efficiently - MAD Stat (TSE)

Embed Size (px)

Citation preview

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Sampling graphs efficiently: model assisted designsand application to Twitter data

Antoine Rebecq

Universite Paris X - INSEE

3/23/17

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

1 Statistics and networksGraphs and statsMethods - algorithms - models

2 Survey samplingEstimatesUse of auxiliary information

3 Extending the sampling designSnowball samplingAdaptive sampling

4 Application to Twitter dataThe problemResultsModel-assisted sampling

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Section 1

Statistics and networks

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Subsection 1

Graphs and stats

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Graphs

Graph G, set of vertices and edges : G = (V ,E )

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Directed graphs

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Statistics of interest - graphs

Size

Degree

Centrality

Clustering

Communities

. . .

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Degree

dv = number of edges incident upon vertex v

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Degree / scale-free property

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Path lengths

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Centrality

Measure of “importance” of a node.

Examples : Google Pagerank, betweenness centrality (number oftimes a node acts as a bridge along the shortest path between twoother nodes)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Betweenness centrality

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Clustering

Global clustering coefficient =3 · number of triangles

number of connected triplets

Local clustering coefficient of a vertex = how close its neighboursare to being a clique (complete graph).

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Local clustering coefficient

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

The rise of “big graphs”

Rise of “big graphs”

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

The rise of “big graphs”

Example : The Graph500 benchmark(http://www.graph500.org). Size of data sets up to 1.1 PBadjacency list (human connectome size)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Subsection 2

Methods - algorithms - models

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Methods for graph statistics

Algorithms (computer science, “big data”)

Model-based estimation

Sampling (“Design-based estimation”)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Methods for graph statistics

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Computer science methods

Efficient algorithms (speed / memory).

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Computer science methods

Efficient algorithms (speed / memory).

Sometimes require sampling.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Model-based estimation

Famous graph models :

Erdos-Renyi

Price / Barabasi-Albert (High tailed degree distribution)

Watts-Strogatz / “small-world” (short path lengths)

Stochastic block models (communities)

Images from [8]

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Model-based estimation : Erdos-Renyi (“random graphs”)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Model-based estimation : Barabasi-Albert (“preferentialattachment”)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Model-based estimation : Watts-Strogatz (“small world”)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Model-based estimation : Stochastic Block Models

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Graphs and statsMethods - algorithms - models

Sampling / Design-based estimation

Sampling : select a few vertices/edges and compute estimatorsusing sample data. Very little exists about design-based statisticalinference on networks (Kolaczyk 2009 , [5])

We try survey sampling methods used in official StatisticsInstitutes to make design-based inference about “big graphs”

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Section 2

Survey sampling

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Subsection 1

Estimates

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Horvitz-Thompson estimator

Population U (here vertices of the graph).

Assign all k ∈ U an inclusion probability P(k ∈ s) = πk

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Horvitz-Thompson estimator

Classic unbiased estimator for totals and means :Horvitz-Thompson

T (Y )HT =∑k∈s

ykπk

ˆy =1

N

∑k∈s

ykπk

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Horvitz-Thompson estimator

Variance of the Horvitz-Thompson estimator depends on the firstand second-order inclusion probabilities :

πk = P(k ∈ s)

πkl = P(k , l ∈ s)

V(T (Y )HT ) =∑k∈U

∑l∈U

(πkl − πkπl)ykπk

ylπl

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Bernoulli sampling

Poisson sampling : For each k ∈ U , run a πk -Bernoulli experimentto decide whether to include unit k in the sample.

Bernoulli sampling : ∀k, πk = p

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Subsection 2

Use of auxiliary information

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Auxiliary information

If πk ∝ yk then V(T (Y )HT ) = 0

In practice, use auxiliary variable : X which is well correlated to Y .

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Stratified sampling

We write : U = U1⊕U2⊕. . .⊕UH and draw independant

samples in each Uh.

Strata should be formed so that intra dispersion of yk is the lowestpossible.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Stratified sampling : Neyman allocation

Given a set of strata and a sample size n, optimal variance isobtained for :

nh =NhS2

h∑h

NhS2h

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

EstimatesUse of auxiliary information

Calibrated estimator

Deville-Sarndal, 1992 ([2]). Modification of the Horvitz-Thompsonestimator to take auxiliary information into account.

Very similar to empirical likelihood methods ([7]).

Computing variances for calibrated estimators is easy.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Section 3

Extending the sampling design

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Official statistics

Measuring “hidden populations”

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Community structure

When trying to measure the size of a community (NC ), use ofedges as auxiliary variables.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

From now on, our sampling designs will include extensions :s = s0 ∪ sext

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Subsection 1

Snowball sampling

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

Population U

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

Initial sample s0

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

One stage snowball extension s = A(s0)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

Formally, we write :

Bi = {i} ∪ {j ∈ V ,Eji 6= ∅}Ai = {i} ∪ {j ∈ V ,Eij 6= ∅}

s = A(s0)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

NC3 =∑k∈s

zi1− π(Bi )

where :

π(Bi ) = P(Bi ⊂ s)

=∏k∈Bi

(1− P(k ∈ s))

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Snowball sampling

V(NC3) =∑i∈s

∑j∈s

zizjπ(Bi ∪ Bj)

γ′ij

where :

γ′ij =π(Bi ∪ Bj)− π(Bi )π(Bj)

[1− π(Bi )][1− π(Bj)]

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Subsection 2

Adaptive sampling

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling

Adaptive sampling (Thompson, [9])

Used in official statistics to measure number of drugs users orHIV-positive people

Sampling design often compared to the video game“minesweeper”

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling

Image from [10]

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling

Once a unit bearing the characteristic of interest is found, all itsnetwork is included in the sample.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling

Estimator :

NC4 =K∑

k=1

n∗CkJkπgk

where :

K = number of networks

y∗k = total of Y in the network k

n∗Ck= Number of people with yk ≥ 1 in the network k

Jk = 1{k ∈ C}πgk = probability that the initial sample intersects k

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling

When using an adaptive design, it is often better to use theRao-Blackwell of the previous estimate. It has a very simple closedform in the case of the adaptive stratified.

NC5 = n0 +K∑

k=1

nr

1− (1− p)nr

where : n0 = #s0 and s0 = ∪r{k ∈ s, δ(k ,C ) = 1} is the union ofthe sides of C.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling - Variance

V(NC4) =K∑

k=1

K∑k ′=1

ykyk ′

πgkk ′

(πgkk ′

πgkπgk ′− 1

)where :

πgkk ′ = 1− πgk − πgk ′ + (1− p)ngk+ngk′

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

Snowball samplingAdaptive sampling

Adaptive sampling - Variance

Variance estimation for the Rao-Blackwell can be done by selectingm samples :

V(NC5) = V(NC4)− 1

m − 1

m∑i=1

(NC5i − NC4)2

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Section 4

Application to Twitter data

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Subsection 1

The problem

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

The Twitter graph

Twitter in 2013

Image from [1]

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

The Twitter API

Access to the Twitter data through an API (Applicationprogramming interface), which limits the number of calls per hour.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Example : Star Wars : The Force Awakens

How many (real) users behind tweets talking about the new StarWars movie ?

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Example : “Star Wars, The Force Awakens”

Let’s write :

yk = Number of tweets @starwars by user k

between 10/29/15, 7 :48 - 10 :48 PM EST

zk = 1{yk ≥ 1}

Goal : estimate NC = T (Z )

Additionally, we write : nC =∑k∈s

zk

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

The Twitter graph

The Twitter graph ([6]) :

Is directed

Degree distribution is heavy-tailed

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

The Twitter graph

Has small path lengths

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Sampling designs

1 Bernoulli sample

2 Stratified Bernoulli

3 Snowball over the stratified Bernoulli

4 Adaptive over the stratified Bernoulli

5 (Rao-blackwell of the adaptive estimator)

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Stratification

U1 = Followers of official @starwars account

U2 = Rest of Twitter users

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Stratification : Neyman allocation

Given some preliminary exploratory data, we get (for n = 2000) :

n1 = 9700

n2 = 10300

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Sample size - extension

Size of s0 : 1000 (so that total sample size, with extensions, wouldbe about n = 20000).

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Calibration variables

N = Number of users in scope

Structure of number of followers

Number of verified users

. . .

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Estimators

NC1 =nC

p

NC2 =N1

n1nC1 +

N − N1

n2nC2

NC3 =∑k∈s

zi1− π(Bi )

NC4 =K∑

k=1

n∗CkJkπgk

NC5 = n0 +K∑

k=1

nr

1− (1− p)nr

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Exclusion probabilities

π(Bi ) = P(Bi ⊂ s)

=∏k∈Bi

(1− P(k ∈ s))

= q#(Bi∩U1)S1 · q#(Bi∩U2)

S2

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Subsection 2

Results

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Results

Design n nscope n0 NC CV ˆDeff

Bernoulli 20013 3946 354121 0.231 1.04

Stratified 20094 9832 316889 0.097 0.68

1-snowball 159957 73570 1000 331097 0.031 0.60

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Results

Mean number of tweets @StarWars per user : 1.18± 0.07

Suggests that bots are not responsible for this very large number oftweets (see [4], [3]) !

Adaptive sampling did not converge.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Subsection 3

Model-assisted sampling

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Auxiliary information for Barabasi-Albert model :

Degree Centrality Local clustering Mean path Max pathDegree ++ - - - -Centrality - - - -Local clustering + +Mean path ++Max path

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Future work

Combine all these (optimal allocations, etc.)

Asymptotics

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Conclusion

Thank you !

http://nc233.com/madstat2017

@nc233

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

Paul Burkhardt and Chris Waring.An nsa big graph experiment.In presentation at the Carnegie Mellon University SDI/ISTCSeminar, Pittsburgh, Pa, 2013.

Jean-Claude Deville and Carl-Erik Sarndal.Calibration estimators in survey sampling.Journal of the American statistical Association,87(418) :376–382, 1992.

Emilio Ferrara.”manipulation and abuse on social media” by emilio ferrarawith ching-man au yeung as coordinator.SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.

Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,and Alessandro Flammini.The rise of social bots.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

arXiv preprint arXiv :1407.5225, 2014.

Eric D Kolaczyk.Statistical analysis of network data.Springer, 2009.

Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.Information network or social network ? : the structure of thetwitter follow graph.In Proceedings of the companion publication of the 23rdinternational conference on World wide web companion, pages493–498. International World Wide Web Conferences SteeringCommittee, 2014.

Art B. Owen.Empirical likelihood.CRC press, 2010.

Tiago P. Peixoto.

Antoine Rebecq Sampling designs for graphs

Statistics and networksSurvey sampling

Extending the sampling designApplication to Twitter data

The problemResultsModel-assisted sampling

The graph-tool python library.figshare, 2014.

Steven K Thompson.Adaptive cluster sampling.Journal of the American Statistical Association,85(412) :1050–1059, 1990.

Steven K Thompson.Stratified adaptive cluster sampling.Biometrika, pages 389–397, 1991.

Antoine Rebecq Sampling designs for graphs