From coincidence to purposeful flow? Properties of transcendental information cascades

From coincidence to purposeful flow? Properties of transcendental information cascades.�Markus Luczak-Roesch �University of Southampton, Web and Internet Science Group�@mluczak | http://sociam.org �

Zooniverse �

Serendipity through talk �

Task and talk participation�

40.5%�

Talk

contr

ibutio

ns�

Classifications �

Community-level linguistic change�

Project � initial 10%� most recent 10%�

PH � transit, star, day, aph, look, one, planet, like, possibl, dip�

day, transit, httparchive. . . , possibl, star, kid, dip, look, planet, like �

SF� like, look, fish, sea, scallop, thing, imag, right, star, left �

corallinealga, anemon, object, hermitcrab, bryozoan, stalkedtun, shrimp, left, cerianthid, sanddollar �

NN� field, record, one, use, enter, get, work, can, specimen, button �

like, field, record, date, name, can, click, look, get, label �

Stable domain specific vocabulary�

Emerging domain specific vocabulary�

Stable problem/error reporting�

Dominance of microposts and implicit coordination�

PH � SG � SW� NN� GZ � CC � PF� SF� AP � WS�

91%�Vo

cabu

lary s

hift�

2

0

6

4

10

8

Microposts�

Luczak-Roesch, M., Tinati, R., Simperl, E., Van Kleek, M., Shadbolt, N., & Simpson, R. (2014). Why won't aliens talk to us? Content and community dynamics in online citizen science. Proceedings of the Eighth AAAI Conference on Weblogs and Social Media, {ICWSM} 2014, Ann Arbor, Michigan, USA, June 1-4, 2014. �

Networks within and out of the Zooniverse �

Crowd mapping�

Crisis response on social media�

A qualitative investigation of crowdsourced disaster response�

•  Haiti (Ushahidi, N=298) �–  requests for help from

identified local source�

•  Congo (Ushahidi, N=102) �–  information about the

situation but not who is responsible for this information�

– more non-local sources�

•  Ebola (Twitter, N=298) �–  comments�

•  tasteless jokes�•  racist comments�•  concern that the crisis could

spread and call to governments to close the borders �

Boundaries of crowdsourced disaster response�

•  Wrong things go viral �•  Crowdsourcing informativeness

of social media information not synchronized with crises* �

negative� neutral� positive�

11 “When you tell a […] kid that is has got Ebola”

*Olteanu, A., Vieweg, S., & Castillo, C. (2015). What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In In Proc. of 18th ACM Computer Supported Cooperative Work and Social Computing (CSCW’15), (No. EPFL-CONF-203562).�

The future of disaster crowd work �

Synchronization�

Coordination�

We can observe situations when online communication does not happen along explicit social ties (especially in critical situations when time to make decisions is rare). Instead of talking explicitly with each other people are broadcasting about the same event or topic.�

Source: United Nations Development Programme, https://goo.gl/Z1uXdV, CC BY-NC-ND 2.0 �

“An informational cascade occurs when it is optimal for an individual, having observed the actions of those ahead him, to follow the behavior of the preceding individual without regard to his own information.” [1] �

[2]

[1] Bikhchandani, Sushil, David Hirshleifer, and Ivo Welch. "A theory of fads, fashion, custom, and cultural change as informational cascades." Journal of political Economy (1992): 992-1026.�[2] Cheng, Justin, et al. "Can cascades be predicted?." Proceedings of the 23rd international conference on World wide web. International World Wide Web Conferences Steering Committee, 2014.�

Boundaries of context-rich approaches�

Twitter�

Facebook�

Quora�

?�

System A�

System B�

System C �

Collective action?�

t

Does the accumulated information propagation behaviour on the Web form giant purposeful processes?�

Sour

ce: M

ichae

l Dale

s, htt

ps:/

/goo

.gl/I

KXs4

X, CC

BY-NC

2.0�

Discovering the algorithms of Social Machines�

Socio-technical Computation�The computational capability embodied in cascades of information sharing activities on the Web that are not necessarily conditioned by system-specific or social network features but only time and inherent properties of pairs of resources. �

Markus Luczak-Roesch, Ramine Tinati, Kieron O'Hara, and Nigel Shadbolt. 2015. Socio-technical Computation. In Proceedings of the 18th ACM Conference Companion on Computer Supported Cooperative Work & Social Computing (CSCW'15 Companion). ACM, New York, NY, USA, 139-142. http://doi.acm.org/10.1145/2685553.2698991 �

2-state model � infinite-state model �

HF LF

[3] Kleinberg, Jon. "Bursty and hierarchical structure in streams." Data Mining and Knowledge Discovery 7.4 (2003): 373-397.�

Time�

Numb

er of

obse

rved d

ocum

ents�

Content streams as automata [3] �

t

Transcendental �information cascades�


t

#A

#A#B

#A#B#C

#B#D

#C

Building transcendental information cascades�

only local understanding of its use but also an abstract globalview. This lets us propose a new model that we call transcen-dental information cascades. Informed by Kleinbergs work onbursty structures in document streams [2] it regards time asthe only ascertainable condition for relationships between anypairs of resources, meaning that we focus on coincidence ofinformation sharing activities rather than socially-determinedconditionality.

In [20] we presented the initial definition of a transcenden-tal information cascade as a 4-tupel TC = (V,E,R, F ). This4-tupel represents a directed network consisting of a set ofnodes V and edges E, derived when applying a set of matchingfunctions F to a set of resources R = {r1, r2, ..., rm}, r

i

=(u

i

, ti

, ci

), where every ui

is a unique identifier of a resourceri

that was shared at the time ti

with the content ci

. Nodes inthe network are those resources from R that contain a set I

i

ofone or multiple cascade identifiers. A cascade identifier is anyunique informational pattern that is recognized by applyinga matching function to the content or any other inherentproperties of a resource (e.g. simple string matching algorithmsto identify keywords in content). Formally a matching functionfk

2 F, k 2 N, k n is defined as:

fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk

matches patterns{i1, i2, ..., ix} in c

i

x 2 N

; otherwise

Nodes V and edges E are then given as follows

V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci

) beingthe result of the concatenation of all identifiers found by allmatching functions2. An edge exists between any two nodesthat share a unique subset of all the cascade identifiers thatwere found for them. This subset and none of its subsets ispart of the identifiers found for any node that was created in thetime period between when the two linked nodes were created.

⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}

A node that contains a cascade identifier that was notdetected for any other nodes before is called the identifierroot. Beside this we call a node without any incoming edgesa network root and node that has no outgoing edges a stub.Our cascade model clearly yields different outputs dependingon the data to hand (e.g. determined by the extent of the

2Please note that [20] contains an unintentionally malformed equation forthis as the wrong symbol was used to refer to the concatenation of the matchingfunctions.

Web crawl), and the matching algorithms determining whichcascade identifiers will be spotted (e.g. reuse of hashtags,URIs, quotes, images, or maybe exploiting wider semanticsor sentiment) as depicted in Figure ??.

Fig. 1. Depending on the applied matching functions, different transcendentalinformation cascade representations can be generated for the same input data.

A fictive example of a transcendental cascade based on ourmodel is shown in Figure 2. Consider a system that featureshashtags as an established form of identifying content patterns.The visualisation uses the following approach to representdistinct identifiers and time: Nodes are chronologically orderedalongside the horizontal dimension from left (the oldest node)to right (the most recent node); additionally nodes are orderedalongside the vertical dimension depending on the set ofidentifiers present in a node (each unique set is assigned toa distinct level). Consequently, the visualisation represents thecontent creation sequence (“#A”) - (“#A#B”) - (“#A”) - (“#A”)- (“#A#B#C”) - (“#C”) - (“#A”) - (“#B#D”) - (“#A”).

Fig. 2. Example of a cascade that emerges along five different identifiers.#A, #B, #A#B#C, #B#D and #C are fictive hashtags (or hashtag combinationsresepectively) treated as the indentifying content patterns

In order to understand how edges are labelled we highlightthe sub-graph involving the nodes 2, 3, 4, and 5. Conformingto our cascade model an edge exist between nodes 2 and 3



i

=(u

i

, ti

, ci

), where every ui



with the content ci


i



fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk


i

x 2 N

; otherwise


V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci


⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}










i

=(u

i

, ti

, ci

), where every ui



with the content ci


i



fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk


i

x 2 N

; otherwise


V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci


⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}










i

=(u

i

, ti

, ci

), where every ui



with the content ci


i



fk

(ci

) =

8>>>>><

>>>>>:

{i1, i2, ..., ix} if fk


i

x 2 N

; otherwise


V ={v1, v2, ..., vp}vy

= (uy

, ty

, Iy

),

E ={e1, e2, ..., eq}ez

=(ua

, ub

,⇤z

)

with Ii

= {i1, i2, ..., io} = f1(ci) [ f2(ci) [ ... [ fn

(ci


⇤z

={ir

|ir

2 Ia

^ ir

2 Ib

,

8ir

! V 0 =

{vc

|vc

= (uc

,tc

, Ic

), ir

2 Ic

^ ta

tc

tb

} = ;,vc

2 V, r 2 N, r |Ib

|}









t

#A

#A#B

#A#B#C

#A

#A

#B

#A

Capturing the unintended action resulting from information sharing activities of human collectives.�

t Document stream

Transcendental Information Cascade

Temporal text/data mining�

!" !#

!$

%

& '$'$ % % '" '" &'" & % '# '# '# '#'# % % &

& (( ( ( & & &(( ( ( (( ( ( (( ( (

'$

()*+,( ()*+,( -

./012'

Figure 5: Decoding the collection

where δ(dt′j , i) = 1 if word dt′j is labeled as theme i;otherwise δ(dt′j , i) = 0. W is the size of the sliding windowin terms of time points.

NStrength(i, t) =AStrength(i, t)

kj=1 AStrength(j, t)

=t′∈[t−W

2 ,t+ W2 ]

|dt′ |j=1 δ(dt′j , i)

t′∈[t−W2 ,t+ W

2 ] |dt′ |

The life cycle of each theme can then be modeled as thevariation of the theme strengths over time.

The analysis of theme life cycles thus involves the follow-ing four steps: (1) Construct an HMM to model how themesshift between each other in the collection. (2) Estimate theunknown parameters of the HMM using the whole streamcollection as observed example sequence. (3) Decode the col-lection and label each word with the hidden theme modelfrom which it is generated. (4) For each trans-collectiontheme, analyze when it starts, when it terminates, and howit varies over time.

5. EXPERIMENTS AND RESULTS

5.1 Data PreparationTwo data sets are constructed to evaluate the proposed

ETP discovery methods. The first, tsunami news data, con-sists of news articles about the event of Asia Tsunami datedDec. 19 2004 to Feb. 8 2005. We downloaded 7468 newsarticles from 10 selected sources, with the keyword query”tsunami”. As shown in Table 1, three of the sources are inAsia, two of them are in Europe and the rest are in the U.S.

News Source Nation News Source NationBBC UK Times of India IndiaCNN US VOA US

Economics Times India Washington Post USNew York Times US Washington Times US

Reuters UK Xinhua News China

Table 1: News sources of Asia Tsunami data set

The second data set consists of the abstracts in KDD con-ference proceedings from 1999 to 2004. All the abstractswere extracted from the full-text pdf files downloaded fromthe ACM digital library2. 2 articles were excluded becausethey were not recognizable by the pdf2text software in Linux,

2http://www.acm.org/dl

giving us a total of 496 abstracts. The basic statistics of thetwo data sets are shown in Table 2. We intentionally didnot perform stemming or stop word pruning in order to testthe robustness of our algorithms.

Data Set # of docs AvgLength Time rangeAsia Tsunami 7468 505.24 12/19/04 - 02/08/05

KDD Abs. 496 169.50 1999-2004

Table 2: Basic information of data sets

On each data set, two experiments are designed: (1) Par-tition the collection into time intervals, discover the themeevolution graph and identify theme evolution threads. (2)Discover trans-collection themes and analyze their life cy-cles. The results are discussed below.

5.2 Experiments on Asia TsunamiSince news reports on the same topic may appear ear-

lier in one source but later in another (i.e., “report delay”),partitioning news articles into overlapping, as opposed tonon-overlapping subcollections seems to be more reasonable.We thus partition the our news data into 5 time intervals,each of which spans about two weeks and is half overlappingwith the previous one. We use the mixture model discussedin Section 3 to extract the most salient themes in each timeinterval. We set the background parameter λB = 0.95 andnumber of themes in each time interval to be 6. The varia-tion of λB is discussed later. Table 3 shows the top 10 wordswith the highest probabilities in each theme span. We seethat most of these themes suggest meaningful subtopics inthe context of the Asia tsunami event.

!"##$%#&'($)&"*"%+

,-.#

/$%0"(+#&'$(&.$%1+-$%

2"(#$%13&456"(-"%0"

7$%1+-$%&81+09

2$3-+-013&:##;"#

/(-+-0-#)&&$%&:(1<

=+1+-#+-0#

Figure 6: Theme evolution graph for Asia Tsunami

With these theme spans, we use KL-divergence to furtheridentify evolutionary transitions. Figure 6 shows a themeevolution graph discovered from Asia Tsunami data whenthe threshold for evolution distance is set to ξ = 12. FromFigure 6, we can see several interesting evolution threadswhich are annotated with symbols.

The thread labeled with a may be about warning systemsfor tsunami. It is interesting to see that the nation coveredby the thread seems to have evolved from the U.S. in periodl1, to China in l2, and then to Japan in l3. In thread b,themes 3, 4, and 5 in period l1 indicate the aids and finan-cial support from UN, from local area, and special aids forchildren, respectively. They all show an evolutionary transi-tion to theme 2 (donation from UK) and theme 3 (aid from

[4] Subašić, I., & Berendt, B. (2013). Story graphs: Tracking document set evolution using dynamic graphs. Intelligent Data Analysis, 17(1), 125-147.�[5] Mei, Q., & Zhai, C. (2005, August). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 198-207). ACM.�

[5]

“The key notion of TTM is burstiness – sudden increases in frequency of text fragments, and all TTM methods aim to model burstiness.” [4]�

t t

F1

Fn

… …

C11

C21

C22 C23

t0 t1 t2 t3 t4 t5 t7 t8 t6

t6 -‐ t0

t2 -‐ t1 t8 -‐ t2 t4 -‐ t2

t7 -‐ t4

t5 -‐ t3

t1 -‐ t0 t2 -‐ t1

t4 -‐ t1

t4 -‐ t3 t6 -‐ t5

t8 -‐ t6

t7 -‐ t4

t5 -‐ t4

t3 -‐ t2

There is more than one “reality” �

Analyzing low-level properties of the multiple states of a system that exist at the same time�

Fig. 4. Overview of the results of the cascade comparison. Cascade size distribution and wiener index are plotted on a log-log scale; identifier entropy isplotted with a log scale on the y-axis.

contain one or few identifiers equally distributed. Very largehashtag cascades in contrast become very fuzzy, meaning thateven though loads of identifiers are covered (indicating manyinformation) the informativeness of the entire cascade is veryshallow. The other three entropy distribution profiles insteadshow that there is a more even distribution of information innon-trivial cascades with multiple identifiers, with the largestcascades still falling into the same category as the largesthashtag cascade.

VI. DISCUSSION

In this section we reflect the results of our study againstthe original questions asked, and then consider how ourcontent-centric approach to cascade construction provides analternative way to consider information flows on the Web.

A. Summary of Experiments

Our experiments show that it is possible to generate struc-turally different cascades from a single source dataset, depend-ing on the pattern matching used. By exploring cascade sub-structures within each of the four resulting cascade datasets,we found that in comparison to cascades that use actual object

identifiers (KID, APH, URIs), cascades which are based onhashtags tend to be either trivial (single identifier cascades)or consist of multiple roots (the origin of the cascade) thatare merging and diverging so that they form one massiveconnected component.

For instance, in A1 cascades, there may be two hashtags,#A and #B, which originate in different, independent posts, bydifferent users. However, over the course of the evolution of thecascade, these hashtags merge, most likely as a consequence ofa user bringing them together in a single post. These hashtagsthen may become part of several merges and diverges, whichcan end up located within a single stub. As a consequence ofthis, information can be perceived as lost, as they do not remainpresent in a distinct cascade, but are subsumed by anotherone. This is reflected in Figure 4, where a large proportion ofthe node types are those that are either merging or diverginginformation.

In comparison to this, the results of cascade types A2 andA3 reveal cascades which are less structurally viral (a lowerwiener index), thus tending to form shorter chains of singleor few identifier cascades. As a consequence, informationis rarely lost or gained as cascades do not merge often. Itis more likely that when a branch node is observed (for



VI. DISCUSSION









VI. DISCUSSION







4�

1� 15�

10�

Tags URIs

KID & APH

Single node motifs

long uniform paths

short uniform paths

long non-uniform paths

Analyzing low-level properties of the multiple states of a system that exist at the same time�

Tags URIs

KID &

APH

Identifier entropy



VI. DISCUSSION









VI. DISCUSSION









VI. DISCUSSION







varying profiles of increasing randomness with growing cascade size

Cascade motifs as an indicator of state?�

?�

t

F1

Fn

… …

C11

C21

C22 C23

Formalising the multiple possible representations of a system at any time and their relationships.��Not all representing purposeful action but reflecting useful informational properties.�

By focusing only on the coincidence of information occurrence, we can capture and analyse emergent collective action across system boundaries and independent from social network contexts.��Markus Luczak-Roesch �@mluczak�http://markus-luczak.de�

Source: Giulia Forsythe, http://goo.gl/6hpZ0W, CC BY-NC-SA 2.0�

Refer

ence

s�•  Markus Luczak-Roesch, Ramine Tinati, Kieron O'Hara, and Nigel

Shadbolt. 2015. Socio-technical Computation. In Proceedings of the 18th ACM Conference Companion on Computer Supported Cooperative Work & Social Computing (CSCW'15 Companion). ACM, New York, NY, USA, 139-142. http://doi.acm.org/10.1145/2685553.2698991 �

•  Markus Luczak-Roesch, Ramine Tinati, and Nigel Shadbolt. 2015. When Resources Collide: Towards a Theory of Coincidence in Information Spaces. To appear in WWW’15 Companion, May 18–22, 2015, Florence, Italy. http://dx.doi.org/10.1145/2740908.2743973 �