30
Bursty and Hierarchical Structure in Streams Jon Kleinberg Cornell University

in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Bur

sty

and

Hie

rarc

hica

lS

truc

ture

inS

trea

ms

Jon

Kle

inbe

rg

Cor

nell

Uni

vers

ity

Page 2: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Topi

csan

dT

ime

Doc

umen

tsca

nbe

orga

niz

edby

topi

c,

but

we

also

expe

rienc

eth

eir

arriv

alov

ertim

e.

E-m

ail,

new

sar

ticle

s.

Res

earc

hpa

pers

,on

asl

ow

ertim

esc

ale

.

(1)

Tem

pora

lsu

b-st

ruct

ure

with

ina

sing

leto

pic.

(Nes

ted)

burs

tsof

activ

itysu

rrou

ndin

gev

ents

.

(2)

Tim

e-lin

eco

nstr

uctio

n:en

umer

atio

nof

topi

csov

ertim

e.

[Alle

n19

95,K

umar

etal

.19

97,S

wan

-Alla

n20

00,S

wan

-Jen

sen

2000

]

[Top

icD

etec

tion

and

Trac

king

:A

llan

etal

.19

98,Y

ang

etal

.19

98]

Dev

elop

tec

hniq

ues

base

don

Mar

kov

sour

cem

odel

sfo

r

tem

pora

lte

xtm

inin

g.

Page 3: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Min

ing

E-m

ail

E-m

ail

arch

ives

asa

dom

ain

for

data

min

ing.

Raw

mat

eria

lfo

rhi

stor

ical

rese

arch

and

lega

lpr

ocee

ding

s.

(Nat

l.A

rchi

ves:

>10

mill

ion

e-m

ail

msg

sfr

omC

linto

nW

hite

Hou

se)

Per

sona

lar

chiv

esca

nre

ach

10-1

00’s

MB

ofpu

rete

xt.

Topi

c-ba

sed

orga

niza

tion

(aut

omat

edfo

lder

man

ag

emen

t):

[Hel

fman

-Isb

ell

95,C

ohen

96,L

ewis

-Kno

wle

s97

,Sah

ami

etal

.98

,

Seg

al-K

epha

rt99

,Hor

vitz

99,R

enni

e00

]

Flo

wof

time

expo

ses

sub-

stru

ctur

ein

aco

here

ntfo

lder

For

exam

ple

,fol

der

on“g

rant

prop

osal

s”co

ntai

nsm

ultip

le

burs

type

riods

corr

espo

ndin

gto

loca

lized

epis

odes

.

E.g

.“t

hepr

oces

sof

gath

erin

gpe

ople

for

our

larg

e

NS

FIT

Rpr

opos

al.”

Page 4: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

The

role

oftim

ein

narr

ativ

es..

.th

ere

seem

sso

met

hing

else

inlif

ebe

side

stim

e,s

omet

hing

whi

ch

ma

yco

nve

nien

tly

beca

lled

“val

ue,”

som

ethi

ngw

hic

his

mea

sure

dno

t

bym

inut

esor

hour

sbu

tby

inte

nsity

,so

that

whe

nw

elo

okat

our

past

itdo

esno

tst

retc

h

back

even

lybu

tpi

les

upin

toa

few

nota

ble

pinn

acle

s,

and

whe

nw

elo

okat

the

futu

reit

seem

sso

met

imes

aw

all,

som

etim

esa

clou

d,

som

etim

esa

sun,

but

neve

ra

chro

nolo

gica

l

char

t. -E

.M.F

orst

er,A

spec

tsof

the

No

vel

(192

8)

Ani

soc

hron

ies

inna

rrat

ives

[Gen

ette

1980

,Cha

tman

1978

]:

non-

unif

orm

rela

tion

betw

een

time

span

ofa

stor

y’s

even

ts

and

the

time

itta

kes

tore

late

them

.

Page 5: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Inte

nsity

?N

otab

leP

inna

cles

?“I

kno

wa

burs

tw

hen

Isee

one

.”??

020406080100

120

140

1.4e

+06

1.5e

+06

1.6e

+06

1.7e

+06

1.8e

+06

1.9e

+06

2e+

062.

1e+

062.

2e+

062.

3e+

062.

4e+

062.

5e+

06

message #

Min

utes

sin

ce 1

/1/9

7

Nee

da

prec

ise

mod

el:

Insp

ectio

nno

tlik

ely

togi

veth

efu

llst

ruct

ure

inth

ese

quen

ce.

Eve

ntua

lly

wan

tto

perf

orm

burs

tde

tect

ion

for

all

term

sin

corp

us.

Page 6: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Thr

esho

ld-B

ased

Met

hods

012345678 900

1000

1100

1200

1300

1400

1500

1600

1700

1800

? ?

# messages rcvd

Day

s si

nce

1/1/

97

Sw

an-A

llan

[199

9,20

00],

Sw

an-J

ense

n[2

000]

intr

oduc

ed

thre

shol

d-ba

sed

met

hods

.

Bin

rele

vant

mes

sag

esby

day.

Iden

tify

days

inw

hic

hnu

mbe

rof

rele

vant

mes

sag

esis

abo

vea

com

pute

dth

resh

old

(

� orsi

mila

rte

st).

Con

tiguo

usse

tof

days

abo

veth

resh

old

cons

titut

esan

epis

ode

.

Page 7: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Thr

esho

ld-B

ased

Met

hods

012345678 900

1000

1100

1200

1300

1400

1500

1600

1700

1800

? ?

# messages rcvd

Day

s si

nce

1/1/

97

Issu

esfo

rth

resh

old-

base

dm

etho

dsas

aba

selin

e:

E-m

ail

fold

ers

quite

spar

se/n

oisy

.

E.g

.in

figur

e,n

o7

cons

ecut

ive

days

with

non-

zer

o#

ofm

essa

ges

.

We

wan

tto

find

epis

odes

last

ing

seve

ral

mon

ths

(e.g

.w

ritin

ga

prop

osal

)as

wel

las

seve

ral

days

.

Mul

tiple

time

scal

es?

Bur

sts

with

inbu

rsts

?

Page 8: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

AM

odel

for

Bur

sty

Str

eam

sW

ant

aso

urce

mod

elfo

rm

essa

ges

,de

term

inin

gar

rival

times

.

f(x)

=

ex

f(x)

=

ex

β−β

α−α

Sim

ples

t:ex

pone

ntia

ldi

strib

utio

n.

Gap

intim

eun

tilne

xtm

essa

ge

isdi

strib

uted

acco

rdin

gto

��

.(“

Mem

oryl

ess”

dist

ribut

ion.

)

Exp

ecte

dga

pva

lue

is

� .T

hus

isca

lled

the

“rat

e”of

mes

sag

e

arriv

als.

Page 9: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

AM

odel

for

Bur

sty

Str

eam

s

low

sta

tehi

gh s

tate

stat

e ch

ange

with

prob

abili

ty p

αga

ps x

dis

trib

uted

at r

ate

gaps

x d

istr

ibut

ed a

t rat

es

α

Am

odel

for

mes

sag

eg

ener

atio

nw

ithpe

rsis

tent

burs

ts:

Mar

kov

sour

cem

odel

[e.g

.A

nic

k-M

itra-

Son

dhi

1982

,Sco

tt19

98]

Low

stat

e

� :ga

psin

time

betw

een

mes

sag

ear

rival

sdi

strib

uted

acco

rdin

gto

expo

nent

ial

dist

ribut

ion

with

rate

.

Hig

hst

ate

� :ga

psdi

strib

uted

atra

te,w

here

.

Bef

ore

each

mes

sag

eem

issi

on,

stat

ech

ang

esw

ithpr

obab

ility

.

Con

side

rm

essa

ges

,w

ithpo

sitiv

ega

psbe

twee

nar

rival

times

.

Mos

tlik

ely

stat

ese

quen

cevi

aB

ayes

’T

hman

ddy

nam

icpr

ogra

mm

ing.

Page 10: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

AR

iche

rM

odel

Wan

tto

mod

elbu

rsts

ofgr

eate

ran

dgr

eate

rin

tens

ity

set

ofst

ates

repr

esen

ting

arbi

trar

ily

smal

lga

psi

zes

.

qq

qq

01

23

qi

emis

sion

s at

rat

es

i αpe

r st

ate

tran

sitio

n pr

obab

ility

n−γ

Infin

itest

ate

set

Ifga

psov

ertim

e,t

hen

aver

ag

era

te.

“bas

era

te”

at� is

.

Rat

esin

crea

seby

fact

orof

:ra

tefo

r

� is

.

Jum

ping

from

� to

� inon

est

epha

spr

ob.

��

� .

Page 11: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

AR

iche

rM

odel

qq

qq

01

23

qi

emis

sion

s at

rat

es

i αpe

r st

ate

tran

sitio

n pr

obab

ility

n−γ

The

orem

:Le

t

� ��

� .

The

max

imum

likel

ihoo

dst

ate

sequ

ence

invo

lves

only

stat

es

,whe

re

.

Usi

ngT

heor

em,

can

redu

ceto

the

finite

-sta

teca

sean

d

appl

ydy

nam

icpr

ogra

mm

ing.

(Cf.

Vite

rbi

algo

rithm

for

Hid

den

Mar

kov

mod

els.

)

Page 12: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Hie

rarc

hica

lStr

uctu

reD

efine

abu

rst

ofin

tens

ityto

bea

max

imal

inte

rval

inw

hic

hop

timal

stat

ese

quen

ceis

inst

ate

� orhi

gher

.

Bur

sts

are

natu

rall

yne

sted

:ea

chbu

rst

ofin

tens

ityis

cont

aine

din

a

uniq

uebu

rst

ofin

tens

ityhi

erar

chic

altr

eest

ruct

ure

.

01

32

01

32

20

13

time

optim

al s

tate

seq

uenc

ebu

rsts

tree

rep

rese

ntat

ion

Page 13: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Exp

erim

ents

with

anE

-Mai

lStr

eam

As

apr

oxy

for

fold

ers,

look

atqu

erie

sto

e-m

ail

arch

ive

.

Sim

ple

impl

emen

tatio

nof

algo

rithm

can

build

burs

tre

pres

enta

tion

for

aqu

ery

inre

al-t

ime

.

Do

spik

esem

erg

ein

vici

nity

ofre

cogn

izab

leev

ents

?

Exa

mpl

e:st

ream

ofal

lm

essa

ges

cont

aini

ngth

ew

ord

“IT

R.”

(Lar

ge

NS

Fpr

ogra

m;

appl

ied

for

two

prop

osal

s(la

rge

and

smal

l)

with

colle

agu

esin

acad

emic

year

1999

-200

0.)

020406080100

120

140

1.4e

+06

1.5e

+06

1.6e

+06

1.7e

+06

1.8e

+06

1.9e

+06

2e+

062.

1e+

062.

2e+

062.

3e+

062.

4e+

062.

5e+

06

message #

Min

utes

sin

ce 1

/1/9

7

Page 14: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

01

23

45

01

23

45

inte

nsiti

es

10/2

8/99

10/2

810

/28

11/2

11/9

11/1

511

/16

11/1

61/

2/00

1/2

1/5

2/4

2/14

2/21

7/10

7/10

7/14

10/3

1

01

23

45

inte

nsiti

es

10/2

8/99

10/2

810

/28

11/2

11/9

11/1

511

/16

11/1

61/

2/00

1/2

1/5

2/4

2/14

2/21

7/10

7/10

7/14

10/3

1

10/2

8/99

-2/

21/0

010

/28-

2/14

10/2

8-11

/16

11/2

-11

/16

11/9

-11

/15

1/2-

2/4

1/2-

1/5

7/10

/00-

10/3

1/00

7/10

-7/

14

inte

nsiti

es

10/2

8/99

10/2

810

/28

11/2

11/9

11/1

511

/16

11/1

61/

2/00

1/2

1/5

2/4

2/14

2/21

7/10

7/10

7/14

10/3

1

Page 15: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

01

23

45

11/1

5: le

tter

of in

tent

dea

dlin

e

1/5:

pre

-pro

posa

l dea

dlin

e

2/14

: ful

l pro

posa

l dea

dlin

e

4/17

: ful

l pro

posa

l dea

dlin

e

7/11

: uno

ffici

al n

otifi

catio

n

9/13

: offi

cial

ann

ounc

emen

t

inte

nsiti

es

10/2

8/99

10/2

810

/28

11/2

11/9

11/1

511

/16

11/1

61/

2/00

1/2

1/5

2/4

2/14

2/21

7/10

7/10

7/14

10/3

1

(

larg

e pr

opos

als)

(la

rge

prop

osal

s)

(

smal

l pro

posa

ls)

(

larg

e pr

opos

als)

(s

mal

l pro

posa

l)

of

aw

ards

Page 16: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Que

ry:

“Pre

lim”

Exa

mpl

e:st

ream

ofal

lm

essa

ges

cont

aini

ngth

ew

ord

“pre

lim.

(Cor

nell

term

inol

ogy

for

ano

n-fin

alex

amin

an

unde

rgra

duat

eco

urse

.)

E-m

ail

arch

ive

span

sfo

urla

rge

cour

ses,

each

with

two

prel

ims.

But

infir

stco

urse

,alm

ost

all

corr

espo

nden

cere

stric

ted

to

cour

see-

mai

lac

coun

t.

Thr

eela

rge

cour

ses,

two

prel

ims

inea

ch.

Page 17: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

prel

im 1

2/25

/99

prel

im 2

4/15

/99

prel

im 1

2/24

/00

prel

im 2

4/11

/00

11/1

3/00

prel

im 2

050100

150

200

250

300

350

400 20

0000

4000

0060

0000

8000

001e

+06

1.2e

+06

1.4e

+06

1.6e

+06

1.8e

+06

2e+0

62.

2e+0

62.

4e+0

6

a) c)

b)

Min

utes

sin

ce 1

/1/9

7

Message #in

tens

ities

01

23

45

67

8

10/4

/00

prel

im 1

Page 18: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Enu

mer

atin

gB

urst

sfo

rT

ime-

Line

Con

stru

ctio

nC

anen

umer

ate

burs

tsfo

rev

ery

wor

din

the

corp

us.

Ess

entia

lly

one

pass

over

anin

vert

edin

dex.

Wei

ght

ofbu

rst

ofin

tens

ity�

��

.

Ove

rhi

stor

yof

aco

nfer

ence

orjo

urna

l,to

pics

rise/

fall

insi

gnifi

canc

e.

Usi

ngw

ords

asst

and-

ins

for

topi

cla

bels

:

Wha

tar

eth

em

ost

prom

inen

tto

pics

atdi

ffer

ent

poin

tsin

time?

Take

wor

dsin

pape

rtit

les

over

hist

ory

ofco

nfer

ence

.

Com

pute

burs

tsfo

rea

chw

ord;

find

thos

eof

grea

test

wei

ght.

All

wor

dsar

eco

nsid

ered

.(E

ven

stop

-wor

ds.)

Page 19: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

AS

ourc

eM

odel

for

Bat

ched

Arr

ival

s

si

p0

tran

sitio

n pr

obab

ility

n−γ

qq

qq

01

23

qi

of r

elev

ant d

oc’s

Fra

ctio

n pe

r st

ate

batc

hes

ofdo

cum

ents

.B

atc

hco

ntai

ns

� tota

l,of

whi

ch

� are

rele

vant

(e.g

.co

ntai

nfix

edw

ord)

.

Ove

rall

rele

vant

frac

tion

.

Sta

te

� :ex

pect

edfr

actio

nof

rele

vant

docu

men

ts

�� .

Page 20: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Wor

dIn

terv

alof

burs

t

gram

mar

s19

69S

TO

C—

1973

FO

CS

auto

mat

a19

69S

TO

C—

1974

ST

OC

lang

uage

s19

69S

TO

C—

1977

ST

OC

mac

hine

s19

69S

TO

C—

1978

ST

OC

recu

rsiv

e19

69S

TO

C—

1979

FO

CS

clas

ses

1969

ST

OC

—19

81F

OC

S

som

e19

69S

TO

C—

1980

FO

CS

sequ

entia

l19

69F

OC

S—

1972

FO

CS

equi

vale

nce

1969

FO

CS

—19

81F

OC

S

prog

ram

s19

69F

OC

S—

1986

FO

CS

prog

ram

1970

FO

CS

—19

78S

TO

C

on19

73F

OC

S—

1976

ST

OC

com

plex

ity19

74S

TO

C—

1975

FO

CS

prob

lem

s19

75F

OC

S—

1976

FO

CS

rela

tiona

l19

75F

OC

S—

1982

FO

CS

logi

c19

76F

OC

S—

1984

ST

OC

vlsi

1980

FO

CS

—19

86S

TO

C

prob

abili

stic

1981

FO

CS

—19

86F

OC

S

how

1982

ST

OC

—19

88S

TO

C

para

llel

1984

ST

OC

—19

87F

OC

S

algo

rithm

1984

FO

CS

—19

87F

OC

S

grap

hs19

87S

TO

C—

1989

ST

OC

lear

ning

1987

FO

CS

—19

97F

OC

S

com

petit

ive

1990

FO

CS

—19

94F

OC

S

rand

omiz

ed19

92S

TO

C—

1995

ST

OC

appr

oxim

atio

n19

93S

TO

C—

impr

oved

1994

ST

OC

—20

00S

TO

C

code

s19

94F

OC

S—

appr

oxim

atin

g19

95F

OC

S—

quan

tum

1996

FO

CS

Page 21: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding
Page 22: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Wor

dIn

terv

alof

burs

t

data

1975

SIG

MD

—19

79S

IGM

D

base

1975

SIG

MD

—19

81V

LDB

appl

icat

ion

1975

SIG

MD

—19

82S

IGM

D

base

s19

75S

IGM

D—

1982

VLD

B

desi

gn19

75S

IGM

D—

1985

VLD

B

rela

tiona

l19

75S

IGM

D—

1989

VLD

B

mod

el19

75S

IGM

D—

1992

VLD

B

larg

e19

75V

LDB

—19

77V

LDB

sche

ma

1975

VLD

B—

1980

VLD

B

theo

ry19

77V

LDB

—19

84S

IGM

D

dist

ribu

ted

1977

VLD

B—

1985

SIG

MD

data

1980

VLD

B—

1981

VLD

B

stat

istic

al19

81V

LDB

—19

84V

LDB

data

base

1982

SIG

MD

—19

87V

LDB

nest

ed19

84V

LDB

—19

91V

LDB

dedu

ctiv

e19

85V

LDB

—19

94V

LDB

tran

sact

ion

1987

SIG

MD

—19

92S

IGM

D

obje

cts

1987

VLD

B—

1992

SIG

MD

obje

ct-

orie

nted

1987

SIG

MD

—19

94V

LDB

para

llel

1989

VLD

B—

1996

VLD

B

obje

ct19

90S

IGM

D—

1996

VLD

B

min

ing

1995

VLD

B—

serv

er19

96S

IGM

D—

2000

VLD

B

sql

1996

VLD

B—

2000

VLD

B

war

ehou

se19

96V

LDB

sim

ilari

ty19

97S

IGM

D—

appr

oxim

ate

1997

VLD

B—

web

1998

SIG

MD

inde

xing

1999

SIG

MD

xml

1999

VLD

B—

Page 23: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

strin

g

grav

ity

twoto

polo

gica

l2d

affin

ekp

alge

bras

repr

esen

tatio

ns

quan

tum

grou

ps

diffe

rent

ial

alge

bra

latti

cedu

ality

n=2

m(a

trix

iibm

mat

rix

m-t

heor

yan

ti-de

n larg

ex

ads

ads_

3

holo

grap

hy

corr

espo

nden

cead

s/cf

t

type

bran

es

non-

bps

non-

com

mut

ativ

e

rand

all-s

undr

umbr

ane-

wor

ld

extr

a

holo

grap

hic

nonc

omm

utat

ive

bran

e

open

wor

ldco

smol

ogic

al

tach

yon

bulk

fuzz

yw

arpe

dd-

bran

esde

sitte

r

arX

iv, h

igh

ener

gy p

hysi

cs th

eory

(plo

t cou

rtes

y of

Pau

l Gin

spar

g)

Page 24: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Som

eO

bser

vatio

nsM

any

ofth

ebu

rsts

cont

ain

sign

ifica

ntnu

mbe

rof

batc

hes

with

few

/no

rele

vant

docu

men

ts.

(cf.

thre

shol

d-ba

sed

met

hods

.)

Wor

dsw

ithhi

ghes

t-w

eigh

tbu

rsts

diff

eren

tfr

omm

ost

freq

uent

wor

ds.

Mos

tfr

eque

ntw

ords

inS

TOC

/FO

CS

title

s:

of,f

or,t

he,a

nd,a

,on,

in,c

ompl

exity

,alg

orith

ms,

with

,to,

prob

lem

s,tim

e,

para

llel,

algo

rithm

,bo

unds

,pr

oble

m,g

raph

s,an

,low

er

Bur

sty

wor

dsal

mos

tal

wa

ysco

nten

t-be

arin

g.

But

cont

ent-

bear

ing

wor

dsno

tal

wa

ysbu

rsty

.

E.g

.“t

ime”

and

“bou

nds”

com

mon

thro

ugho

utal

lye

ars.

Bur

stw

eigh

tre

pres

ents

bala

nce

betw

een

ubiq

uity

and

abru

ptne

ss.

Rel

ativ

era

tes

ofhi

ghan

dlo

wst

ates

(par

amet

er)

dete

rmin

es

whe

ther

we

find

brie

f,in

tens

ebu

rsts

orlo

nger

,mild

erbu

rsts

.

Page 25: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Wor

dIn

terv

alof

burs

t

depr

essi

on19

30–

1937

reco

very

1930

–19

37

bank

s19

31–

1934

dem

ocra

cy19

37–

1941

war

time

1941

–19

47

prod

uctio

n19

42–

1943

fight

ing

1942

–19

45

japa

nese

1942

–19

45

war

1942

–19

45

peac

etim

e19

45–

1947

prog

ram

1946

–19

48

vete

rans

1946

–19

48

wag

e19

46–

1949

hous

ing

1946

–19

50

atom

ic19

47–

1959

colle

ctiv

e19

47–

1961

aggr

essi

on19

49–

1955

defe

nse

1951

–19

52

free

1951

–19

53

sovi

et19

51–

1953

kore

a19

51–

1954

com

mun

ist

1951

–19

58

prog

ram

1954

–19

56

allia

nce

1961

–19

66

com

mun

ist

1961

–19

67

pove

rty

1963

–19

69

prop

ose

1965

–19

68

toni

ght

1965

–19

69

billi

on19

66–

1969

viet

nam

1966

–19

73

Page 26: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Som

eO

bser

vatio

ns

Isit

the

cont

ent

that

’sbu

rsty

,or

just

the

time

serie

s?

Per

mut

atio

nte

st(s

ee[S

wan

-Jen

sen

2000

])

Sta

rtw

ithfu

lle-

mai

lco

rpus

,ar

rival

times

.

Shu

ffle

mes

sag

esvi

ara

ndom

perm

utat

ion

:

mes

sag

ear

rives

attim

e

� (inst

ead

ofm

essa

ge

).

Tota

lw

eigh

tof

all

burs

tsin

shuf

fled

corp

usm

ore

than

orde

rof

ma

gnitu

desm

alle

rth

anin

true

corp

us(2

5Kvs

.37

0K)

Alm

ost

nohi

erar

chy

insh

uffle

dve

rsio

n:av

era

ge

of16

wor

dsw

ith

dept

h,v

ersu

sin

true

corp

us.

Page 27: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Fur

ther

Rel

ated

Wor

kM

arko

vso

urce

mod

els

for

time-

serie

san

alys

is

Fra

udde

tect

ion,

Web

pag

ere

ques

ts[S

cott

98,S

cott-

Sm

yth

02].

Pie

ce-w

ise

func

tion

appr

oxim

atio

n

Long

hist

ory

inst

atis

tics

[Hud

son

1966

,Haw

kins

1976

].

Rec

ent

appl

icat

ions

inda

tam

inin

gfo

rtr

end

and

even

tde

tect

ion

[Keo

gh-S

myt

h19

97,H

anet

al.

1998

,Man

nila

-Sal

men

kivi

2001

]

Con

stru

ctin

gtr

ees

from

time

serie

s

Wav

efor

mbr

anc

hes

atlo

cal

min

ima,

lea

ves

atlo

cal

max

ima.

[Ehr

ich-

Foi

th19

76,S

haw

-DeF

igue

iredo

1990

]

Hie

rarc

hica

lH

MM

s[F

ine-

Sin

ger

-Tis

hby

1998

,Mur

phy-

Pas

kin

2001

]

Vis

ualiz

atio

nof

new

sst

ream

s

Wav

elet

Ana

lysi

s[M

iller

etal

.98

],T

hem

eRiv

er[H

avr

eet

al.

2000

].

Page 28: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Fur

ther

Dire

ctio

nsW

ebcl

icks

trea

mda

ta

Logs

colle

cted

byG

ay,S

tefa

none

,Gra

ce-M

artin

,H

embr

ooke

2000

.

80un

derg

radu

ates

intw

ocl

asse

s,ea

rly

Mar

chto

mid

-Ma

y20

00,

with

cons

ent.

Bur

sts

corr

espo

ndto

sud

den

rise

insi

tetr

affic

.

Gre

atdi

ffer

ence

betw

een

sing

le-u

ser

burs

tsan

dbu

rsts

invo

lvin

g

mor

eth

ane.

g.10

dist

inct

user

s.

Man

yof

the

heav

iest

mul

ti-us

erbu

rsts

invo

lve

UR

Lsof

on-li

necl

ass

read

ing

assi

gnm

ents

,ju

stbe

fore

and

durin

gdi

scus

sion

sect

ion.

Sim

ilar

dom

ains

:

Sea

rch

engi

nequ

ery

logs

.(c

f.G

oogl

eZ

eitg

eist

)

Sup

erpo

sitio

nof

dow

nloa

ding

and

pape

rsu

bmis

sion

inth

ear

Xiv

.

Page 29: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Ope

nQ

uest

ions

Dat

ast

ream

com

puta

tion

Ina

data

stre

amm

odel

,fin

dbu

rsts

ofla

rge

wei

ght

for

all

item

s

(e.g

.al

lpo

ssib

lew

ords

)si

mul

tane

ousl

y.

One

pass

,lim

ited

stor

ag

e.

On-

line

algo

rithm

s

Giv

ena

stre

amof

e-m

ail

mes

sag

es/p

aper

title

s/p

aper

dow

nloa

ds,

how

earl

y,in

anon

-line

setti

ng,

can

ala

rge-

wei

ght

burs

tbe

iden

tified

?

Det

ectin

gth

eem

erg

ence

ofsi

gnifi

cant

new

topi

csas

the

yha

ppen

.

(cf.

first

-sto

ry

dete

ctio

npr

oble

min

TD

T).

Page 30: in Structure - Indiana University Bloomingtoniv.slis.indiana.edu/sw/data/kleinberg-stream-slides.pdf · sub-structure within a single topic. (Nested) b ur sts of activity surr ounding

Refl

ectio

nsT

hefa

ctth

atw

ene

edto

ols

topr

e-sc

reen

our

emai

lfo

rus

just

sho

ws

how

info

rmat

ion-

ove

rload

edou

rso

ciet

yha

sbe

com

e.

–S

lash

dot

post

ing

24A

pril

2002

,2:1

0P

M

Who

the

@#$

!g

ets

som

uch

emai

lth

ey

need

tom

ine

for

text

??!!

dont

chan

ge

your

emai

lfil

terin

g,ch

ang

eyo

urpa

thet

iclif

e!!

–S

lash

dot

post

ing

24A

pril

2002

,6:0

2P

M

Ifon

lyit

wer

eso

sim

ple

...

Incr

easi

ngl

yab

leto

mea

sure

pers

onal

activ

ityat

unpr

eced

ente

d

leve

lsof

deta

il.

Cop

ing

with

aw

orld

inw

hic

hyo

uron

-line

tool

skn

ow

mor

eab

out

you

than

you

real

ize.