26
Corporate Development & Strategy Microsoft Confidential: Internal Use Only Bursty and Hierarchical Structure in Streams Jon Kleinberg ACM SIGKDD’02 Presented by Deng Cai 10/20/2004

Bursty and Hierarchical Structure in Streams

  • Upload
    dannon

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

Bursty and Hierarchical Structure in Streams. Jon Kleinberg ACM SIGKDD’02 Presented by Deng Cai 10/20/2004. Outline. The problem and main idea A Weighted Automation Model Two state Infinite state Experiments Email Paper Thoughts. Main Idea. - PowerPoint PPT Presentation

Citation preview

Page 1: Bursty and Hierarchical Structure in Streams

Corporate Development & Strategy

Microsoft Confidential: Internal Use Only

Bursty and Hierarchical Structure in Streams

Jon KleinbergACM SIGKDD’02

Presented by Deng Cai10/20/2004

Page 2: Bursty and Hierarchical Structure in Streams

Outline

• The problem and main idea

• A Weighted Automation Model– Two state– Infinite state

• Experiments– Email– Paper

• Thoughts

Page 3: Bursty and Hierarchical Structure in Streams

Main Idea

• Extract meaningful structure from document stream• Burst of activity: certain features rising sharply in

frequency as the topic emerges

• A formal approach for modeling such “bursts”– An infinite-state automaton– Bursts appear as state transitions– A nested representation of the set of bursts that imposes a

hierarchical structure on the overall stream.

Page 4: Bursty and Hierarchical Structure in Streams

Two Cases

• Email– Articles’ arrival over time– Try to find hierarchy structure

• Paper title– Batch appearing– Thy to enumerate all the bursts (ranking bursts)

Page 5: Bursty and Hierarchical Structure in Streams

A Weighted Automation Model: One State Model

• Generating model:

– : the gap in time of two consecutive messages– Expectation:– : rate of message arrivals

• Why this model?

1 x

( ) xf x e

Page 6: Bursty and Hierarchical Structure in Streams

A Weighted Automaton Model: Two State Model

• Two states automaton A: q0,q1

• A changes state with probability p, remaining in its current state with probability 1-p, independently of previous emissions and state changes.

• A begins in state q0. Before each message is emitted, A changes state with probability p. A message is then emitted, and the gap in time until the next message is determined by the distribution associated with A's current state.

00 0( ) xf x e 1

1 1( ) xf x e

Page 7: Bursty and Hierarchical Structure in Streams

A Weighted Automaton Model: Two State Model

• Based on a set of messages to estimate a state sequence– Maximum likelihood

• n inter-arrival gaps:• A state sequence:• b denotes the number of state transitions in the

sequence q

1 2( , , )nx x xx

1 2( , , )

ni i iq q qq

1

Pr ( )Pr |

Pr ( )

1 1 ( )1 t

b nn

i tt

ff

p p f xZ p

q

qq

q xq x

q x

Page 8: Bursty and Hierarchical Structure in Streams

A Weighted Automaton Model: Two State Model

• Finding a state sequence q maximizing previous probability is equivalent to finding one that minimizes

• Equivalent to minimize the following cost function:

1

ln Pr | ln ln ( ) ln 1 ln1 t

n

i tt

pb f x n p Zp

q x

1

| ln ln ( )1 t

n

i tt

pc b f xp

q x

Page 9: Bursty and Hierarchical Structure in Streams

An Infinite-state Model

• Base state q0

– Exponential density function with rate– Consistent with completely uniform message arrivals.

• State qi

– Exponential density function with rate– , scaling parameter

• the infinite sequence of states models inter-arrival gaps that decrease geometrically from

• for every i and j, there is a cost associated with a state transition from qi to qj .– In this paper

0f1

0 ˆ /g n T

if1

0 ˆi ii s s g

1s

g0 1, ,q q

ln,

0

j i n if j ii j

else

,i j

Page 10: Bursty and Hierarchical Structure in Streams

An Infinite-state Model

• This automaton, with its associated parameters s and , will be denoted as • Given , find a state sequence that

minimizes the cost function:

• As before, minimizing the first term is consistent with having few state transitions and transitions that span only a few distinct states, while minimizing the second term is consistent with passing through states whose rates agree closely with the inter-arrival gaps. Thus, the combined goal is to track the sequence of gaps as well as possible without changing state too much.

• Observe that the scaling parameter s controls the “resolution" with which the discrete rate values of the states are able to track the real-valued gaps; the parameter controls the ease with which the automaton can change states.

• How to choose these two parameters?

,sA

1

10 1

| , ln ( )t

n n

t t i tt t

c i i f x

q x

1 2( , , )nx x xx 1 2

( , , )ni i iq q qq

Page 11: Bursty and Hierarchical Structure in Streams

An Infinite-state Model

• An optimal state sequence in can be found by restricting to a number of states k that is a very small constant, always at most 25.

• This can be done by adapting the standard forward dynamic programming algorithm used for hidden Markov models to the model and cost function defined here

,sA

Page 12: Bursty and Hierarchical Structure in Streams

An Infinite-state Model

• We can formally define a burst of intensity j to be a maximal interval over which q is in a state of index j or higher.

• It follows that bursts exhibit a natural nested structure

Page 13: Bursty and Hierarchical Structure in Streams

An Infinite-state Model

Page 14: Bursty and Hierarchical Structure in Streams

Experiments: Email (Hierarchical Structure)

• Saved email of the author (June 9, 1997 — Aug. 23, 2001)• Total 34344 messages• Subsets of the collection can be chosen by selecting all

messages that contain a particular string or set of strings– ITR: it is the name of a large National Science Foundation program

for which my colleagues and I wrote two proposals in 1999-2000– Prelim: the term used at Cornell for (non-final) exams in

undergraduate courses.• To examining:

– First, is it in fact the case that the appearance of messages containing particular words exhibits a “spike," in some informal sense, in the (temporal) vicinity of significant times such as deadlines, scheduled events, or unexpected developments?

– Do the algorithms developed here provide a means for identifying this phenomenon

Page 15: Bursty and Hierarchical Structure in Streams

ITR

Page 16: Bursty and Hierarchical Structure in Streams

ITR

Page 17: Bursty and Hierarchical Structure in Streams

prelim

Page 18: Bursty and Hierarchical Structure in Streams

prelim

Page 19: Bursty and Hierarchical Structure in Streams

Experiments: Paper Title (Enumerating Bursts)

• For every word w that appears in the collection, one computes all the bursts in the stream of messages containing w. Combined with a method for computing a weight associated with each burst, and for then ranking by weight,

• This essentially provides a way to find the terms that exhibit the most prominent rising and falling pattern over a limited period of time.

• Extracting bursts in term usage from the titles of conference papers.

• Two distinct sources of data will be used here: – The titles of all papers from the database conferences SIGMOD and

VLDB for the years 1975-2001– The titles of all papers from the theory conferences STOC and

FOCS for the years 1969-2001.

Page 20: Bursty and Hierarchical Structure in Streams

The Automaton

• is not suitable in this case. Since it is fundamentally based on analyzing the distribution of inter-arrival gaps

• Documents arrive in discrete batches; in each new batch of documents, some are relevant and some are irrelevant.

• The idea is thus to find an automaton model that generates batched arrivals, with particular fractions of relevant documents.

• A sequence of batched arrivals could be considered bursty if the fraction of relevant documents alternates between reasonably long periods in which the fraction is large and other periods in which it is small.

,sB

,sA

Page 21: Bursty and Hierarchical Structure in Streams

The Automaton

• Base state– –

• State qi

– – will only be defined for i such that

• State qi produces a mixture of relevant and irrelevant documents according to a binomial distribution with probability pi.

1

ntt

R r

0 /p R D

,sB

1

ntt

D d

0i

ip p s1ip

Page 22: Bursty and Hierarchical Structure in Streams

The Automaton

• Cost function

– If the automaton is in state qi when the tth batch arrives

,sB

1 1

10 0

| , , ,n n

t t t t tt t

c i i i r d

q x

, , ln 1 t ttd rt r

t t i it

di r d p p

r

Page 23: Bursty and Hierarchical Structure in Streams

Experiment (Paper Title)

• The main goal is to enumerate bursts of positive intensity, thus the two state automaton is used.

• Given an optimal state sequence, bursts of positive intensity correspond to intervals in which the state is q1 rather than q0

• weight of the burst

2,sB

2

1

0, , 1, ,t

t t t tt t

r d r d

Page 24: Bursty and Hierarchical Structure in Streams

Experiment Result

Page 25: Bursty and Hierarchical Structure in Streams

Experiment Result

Page 26: Bursty and Hierarchical Structure in Streams

Thoughts

• How to identify one message belongs to a certain topic is the hard problem. (This paper avoid this)– Text mining should handle this problem

• The result are not very impressive and the model might not so meaningful. – Why this generating model?– Why this parameters? – All of these can not be mathematically proved (verified)