Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Parameter Free Bursty Events Parameter Free Bursty Events Detection in Text StreamsDetection in Text Streams
Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S YuGabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu
VLDB 2005VLDB 2005
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Parameter Free Bursty Events Detection in Text StreamsParameter Free Bursty Events Detection in Text Streams
Introduction Introduction (1 or 5)(1 or 5)
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Parameter Free Bursty Events Detection inParameter Free Bursty Events Detection in Text Streams Text Streams– A sequence of documents organized temporallyA sequence of documents organized temporally
» E.g. News stories and e-mailsE.g. News stories and e-mails
– Two kinds of stream: Online vs. OfflineTwo kinds of stream: Online vs. Offline» Online Stream: Open-ended. Online Stream: Open-ended.
» Offline Stream: Have boundaries. Offline Stream: Have boundaries.
Introduction Introduction (2 or 5)(2 or 5)
………… ……
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Parameter FreeParameter Free Bursty Events Bursty Events Detection in Text StreamsDetection in Text Streams– An event consists a set of features that are useful to identify An event consists a set of features that are useful to identify
(understand) the event.(understand) the event.
– A Bursty Event is an event that is A Bursty Event is an event that is hothot in a specific period of time in a specific period of time
– We call the features that are used to identify the Bursty Event as We call the features that are used to identify the Bursty Event as Bursty Features Bursty Features
– E.g. The event “SARS” consists of the features “Outbreak, E.g. The event “SARS” consists of the features “Outbreak, Atypic, Respire, …” Atypic, Respire, …”
Introduction Introduction (3 or 5)(3 or 5)
TimeTime
No. of News StoriesNo. of News Stories
An event, e.g. SARSAn event, e.g. SARS
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Introduction Introduction (4 or 5)(4 or 5)
Parameter Free Parameter Free Bursty Events Detection in Text StreamBursty Events Detection in Text Stream– Given a text stream, try to figure out all of the bursty events Given a text stream, try to figure out all of the bursty events
» In other words, try to figure out all of the bursty features (features that In other words, try to figure out all of the bursty features (features that are “hot” in a specific period) and group the bursty features together are “hot” in a specific period) and group the bursty features together logically, such that the bursty features grouped together are useful for logically, such that the bursty features grouped together are useful for identifying an event.identifying an event.
………… ……
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Introduction Introduction (5 or 5)(5 or 5)
Parameter Free Bursty Events Detection in Text StreamsParameter Free Bursty Events Detection in Text Streams– Parameter Free – You do not need to turn the parameters by Parameter Free – You do not need to turn the parameters by
yourselfyourself» The framework is applicable on any corpusThe framework is applicable on any corpus» No fine tuning is necessaryNo fine tuning is necessary» No parameter needs to be estimatedNo parameter needs to be estimated
– Why parameter free is useful?Why parameter free is useful?» Without any prior knowledge about the information in a database, it is Without any prior knowledge about the information in a database, it is
rather difficult to make any initially estimationrather difficult to make any initially estimation» In our problem, we are trying to identify the bursty events in a text In our problem, we are trying to identify the bursty events in a text
stream. In this problem, we do not know have any prior knowledge stream. In this problem, we do not know have any prior knowledge about the information in the database. We do not know what it about the information in the database. We do not know what it contains. We even do not know whether there is any burst. We do not contains. We even do not know whether there is any burst. We do not know…know…
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Problem SettingProblem Setting
Data archivedData archived– Source: Local news stories (South China Morning Post)Source: Local news stories (South China Morning Post)
– Period: 2003-01-01 to 2004-12-31Period: 2003-01-01 to 2004-12-31
Some major settingsSome major settings– Offline detectionOffline detection
– New stories that are release on the same day (i.e. new stories that New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together appear in the same piece of the newspaper) are grouped together as a batchas a batch
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
A possible method (A possible method (NotNot our approach) our approach) – Step 1:Step 1:
» Objective: Group similar events togetherObjective: Group similar events together» Method: Use clustering to group similar documents together (e.g. K-Method: Use clustering to group similar documents together (e.g. K-
Means)Means)– Step 2Step 2
» Objective: Extract the keywords of each eventObjective: Extract the keywords of each event» Method: Use feature selection (e.g. Information gain)Method: Use feature selection (e.g. Information gain)
Document Pivot Clustering Approach Document Pivot Clustering Approach (1 of 3)(1 of 3)
All News StoriesAll News Stories
Via ClusteringVia Clustering
....
..
Group 1Group 1
Group 2Group 2
Step 1Step 1
Step 2Step 2
Extract the Key FeaturesExtract the Key Features
Extract the Key FeaturesExtract the Key Features
featurefeature
....
..
featurefeature
....
..
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Document Pivot Clustering Approach Document Pivot Clustering Approach (2 of 3)(2 of 3)
Some difficultiesSome difficulties1.1. Most similar documents may not report the same eventMost similar documents may not report the same event
– From our experiments, we found that two documents that are the From our experiments, we found that two documents that are the most similar in terms of the features, may not necessary report the most similar in terms of the features, may not necessary report the same eventsame event
2.2. Clustering requires feature weightings (e.g. tf-idf)Clustering requires feature weightings (e.g. tf-idf)– Feature weighting is originated from IR. Its idea is: feature appear in Feature weighting is originated from IR. Its idea is: feature appear in
fewer documents in the domain are more useful (obtain higher fewer documents in the domain are more useful (obtain higher weights).weights).
– For clustering: feature appear in many documents in a certain period For clustering: feature appear in many documents in a certain period should obtain a higher weights.should obtain a higher weights.
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Some difficulties Some difficulties (cont’d)(cont’d)
3.3. A long running events may be broken down into several small A long running events may be broken down into several small piecespieces– This phenomenon appears in many reported studies (esp. in TDT)This phenomenon appears in many reported studies (esp. in TDT)
4.4. Difficult to figure out the bursty featuresDifficult to figure out the bursty features– Assume clustering can determine bursty events. However, there can Assume clustering can determine bursty events. However, there can
be many clusters that are not “hot” (important). Determine which of be many clusters that are not “hot” (important). Determine which of the cluster is “hot” is difficult (may require a ranking function, but the cluster is “hot” is difficult (may require a ranking function, but difficult to derive.)difficult to derive.)
Document Pivot Clustering Approach Document Pivot Clustering Approach (3 of 3)(3 of 3)
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty features– Step 2Step 2
» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3
» Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. ClusterClusterEvent 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
ClusterCluster
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty featuresStep 2Step 2
Group the bursty features into bursty eventsGroup the bursty features into bursty eventsStep 3Step 3
Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. Event 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (1 of 7)(1 of 7)
General IdeaGeneral Idea– Given a single feature, f, try to figure out whether it contains any Given a single feature, f, try to figure out whether it contains any
bursty period. bursty period.
– If so, then it is a bursty feature (in some specific periods)If so, then it is a bursty feature (in some specific periods)
TimeTime
No. of docs contains the feature, fNo. of docs contains the feature, f
Bursty PeriodBursty Period
The distribution of a feature, f, The distribution of a feature, f, among documentsamong documents
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (2 of 7)(2 of 7)
Some more examplesSome more examples
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
No burstNo burst Not a burst (stopword)Not a burst (stopword)
Burst without fading awayBurst without fading awayTwo burstTwo burst
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (3 of 7)(3 of 7)
An obvious approach to discover whether a feature is a An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut”bursty feature is to use a “threshold cut”
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
Bursty PeriodBursty Period
The distribution of a feature, f, The distribution of a feature, f, among documentsamong documents
thresholdthreshold
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (4 of 7)(4 of 7)
ChallengesChallenges– Setting one single threshold for all features is impossibleSetting one single threshold for all features is impossible
Another attempt – set a “percentage cut”Another attempt – set a “percentage cut”– Figure out the relative differences between the max and min of the “No. Figure out the relative differences between the max and min of the “No.
of docs contains the feature” of docs contains the feature”
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
For a stop-word:For a stop-word:For a normal non-bursty feature:For a normal non-bursty feature:
thresholdthreshold
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (5 of 7)(5 of 7)
ChallengesChallenges– Setting a percentage cut is also impossibleSetting a percentage cut is also impossible
» Different features has different distribution:Different features has different distribution:
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
TimeTime
No. of docs contains No. of docs contains the feature, fthe feature, f
500500 300300
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (6 of 7)(6 of 7)
Our solutionOur solution– Treating each feature in the text stream as a probabilistic Treating each feature in the text stream as a probabilistic
distributiondistribution– In each day, we compute the probability that the number of In each day, we compute the probability that the number of
documents contains a particular feature, fdocuments contains a particular feature, fjj
» What we got are: What we got are: N’N’ – no. of news stories in the stream – no. of news stories in the stream
n’n’ – no. of news stories in a time window (one day)– no. of news stories in a time window (one day)KK’’ – no. of news stories contains the specific feature – no. of news stories contains the specific feature n’ n’ –– K’ K’ – no. of news stories does not contain the specific feature – no. of news stories does not contain the specific feature
» We can model the distribution of a feature in a time window (i.e. in a We can model the distribution of a feature in a time window (i.e. in a day) by binomial distribution (the above four elements are enough for day) by binomial distribution (the above four elements are enough for computing binomial distribution)computing binomial distribution)
(Continue (Continue next page)next page)
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Identify the Bursty Features Identify the Bursty Features (7 of 7)(7 of 7)
– If in any time window (day), the value of the binomial distribution If in any time window (day), the value of the binomial distribution (probability that the number of documents contain the feature) (probability that the number of documents contain the feature) changechange significantly, than it implies that the feature exhibit significantly, than it implies that the feature exhibit “abnormal” behavior“abnormal” behavior
» The reason is that if the features are generated from an unknown The reason is that if the features are generated from an unknown probability distribution, than the value of the binomial distribution at probability distribution, than the value of the binomial distribution at each time window (in each day) should be more or less constanteach time window (in each day) should be more or less constant
– Two reasons that it drop significantly:Two reasons that it drop significantly:» Suddenly very few documents contains the specific featuresSuddenly very few documents contains the specific features
We are not interested in this kind of observation, as it only tells us that We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature window (day). It gives no insight about whether it is a bursty feature NOW.NOW.
» Suddenly many documents contains the specific features Suddenly many documents contains the specific features We are interested in this kind of featuresWe are interested in this kind of features
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
ClusterCluster
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty featuresStep 2Step 2
Group the bursty features into bursty eventsGroup the bursty features into bursty eventsStep 3Step 3
Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. Event 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty features– Step 2Step 2
» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3
Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. ClusterClusterEvent 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Group the Bursty Features Group the Bursty Features (1 of 2)(1 of 2)
General ideaGeneral idea– Group the features such that they always appear togetherGroup the features such that they always appear together
» If the features always appear together, they should be discussing the If the features always appear together, they should be discussing the same eventsame event
– Cluster the featuresCluster the features
ChallengeChallenge– Should we group these two features together?Should we group these two features together?
» Situation:Situation:If feature A appears, Feature B If feature A appears, Feature B alwaysalways appears also. appears also.Feature A appears in 1,000 stories. Feature B appears in 200 stories.Feature A appears in 1,000 stories. Feature B appears in 200 stories.
» We claim that they should not be grouped together, as Feature B is We claim that they should not be grouped together, as Feature B is only a subset of Feature A. only a subset of Feature A.
We want to group the feature at the “same level”We want to group the feature at the “same level”
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Group the Bursty Features Group the Bursty Features (2 of 2)(2 of 2)
Our solutionOur solution– We try to figure out what is the probability of the features grouped We try to figure out what is the probability of the features grouped
together given the observation of the document distribution of the together given the observation of the document distribution of the text streamtext stream
» Find a maximum probability that the features would be grouped Find a maximum probability that the features would be grouped together (Expectation-Maximization, EM)together (Expectation-Maximization, EM)
– Mathematically,Mathematically,
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty features– Step 2Step 2
» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3
Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. ClusterClusterEvent 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Feature Pivot Clustering ApproachFeature Pivot Clustering Approach
Overview of the frameworkOverview of the framework– Step 1Step 1
» Identify the bursty featuresIdentify the bursty features– Step 2Step 2
» Group the bursty features into bursty eventsGroup the bursty features into bursty events– Step 3Step 3
» Determine the hot periods of the bursty eventsDetermine the hot periods of the bursty events
All News StoriesAll News Stories
ExtractExtractAll featureAll feature
....
.. IdentifyIdentify
Event 1Event 1
....
..Bursty featureBursty feature
....
.. ClusterClusterEvent 2Event 2
....
..
....
..
Determine theDetermine thehot periodhot period
Determine theDetermine thehot periodhot period
Step 1Step 1 Step 2Step 2
Step 3Step 3
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Determine the Hot PeriodsDetermine the Hot Periods
General ideaGeneral idea– The highest average probability that the bursty features will be The highest average probability that the bursty features will be
appeared togetherappeared together
GraphicallyGraphically
TimeTime
Document DistributionDocument Distribution
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Problem SettingProblem Setting
Data archivedData archived– Source: Local news stories (South China Morning Post)Source: Local news stories (South China Morning Post)
– Period: 2003-01-01 to 2004-12-31Period: 2003-01-01 to 2004-12-31
Major SettingsMajor Settings– Offline detectionOffline detection
– New stories that are release on the same day (i.e. new stories that New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together appear in the same piece of the newspaper) are grouped together as a batchas a batch
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Results HighlightResults Highlight
Some eventsSome events
Bursty EventsBursty Events Bursty FeaturesBursty Features
SARSSARS Sars, Outbreak, Atypic, Respire, …Sars, Outbreak, Atypic, Respire, …
LegislationLegislation Article, Yip, Law, Rally, …Article, Yip, Law, Rally, …
Bird FuBird Fu Bird, FluBird, Flu
Taiwan IssueTaiwan Issue Taiwan, Chen, Shu, BianTaiwan, Chen, Shu, Bian
Iraq WarIraq War Iraq, War, Saddam, …Iraq, War, Saddam, …
GasGas Victim, Might, Accident, GasVictim, Might, Accident, Gas
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works ConclusionConclusion
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Related Works Related Works (1 of 2)(1 of 2)
TDT – Automatically techniques for locating topically TDT – Automatically techniques for locating topically related materials in streams data related materials in streams data (Wayne 2000 pp. 1487)(Wayne 2000 pp. 1487)
– Five major tasks: segmentation, tracking, Five major tasks: segmentation, tracking, detectiondetection, first story , first story detection, detection, linkinglinking
– Work well with the “document-pivot clustering” approachWork well with the “document-pivot clustering” approach» Try to group similar documents to form an event (The event is not Try to group similar documents to form an event (The event is not
named, i.e. no need to extract or identify the main features in the named, i.e. no need to extract or identify the main features in the event)event)
No need to figure out the “bursty features”No need to figure out the “bursty features”
– Other interesting issueOther interesting issue» Our approach naturally combine the detection task and linking task Our approach naturally combine the detection task and linking task
togethertogether
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Related Works Related Works (2 of 2)(2 of 2)
Many other related worksMany other related works– Vlachos et la SIGMOD’04Vlachos et la SIGMOD’04
» Burst for online queryBurst for online query
– Smith SIGIR’02Smith SIGIR’02» Events DetectionEvents Detection
– Kleinbery KDD’02Kleinbery KDD’02» Burst and hierarchical structureBurst and hierarchical structure
– Swan & Allan SIGIR’00Swan & Allan SIGIR’00» Time varying featuresTime varying features
– ……
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
OutlineOutline
IntroductionIntroduction– Bursty events? Text streams? Etc.Bursty events? Text streams? Etc.
A Possible MethodA Possible Method– Document pivot clusteringDocument pivot clustering
Proposed WorkProposed Work– Feature pivot clusteringFeature pivot clustering
Results HighlightResults Highlight Related WorksRelated Works Summary & Future WorkSummary & Future Work
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Summary & Future WorkSummary & Future Work
Document Pivot Clustering vs. Feature Pivot ClusteringDocument Pivot Clustering vs. Feature Pivot Clustering– Document Pivot Clustering – Clustering is based on the content of Document Pivot Clustering – Clustering is based on the content of
the documentsthe documents– Feature Pivot Clustering – Clustering is based on distribution of Feature Pivot Clustering – Clustering is based on distribution of
featuresfeatures Future WorksFuture Works
– Try to apply the framework in TDT datasetTry to apply the framework in TDT dataset» However, TDT contain However, TDT contain selectedselected news stories from multiple sources. news stories from multiple sources.
The distribution of features may be affected.The distribution of features may be affected.» Moreover, the time period of TDT is relatively short. We do not know Moreover, the time period of TDT is relatively short. We do not know
whether the change in the distribution of features is significant whether the change in the distribution of features is significant enough for us to do analysisenough for us to do analysis
– Try to assign the same features to multiple events (more realistic)Try to assign the same features to multiple events (more realistic)» However, this may lead to many new issues, such as a “cycle” appear, However, this may lead to many new issues, such as a “cycle” appear,
or the some parameters needed to introduceor the some parameters needed to introduce
Systems Engineering and Engineering ManagementThe Chinese University of Hong Kong
Thank you very much
– The End –