Upload
sunghyon-kyeong
View
589
Download
0
Embed Size (px)
Citation preview
Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering”
Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong†
Mathematical Analytics Team, National Institute for Mathematical Scneice
2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013
Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
p
Role of SNSs• Informing breaking news (Twitter Journalism)
• Expressing one’s feelings and emotions
• Communication tool in daily life
• Research tools for studying - social behaviors, - human commmunication, - detection of a flu epidemic, - and text mining
4
p
In this study
• Twitter streaming API and MongoDB were used for data collection.
• We proposed a measure for the social hot topic detection of the day.
• Geographic communities were detected for the weather related keywords, and visualized using Google Fusion Table.
5
p
Related Works• Met et al. (2006) proposed probabilistic latent semantic
indexing (PLSI) to discover a spatiotemporal theme pattern on weblogs.
• Wang et al. (2007) proposed location aware topic model (LATM) to incorporate the relationship between locations and words.
• Yin et al. (2011) proposed Latent Geogrpahical Topic Analysis (LGTA), a novel location-text joint model.
• In general, EM algorithm takes huge amount of computing time, and the previous studies did not directly classify locations by topics.
6
EM: expectation minimization
p
Data collection• Geo-tagged public statuses tweeted in the united states.
• A total of ~19 millions geo-tagged Twitter statuses were obtained from March 23 to April 1, 2013.
• This period includes events such as snowfall on spring, same-sex marriage issues by the US court, world cup qualifier match between the US and Mexico, basketball games, and the Easter
8
Twitter streaming data in US
p
MongoDB Sharding
9
! !
! !
! !
! !
! !
! !Mongod Mongod
Mongod ! !
! !
! !Mongod Mongod
Mongod! !
! !
! !Mongod Mongod
Mongod
MongoS! !
! !
C1 Mongod
C2 Mongod
C3 Mongod
Config Servers
Shard1 Shard2 Shard3
! !
Client
Application
Replica Sets
p
Word frequency
11
wf! =X
t2T
X
s2Sf!tswf! frequency function for a word ( )
in a US state ( ) at time ( ).!
s t
The most frequently tweeted words are not the social topic, but emotional words expressing one’s feelings.
Top 5 words and Easter
p
Distribution of Word Freq.
12
log10(word frequency)
log 1
0(Cou
nts) lol
likeloveEaster
※ scale-free distribution
p
Ratio of Word Freq.
14
R!t =
F!t � F!
t�1
F!t + F!
t�1
F!t =
X
s2Sf!ts
The time series function for a word ( ) integrated over the spatial index ( ).s
!The definition of a ratio of word frequency to measure social topic.
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Easter lol like love
p
Social Topics by
15
Topics Top words in terms of frequencyWeather H1={weather, snow, winter, cold, sick}
Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs}
Weekend H3={bar,party,drinking,beer,movies,drunk,club}
US law H4={gay,marriage}
Sports 1 H5={soccer,usa,mexico}
Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan}
TV show H7={thewalkingdead,walking,dead}
EasterH8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,egg,eggs,god,lord}
April Fools’ Day H9={april,joke,fool}
Emotions H10={lol,like,love,shit,fuck,haha,oh,ass}
R!t
p
Topic - Weather, H1
16
• According to US newspapers, there was a heavy snowfall in about six states in the Midwest to Estern states, from Missouri to Pensylvania on March 24, 2013.
• The snowfall stoped on March 25. Interestingly, is dramatically decreased for the word set H1 on March 26.
-0.6
-0.3
0.0
0.3
0.6
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
WeatherSnowWinterColdSick
R!t
p
Topic - Weekend, H3
17
-0.4
-0.2
0.0
0.2
0.4
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
BarPartyDrinkingBeerMoviesDrunkClub
• Topic words during the weekend include the entertainment words such as moview and party but these are also used steadily during the week albeit less frequently.
p
Topic - US Law, H4
• On March 26, the hot topic was the same-sex marriage issue by US court, and we can see the corresponding rapid increase on the March 26.
18
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
gaymarriage
p
Topic - Sports, H5
• As the US and Mexico played a World Cup qualifying match in Mexico on March 26, we found that for the topic ‘Sports 1’ peaked on March.
19
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
SoccerUSAMexico
R!t
p
Topic - Easter, H9
• On March 31, we can see that about Easter such as easter, happy, bunny, egg(s), god and jesus increases.
• This is expected as the Easter is one of the most cerebrated Christian festivals in the US.
20
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
EasterBlessedBunnyJesusHappyHappyeasterBasketCandyEggEggsGodLord
R!t
p
Topic - Emotions, H10• The for emotional words was showed a small
fluctuation ( ) even though they showed higher word frequency ranking.
• This results suggest that the frequency of expressions of feelings and emotions are relatively constant over time.
21
-0.1
-0.1
0.0
0.1
0.1
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
lol likelove shitfuck hahaoh ass
R!t
|R!t | < 0.1
p
Geographic Clustering• For each set of hot topic Hk, we computed the
spatiotemporal matrix for the k-th hot topic as the following:
22
�kts =
X
!2Hk
f!ts
• Then we obtained the adjacency matrix by Pearson’s correlation coefficient between US states:Ak
ij = Corr(�k•i,�
k•j)
• Modularity (Q) was computed from the weighted graph using a Louvain community detection algorithm, which maximize Q
Q =1
2m
X
i,j
hAij �
sisj2m
i�(Ci, Cj)
p
Types of Graph
24
1. What is degree? 2. betweenness centrality?3. global/local network efficiency?4. modular structure
undirected binary graph
directed binary graph
directed weighted graph
1
3
6
5
2
4
0 1 1 0 0 0
1 0 1 0 1 0
1 1 0 0 0 0
0 0 0 0 1 0
0 0 0 1 0 1
0 0 0 0 1 0
Aij =
AdjacencyMatrix
p
Network Analysis Ex.
25
co-authorship network formed by author list
semantic network formed by free association
Steyvers, Cognitive Science 29 (2005) 41–78Neumann, PNAS 101 (2004) 5200-5205
p
Conclusion• The ratio of word frequency properly detected social hot
topics of the day by identifying increasing or decreasing frequency of keywords in Twitter messages,
• while supressing the non-topic keywords such as frequencly tweeted emotional words (e.g., lol, like, and love).
• The social topic detection method may be applied on a different time scale, e.g., hourly, monghly, or yearly.
• The geographic clustering based on a social topic appropriately reflected not only the patyway of spring storm but also the properties of US geography.
27