Upload
nattiya-kanhabua
View
88
Download
0
Tags:
Embed Size (px)
Citation preview
Can Twitter & Co. Save Lives?
Nattiya Kanhabua, Avaré Stewart, Sara Romano
Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl
L3S Research Center / Leibniz Universität Hannover, Germany
Research Seminar @MPII, Saarbrücken
22 October 2013
Motivation
• Numerous works use Twitter to infer the existence and magnitude of real-world events in real-time– Earthquake [Sakaki et al., 2010]– Predicting financial time series [Ruiz et al., 2012]– Influenza epidemics [Culotta, 2010; Lampos et al.,
2011; Paul et al., 2011]
Health related tweets
• User status updates or news related to public health are common in Twitter– I have the mumps...am I alone?
– my baby girl has a Gastroenteritis so great!! Please do not give it to meee
– #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/....
– As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.
Data Collection• Official outbreak reports
– ~3,000 ProMED-mail reports from 2011– WHO reports have very small coverage
• Twitter data– ~1,200 health-related terms (i.e., infectious diseases,
their synonyms, pathogens and symptoms)– Over 112 millions of tweets from 2011
• Series of NLP tools including– OpenNLP (tokenization, sentence splitting, POS tagging)– OpenCalais (named entity recognition) – HeidelTime (temporal expression extraction)
Event Extraction
• An event is a sentence containing two entities– (1) medical condition and (2) geographic expression– A minimum requirement by domain experts
• A victim and the time of an event can be identified from the sentence itself, or its surrounding context
• Output: a set of event candidates
Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA’ 12]
Message Filtering: Challenges• Ambiguity
– having several meanings– used in different contexts
• Incompleteness– missing or under-reported events– data processing errors
Message Filtering: Challenges• Ambiguity
– having several meanings– used in different contexts
• Incompleteness– missing or under-reported events– data processing errors
Category Example tweet
Literature A two hour train journey, Love In the Time of Cholera ...
Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith Universal Audio...
Marketing Exclusive distributor of high quality #HIV/AIDS Blood & Urine and #Hepatitis #Self -testers.
General Identification of genotype 4 Hepatitis E virus binding proteins on swine liver cells: Hepatitis E virus...
Negative i dont have sniffles and no real coughing..well its coughing but not like an influenza cough.
Joke Thought I had Bieber Fever. Ends up I just had a combo of the mumps, mono, measles & the hershey squ...
Approach for Noisy Data
• MedISys1
– providing a list of negative keywords created by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic culture words or phrases
1http://medusa.jrc.it/medisys/homeedition/en/home.html2http://www.urbandictionary.com/
Approach for Noisy Data
• MedISys1
– providing a list of negative keywords created by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic culture words or phrases
1http://medusa.jrc.it/medisys/homeedition/en/home.html2http://www.urbandictionary.com/
Signal Generation: Challenges• Temporal Dynamics
– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
Signal Generation: Challenges• Temporal Dynamics
– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
[Rortais et al., 2010 in Journal of Food Research International]
Signal Generation: Challenges• Temporal Dynamics
– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
Signal Generation: Challenges• Temporal Dynamics
– seasonal infectious diseases– rare and spontaneous outbreaks
• Location Dynamics– frequency and duration– levels of prevalence or severity
[Emch et al., 2008 in International Journal of Health Geographics]
Temporal Diversity
• Refined Jaccard Index (RDJ-index)– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity• Problem: “All-Pair comparison”• Solution: Estimation algorithms with probabilistic
error bound guarantees[Deng et al., CIKM’
12]
ji
ji OOJSnn
RDJ ),()1(
2
nji 1
∩ UU
Jaccard similarity
Temporal Diversity
• Refined Jaccard Index (RDJ-index)– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity• Problem: “All-Pair comparison”• Solution: Estimation algorithms with probabilistic
error bound guarantees[Deng et al., CIKM’
12]
ji
ji OOJSnn
RDJ ),()1(
2
nji 1
∩ UU
Jaccard similarity
(1) Top-k terms(2) Entities
Approach• Personalized Tweet Ranking for Epidemic
Intelligence– Learning to rank and recommender systems– User's context as implicit criteria for recommendation
[Diaz-Aviles et al., ICWSM’ 12, Diaz-Aviles et al., WWW’
12]
Approach• Personalized Tweet Ranking for Epidemic
Intelligence– Learning to rank and recommender systems– User's context as implicit criteria for recommendation
[Diaz-Aviles et al., ICWSM’ 12, Diaz-Aviles et al., WWW’
12]
Conclusion
• Can Twitter & Co. Save Lives?– On a global level, we were able to generate
signals earlier than official reporting mechanisms.
– The ultimate answer depends on: how a health organization will use and react to information provided by our system.
Future Work
• Real-Time Analysis of Big and Fast Social Web Streams– Scalable, efficient methods for filtering and
generating signals in real-time
– Effective methods for aggregating and visualizing information in a meaningful way
Thank [email protected]
References• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter
messages. In Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.
• [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012.
• [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2012), 2012.
• [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access (TAIA'2012), 2012.
• [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.
• [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at WWW'2013, 2013.
• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning. ACM TIST, 3, 2011.
• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011.
• [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.
• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.