Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Principle Investigator: Dr. Ming-Hsiang Tsou [email protected], (Geography), Co-Pis: Dr. Dipak K Gupta
(Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography).
San Diego State University, USA.
Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber-
Enabled Discovery and Innovation (CDI) program. Award # 1028177.
(2010-2014) http://mappingideas.sdsu.edu/
Overview of the NSF-CDI project
(Year-3) and Research Progress
• Ming-Hsiang (Ming) Tsou [email protected],
Professor (Geography), San Diego State University, PI of “Mapping Ideas” project.
Co-PIs:
Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic),
Dr. Brian Spitzberg (Communication), Dr. Li An (Geography)
Starting Date: October 1, 2010 (Four Years, $1.38M total)
Goal 1: Establish a new multidisciplinary
research framework to represent the
spatiotemporal diffusion of ideas and the semantic
web on the Internet.
Goal 2: Create effective visualization and
analysis methods for the dynamic geospatial
information landscape with three selected topics
(e.g. natural disasters, continuous threats for human
beings, and radical social movements).
Goal 3: Build domain-specific ontology, citation,
and (provocative) event knowledge bases with
thesaurus and citation networks for the three
selected topics and their Semantic Webs.
Goal 4: Develop theoretical model(s) capable of
integrating the individual (semantic usage, online
motivations) and societal (diffusion) motives and
practices associated with the spatiotemporal
diffusion of ideas.
Knowledge Discovery in Cyberspace (KDC)
Similar to the multidisciplinary research field, called “knowledge
discovery in databases (KDD)” (Fayyad et al. 1996), this emerging
research field, knowledge discovery in cyberspace (KDC), will
focus on how to handle and analyze very large information and
human messages collected from cyberspace and social media.
The purpose of KDC is to scale up our research capability of
handling millions of records and information items available in social
media (such as Twitter) or web pages (searched by Google, Yahoo, or
Bing search engines).
(Cited from: Ming-Hsiang Tsou & Michael Leitner (2013): Visualization of social media: seeing a
mirage or a message?, Cartography and Geographic Information Science, 40:2, 55-60 )
Goal 1: Establish a new multidisciplinary research framework
Place (Scale, Space,
context)
Time (Dynamic)
Messages ( content /
function, who,
what, how, media)
Tweets, web pages, emails, short messages
(San Diego, New York, 92119, SDSU, Bus Stops,
Sea World…)
August 23, 2012 (snapshot), one week, two
months, before / after etc.
Triangular Knowledge Base
(Human centered)
Interdependent!
The Uniqueness
of KDC
Cyberspace • Social Media (Twitter, Facebook, Flickr, Youtube)
• Web Pages, Weblogs, News, RSS, Emails, etc.
Target Data: • Tweets (keywords, regions, API types)
• Web Pages (keywords, web search engines)
Collected Data:
•SQL databases (Tweet contents) •Excel files (Web Search Results)
Preprocessed Data: • SQL-output-Tweets – remove errors and duplicated
• Geocoded Excels - add lat/long ,
• Improving geolocation results.
Transformed Data: • Graphics, Bar charts, WordCloud, etc.
• Original Point Maps (each point represented one web page or one tweets).
Selection
(research focus)
Collection
(Tools, APIs)
Preprocessing
(reduce noises, data clearing, select regions, time scale/series ???)
Transformation
(mapping + graphs)
KDC: Knowledge Discovery in Cyberspace (7 steps)
Transformed Data: (continued from last page)
• Graphics, Bar charts, WordCloud, etc.
• Original Point Maps (each point represented one web page or one tweets).
Visualized Data: •Kernel Density maps, Differential KD maps, Point Density Maps. •Excel files (Web Search Results)
Pattern Recognition: • Decision Trees and Rules
• Nonlinear Regression and Classification Methods
• Example-based Methods (nearest-neighbor classification)
• Probabilistic Graphic Dependency Models
• Relational Learning Models
Knowledge Formalization: • Verification (City Mayor Maps, Movie tweets)
• Discovery
• Prediction (Election)
• Description (Outbreaks, election)
Explore/Compare Methods
(Select algorithms)
Information Mining
(Analyze Space-Time-Information
relationships)
Interpretation / Evaluation
All seven steps are systematic, algorithm-based procedures.
Real World
Cyber Information Space (BIG DATA) (Web Pages, Social Media, Weblogs, Forums, News)
Information Mining Tools
CyberDiscovery Tools
• Yahoo API
• Bing API
• Google API
Twitter_GeoSearch_Tool
• Search API
• Streaming API
VISION Ontological Analysis Platform WHO, Where, When, What - WHY? (Networks)
Spatial Visualization
Tools / Methods
Computational Linguistics Analysis
Tools / Methods
Place – Time – Messages
(content/functions)
New Theories (explanation) , New Models (simulation), New Knowledge
“V”isualizing
“I”nformation
“S”pace
“I”n
“O”ntological
“N”etworks
(VISION)
Information Communication Channels in Cyberspace
• Web Pages ( Semi-Public Information Communication )
• Social Media (Twitter: Semi-Private Information Communication)
Web Pages: Use Web Search
Engines (Google, Yahoo, and Bing) to
retrieve up to 1,000 web pages per
keyword. Then analyze their contents
associated with their ranks and
geolocations.
Social Media (Tweets): Use Twitter
APIs to retrieve tweets based on
Keywords or #Hashtag and
geolocations (self-defined home-towns
or GPS locations).
Collect Web Page Contents, Ranks, and Locations: We develop Cyber-Discovery Search Engine (Retrieve up to 1000 results from Yahoo or Bing)
Twitter – Spatial search API
Center: 41.961295, -93.281859
Radius: 180 miles
Search API Limitations:
1. Spatial Search can only trace back up to
seven days. (Regular search can trace back to
14 days.)
2. Each search results can not exceed 1500
tweets.
Twitter APIs:
• REST API
• Stream API
• Search API
Web Pages Search Results vs. Tweets
Web Page Visualization maps (using Google or
Yahoo search egnine result to convert Web Page IP
addresses into Lat/Lon. with Maxmind Lookup tables).
IP geolocation - the “registration location of Web
server” (not the physical location of machines). (What is the veracity of geolocation ?).
COMPARE between Bing engine and Yahoo engine: (“Jerry Sanders”
keyword -- % in 12 different web page categories – defined by our team
members)
Bing search: more commercial and informational (wiki), social media.
Yahoo search: more blogs and news, and educational pages.
(But in general, there are some similarity between the two engines)
Classifying different types of web pages and
social medias for content and linguistic analysis;
Spatial Accuracy of Web Page Categories based on the IP address geo-convertion.
Highest: Educational
73.86%
Social Media 68.97%
Government 60.98%
Lowest: Blog
10.81%
Special Interest Group 12.81%
NGO 20.93%
Geolocation Accuracy in Different Keywords
• GREEN (correct) • Blue (incorrect) • Gray (unknown)
• Highest Spatial Accuracy • McGinn – 33.57%
• Lowest Spatial Accuracy • Santorum – 21.29%
• Highest N/A • Flu – 35.52%
Web Page Information
Landscape (2012
Presidential Election)
Ming-Hsiang Tsou , Jiue-An Yang , Daniel Lusher , Su Han , Brian Spitzberg , Jean Mark Gawron ,
Dipak Gupta & Li An (2013): Mapping social activities and concepts with social media (Twitter) and web
search engines (Yahoo and Bing): a case study in 2012 US Presidential Election, Cartography and Geographic
Information Science, DOI:10.1080/15230406.2013.799738
http://mappingideas.sdsu.edu/mapshowcase/election/webpage/election3.html
1) Select 30 major U.S. Cities within 17 miles radius -- collect tweets with movie keywords (TED, Spider Man, etc.)
2) Compare the daily movie box office results and the number of tweets containing each movie keywords.
Twitter Case Study #1: 2012 Summer
Comparing FIVE Movie Tweets & Box
Office
Daily Weekly
TED 0.8826 0.9989
Spider-Man 0.9409 0.9725
Ice Age 0.8895 0.9528
Dark Knight 0.9523 0.9375
Step Up 0.8931 0.8123
Five Movies Correlation Test
Daily: Daily_Tweets vs. Daily_Box_Revenue
Weekly: 8_to_13_days_before, one_week_before, release_day,
one_week_after, two_weeks_after, three_weeks_after,
four_weeks_after
Tweet_Daily and Box_Daily (TED)
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
0
10,000
20,000
30,000
40,000
50,000
60,000
-13-12-11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Bo
x O
ffic
e R
eve
nu
e
Twe
ets
Tweet_Daily
Box_Daily
Release
One Week
Two Week
Tweet_Weekly and Box_Weekly (TED)
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
-2 -1 0 1 2 3 4
Bo
x O
ffic
e R
eve
nu
e
Twe
ets
Week (releasing day as 0)
Box_weekly
Tweets_weekly
Before Hurricane Sandy
After Hurricane Sandy
Case Study #2: 2012 Presidential
Election (Tweets)
Sentiment
Analysis
(case study: 2012
Presidential
Election)
(Before / After
Hurricane Sandy)
User Behavior Analysis: Tweeting Pattern – by Hours
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Tota
l Tw
eet
s (Z
)
Hour in Day
President Election
Public Health
Environmental
Movie
Location Consistency
Consistent 60%
non-Consistent
20%
N/A 20%
Locations are aggregated by States
Geo-tagged Location vs. Self-report Location
Domain GPS Tweet %
Elections 1.018
Health 4.360
Environmental 1.315
Movie 2.875
THE INNOVATION DIFFUSION MODEL (E. Rogers, 1962): Diffusion: “a special type of communication concerned with the spread of messages that are perceived as new ideas” (E. Rogers, p. 35).
Torsten Hägerstrand, 1953 – Doctoral Dissertation - Culture Diffusion. 1967
Innovation Diffusion as a Spatial Process (book).
Developing new communication theories for cyberspace
social networks: DIFFUSION OF INNOVATIONS
Public (Mass Media, TV, News)
vs.
Private (Personal Communications)
Publication (2013)
One special issue in
CaGIS journal
Five refereed journal
papers
Human Dynamics
Mobile Technology
Spatial Science
SDSU: Human Dynamics in the Mobile Age
(HDMA) A Proposal for an Area of Research Excellence at San Diego State University, 2013.
Five Core Faculty: • Ming-Hsiang Tsou (Coordinator of HDMA, Professor of Geography) 2012 SDSU President's
Leadership Fund (PLF) Awards for Excellence.
• Sheldon Zhang ( Chair of Sociology, Professor ), expert in Human Trafficking, Transnational Organized
Crime, Juvenile Delinquency, Probation and Parole, etc.
• John Elder (Distinguished Professor of Public Health, the Director of IBACH) 2012 SDSU President's
Leadership Fund (PLF) Awards for Excellence, expert in randomized trials for chronic disease prevention
in Latino communities.
• Piotr Jankowski Incoming Chair of Geography), expert in Spatial Decision Support Systems.
• Brian Spitzberg (Senate Distinguished Professor, School of Communication), expert in communication
theories, communicative (in)competence, obsessive relational intrusion.
along with other 29 funding faculty members from 14 different departments in 5 colleges.
Four new faculty lines will be hired in 2014 and 2015.
(Assistant or Associate Professor levels)
• One faculty line for Graduate School of Public Health is suggested for the Institute for Behavioral and Community Health [IBACH] to focus on Behavior-Environmental interactions in Community Health. (2014)
• One faculty line is suggested for the Department of Geography to focus on spatio-temporal data analytics. (2014)
• One faculty line is suggested for the Department of Sociology to focus on Urban Sociology with demography and spatial analysis techniques. (2015)
• One faculty line is suggested for the School of Communication to focus on strategic communication with Big Data and Social Media. (2015).