30
Principle Investigator: Dr. Ming-Hsiang Tsou [email protected], (Geography), Co-Pis: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography). San Diego State University, USA. Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber- Enabled Discovery and Innovation (CDI) program. Award # 1028177. (2010-2014) http://mappingideas.sdsu.edu/ Overview of the NSF-CDI project (Year-3) and Research Progress Ming-Hsiang (Ming) Tsou [email protected], Professor (Geography), San Diego State University , PI of “Mapping Ideas” project. Co-PIs: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography)

Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Principle Investigator: Dr. Ming-Hsiang Tsou [email protected], (Geography), Co-Pis: Dr. Dipak K Gupta

(Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography).

San Diego State University, USA.

Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber-

Enabled Discovery and Innovation (CDI) program. Award # 1028177.

(2010-2014) http://mappingideas.sdsu.edu/

Overview of the NSF-CDI project

(Year-3) and Research Progress

• Ming-Hsiang (Ming) Tsou [email protected],

Professor (Geography), San Diego State University, PI of “Mapping Ideas” project.

Co-PIs:

Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic),

Dr. Brian Spitzberg (Communication), Dr. Li An (Geography)

Page 2: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Starting Date: October 1, 2010 (Four Years, $1.38M total)

Goal 1: Establish a new multidisciplinary

research framework to represent the

spatiotemporal diffusion of ideas and the semantic

web on the Internet.

Goal 2: Create effective visualization and

analysis methods for the dynamic geospatial

information landscape with three selected topics

(e.g. natural disasters, continuous threats for human

beings, and radical social movements).

Goal 3: Build domain-specific ontology, citation,

and (provocative) event knowledge bases with

thesaurus and citation networks for the three

selected topics and their Semantic Webs.

Goal 4: Develop theoretical model(s) capable of

integrating the individual (semantic usage, online

motivations) and societal (diffusion) motives and

practices associated with the spatiotemporal

diffusion of ideas.

Page 3: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Knowledge Discovery in Cyberspace (KDC)

Similar to the multidisciplinary research field, called “knowledge

discovery in databases (KDD)” (Fayyad et al. 1996), this emerging

research field, knowledge discovery in cyberspace (KDC), will

focus on how to handle and analyze very large information and

human messages collected from cyberspace and social media.

The purpose of KDC is to scale up our research capability of

handling millions of records and information items available in social

media (such as Twitter) or web pages (searched by Google, Yahoo, or

Bing search engines).

(Cited from: Ming-Hsiang Tsou & Michael Leitner (2013): Visualization of social media: seeing a

mirage or a message?, Cartography and Geographic Information Science, 40:2, 55-60 )

Goal 1: Establish a new multidisciplinary research framework

Page 4: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Place (Scale, Space,

context)

Time (Dynamic)

Messages ( content /

function, who,

what, how, media)

Tweets, web pages, emails, short messages

(San Diego, New York, 92119, SDSU, Bus Stops,

Sea World…)

August 23, 2012 (snapshot), one week, two

months, before / after etc.

Triangular Knowledge Base

(Human centered)

Interdependent!

The Uniqueness

of KDC

Page 5: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Cyberspace • Social Media (Twitter, Facebook, Flickr, Youtube)

• Web Pages, Weblogs, News, RSS, Emails, etc.

Target Data: • Tweets (keywords, regions, API types)

• Web Pages (keywords, web search engines)

Collected Data:

•SQL databases (Tweet contents) •Excel files (Web Search Results)

Preprocessed Data: • SQL-output-Tweets – remove errors and duplicated

• Geocoded Excels - add lat/long ,

• Improving geolocation results.

Transformed Data: • Graphics, Bar charts, WordCloud, etc.

• Original Point Maps (each point represented one web page or one tweets).

Selection

(research focus)

Collection

(Tools, APIs)

Preprocessing

(reduce noises, data clearing, select regions, time scale/series ???)

Transformation

(mapping + graphs)

KDC: Knowledge Discovery in Cyberspace (7 steps)

Page 6: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Transformed Data: (continued from last page)

• Graphics, Bar charts, WordCloud, etc.

• Original Point Maps (each point represented one web page or one tweets).

Visualized Data: •Kernel Density maps, Differential KD maps, Point Density Maps. •Excel files (Web Search Results)

Pattern Recognition: • Decision Trees and Rules

• Nonlinear Regression and Classification Methods

• Example-based Methods (nearest-neighbor classification)

• Probabilistic Graphic Dependency Models

• Relational Learning Models

Knowledge Formalization: • Verification (City Mayor Maps, Movie tweets)

• Discovery

• Prediction (Election)

• Description (Outbreaks, election)

Explore/Compare Methods

(Select algorithms)

Information Mining

(Analyze Space-Time-Information

relationships)

Interpretation / Evaluation

All seven steps are systematic, algorithm-based procedures.

Page 7: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Real World

Cyber Information Space (BIG DATA) (Web Pages, Social Media, Weblogs, Forums, News)

Information Mining Tools

CyberDiscovery Tools

• Yahoo API

• Bing API

• Google API

Twitter_GeoSearch_Tool

• Search API

• Streaming API

VISION Ontological Analysis Platform WHO, Where, When, What - WHY? (Networks)

Spatial Visualization

Tools / Methods

Computational Linguistics Analysis

Tools / Methods

Place – Time – Messages

(content/functions)

New Theories (explanation) , New Models (simulation), New Knowledge

“V”isualizing

“I”nformation

“S”pace

“I”n

“O”ntological

“N”etworks

(VISION)

Page 8: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Information Communication Channels in Cyberspace

• Web Pages ( Semi-Public Information Communication )

• Social Media (Twitter: Semi-Private Information Communication)

Web Pages: Use Web Search

Engines (Google, Yahoo, and Bing) to

retrieve up to 1,000 web pages per

keyword. Then analyze their contents

associated with their ranks and

geolocations.

Social Media (Tweets): Use Twitter

APIs to retrieve tweets based on

Keywords or #Hashtag and

geolocations (self-defined home-towns

or GPS locations).

Page 9: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Collect Web Page Contents, Ranks, and Locations: We develop Cyber-Discovery Search Engine (Retrieve up to 1000 results from Yahoo or Bing)

Page 10: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Twitter – Spatial search API

Center: 41.961295, -93.281859

Radius: 180 miles

Search API Limitations:

1. Spatial Search can only trace back up to

seven days. (Regular search can trace back to

14 days.)

2. Each search results can not exceed 1500

tweets.

Twitter APIs:

• REST API

• Stream API

• Search API

Page 11: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Web Pages Search Results vs. Tweets

Page 12: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Web Page Visualization maps (using Google or

Yahoo search egnine result to convert Web Page IP

addresses into Lat/Lon. with Maxmind Lookup tables).

IP geolocation - the “registration location of Web

server” (not the physical location of machines). (What is the veracity of geolocation ?).

Page 13: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

COMPARE between Bing engine and Yahoo engine: (“Jerry Sanders”

keyword -- % in 12 different web page categories – defined by our team

members)

Bing search: more commercial and informational (wiki), social media.

Yahoo search: more blogs and news, and educational pages.

(But in general, there are some similarity between the two engines)

Classifying different types of web pages and

social medias for content and linguistic analysis;

Page 14: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Spatial Accuracy of Web Page Categories based on the IP address geo-convertion.

Highest: Educational

73.86%

Social Media 68.97%

Government 60.98%

Lowest: Blog

10.81%

Special Interest Group 12.81%

NGO 20.93%

Page 15: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Geolocation Accuracy in Different Keywords

• GREEN (correct) • Blue (incorrect) • Gray (unknown)

• Highest Spatial Accuracy • McGinn – 33.57%

• Lowest Spatial Accuracy • Santorum – 21.29%

• Highest N/A • Flu – 35.52%

Page 16: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Web Page Information

Landscape (2012

Presidential Election)

Ming-Hsiang Tsou , Jiue-An Yang , Daniel Lusher , Su Han , Brian Spitzberg , Jean Mark Gawron ,

Dipak Gupta & Li An (2013): Mapping social activities and concepts with social media (Twitter) and web

search engines (Yahoo and Bing): a case study in 2012 US Presidential Election, Cartography and Geographic

Information Science, DOI:10.1080/15230406.2013.799738

Page 17: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

http://mappingideas.sdsu.edu/mapshowcase/election/webpage/election3.html

Page 18: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

1) Select 30 major U.S. Cities within 17 miles radius -- collect tweets with movie keywords (TED, Spider Man, etc.)

2) Compare the daily movie box office results and the number of tweets containing each movie keywords.

Twitter Case Study #1: 2012 Summer

Comparing FIVE Movie Tweets & Box

Office

Page 19: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Daily Weekly

TED 0.8826 0.9989

Spider-Man 0.9409 0.9725

Ice Age 0.8895 0.9528

Dark Knight 0.9523 0.9375

Step Up 0.8931 0.8123

Five Movies Correlation Test

Daily: Daily_Tweets vs. Daily_Box_Revenue

Weekly: 8_to_13_days_before, one_week_before, release_day,

one_week_after, two_weeks_after, three_weeks_after,

four_weeks_after

Page 20: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Tweet_Daily and Box_Daily (TED)

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

0

10,000

20,000

30,000

40,000

50,000

60,000

-13-12-11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Bo

x O

ffic

e R

eve

nu

e

Twe

ets

Tweet_Daily

Box_Daily

Release

One Week

Two Week

Page 21: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Tweet_Weekly and Box_Weekly (TED)

0

10000000

20000000

30000000

40000000

50000000

60000000

70000000

80000000

90000000

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

-2 -1 0 1 2 3 4

Bo

x O

ffic

e R

eve

nu

e

Twe

ets

Week (releasing day as 0)

Box_weekly

Tweets_weekly

Page 22: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,
Page 23: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Before Hurricane Sandy

After Hurricane Sandy

Case Study #2: 2012 Presidential

Election (Tweets)

Page 24: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Sentiment

Analysis

(case study: 2012

Presidential

Election)

(Before / After

Hurricane Sandy)

Page 25: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

User Behavior Analysis: Tweeting Pattern – by Hours

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Tota

l Tw

eet

s (Z

)

Hour in Day

President Election

Public Health

Environmental

Movie

Page 26: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Location Consistency

Consistent 60%

non-Consistent

20%

N/A 20%

Locations are aggregated by States

Geo-tagged Location vs. Self-report Location

Domain GPS Tweet %

Elections 1.018

Health 4.360

Environmental 1.315

Movie 2.875

Page 27: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

THE INNOVATION DIFFUSION MODEL (E. Rogers, 1962): Diffusion: “a special type of communication concerned with the spread of messages that are perceived as new ideas” (E. Rogers, p. 35).

Torsten Hägerstrand, 1953 – Doctoral Dissertation - Culture Diffusion. 1967

Innovation Diffusion as a Spatial Process (book).

Developing new communication theories for cyberspace

social networks: DIFFUSION OF INNOVATIONS

Public (Mass Media, TV, News)

vs.

Private (Personal Communications)

Page 28: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Publication (2013)

One special issue in

CaGIS journal

Five refereed journal

papers

Page 29: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Human Dynamics

Mobile Technology

Spatial Science

SDSU: Human Dynamics in the Mobile Age

(HDMA) A Proposal for an Area of Research Excellence at San Diego State University, 2013.

Five Core Faculty: • Ming-Hsiang Tsou (Coordinator of HDMA, Professor of Geography) 2012 SDSU President's

Leadership Fund (PLF) Awards for Excellence.

• Sheldon Zhang ( Chair of Sociology, Professor ), expert in Human Trafficking, Transnational Organized

Crime, Juvenile Delinquency, Probation and Parole, etc.

• John Elder (Distinguished Professor of Public Health, the Director of IBACH) 2012 SDSU President's

Leadership Fund (PLF) Awards for Excellence, expert in randomized trials for chronic disease prevention

in Latino communities.

• Piotr Jankowski Incoming Chair of Geography), expert in Spatial Decision Support Systems.

• Brian Spitzberg (Senate Distinguished Professor, School of Communication), expert in communication

theories, communicative (in)competence, obsessive relational intrusion.

along with other 29 funding faculty members from 14 different departments in 5 colleges.

Page 30: Overview of the NSF-CDI project (Year-3) and Research Progressmappingideas.sdsu.edu/old_Mappingideas/Summer... · Transformed Data: (continued from last page) •Graphics, Bar charts,

Four new faculty lines will be hired in 2014 and 2015.

(Assistant or Associate Professor levels)

• One faculty line for Graduate School of Public Health is suggested for the Institute for Behavioral and Community Health [IBACH] to focus on Behavior-Environmental interactions in Community Health. (2014)

• One faculty line is suggested for the Department of Geography to focus on spatio-temporal data analytics. (2014)

• One faculty line is suggested for the Department of Sociology to focus on Urban Sociology with demography and spatial analysis techniques. (2015)

• One faculty line is suggested for the School of Communication to focus on strategic communication with Big Data and Social Media. (2015).