Upload
gerard-walsh
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Mobility analysis from Twitter data
NTTS 2015 - satellite Workshop on Big Data
Twitter as data source
NoSQL Database
Filter by: Geo-referenced Only
México
Real-time Tweets
INEGI
TwitterTwitter
Why Tweeter?
• Availability• 1% of Tweets available without cost• Around 12 M accounts in Mexico• 700,000 accounts are geo-referenced• Collection of 150 M of tweets since
January 2014
Devices generatingtweets in Mexico
Andr
oid
iPho
ne
Tweet collection infrastructure
Unix “Red Hat”
NoSql Database “Elasticsearch”
Cluster (Hydra)
Big Data Layers
Test of Concept
General Process
Every DayCollection
StoreGeo-Referenced
Tweets
15M
?
Set an Objective
Filter and Process
Generate outputs
Topics
• Mobility– Internal flows– Tourism– Borders commuting– National Roads Networks: Use of roads (planned)– Urban influence zones (planned)
• Subjective wellness– Based on text– Based on emoticons
Geo-referenced Tweets 2014
DF
Internal mobility (from-to)
Méx
ico St
ate
To Mexico City
From Mexico
City
Where we go when tweeting?
Internal Tourism
Origin of Tourists visiting
Guanajuato (1-3 February 2014)
Internal Tourism
Origin of Tourists visiting
Puebla(1-3 February 2014)
Use of twitter in long weekendsDisplacements to Puebla and Guanajuato before, on and
after 1-3 February period
Border commuting
• México
• USA
National Roads Network
Urban Influence zones
Subjective Wellness• Complement of existing survey
– Subjective perceived wellness (monthly)
• Two approaches– Based on emoticons (possible international
comparability)• Netherlands experiments
– Based on text (diversity of analysis, regionalisms)
• Text analysis infrastructure development
Methods and Tools
• Pioanalisis: Tool for collection of the training set (crowdsourcing)
• Machine learning (supervised and unsupervised), Support Vector Machines, Incremental Learning
• Random forest, Latent Dirchlet Allocation (LDA)• SOM Neuronal Networks (SOM: Self Organizing
Map)• Classification Methods: Naive Bayes, Support
Vector Machines (SVM), KNN, Word Count• Dictionaries:Spanish Emotion Lexicon (SEL), KNN,
AFINN, WordNet, ANEW
Partnerships• International
– UNECE• ICHEC
– UNSD– LAMBDoop– University of Pensylvania
• National– KioNetworks
• Dattlas
– TecMilenioINFOTEC– Centro Geo– CIDE– CIMAT– Sectur
• Internal– INEGI General Directions
Conclusions• We are in a discovery stage:
– Findings going from ‘interesting’ to ‘valuable’
• Lot of research needed: – … but we are getting a lot of knowledge and experience
• Partnerships are a must• Combining other big data sources is an imminent next
step• New challenges and threats will appear
– Costs increase?– Legal issues?– Methodologies and quality frameworks re-engineering)?– Evolution of traditional statistics?
• A lot of etcetera?
New statistics production landscape?