Twitter Geolocation and Regional Classification via …people.seas.harvard.edu/~miriamcha/pubs/poster_cha_… · · 2015-12-17Develop Twitter geolocation and regional classification

Geolocation

Region and State Classification !  Develop Twitter geolocation and regional classification based on sparse coding and dictionary learning

Classification on Compressed Data

Twitter Geolocation and Regional Classification via Sparse Coding

Miriam Cha, Youngjune Gwon, and H.T. Kung

Approach

Objectives

Introduction Results

Conclusion and Future Work

Evaluation

SC gives decreased error by ~100 km compared to Raw throughout all reported amounts of labeled data

Sparse Coding Represent x as a linear combination of basis vector in D

Geolocation and Region/State Classification

!  As the amount of labeled training samples is limited in practice, such advantage of sparse coding is attractive

CMU GeoText Dataset !  Geo-tagged microblog corpus Ø  http://www.ark.cs.cmu.edu/GeoText/ Ø  377,616 Twitter messages from 9,475 users

within continental US over one week Ø  All user geolocations known (latitude,

longitude)

Experimental Methodology !  Three types of embedding schemes (binary bag-of-words, word-counts, word-sequence) !  For a fair comparison, we follow the identical experimental methodology as Eisenstein et al. (2010)

!  Twitter’s location service enables users to add location information to their Tweets !  Location information of message is useful (e.g., consumer marketing) Ø  However, < 3% of messages are geo-

tagged†

!  Machine learning can geolocate based on message content Ø  Best results are from supervised model-

based approaches !  We focus on unsupervised data-driven approach, namely sparse coding, to take advantage of abundance of unlabeled messages to geolocate

Twitter

Motivation

† C. Weidemann, Journal of Geoinformatics

Geolocation errors. Values are mean (median) in km

Performance comparison summary

Achieves 9% gain for region classification and 14% gain for

state classification

Our method applied to word-sequence vectors achieves the best performance

Voting-based Grid Selection Similarity matching in feature space (sparse code domain) •  Find k-Nearest Neighbors (kNNs) in sparse codes •  Look up reference locations of kNNs to compute

geocoordinate estimate

Ø  Estimate latitude and longitude with Twitter message content

Ø  Classify US state and region using message content

Over 270 million active users

Most popular microblog service

500 million tweets per day

man$I$so$wish$I$didn't$have$to$get$up$at$6am$on$saturdays$:($I$really$wanna$go!!$It$sounds$amazing!$

“Helena”$@USER_d32baf7f$

Reply& Retweet& Favorite& More&… 8:31$$AM$M$$1$Mar$2014$

Omg$I$just$realized$its$11!$I$goRa$get$some$sleep!$


Reply& Retweet& Favorite& More&… 2:10$$AM$M$$29$Feb$2014$

Before$I$started$school$ppl$only$wanted$to$do$stuff$on$sat$nights$but$now$that$sat$works$4$me$beRer$ppl$wanna$go$out$Friday!$


Reply& Retweet& Favorite& More&… 6:32$$AM$M$$28$Feb$2014$

(41.9233,&883.4538)&

Table 1: Geolocation errors. The values are mean (median) in km.Raw SC SC+PCA SC+Raw SC+voting SC+all

Binary 1024 (632) 879 (722) 748 (596) 861 (713) 825 (529) 707 (489)Word counts 1189 (1087) 1042 (887) 969 (802) 1022 (863) 998 (511) 926 (497)Word sequence – 767 (615) 706 (583) 671 (483) 715 (580) 581 (425)

Twitter Geolocation and Regional Classification via Sparse Coding

Miriam ChaHarvard University

Youngjune GwonHarvard University

H. T. KungHarvard University

Table 2: Performance comparison summaryGeolocation error Classification accuracyMean Median Region State

Our approach 581 425 67% 41%Eisenstein et al. 845 501 58% 27%W&B 967 479 – –Roller et al. 897 432 – –

Abstract

EvaluationWe evaluate our baseline methods and enhancements empir-ically, using the GEOTEXT dataset. This section will presenta comparative performance analysis against other text-basedgeolocation approaches of significant impact.

AcknowledgmentsThis material is based upon work supported by the National Sci-ence Foundation Graduate Research Fellowship under Grant No.DGE1144152 and gifts from the Intel Corporation.

Copyright c� 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Table 1: Geolocation errors. The values are mean (median) in km.Raw SC SC+PCA SC+regD SC+Raw SC+voting SC+all

Binary 1024 (632) 879 (722) 748 (596) 846 (701) 861 (713) 825 (529) 707 (489)Word counts 1189 (1087) 1042 (887) 969 (802) 1020 (843) 1022 (863) 998 (511) 926 (497)Word sequence – 767 (615) 706 (583) 735 (596) 671 (483) 715 (580) 581 (425)

Table 2: Performance comparison summaryGeolocation error Classification accuracyMean Median Region State

Our approach 581 425 67% 41%Eisenstein et al. 845 501 58% 27%W&B 967 479 – –Roller et al. 897 432 – –

Overall, our methods applied to word-sequence vectorsachieve the best performance. PCA whitening substantiallyimproves the baseline (SC), especially on patches takenfrom binary embedding. We have appended binary embed-ding for SC+Raw on patches from word sequence vectorand achieved the most impressive performance gain. Each ofthese enhancements individually has decreased geolocationerrors for all embedding schemes. When all enhancementsare applied, the combined scheme achieves the best resultsfor each embedding scheme.

Figure 3 depicts the general trend that sparse coding out-performs Raw as a function of the amount of words in thelabeled document. We observe the general trend that as morelabeled datasets are available, geolocation error decreasesfor both Raw and sparse coding. When labeled dataset islimited, the result for sparse coding with N = 64 is par-ticularly striking, showing decreased geolocation errors byapproximately 100 km compared to Raw, throughout all re-ported amounts of labeled data. Note that SC (w/ N = 64)at 2 ⇥ 10

6 labeled data performs about the same as Raw at3.7⇥ 10

6. As the amount of labeled training samples is lim-ited in practice, such advantage of sparse coding is attractive.

500

600

700

800

900

1000

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0M

edia

n lo

catio

n er

ror (

km)

Total # of words in labeled dataset (x 106)

Median location error vs. labeled dataset amount

Raw (binary bag-of-words)SC (w/ N=32)SC (w/ N=64)

Figure 3: Median location error vs. size of labeled dataset.

Performance comparison. Table 2 presents a summary thatcompares the geolocation and regional classification perfor-mances of our approach and the previous work. We haveachieved a 9% gain for region classification over Eisensteinet al., and a 14% gain for state classification.

Conclusion and Future WorkWe have shown that twitter geolocation and regional clas-sification can benefit from unsupervised, data-driven ap-proaches such as sparse coding and dictionary learning.Such approaches are particularly suited to microblog scenar-ios where labeled data can be limited. However, a straight-forward application for sparse coding would only producesuboptimal solutions. As we have demonstrated, competitiveperformance is attainable only when all of our enhancementsteps are applied. In particular, we find that raw feature aug-mentation and an algorithmic enhancement by voting-basedgrid selection have been significant in reducing errors. To thebest of our knowledge, the use of sparse coding in text-basedgeolocation and the proposed enhancements are novel. Ourfuture work includes a hybrid method that can leverage thestrengths of data-driven and model-based approaches and itsevaluation in both US and worldwide data.

AcknowledgmentsThis material is based upon work supported by the National Sci-ence Foundation Graduate Research Fellowship under Grant No.DGE1144152 and gifts from the Intel Corporation.

ReferencesAharon, M.; Elad, M.; and Bruckstein, A. 2006. K-SVD: An Al-gorithm for Designing Overcomplete Dictionaries for Sparse Rep-resentation. IEEE Trans. on Signal Processing.Chen, S. S.; Donoho, D. L.; and Saunders, M. A. 2001. AtomicDecomposition by Basis Pursuit. SIAM Rev.

Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P. 2010.A Latent Variable Model for Geographic Lexical Variation. InEMNLP.GeoText. 2010. CMU Geo-tagged Microblog Corpus. http:

//www.ark.cs.cmu.edu/GeoText/.Hong, L.; Ahmed, A.; Gurumurthy, S.; Smola, A. J.; and Tsiout-siouliklis, K. 2012. Discovering Geographical Topics in the TwitterStream. In WWW.Olshausen, B. A., and Field, D. J. 1997. Sparse Coding with anOvercomplete Basis Set: Strategy Employed by V1? Vision re-

search.Roller, S.; Speriosu, M.; Rallapalli, S.; Wing, B.; and Baldridge, J.2012. Supervised Text-based Geolocation Using Language Modelson an Adaptive Grid. In EMNLP.Sinnott, R. W. 1984. Virtues of the Haversine. Sky and Telescope.Tibshirani, R. 1994. Regression Shrinkage and Selection via Lasso.Journ. of Royal Statistical Society.Tropp, J., and Gilbert, A. 2007. Signal Recovery From RandomMeasurements Via Orthogonal Matching Pursuit. IEEE Trans. on

Information Theory.Wing, B. P., and Baldridge, J. 2011. Simple Supervised DocumentGeolocation with Geodesic Grids. In ACL.

Conclusion !  Twitter geolocation and regional classification can benefit from unsupervised, data-driven approaches such as sparse coding and dictionary learning !  Achieve competitive performance standing at 581 km mean and 425 km median location errors for GeoText dataset !  Our classification accuracy for 4-way US regions is a 9% improvement over the best topical model in prior literature, and the 14% for 48-way US states

Future Work !  Evaluation in both US and worldwide data !  Hybrid method that can leverage the strengths of data-driven and model-based approaches

Acknowledgement This work is based upon work supported by the National Science

Foundation Graduate Research Fellowship under Grant No. DGE1144152 and gifts from the Intel Corporation

Documents

Twitter Geolocation and Regional Classification via …people.seas.harvard.edu/~miriamcha/pubs/poster_cha_… · · 2015-12-17Develop Twitter geolocation and regional classification