Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used… page 1
Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used… page 2
Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used… page 3
Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used… page 4

Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used…

Embed Size (px)

Text of Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be...

  • Twitter Geolocation and Regional Classification via Sparse Coding

    Miriam ChaHarvard University

    Youngjune GwonHarvard University

    H. T. KungHarvard University

    Abstract

    We present a data-driven approach for Twitter geoloca-tion and regional classification. Our method is based onsparse coding and dictionary learning, an unsupervisedmethod popular in computer vision and pattern recog-nition. Through a series of optimization steps that inte-grate information from both feature and raw spaces, andenhancements such as PCA whitening, feature augmen-tation, and voting-based grid selection, we lower geolo-cation errors and improve classification accuracy frompreviously known results on the GEOTEXT dataset.

    IntroductionLocation information of a user provides crucial context tobuild useful applications and platforms for social network-ing. It has been known that one can geolocate a documentgenerated online by identifying geographically varying char-acteristics of the text. In particular, a variety of linguistic fea-tures such as toponyms, geographic and dialectic elementscan be used to create a content-based geolocation system.

    In recent years, there is a growing interest to leverage so-cial networking and microblogging as a special sensor net-work to detect natural disasters, terrorism, and social un-rests. Social media research has experimented with super-vised topical models trained on Twitter messages (known astweets of 140 characters or fewer) for geolocation. However,predicting accurate geocoordinates using only microblogtext data is a hard problem. Microblog text can deviate sig-nificantly from normal usage of a language. The process-ing and decomposition of casual tweets often require a com-plex topical model, whose parameters must be trained metic-ulously with large labeled dataset. The fact that only 3%of tweets are geotagged indicates the difficulty of super-vised learning due to limited availability of labeled train-ing examples. Moreover, learning algorithms that use rawdata directly without computing feature representations canseverely limit the prediction performance.

    In this paper, we present a new approach for the Twittergeolocation problem, where we want to estimate the geoco-ordinates of a Twitter user solely based on associated tweettext. Our approach is based on unsupervised representational

    Copyright c 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

    learning via sparse coding (1997). We show that sparse cod-ing trained on unlabeled tweets leads to a powerful repre-sentational mapping for text-based geolocation comparableto topical models for lexical variation. Such mapping is par-ticularly effective when it is used in a similarity searchingheuristic in location estimation such as k-Nearest Neighbors(k-NNs). When we apply a series of enhancements consist-ing of PCA whitening, regional dictionary learning, featureaugmentation, and voting-based grid selection, we achievecompetitive performance standing at 581 km mean and 425km median location errors for the GEOTEXT dataset. Ourclassification accuracy for 4-way US Regions is a 9% im-provement over the best topical model in prior literature, andthe 14% for 48-way US States.

    Related workEisenstein et al. (2010) propose a latent variable model withtwo sources of lexical variation: topic and region. The modeldescribes a sophisticated generative process from multivari-ate Gaussian and Latent Dirichlet Allocation (LDA) for pre-dicting an authors geocoordinates from tweet text via vari-ational inference. Hong et al. (2012) extends the idea tomodel regions and topics jointly. Wing & Baldridge (2011)describe supervised models for geolocation. They have ex-perimented with geodesic grids having different resolutionsand obtained a lower median location error than Eisensteinet al. Roller et al. (2012) present an improvement over Wing& Baldridge with adaptive grid schemes based on k-d treesand uniform partitioning.

    DataWe use the CMU GEOTEXT dataset (2010). GEOTEXT is ageo-tagged microblog corpus comprising 377,616 tweets by9,475 users from 48 contiguous US states and WashingtonD.C. Each document in the dataset is concatenation of alltweets by a single user whose location information is pro-vided as GPS-assigned latitude and longitude values (geo-coordinates). The document is a sequence of integer num-bers ranging 1 to 5,216, where each number represents theposition in vocab, the table of uniquely appearing words inthe dataset. While recognizing its limitation to US Twitterusers only, we choose GEOTEXT for fair comparison withthe prior approaches mentioned in related work that use thesame dataset.

    Proceedings of the Ninth International AAAI Conference on Web and Social Media

    582

  • ApproachOur approach is essentially semi-supervised learning. Weconcern the shortage of labeled training examples in prac-tice, and hence pay special attention on the unsupervisedpart consisting of sparse coding and dictionary learning withunlabeled data.

    PreliminariesNotation. Let our vocabulary vocab be an array of wordswith size V = |vocab|. A document containing W wordscan be represented as a binary bag-of-words vector wBW {0, 1}V embedded onto vocab. The ith element in wBWis 1 if the word vocab[i] has appeared in the text. We de-note a word-counts vector wWC {0, 1, 2, . . . }V , wherethe ith element in wWC now is the number of times theword vocab[i] has appeared. We can also use a word-sequence vector wWS {1, . . . , V }W . That is, wi, the ithelement in wWS represents the ith word in the text, which isvocab[wi]. We use patch x RN , a subvector taken fromwBW , wWC , or wWS for an input to sparse coding. For m-class classification task, we use label l {c1, c2, . . . , cm},where cj is the label value for class j. For geolocation task,label l = (lat, lon) is the geocoordinate information.Sparse coding. We use sparse coding as the basic means toextract features from text. Given an input patch x RN ,sparse coding solves for a representation y RK in thefollowing optimization problem:

    minyxDy22 + (y) (1)

    Here, x is interpreted as a linear combination of basis vec-tors in an overcomplete dictionary D RNK (K N ).The solution y contains the linear combination coefficientsregularized by the sparsity-inducing second term of Eq. (1)for some > 0.

    Sparse coding employs `0- or `1-norm of y as (.). Be-cause `0-norm of a vector is the number of its nonzero el-ements, it precisely fits the regularization purpose of sparsecoding. Finding the `0-minimum solution for Eq. (1), how-ever, is known to be NP-hard. The `1-minimization knownas Basis Pursuit (2001) or LASSO (1994) is often preferred.Recently, it is known that the `0-based greedy algorithmssuch as Orthogonal Matching Pursuit (OMP) (2007) can runfast. Throughout this paper, our default choice for sparsecoding is OMP.

    How can we learn a dictionary D of basis vectors forsparse coding? This is done by an unsupervised, data-drivenprocess incorporating two separate optimizations. It firstcomputes sparse code for each training example (unlabeled)using the current dictionary. Then, the reconstruction errorfrom the computed sparse codes is used to update each ba-sis vector in the dictionary. We use K-SVD algorithm (2006)for dictionary learning.

    Baseline methodsWe consider two localization tasks. The first task is to esti-mate the geocoordinates given tweets of unknown location.The second task is to perform multiclass classification thatpredicts a region (Northeast, Midwest, South, West) or state

    in the US. In the following, we present our baseline methodsfor these two tasks.Geolocation. We use the following steps for geolocation.

    1. (Text embedding) perform binary, word-counts, or word-sequence embedding of raw tweet text, using vocab;

    2. (Unsupervised learning) use unlabeled data patches tolearn basis vectors in a dictionary D;

    3. (Feature extraction) do sparse coding with labeled datapatches {(x(1), l(1)), (x(2), l(2)), . . . }, using the learneddictionary D and yield sparse codes {y(1),y(2), . . . };

    4. (Feature pooling) perform max pooling over a groupof M sparse codes from a user to obtain pooledsparse code z such that the kth element in z, zk =max(y1,k, y2,k, . . . , yM,k) where yi,k is the kth elementfrom yi, the ith sparse code in the pooling group;

    5. (Tabularization of known locations) build a lookup tableof reference geocoordinates associated with pooled sparsecodes z (from labeled data patches).

    When tweets from unknown geocoordinates arrive, thegeolocation pipeline comprises preprocessing, feature ex-traction via sparse coding (without labels), and max pool-ing. Using the resulting pooled sparse codes of the tweets,we find the k pooled sparse codes from the lookup table thatare closest in cosine similarity. We take the average geoco-ordinates of the k-NNs.Region and state classification. We employ the followingmethod for classification.

    1. Repeat steps 1 to 4 from the geolocation method;

    2. (Supervised classifier training) use pooled sparse codes zas features to train a classifier such as multiclass SVM andsoftmax regression.

    When tweets from an unknown region or state arrive, theclassification pipeline consists of preprocessing, feature ex-traction via sparse coding (without labels), and max pooling.The resulting pooled sparse codes of the tweets are appliedto the trained classifier from step 2 to predict the class label.

    EnhancementsPCA whitening. Principal components analysis (PCA) is adimensionality reduction technique that can speed up unsu-pervised learning such as sparse coding. A similar procedurecalled whitening is also helpful by making input less redun-dant and thus sparse coding more effective in classification.We precondition n input patches taken from wB