Twitter Geolocation and Regional Classification via Sparse gyj/pubs/twitter-paper.pdf · can be used…

  • Published on

  • View

  • Download


<ul><li><p>Twitter Geolocation and Regional Classification via Sparse Coding</p><p>Miriam ChaHarvard University</p><p>Youngjune GwonHarvard University</p><p>H. T. KungHarvard University</p><p>Abstract</p><p>We present a data-driven approach for Twitter geoloca-tion and regional classification. Our method is based onsparse coding and dictionary learning, an unsupervisedmethod popular in computer vision and pattern recog-nition. Through a series of optimization steps that inte-grate information from both feature and raw spaces, andenhancements such as PCA whitening, feature augmen-tation, and voting-based grid selection, we lower geolo-cation errors and improve classification accuracy frompreviously known results on the GEOTEXT dataset.</p><p>IntroductionLocation information of a user provides crucial context tobuild useful applications and platforms for social network-ing. It has been known that one can geolocate a documentgenerated online by identifying geographically varying char-acteristics of the text. In particular, a variety of linguistic fea-tures such as toponyms, geographic and dialectic elementscan be used to create a content-based geolocation system.</p><p>In recent years, there is a growing interest to leverage so-cial networking and microblogging as a special sensor net-work to detect natural disasters, terrorism, and social un-rests. Social media research has experimented with super-vised topical models trained on Twitter messages (known astweets of 140 characters or fewer) for geolocation. However,predicting accurate geocoordinates using only microblogtext data is a hard problem. Microblog text can deviate sig-nificantly from normal usage of a language. The process-ing and decomposition of casual tweets often require a com-plex topical model, whose parameters must be trained metic-ulously with large labeled dataset. The fact that only 3%of tweets are geotagged indicates the difficulty of super-vised learning due to limited availability of labeled train-ing examples. Moreover, learning algorithms that use rawdata directly without computing feature representations canseverely limit the prediction performance.</p><p>In this paper, we present a new approach for the Twittergeolocation problem, where we want to estimate the geoco-ordinates of a Twitter user solely based on associated tweettext. Our approach is based on unsupervised representational</p><p>Copyright c 2015, Association for the Advancement of ArtificialIntelligence ( All rights reserved.</p><p>learning via sparse coding (1997). We show that sparse cod-ing trained on unlabeled tweets leads to a powerful repre-sentational mapping for text-based geolocation comparableto topical models for lexical variation. Such mapping is par-ticularly effective when it is used in a similarity searchingheuristic in location estimation such as k-Nearest Neighbors(k-NNs). When we apply a series of enhancements consist-ing of PCA whitening, regional dictionary learning, featureaugmentation, and voting-based grid selection, we achievecompetitive performance standing at 581 km mean and 425km median location errors for the GEOTEXT dataset. Ourclassification accuracy for 4-way US Regions is a 9% im-provement over the best topical model in prior literature, andthe 14% for 48-way US States.</p><p>Related workEisenstein et al. (2010) propose a latent variable model withtwo sources of lexical variation: topic and region. The modeldescribes a sophisticated generative process from multivari-ate Gaussian and Latent Dirichlet Allocation (LDA) for pre-dicting an authors geocoordinates from tweet text via vari-ational inference. Hong et al. (2012) extends the idea tomodel regions and topics jointly. Wing &amp; Baldridge (2011)describe supervised models for geolocation. They have ex-perimented with geodesic grids having different resolutionsand obtained a lower median location error than Eisensteinet al. Roller et al. (2012) present an improvement over Wing&amp; Baldridge with adaptive grid schemes based on k-d treesand uniform partitioning.</p><p>DataWe use the CMU GEOTEXT dataset (2010). GEOTEXT is ageo-tagged microblog corpus comprising 377,616 tweets by9,475 users from 48 contiguous US states and WashingtonD.C. Each document in the dataset is concatenation of alltweets by a single user whose location information is pro-vided as GPS-assigned latitude and longitude values (geo-coordinates). The document is a sequence of integer num-bers ranging 1 to 5,216, where each number represents theposition in vocab, the table of uniquely appearing words inthe dataset. While recognizing its limitation to US Twitterusers only, we choose GEOTEXT for fair comparison withthe prior approaches mentioned in related work that use thesame dataset.</p><p>Proceedings of the Ninth International AAAI Conference on Web and Social Media</p><p>582</p></li><li><p>ApproachOur approach is essentially semi-supervised learning. Weconcern the shortage of labeled training examples in prac-tice, and hence pay special attention on the unsupervisedpart consisting of sparse coding and dictionary learning withunlabeled data.</p><p>PreliminariesNotation. Let our vocabulary vocab be an array of wordswith size V = |vocab|. A document containing W wordscan be represented as a binary bag-of-words vector wBW {0, 1}V embedded onto vocab. The ith element in wBWis 1 if the word vocab[i] has appeared in the text. We de-note a word-counts vector wWC {0, 1, 2, . . . }V , wherethe ith element in wWC now is the number of times theword vocab[i] has appeared. We can also use a word-sequence vector wWS {1, . . . , V }W . That is, wi, the ithelement in wWS represents the ith word in the text, which isvocab[wi]. We use patch x RN , a subvector taken fromwBW , wWC , or wWS for an input to sparse coding. For m-class classification task, we use label l {c1, c2, . . . , cm},where cj is the label value for class j. For geolocation task,label l = (lat, lon) is the geocoordinate information.Sparse coding. We use sparse coding as the basic means toextract features from text. Given an input patch x RN ,sparse coding solves for a representation y RK in thefollowing optimization problem:</p><p>minyxDy22 + (y) (1)</p><p>Here, x is interpreted as a linear combination of basis vec-tors in an overcomplete dictionary D RNK (K N ).The solution y contains the linear combination coefficientsregularized by the sparsity-inducing second term of Eq. (1)for some &gt; 0.</p><p>Sparse coding employs `0- or `1-norm of y as (.). Be-cause `0-norm of a vector is the number of its nonzero el-ements, it precisely fits the regularization purpose of sparsecoding. Finding the `0-minimum solution for Eq. (1), how-ever, is known to be NP-hard. The `1-minimization knownas Basis Pursuit (2001) or LASSO (1994) is often preferred.Recently, it is known that the `0-based greedy algorithmssuch as Orthogonal Matching Pursuit (OMP) (2007) can runfast. Throughout this paper, our default choice for sparsecoding is OMP.</p><p>How can we learn a dictionary D of basis vectors forsparse coding? This is done by an unsupervised, data-drivenprocess incorporating two separate optimizations. It firstcomputes sparse code for each training example (unlabeled)using the current dictionary. Then, the reconstruction errorfrom the computed sparse codes is used to update each ba-sis vector in the dictionary. We use K-SVD algorithm (2006)for dictionary learning.</p><p>Baseline methodsWe consider two localization tasks. The first task is to esti-mate the geocoordinates given tweets of unknown location.The second task is to perform multiclass classification thatpredicts a region (Northeast, Midwest, South, West) or state</p><p>in the US. In the following, we present our baseline methodsfor these two tasks.Geolocation. We use the following steps for geolocation.</p><p>1. (Text embedding) perform binary, word-counts, or word-sequence embedding of raw tweet text, using vocab;</p><p>2. (Unsupervised learning) use unlabeled data patches tolearn basis vectors in a dictionary D;</p><p>3. (Feature extraction) do sparse coding with labeled datapatches {(x(1), l(1)), (x(2), l(2)), . . . }, using the learneddictionary D and yield sparse codes {y(1),y(2), . . . };</p><p>4. (Feature pooling) perform max pooling over a groupof M sparse codes from a user to obtain pooledsparse code z such that the kth element in z, zk =max(y1,k, y2,k, . . . , yM,k) where yi,k is the kth elementfrom yi, the ith sparse code in the pooling group;</p><p>5. (Tabularization of known locations) build a lookup tableof reference geocoordinates associated with pooled sparsecodes z (from labeled data patches).</p><p>When tweets from unknown geocoordinates arrive, thegeolocation pipeline comprises preprocessing, feature ex-traction via sparse coding (without labels), and max pool-ing. Using the resulting pooled sparse codes of the tweets,we find the k pooled sparse codes from the lookup table thatare closest in cosine similarity. We take the average geoco-ordinates of the k-NNs.Region and state classification. We employ the followingmethod for classification.</p><p>1. Repeat steps 1 to 4 from the geolocation method;</p><p>2. (Supervised classifier training) use pooled sparse codes zas features to train a classifier such as multiclass SVM andsoftmax regression.</p><p>When tweets from an unknown region or state arrive, theclassification pipeline consists of preprocessing, feature ex-traction via sparse coding (without labels), and max pooling.The resulting pooled sparse codes of the tweets are appliedto the trained classifier from step 2 to predict the class label.</p><p>EnhancementsPCA whitening. Principal components analysis (PCA) is adimensionality reduction technique that can speed up unsu-pervised learning such as sparse coding. A similar procedurecalled whitening is also helpful by making input less redun-dant and thus sparse coding more effective in classification.We precondition n input patches taken from wBW , wWC ,and wWS for sparse coding by PCA whitening:</p><p>1. Remove patch mean x(i) := x(i) 1nn</p><p>i=1 x(i);</p><p>2. Compute covariance matrix C= 1n1n</p><p>i=1x(i)x(i)&gt;;</p><p>3. Do PCA [U,]=eig(C);</p><p>4. Compute xPCAwhite = ( + I)1/2U&gt;x, where is asmall positive value for regularization.</p><p>Regional dictionary learning (regD). In principle, dictio-nary learning is an unsupervised process. However, labelinformation of training examples can be used beneficially</p><p>583</p></li><li><p>lat7lon7) </p><p>lat20lon20)</p><p>latiloni)</p><p>lat21lon21)</p><p>lat43lon43)lat44lon44)lat45lon45)</p><p>Figure 1: Voting-based grid selection scheme</p><p>to train dictionary. We consider the following enhancement.Using regional labels of training examples, we learn sepa-rate dictionaries D1, D2, D3, and D4 for the four US re-gions, where each Di is trained using data from the respec-tive region. Simple concatenation forms the final dictionaryD = [D1D2D3D4]. Thus, the resulting basis vectors aremore representative for their respective regions.Raw feature augmentation. Sparse code y serves as thefeature for raw patch input x. Information-theoreticallyspeaking, x and y are approximately equivalent since onerepresentation is a transform of the other. However, due toreasons such as insufficient samples for dictionary trainingor presence of noise in x, there is a possibility of many-to-one ambiguity. In other words, raw data belonging to dif-ferent classes may have very similar sparse code. To mit-igate this problem, we consider a simple enhancement byappending the raw patch vector x to its sparse code (or,pooled sparse code z) and use the combined vector as theaugmented feature.Voting-based grid selection. Figure 1 illustrates our voting-based grid selection scheme. With patches with known geo-coordinates, we compute their pooled sparse codes and con-struct a lookup table with their corresponding geocoordi-nates. When a patch from unknown geocoordinates arrives,we compute its pooled sparse code, and find the k pooledsparse codes from the lookup table that are closest in cosinesimilarity. Each k-NN casts a vote to its corresponding grid.We identify the grid which receives the most votes and takethe average of geocoordinates in the selected grid.</p><p>!"#$</p><p>vocab #$</p><p> %</p><p>&amp;!"'#lat'lon$(</p><p>vocab </p><p> % </p><p>)</p><p>*</p><p>+)</p><p>,</p><p>+</p><p>Figure 2: Full system pipeline</p><p>Full system pipelineIn Figure 2, we present the full system pipeline for geoloca-tion and classification via sparse coding.</p><p>EvaluationWe evaluate our baseline methods and enhancements empir-ically, using the GEOTEXT dataset. This section will presenta comparative performance analysis against other text-basedgeolocation approaches of significant impact.</p><p>ExperimentsDataset processing. For each test instance, we cut thedataset into five folds such that fold=user id%5, fol-lowing Eisenstein et al. (2010). Folds 14 are used for train-ing, and fold 5 for testing. We have converted the entire textdata for each user to binary (BW), word-counts (WC), andword-sequence (WS) vectors. From these vectors, we sparsecode patches of a configurable size N taken from the input.We use patch sizes of N = 32, 64 with small or no overlapfor feature extraction.Training. In unsupervised learning, we precondition patcheswith PCA whitening prior to sparse coding. The parametersfor sparse coding are experimentally determined. We haveused a dictionary size K 10N , sparsity 0.1N T 0.4N (T is number of nonzero elements in sparse code y).For max pooling, we use pooling factors M in 10s.</p><p>In supervised learning, we have trained linear multiclassSVM and softmax classifiers, using max pooled sparse codesas features. In our results, we report the best average perfor-mance of the two. We have exhaustively built the lookuptable of reference geocoordinates for k-NN. This results upto 7,500 entries (depending on how much of labeled datasetis allowed for supervised learning). In reducing predictionerrors, we find k-NN works well for 20 k 30. We usehigher k (200 k 250) with voting-based grid selec-tion scheme. We have varied grid sizes suggested by Rolleret al. (2012) and decided on 5 for most purposes.Metrics. For geolocation, we use the mean and median dis-tance errors between the predicted and ground-truth geoco-ordinates in kilometers. We note that it is required to ap-proximate the great-circle distance between any two loca-tions on the surface of earth. We use the Haversine formula(1984). For classification tasks, we adopt multiclass classifi-cation accuracy as the evaluation metric.</p><p>Results and discussionGeolocation. In Table 1, we present the geolocation errorsof our baseline method and enhancements. We call our base-line method SC (sparse coding). Note that Raw meansk-NN search based on vectors wBW and wWC . We omitRaw for word-sequence embedding wWS because there isno logic in comparing word-sequence vectors of two dif-ferent documents. For our baseline with each enhancement,we have SC+PCA (PCA whitened patches), SC+regD(regional dictionary learning for sparse coding), SC+Raw(raw feature augmentation), SC+voting (voting-based gridselection), and SC+all (combining all enhancements).</p><p>584</p></li><li><p>Table 1: Geolocation errors. The values are mean (median) in km.Raw SC SC+PCA SC+regD SC+Raw SC+voting SC+all</p><p>Binary 1024 (632) 879 (722) 748 (...</p></li></ul>


View more >