geolocation twitter network text geotagging

Twitter User Geolocation Using a Unified Text andNetwork Prediction Model

Afshin Rahimi, Trevor Cohn and Timothy BaldwinDepartment of Computing and Information Systems, The University of Melbourne

OVERVIEW

Task: Where does @ShvwnK live?

Input: user, concatenated tweet text, mention-list

Output: latitude/longitude(known for training users, predicted for test users)

Datasets: 3 Twitter geolocation datasets (#users in parenthesis)GeoText (9.5K), Twitter-US (450K) and Twitter-World (1.4M).

TEXT-BASED MODEL

Logistic regression with l1 regularisationover k-d tree discretisation of latitude/longitude.

top features of NYC use of “upstate” in U.S.

NETWORK-BASED MODEL

Label propagation in a collapsed network:

• Build the graph using @-mentions.

• Use training nodes as seed (labelled samples).

• Infer the test labels by Modified Adsorption (Taluk-dar and Crammer, 2009).

argminY

c(Y ) =∑l

[µ1

Match seed︷︸︸︷(Yl − Yl)

TS(Yl − Yl) + µ2 Y Tl LYl︸︷︷︸

Smooth labels

]

0.7 0.5

0.01

new label estimate

FROM @-MENTION TO COLLAPSED NETWORK

@-mention Network Collapsed Network + Text Dongle Nodes

labelled nodes

unlabelled nodes

mentioned nodes

text dongle nodes

celebrity

UNIFIED MODEL: NETWORK & TEXT

• For connected users, Network-based models aremore accurate.

• For disconnected users (about 20% of the nodes),text-based models are more accurate.

• Solution: Utilise both text and network!

• For each test node, attach a text dongle node car-rying text-based predictions.

• Add the text dongle nodes to seed nodes (like train-ing nodes).

• Use Modified Adsorption to infer the labels.

“CELEBRITIES” DON’T GEOLOCATE

• “Celebrities” (highly mentioned users) areconnected from everywhere.

• They connect lots of people.

• Solution: Remove users with more than T mentions.

• Results in sparser graphs (tractable inference)and more accurate geolocation.

TUNING T (TWITTER-US)

2 5 15 50 500 5kCelebrity threshold T (# of mentions)

700

720

740

760

780

800

820

840

860

Mea

n er

ror (

in k

m)

Mean errorGraph size

105

106

107

108

109

Grap

h si

ze (#

edg

es)

Decreasing T results in: sparser graph, lower mean error.

RESULTS

State of the art results over all three datasets!

GEOTEXT TwitterUS TwitterWorld

600

800

1000

1200

1400

1600

Mea

n Er

ror (

km)

Network-based Model (This work)Unified Model (This work)Network-based: Rahimi et al. (NAACL2015)Text-based: Rahimi et al. (NAACL2015)Text-based: Wing and Baldrige (EMNLP2014)Text-based: Cha et al. (ICWSM2015)

larger dataset−−−−−−−−−→

Education

geolocation twitter network text geotagging