View
295
Download
1
Category
Preview:
Citation preview
Overview of the 2014 ALTA Shared TaskIdentifying Expressions of Locations in Tweets
Diego Molla Sarvnaz Karimi
Macquarie University CSIRO
ALTA 2014, Melbourne, Australia
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Contents
The 2014 ALTA Shared Task
The Tweet Data
Kaggle in Class
Evaluation Results
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 2/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Contents
The 2014 ALTA Shared Task
The Tweet Data
Kaggle in Class
Evaluation Results
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 3/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
The 2014 Shared Task
Task: Identify Expressions of Locations in Tweets
Categories: student, open
Prize: $500 (IBM Research Shared Task Student Prize)
Framework: Kaggle in Class
Student Category
I All members areuniversity students.
I No members are full-timeemployed.
I No members have a PhD.
Open Category
I Any other teams.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 4/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Identify Expressions of Locations in Tweets
Tweet LocationFrance and Germany join the US and UKin advising their nationals in Libya to leaveimmediately http://bbc.in/1rVmrDJ
France, Ger-many, US, UK,Libya
Dutch investigators not going to MH17crash site in eastern Ukraine due to securityconcerns, OSCE monitors say
MH17 crash site,eastern Ukraine
Seeing early signs of potential flashflooding with stationary storms near St.Marys, Tavistock, Cambridge #onstormpic.twitter.com/BtogIxgQ5G
St. Marys,Tavistock,Cambridge
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 5/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Motivation
1. When people discuss events, often they mention the location.
2. In the case of emergencies, such locations are very useful.
3. Recommender systems can use location information toimprove their recommendations.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Motivation
1. When people discuss events, often they mention the location.
2. In the case of emergencies, such locations are very useful.
3. Recommender systems can use location information toimprove their recommendations.
http://rt.com/usa/new-jersey-flooded-sandy-575/
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Motivation
1. When people discuss events, often they mention the location.
2. In the case of emergencies, such locations are very useful.
3. Recommender systems can use location information toimprove their recommendations.
http://static.echonest.com/DukeListens/event_mapping_at_last_fm.html
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Location Expressions in Tweets
What is a location?
Any specific mention of a country, city, suburb, or POI.
I Macquarie Centre.
I Ryde Hospital.
Where can we find location mentions?
I In the text.
I In hashtags: #Australia.
I In URLs: http://abc.net.au/melbourne/.
I In mentions: @Australia.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Location Expressions in Tweets
What is a location?
Any specific mention of a country, city, suburb, or POI.
I Macquarie Centre.
I Ryde Hospital.
Where can we find location mentions?
I In the text.
I In hashtags: #Australia.
I In URLs: http://abc.net.au/melbourne/.
I In mentions: @Australia.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Related Work
Named entity recognition in Twitter
I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).
I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).
Location extraction
I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).
I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).
I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Related Work
Named entity recognition in Twitter
I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).
I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).
Location extraction
I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).
I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).
I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Contents
The 2014 ALTA Shared Task
The Tweet Data
Kaggle in Class
Evaluation Results
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 9/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Tweet Collection
Source
I From Lingad et al. (2013).
I Tweets from late 2010 to late 2012.
I Augmented with additional tweets.
I Several annotations, only location mentions were used for theALTA shared task.
Size
I Originally, 3,220 tweets.
I Available for the ALTA shared task: 3,047.
I After removing duplicates: 3,003.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 10/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Data Contents
Data for training and development
I Tweet IDs.
I Location mentions.
I Tweet download script.
Copyright restrictions
I Twitter does not allow the distribution of tweets.
I The shared task participants were asked to download thetweets themselves.
I Depending on the network status and changes by Twitter andTwitter users, specific tweets might not be available fordownload.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 11/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Data Format
Format of location mentions
I All multi-word terms split into their single words.
I Word duplicates are numbered.
I All punctuation marks are removed, including #.
I Words are lowercased.
I Data in a CSV file.
Examples
I Tweet ID1, france germany us uk libya
I Tweet ID2, australia australia2 australia3
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 12/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Contents
The 2014 ALTA Shared Task
The Tweet Data
Kaggle in Class
Evaluation Results
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 13/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Kaggle in Class
Kaggle
I Kaggle offers a Web-based framework for data-drivencompetitions.
I A large base of potential participants.
I Potentially large prizes for the participants.
I Fee-based for the organisers; free for the participants.
Kaggle in Class
I Free for organisers and participants.
I Limited user support by Kaggle.
I Used by course-based competitions.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 14/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Alta Shared Task in Kaggle in Class
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 15/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Alta Shared Task in Kaggle in Class
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 16/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Features of Kaggle in Class
I Public leaderboard: all participants can submit and comparewith other participants.
I Automated evaluation: organisers can choose among severalevaluation metrics.
I Public and private partitions: A private partition of the testdata is held private for the final ranking
I Public: 501 tweets.I Private: 502 tweets.
I Discussion forum: for communication among participants.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 17/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Contents
The 2014 ALTA Shared Task
The Tweet Data
Kaggle in Class
Evaluation Results
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 18/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Evaluation Metric
Mean F1-Score
I Compute recall and precision of each individual word.
I This allows evaluation of partially correct location mentions.
F1 = 2pr
p + r
Example
I Target: senegal senegal2
I System output: senegal christchurch brighton
I p = 1/3
I r = 1/2
I F1 = 0.42014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 19/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Conclusions
Conclusions
I Kaggle in class, a useful means to run the shared task.I Few participants, but very active.
I 168 runs in the combined 4 teams.
I Participants (read the Proceedings!) used a combination of:
1. sequence labellers,2. feature engineering, and3. combined classifiers.
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 20/21
The 2013 Task The Tweet Data Kaggle in Class Evaluation Results
Results
Team Category Public PrivateMQ Student 0.781 0.792AUT NLP Open 0.748 0.747Yarra Student 0.768 0.732JK Rowling Open 0.751 0.726
2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 21/21
Recommended