1

Click here to load reader

Towards an Active Learning System for Company Name Disambiguation in Microblog Streams

Embed Size (px)

DESCRIPTION

In this paper we describe the collaborative participation of UvA \& UNED at RepLab 2013. We propose an active learning approach for the filtering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweet-inherent features such as hashtags and usernames. The tweets manually inspected during the active learning process is at most 1\% of the test data. While our baseline does not perform well, we can see that active learning does improve the results.

Citation preview

Page 1: Towards an Active Learning System for Company Name Disambiguation in Microblog Streams

Active Learning for Filtering

Maria-Hendrike Peetz, Damiano Spina, Maarten de Rijke, Julio Gonzalo

ISLA, Informatics Institute University of Amsterdam http://ilps.science.uva.nl

Lessons Learnt

UvA and UNED at RepLab 2013

• Filtering models worked well on the trial data

• Active learning with 1% improves weak classifiers

• There was not enough training data for language dependent training.

NLP & IR Group UNED

http://nlp.uned.es

Combine expert knowledge with computational effort by asking the right questions.

Annotating few examples is ok.

@nokia: My phone stopped working.

Data changes over time.

@microsoft

2012

2013

1. Feature representation

2. Model Training/Update

4. Candidate Selection

5. User Feedback

Training DatasetModel

3. Classification

Test Dataset

1. Feature representation

new labeled

instance

Feature representation Model Training/Update

Candidate SelectionClassification

Sample tweets, annotate them, and update the model.

With a decent baseline, annotating 15 tweets per entity is enough.

1. Analysts need to keep up to date anyway.

2. They have the ultimate responsibility for the result.

Results

Margin/uncertainty sampling:

Selecting examples close to the margin means sampling examples where the classifier is less confident.

• Naive Bayes

• Tweet metadata (always)

• Entity linking of the tweets (BoE)

• Retrain NB with every new instance.

• Higher weight to newly annotated instances.

Select 1% of tweets for annotation:

On average that are 15 tweets per entity, in the language dependent case: 10 English, 5 Spanish

4.2 Submitted runs

Table 5 shows the results of our o�cial runs with respect to accuracy, reliability(R), sensitivity (S), and F(R,S), the F1 Measure of Reliability and Sensitivity. Wecan see that our baselines as well as the best performing system performs worsethan a simple baseline. The provided baseline selects the class of an instancebased on the class of the closest (using Jaccard similarity) instance in the trainingset.

We can, however, see some interesting improvements. For a start, active learn-ing helps. We can see that the use of 1% annotation improves the results for allfour metrics. Secondly, building a language-independent model performs betterthan building two language-dependent models per entity. Finally, we can see thatthe class imbalance also holds in the test set, as assigning the majority class forstrongly skewed data performs much better than using active learning alone.

Table 4. Results of the o�cial runs.

run id accuracy R S F(R,S)

baseline 0.8714 0.4902 0.3200 0.3255UvA UNED filtering 1 0.2785 0.1635 0.1258 0.0730UvA UNED filtering 2 0.2847 0.2050 0.1441 0.0928UvA UNED filtering 3 0.5657 0.2040 0.2369 0.1449UvA UNED filtering 4 0.6360 0.2386 0.2782 0.1857UvA UNED filtering 5 0.7745 0.6486 0.1833 0.1737UvA UNED filtering 6 0.8155 0.6780 0.2187 0.2083

Table 5. Results of the o�cial runs.

run id accuracy R S F(R,S)

Jaccard 0.8714 0.4902 0.3200 0.3255BoE+lang. independent 0.2785 0.1635 0.1258 0.0730BoE+lang. dependent 0.2847 0.2050 0.1441 0.0928BoE+lang. independent + AL 0.5657 0.2040 0.2369 0.1449BoE+lang. dependent + AL 0.6360 0.2386 0.2782 0.1857BoE+lang. independent + AL + majority 0.7745 0.6486 0.1833 0.1737BoE+lang. dependent + AL + majority 0.8155 0.6780 0.2187 0.2083

5 Conclusions

We have presented an active learning approach to company name disambigua-tion in tweets. For this classification task, we found that active learning does

? ? ? ? ?

1 1

0 0

Task: Identify tweets that are relevant to a company.

[email protected] [email protected]