Active Learning for Entity Filtering new labeled 9 8 in Microblog … · 2015-08-05 · Santiago, Chile. August 9–13, 2015 Results Run Accuracy F(R,S) Best system at RepLab 2013

Active Learning for Entity Filtering in Microblog Streams

Acknowledgments: This research was supported by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement nr 312827 (VOX-Pol), the Netherlands Organisation for Scientific Research (NWO) under project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, 612.066.930, CI-14-25, SH-322-15, Amsterdam Data Science, the Dutch national program COMMIT, the ESF Research Network Program ELIAS, the Elite Network Shifts project funded by the Royal Dutch Academy of Sciences (KNAW), the Netherlands eScience Center under project nr 027.012.105, the Yahoo! Faculty Research and Engagement Program, the Microsoft Research PhD program and the HPC Fund.

38th International ACM SIGIR Conference on Research and Development in Information Retrieval Santiago, Chile. August 9–13, 2015

Results

Run Accuracy F(R,S)

Best system at RepLab 2013

0.91 0.49

Passive SVM 0.92 0.46

Passive Learning Baseline

Effectiveness of Active Learning

Initial Training Reduction

Outliers Margin Sampling performs significantly

better than all density approaches

Conclusions

Active learning scenario for entity filtering is feasible

Less annotation is needed when annotation is done

on the fly

Compared state-of-the-art sampling methods:

margin sampling works best

Code available at http://damiano.github.io/al-ef

Setup Binary classification

problem

Support vector machines

Simulated feedback from

ground truth

RepLab 2013 dataset

Evaluation metrics:

• Accuracy

• Reliability&Sensitivity

http://nlp.uned.es/replab2013

Damiano Spina [email protected]

Maria-Hendrike Peetz [email protected]

Maarten de Rijke [email protected]

0 20 40 60 80 100

02

04

06

080

10

0

% of training data% o

f m

anu

ally

an

nota

ted t

est d

ata

0 20 40 60 80 100

02

04

06

080

10

0

% of training data% o

f m

anu

ally

an

nota

ted t

est d

ata

Random Sampling Margin Sampling

The cost of training the initial model can be substantially reduced

Active Learning

1. Feature

representation

2. Model

Training/Update

4. Candidate

Selection

5. User Feedback

Training DatasetModel

3. Classification

Test Dataset

1. Feature

representation

üû

new labeled

instance

• Random Sampling (RS)

• Margin Sampling (MS)

• Margin*Density

• MS + Reranking based on

density

User feedback for updating the classification model

The Entity Filtering Task

Filter out tweets that are not related to a given entity of interest

Margin Sampling significantly outperforms Random Sampling

Documents

Active Learning for Entity Filtering new labeled 9 8 in Microblog … · 2015-08-05 · Santiago, Chile. August 9–13, 2015 Results Run Accuracy F(R,S) Best system at RepLab 2013