Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Active Learning for Entity Filtering in Microblog Streams
Acknowledgments: This research was supported by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement nr 312827 (VOX-Pol), the Netherlands Organisation for Scientific Research (NWO) under project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, 612.066.930, CI-14-25, SH-322-15, Amsterdam Data Science, the Dutch national program COMMIT, the ESF Research Network Program ELIAS, the Elite Network Shifts project funded by the Royal Dutch Academy of Sciences (KNAW), the Netherlands eScience Center under project nr 027.012.105, the Yahoo! Faculty Research and Engagement Program, the Microsoft Research PhD program and the HPC Fund.
38th International ACM SIGIR Conference on Research and Development in Information Retrieval Santiago, Chile. August 9–13, 2015
Results
Run Accuracy F(R,S)
Best system at RepLab 2013
0.91 0.49
Passive SVM 0.92 0.46
Passive Learning Baseline
Effectiveness of Active Learning
Initial Training Reduction
Outliers Margin Sampling performs significantly
better than all density approaches
Conclusions
Active learning scenario for entity filtering is feasible
Less annotation is needed when annotation is done
on the fly
Compared state-of-the-art sampling methods:
margin sampling works best
Code available at http://damiano.github.io/al-ef
Setup Binary classification
problem
Support vector machines
Simulated feedback from
ground truth
RepLab 2013 dataset
Evaluation metrics:
• Accuracy
• Reliability&Sensitivity
http://nlp.uned.es/replab2013
Damiano Spina [email protected]
Maria-Hendrike Peetz [email protected]
Maarten de Rijke [email protected]
0 20 40 60 80 100
02
04
06
080
10
0
% of training data% o
f m
anu
ally
an
nota
ted t
est d
ata
0 20 40 60 80 100
02
04
06
080
10
0
% of training data% o
f m
anu
ally
an
nota
ted t
est d
ata
Random Sampling Margin Sampling
The cost of training the initial model can be substantially reduced
Active Learning
1. Feature
representation
2. Model
Training/Update
4. Candidate
Selection
5. User Feedback
Training DatasetModel
3. Classification
Test Dataset
1. Feature
representation
üû
new labeled
instance
• Random Sampling (RS)
• Margin Sampling (MS)
• Margin*Density
• MS + Reranking based on
density
User feedback for updating the classification model
The Entity Filtering Task
Filter out tweets that are not related to a given entity of interest
Margin Sampling significantly outperforms Random Sampling