17
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey

Interactive Deduplication using Active Learning

  • Upload
    yitro

  • View
    32

  • Download
    5

Embed Size (px)

DESCRIPTION

Interactive Deduplication using Active Learning. Sunita Sarawagi and Anuradha Bhamidipaty. Presented by Doug Downey. Active Learning for de-duplication. De-duplication systems try to learn a function: Where D is the data set. f is learned using a labeled training data set - PowerPoint PPT Presentation

Citation preview

Page 1: Interactive Deduplication using Active Learning

Interactive Deduplication using Active Learning

Sunita Sarawagi and Anuradha Bhamidipaty

Presented by Doug Downey

Page 2: Interactive Deduplication using Active Learning

Active Learning for de-duplication

• De-duplication systems try to learn a function:

• Where D is the data set.– f is learned using a labeled training data set– Normally, D is large, so many sets Lp are possible.

• Choosing a representative & useful Lp is hard.

• Instead of a fixed set Lp, in Active Learning the learner interactively chooses pairs from DD to be labeled and added to Lp.

nondupdupDDf ,:

DDLp

Page 3: Interactive Deduplication using Active Learning

The ALIAS de-duplicator• Input

– Set Dp of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc).

– Initial set Lp of some elements of Dp labeled as duplicates or non-duplicates.

• Set T = Lp Loop until user satisfaction:– Train classifier C using T.– Use C to choose a set S of instances from Dp for

labeling.– Get labels for S from user, and set T = T S.

Page 4: Interactive Deduplication using Active Learning

The ALIAS de-duplicator

Page 5: Interactive Deduplication using Active Learning

Active Learning• How do we choose the set S of instances to label?• Idea: Choose most uncertain instances.

• We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples r and b.

• The point m– maximally uncertain, – also the point that reduces our “confusion region” the most.– So choose m!

Page 6: Interactive Deduplication using Active Learning

Measuring Uncertainty with Committees

• Train a committee of several slightly different versions of a classifier.

• Uncertainty(x) entropycommittee(x)• Form committees by

– Randomizing model parameters– Partitioning training data– Partitioning attributes

Page 7: Interactive Deduplication using Active Learning

Methods for Forming Committees

Page 8: Interactive Deduplication using Active Learning

Committee Size

Page 9: Interactive Deduplication using Active Learning

Representativeness of an Instance

• We need informative instances, not just uncertain ones.

• Solution: sample n of the kn most uncertain instances, weighted by uncertainty.– k = 1 no sampling– kn = all data full-sampling

• Why not use information gain?

Page 10: Interactive Deduplication using Active Learning

Sampling for Representativeness

Page 11: Interactive Deduplication using Active Learning

Evaluation – Different Classifiers

• Decision Trees & Naïve Bayes:– Committees of 5 via parameter randomization

• SVMs– Uncertainty = distance from separator

• Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k = 5).

• Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls.

• Data sets:– Bibliography: 32131 citation pairs from Citeseer, 0.5% duplicates.– Address: 44850 pairs, 0.25% duplicates.

Page 12: Interactive Deduplication using Active Learning

Evaluation – different classifiers

Page 13: Interactive Deduplication using Active Learning

Evaluation – different classifiers

Page 14: Interactive Deduplication using Active Learning

Value of Active Learning

Page 15: Interactive Deduplication using Active Learning

Value of Active Learning

Page 16: Interactive Deduplication using Active Learning

Example Decision Tree

Page 17: Interactive Deduplication using Active Learning

Conclusions

• Active Learning improves performance over random selection.– Uses two orders of magnitude less training

data.– Note: not due just to change in +/- mix.

• In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.