Address standardization with latent semantic association

1

Address Standardization with Latent Semantic

AssociationAuthor : Honglei Guo, Huijia Zhu, Zhili Guo, XiaoXun Zhang, and Zhong Su

Publication : KDD’09Advisor : Chia-Hui ChangPresenter : Chia-Yi Huang

2010/08/12

2

Introduction Related Works Latent Semantic Association Method Address Standardization Using LASA Model

and Informative Sampling Experiments Conclusions

Outline

3

IntorductionMotivationApproachesRelated Works

4

Address data are highly irregular ◦ most of them are often generated by different

people at different times.

Address should be converted to a standard consistent format.◦ Ex: “1101 Kitchawan Road, Route 134, Yorktown

Heights, N.Y. 10598”◦ [House No. : 1101], [Street : Kitchawan Road],

[Route : Route 134], [City : Yorktown Heights], [State: N.Y. ], [Zip :10598]

Introduction

5

Latent semantic association (LaSA)◦ To minimize human efforts and augment the

size of labeled training data set.

Address Standardization model is learned form LaSA features and informative samples.

Introduction (cont.)

6

Latent Semantic Association Method

Virtual Context DocumentLearning LaSA Model

7

In order to minimize the human efforts, we expect use ps(x, y) to approximate pt(x, y).◦ X : feature space to represent word instances.◦ Y : set of semantic labels.◦ ps(x, y), pt(x, y) : the underlying distribution for the labeled

training data set and the target data set.

LaSA model θs,t to capture latent semantic association among words form the unlabeled domain data.◦ Better augments the training data set.◦ Enhance the estimate of the distribution to better

approximate the real domain distribution.

Latent Semantic Association Model

8

Virtual Context Document◦ Given a word xi , virtual context document of xi is

◦ F(xiSk) : context feature set of xi in the address sample sk,

1≤k≤n.◦ n : total number of the samples which contain xi in the corpus.

Learning LaSA Model form Virtual Context Documents

9

Given vdxi = {f1, …, fj, …, fm} Weight(fi, xi) = log2 {P(fj, xi) / P(fj)P(xi)}

Learning LaSA Model form Virtual Context Documents (cont.)

10

Learning LaSA Model Latent dirichlet

allocation(LDA) imposes a dirichlet distrubution on the topic mixture weights corresponding to the documents in the corpus.

11

Learning LaSA Model (cont.)

12

Address Standardization Using LaSA Model and Informative Sampling

RRM ClassifierLatent Semantic Association FeatureInformative Sampling

13

View address standardization as a sequential classification problem.◦ Employs Robust Risk Minimization(RRM) Classifier.

Latent Semantic Association Feature◦ Frequency : 10◦ Number of topic N : 50◦ Context view window size : {-3 , 3}

Address Standardization Using LaSA Model

14

Informative sample selection method use a variant of uncertainty-sampling.

More uncertain fragments ate contained in the sample, more informative the sample is.

Given an address sample Si = {tokj}Nj=1,

◦ Tokj : jth token unit in Si

Confidence score of Si :◦ Score(tokj) : confidence score of tokj in Si

◦ TokNum(Si) : total number of token units in Si

◦ UncNum(Si) : the number of uncertain units in Si

Token units with lower confidence score(i.e. Score(tokj) ≤ α) are considered as uncertain units.

Informative Sampling

15

Informative Sampling (cont.)

16

Data set

Experiments

17

Performance Enhancement by LaSA model◦ Relative F-measure enhancement◦ Relative Error Reduction

Experiments(cont.)

18

Training Data Reduction by LaSA Feature

Experiments(cont.)

19

Cumulative impact of LaSA model and informative sampling

20

Cumulative impact of LaSA model and informative sampling

21

LaSA-Info method achieves more than 45% reduction in error over the state-of-the-art RRM trained on the same material.

Compared to the supervised learning method, the present approach requires only 5% as much annotated data to achieve the same level of performance.

Conclusions