Upload
jyhuangtc
View
672
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
1
Address Standardization with Latent Semantic
AssociationAuthor : Honglei Guo, Huijia Zhu, Zhili Guo, XiaoXun Zhang, and Zhong Su
Publication : KDD’09Advisor : Chia-Hui ChangPresenter : Chia-Yi Huang
2010/08/12
2
Introduction Related Works Latent Semantic Association Method Address Standardization Using LASA Model
and Informative Sampling Experiments Conclusions
Outline
3
IntorductionMotivationApproachesRelated Works
4
Address data are highly irregular ◦ most of them are often generated by different
people at different times.
Address should be converted to a standard consistent format.◦ Ex: “1101 Kitchawan Road, Route 134, Yorktown
Heights, N.Y. 10598”◦ [House No. : 1101], [Street : Kitchawan Road],
[Route : Route 134], [City : Yorktown Heights], [State: N.Y. ], [Zip :10598]
Introduction
5
Latent semantic association (LaSA)◦ To minimize human efforts and augment the
size of labeled training data set.
Address Standardization model is learned form LaSA features and informative samples.
Introduction (cont.)
6
Latent Semantic Association Method
Virtual Context DocumentLearning LaSA Model
7
In order to minimize the human efforts, we expect use ps(x, y) to approximate pt(x, y).◦ X : feature space to represent word instances.◦ Y : set of semantic labels.◦ ps(x, y), pt(x, y) : the underlying distribution for the labeled
training data set and the target data set.
LaSA model θs,t to capture latent semantic association among words form the unlabeled domain data.◦ Better augments the training data set.◦ Enhance the estimate of the distribution to better
approximate the real domain distribution.
Latent Semantic Association Model
8
Virtual Context Document◦ Given a word xi , virtual context document of xi is
◦ F(xiSk) : context feature set of xi in the address sample sk,
1≤k≤n.◦ n : total number of the samples which contain xi in the corpus.
Learning LaSA Model form Virtual Context Documents
9
Given vdxi = {f1, …, fj, …, fm} Weight(fi, xi) = log2 {P(fj, xi) / P(fj)P(xi)}
Learning LaSA Model form Virtual Context Documents (cont.)
10
Learning LaSA Model Latent dirichlet
allocation(LDA) imposes a dirichlet distrubution on the topic mixture weights corresponding to the documents in the corpus.
11
Learning LaSA Model (cont.)
12
Address Standardization Using LaSA Model and Informative Sampling
RRM ClassifierLatent Semantic Association FeatureInformative Sampling
13
View address standardization as a sequential classification problem.◦ Employs Robust Risk Minimization(RRM) Classifier.
Latent Semantic Association Feature◦ Frequency : 10◦ Number of topic N : 50◦ Context view window size : {-3 , 3}
Address Standardization Using LaSA Model
14
Informative sample selection method use a variant of uncertainty-sampling.
More uncertain fragments ate contained in the sample, more informative the sample is.
Given an address sample Si = {tokj}Nj=1,
◦ Tokj : jth token unit in Si
Confidence score of Si :◦ Score(tokj) : confidence score of tokj in Si
◦ TokNum(Si) : total number of token units in Si
◦ UncNum(Si) : the number of uncertain units in Si
Token units with lower confidence score(i.e. Score(tokj) ≤ α) are considered as uncertain units.
Informative Sampling
15
Informative Sampling (cont.)
16
Data set
Experiments
17
Performance Enhancement by LaSA model◦ Relative F-measure enhancement◦ Relative Error Reduction
Experiments(cont.)
18
Training Data Reduction by LaSA Feature
Experiments(cont.)
19
Cumulative impact of LaSA model and informative sampling
20
Cumulative impact of LaSA model and informative sampling
21
LaSA-Info method achieves more than 45% reduction in error over the state-of-the-art RRM trained on the same material.
Compared to the supervised learning method, the present approach requires only 5% as much annotated data to achieve the same level of performance.
Conclusions