Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
1
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers
Jing Jiang & ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
��������������� �������������
What is domain adaptation?
2
��������������� ������������� �
Example: named entity recognition
New York Times New York Times
train (labeled)
test (unlabeled)
NER Classifier
standard supervised learning 85.5%
persons, locations, organizations, etc.
��������������� ������������� �
Example: named entity recognition
New York Times
NER Classifier
train (labeled)
test (unlabeled)
New York Times
labeled data not availableReuters
64.1%
persons, locations, organizations, etc.
non-standard (realistic) setting
3
��������������� ������������� �
Domain difference �
performance drop
train test
NYT NYT
New York Times New York Times
NER Classifier 85.5%
Reuters NYT
Reuters New York Times
NER Classifier 64.1%
ideal setting
realistic setting
��������������� ������������� �
Another NER example
train test
mouse mouse
gene name
recognizer
fly mouse
gene name
recognizer
ideal setting
realistic setting
54.1%
28.1%
4
��������������� ������������� �
Other examples
� Spam filtering:� Public email collection
�personal inboxes
� Sentiment analysis of product reviews� Digital cameras
�cell phones
� Movies �
books� Can we do better than standard supervised
learning?� Domain adaptation: to design learning methods that
are aware of the training and test domain difference.
��������������� ������������� �
How do we solve the problem in general?
5
��������������� ������������� �
Observation 1domain-specific features
winglessdaughterlesseyelessapexless…
��������������� ������������� ��
Observation 1domain-specific features
winglessdaughterlesseyelessapexless…
• describing phenotype
• in fly gene nomenclature
• feature “ -less” weighted high
CD38PABPC5
…
feature still useful for other
organisms?
No!
6
��������������� ������������� � �
Observation 2generalizable features
� ������������ ����������������� � ������������� �� "!!$#&%� '!�()(*!��,+-�������.0/�12/43$(5%��76869!: ;�$(
in each �
� 6=<��>6 ?A@CBED +0(F!>#:%� "!�(*()!��G7HIG /J6=<I��!:3� "/���(K�:���L1�.M+N��.OP!�.-.0( � 6=<��76 Q�RTS�QE?VU +0(!$#&%� '!�()(*!��,+-�XWY!>6Z��. G "��+-������,+[�\�] 8���$1�!\/�W^����3�. 66_+0(*(`3�!�($a
��������������� ������������� ��
Observation 2generalizable features
� ��!�OP��%$!��76Z��%:.0!�1�+0O����� bc+[��1�.0!�()( �� "!��d^�ef�g���J��� +h�������.0/�12/43$(5%��76869!: ;�$(
in each �
� 6=<��>6 ijlk�m +0( ��d^�en�g���>���G7HIG /�6=<L��!:3� "/���(K�:���I1�.-+o��.OP!:.M.0( � 6=<��76 p7qsr�p�i�t +0(��d^�ef�g���J��� +h�5WY!J6u��. G 8��+[������,+h�v�w Y����1�!\/�W^����3�. 66_+0()(73�!�(>a
feature “X be expressed”
7
��������������� ������������� � �
General idea: two-stage approach
SourceDomain Target
Domain
features
generalizable features
domain-specific features
��������������� ������������� � �
Goal
TargetDomain
SourceDomain
features
8
��������������� ������������� � �
Regular classification
SourceDomain Target
Domain
features
��������������� ������������� � �
Generalization: to emphasize generalizable features in the trained model
SourceDomain Target
Domain
features Stage 1
9
��������������� ������������� ��
Adaptation: to pick up domain-specific features for the target domain
SourceDomain Target
Domain
features Stage 2
��������������� ������������� � �
Regular semi-supervised learning
SourceDomain Target
Domain
features
10
��������������� ������������� � �
Comparison with related work
� We explicitly model generalizable features.� Previous work models it implicitly [Blitzer et al. 2006, Ben-
David et al. 2007, Daumé III 2007].� We do not need labeled target data but we need
multiple source (training) domains.� Some work requires labeled target data [Daumé III 2007].
� We have a 2nd stage of adaptation, which uses semi-supervised learning.
� Previous work does not incorporate semi-supervised learning [Blitzer et al. 2006, Ben-David et al. 2007, DauméIII 2007].
��������������� ������������� �
Implementation of the two-stage approach with logistic
regression classifiers
11
��������������� ������������� �
x
Logistic regression classifiers01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
-less
X be expressed
wyT x
∑=
'' )exp(
)exp(),|(
y
Ty
Tyyp
xw
xwwx
p binary features
… and wingless are expressed in…
wy
��������������� ������������� �
Learning a logistic regression classifier
01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
(
���
�
�
−
=
∑ ∑=
N
iy
Ty
Ty
N 1'
'
2
)exp(
)exp(log
1
minargˆ
xw
xw
www
λ
log likelihood of training data
regularization term
penalize large weights
control model complexity
wyT x
12
��������������� ������������� ��
Generalizable features in weight vectors
0.24.55
-0.33.0
::
2.1-0.90.4
3.20.54.5-0.13.5
::
0.1-1.0-0.2
0.10.74.20.13.2::
1.70.10.3
D1 D2 DK
w1 w2 wK
…
K source domains
generalizable features
domain-specific features
��������������� ������������� ��
We want to decompose w in this way
0.24.55
-0.33.0
::
2.1-0.90.4
00
4.60
3.2::000
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
h non-zero entries for h
generalizable features
13
��������������� ������������� ��
Feature selection matrix A
0 0 1 0 0 … 00 0 0 0 1 … 0
::
0 0 0 0 0 … 1
01001::010
01::0
x
A z = Ax
h
matrix A selects h generalizable features
��������������� ������������� ��
0.24.55
-0.33.0
::
2.1-0.90.4
4.63.2
::
3.6
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
Decomposition of w
01001::010
01::0
01001::010
wT x = vT z + uT x
weights for generalizable features
weights for domain-specific features
14
��������������� ������������� �
Decomposition of w
wTx = vTz + uTx
= (Av)Tx + uTx
= vTAx + uTx
w = AT v + u
��������������� ������������� ��
0.24.55
-0.33.0
::
2.1-0.90.4
4.63.2::
3.6
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0
::
0 0 … 00 0 … 00 0 … 0
Decomposition of wshared by all domainsdomain-specific
w = AT v + u
15
��������������� ������������� ��
Framework for generalization
Fix A, optimize:
wkregularization term�s >> 1: to penalize domain-specific features
SourceDomain
TargetDomain
���
+−
��� �� �+=
� ��
= =
=
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
k
AypNK 1 1
1
22
}{,
);|(log11
minarg} )ˆ{,ˆ(
uvx
uvuvuv
λλ
likelihood of labeled data from K source domains
��������������� ������������� ���
Framework for adaptation
������++
���� ++
−
��� ������++=
�� �
�
=
= =
=
);|(log1
);|(log1
1
1
minarg})ˆ{,ˆ,ˆ(
1
1 1
2
1
22
}{,
m
i
tTti
ti
N
k
N
i
kTki
ki
k
tt
K
k
ks
kt
Aypm
AypNK
k k
kt
uvx
uvx
uuvuuvuuv
λλλ
SourceDomain
TargetDomain
likelihood of pseudo labeled target domain
examples
�t = 1 <<
�s : to pick up domain-specific
features in the target domain
Fix A, optimize:
16
��������������� ������������� ���
� Joint optimization
� Alternating optimization
���
+−
��� !"#$%
+=
∑ ∑
∑
= =
=
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
A,
k
AypNK
A
1 1
1
22
}{,
);|(log11
minarg})ˆ{,ˆ,ˆ(
uvx
uvuvuv
λλ
How to find A? (1)
&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1+
How to find A? (2)
2 Domain cross validation3 Idea: training on (K – 1) source domains and test on the held-out source domain3 Approximation:4 wf
k: weight for feature f learned from domain k4 wfk: weight for feature f learned from other domains4 rank features by
3 See paper for details
5=
⋅K
k
kf
kf ww
1
17
&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1 1
Intuition for domain cross validation
…domains
………expressed………-less
D1 D2 Dk-1 Dk (fly)
……-less……expressed……
w
1.5
0.05
w
2.0
1.2………expressed………-less
1.8
0.1
&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��
Experiments
2 Data set3 BioCreative Challenge Task 1B3 Gene/protein name recognition3 3 organisms/domains: fly, mouse and yeast2 Experiment setup3 2 organisms for training, 1 for testing3 F1 as performance measure
18
&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��
Experiments: Generalization
0.4700.1950.654DA-2(domain CV)
0.4250.1530.627DA-1 (joint-opt)
0.4160.1290.633BL
Y+F�
MM+Y�
FF+M�
YMethod
SourceDomain
TargetDomain
SourceDomain
TargetDomain using generalizable features is effective
domain cross validation is more effective than joint optimization
F: fly M: mouse Y: yeast
&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��
Experiments: Adaptation
0.5010.3050.759DA-2-SSL0.4580.2410.633BL-SSL
Y+F � MM+Y � FF+M � YMethod
SourceDomain
TargetDomain
F: fly M: mouse Y: yeast
domain-adaptive bootstrapping is more effective than regular bootstrapping
SourceDomain
TargetDomain
19
��������������� ������������� ���
Experiments: Adaptation
domain-adaptive SSL is more effective, especially with a small number of pseudo labels
��������������� ������������� ���
Conclusions and future work
� Two-stage domain adaptation� Generalization: outperformed standard supervised
learning� Adaptation: outperformed standard bootstrapping
� Two ways to find generalizable features� Domain cross validation is more effective
� Future work� Single source domain?� Setting parameters h and m
20
��������������� ������������� ���
References
� S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007.
� J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006.
� H. Daumé III. Frustratingly easy domain adaptation. ACL 2007.
��������������� ������������� ���
Thank you!