Jing Jiang & ChengXiang Zhai07.pdf · Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign What is domain adaptation? ... Adaptation:

1

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

Jing Jiang & ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

��

What is domain adaptation?

2

��

Example: named entity recognition

New York Times New York Times

train (labeled)

test (unlabeled)

NER Classifier

standard supervised learning 85.5%

persons, locations, organizations, etc.

��

Example: named entity recognition

New York Times

NER Classifier

train (labeled)

test (unlabeled)

New York Times

labeled data not availableReuters

64.1%

persons, locations, organizations, etc.

non-standard (realistic) setting

3

��

Domain difference �

performance drop

train test

NYT NYT

New York Times New York Times

NER Classifier 85.5%

Reuters NYT

Reuters New York Times

NER Classifier 64.1%

ideal setting

realistic setting

��

Another NER example

train test

mouse mouse

gene name

recognizer

fly mouse

gene name

recognizer

ideal setting

realistic setting

54.1%

28.1%

4

��

Other examples

� Spam filtering:� Public email collection

�personal inboxes

� Sentiment analysis of product reviews� Digital cameras

�cell phones

� Movies �

books� Can we do better than standard supervised

learning?� Domain adaptation: to design learning methods that

are aware of the training and test domain difference.

��

How do we solve the problem in general?

5

��

Observation 1domain-specific features

winglessdaughterlesseyelessapexless…

��

Observation 1domain-specific features

winglessdaughterlesseyelessapexless…

• describing phenotype

• in fly gene nomenclature

• feature “ -less” weighted high

CD38PABPC5

…

feature still useful for other

organisms?

No!

6

��

Observation 2generalizable features

� �� "!!$#&%� '!�()(*!��,+-��.0/�12/43$(5%��76869!: ;�$(

in each �

� 6=<��>6 ?A@CBED +0(F!>#:%� "!�(*()!��G7HIG /J6=<I��!:3� "/��(K�:��L1�.M+N��.OP!�.-.0( � 6=<��76 Q�RTS�QE?VU +0(!$#&%� '!�()(*!��,+-�XWY!>6Z��. G "��+-��,+[�\�] 8��$1�!\/�W^��3�. 66_+0(*(`3�!�($a

��

Observation 2generalizable features

� ��!�OP��%$!��76Z��%:.0!�1�+0O�� bc+[��1�.0!�()( �� "!��d^�ef�g��J�� +h��.0/�12/43$(5%��76869!: ;�$(

in each �

� 6=<��>6 ijlk�m +0( ��d^�en�g��>��G7HIG /�6=<L��!:3� "/��(K�:��I1�.-+o��.OP!:.M.0( � 6=<��76 p7qsr�p�i�t +0(��d^�ef�g��J�� +h�5WY!J6u��. G 8��+[��,+h�v�w Y��1�!\/�W^��3�. 66_+0()(73�!�(>a

feature “X be expressed”

7

��

General idea: two-stage approach

SourceDomain Target

Domain

features

generalizable features

domain-specific features

��

Goal

TargetDomain

SourceDomain

features

8

��

Regular classification

SourceDomain Target

Domain

features

��

Generalization: to emphasize generalizable features in the trained model

SourceDomain Target

Domain

features Stage 1

9

��

Adaptation: to pick up domain-specific features for the target domain

SourceDomain Target

Domain

features Stage 2

��

Regular semi-supervised learning

SourceDomain Target

Domain

features

10

��

Comparison with related work

� We explicitly model generalizable features.� Previous work models it implicitly [Blitzer et al. 2006, Ben-

David et al. 2007, Daumé III 2007].� We do not need labeled target data but we need

multiple source (training) domains.� Some work requires labeled target data [Daumé III 2007].

� We have a 2nd stage of adaptation, which uses semi-supervised learning.

� Previous work does not incorporate semi-supervised learning [Blitzer et al. 2006, Ben-David et al. 2007, DauméIII 2007].

��

Implementation of the two-stage approach with logistic

regression classifiers

11

��

x

Logistic regression classifiers01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

-less

X be expressed

wyT x

∑=

'' )exp(

)exp(),|(

y

Ty

Tyyp

xw

xwwx

p binary features

… and wingless are expressed in…

wy

��

Learning a logistic regression classifier

01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

(

��

�

�

−

=

∑ ∑=

N

iy

Ty

Ty

N 1'

'

2

)exp(

)exp(log

1

minargˆ

xw

xw

www

λ

log likelihood of training data

regularization term

penalize large weights

control model complexity

wyT x

12

��

Generalizable features in weight vectors

0.24.55

-0.33.0

::

2.1-0.90.4

3.20.54.5-0.13.5

::

0.1-1.0-0.2

0.10.74.20.13.2::

1.70.10.3

D1 D2 DK

w1 w2 wK

…

K source domains


domain-specific features

��

We want to decompose w in this way

0.24.55

-0.33.0

::

2.1-0.90.4

00

4.60

3.2::000

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

h non-zero entries for h


13

��

Feature selection matrix A

0 0 1 0 0 … 00 0 0 0 1 … 0

::

0 0 0 0 0 … 1

01001::010

01::0

x

A z = Ax

h

matrix A selects h generalizable features

��

0.24.55

-0.33.0

::

2.1-0.90.4

4.63.2

::

3.6

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

Decomposition of w

01001::010

01::0

01001::010

wT x = vT z + uT x

weights for generalizable features

weights for domain-specific features

14

��

Decomposition of w

wTx = vTz + uTx

= (Av)Tx + uTx

= vTAx + uTx

w = AT v + u

��

0.24.55

-0.33.0

::

2.1-0.90.4

4.63.2::

3.6

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0

::

0 0 … 00 0 … 00 0 … 0

Decomposition of wshared by all domainsdomain-specific

w = AT v + u

15

��

Framework for generalization

Fix A, optimize:

wkregularization term�s >> 1: to penalize domain-specific features

SourceDomain

TargetDomain

��

+−

�� +=

� ��

= =

=

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

k

AypNK 1 1

1

22

}{,

);|(log11

minarg} )ˆ{,ˆ(

uvx

uvuvuv

λλ

likelihood of labeled data from K source domains

��

Framework for adaptation

��++

�� ++

−

�� ++=

��

�

=

= =

=

);|(log1

);|(log1

1

1

minarg})ˆ{,ˆ,ˆ(

1

1 1

2

1

22

}{,

m

i

tTti

ti

N

k

N

i

kTki

ki

k

tt

K

k

ks

kt

Aypm

AypNK

k k

kt

uvx

uvx

uuvuuvuuv

λλλ

SourceDomain

TargetDomain

likelihood of pseudo labeled target domain

examples

�t = 1 <<

�s : to pick up domain-specific

features in the target domain

Fix A, optimize:

16

��

� Joint optimization

� Alternating optimization

��

+−

�� !"#$%

+=

∑ ∑

∑

= =

=

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

A,

k

AypNK

A

1 1

1

22

}{,

);|(log11

minarg})ˆ{,ˆ,ˆ(

uvx

uvuvuv

λλ

How to find A? (1)

&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1+

How to find A? (2)

2 Domain cross validation3 Idea: training on (K – 1) source domains and test on the held-out source domain3 Approximation:4 wf

k: weight for feature f learned from domain k4 wfk: weight for feature f learned from other domains4 rank features by

3 See paper for details

5=

⋅K

k

kf

kf ww

1

17

&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1 1

Intuition for domain cross validation

…domains

………expressed………-less

D1 D2 Dk-1 Dk (fly)

……-less……expressed……

w

1.5

0.05

w

2.0

1.2………expressed………-less

1.8

0.1

&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��

Experiments

2 Data set3 BioCreative Challenge Task 1B3 Gene/protein name recognition3 3 organisms/domains: fly, mouse and yeast2 Experiment setup3 2 organisms for training, 1 for testing3 F1 as performance measure

18

&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��

Experiments: Generalization

0.4700.1950.654DA-2(domain CV)

0.4250.1530.627DA-1 (joint-opt)

0.4160.1290.633BL

Y+F�

MM+Y�

FF+M�

YMethod

SourceDomain

TargetDomain

SourceDomain

TargetDomain using generalizable features is effective

domain cross validation is more effective than joint optimization

F: fly M: mouse Y: yeast

&�'�(�)�*+�,�,�) -�.�/�0�+�,�,�) 1��

Experiments: Adaptation

0.5010.3050.759DA-2-SSL0.4580.2410.633BL-SSL

Y+F � MM+Y � FF+M � YMethod

SourceDomain

TargetDomain

F: fly M: mouse Y: yeast

domain-adaptive bootstrapping is more effective than regular bootstrapping

SourceDomain

TargetDomain

19

��

Experiments: Adaptation

domain-adaptive SSL is more effective, especially with a small number of pseudo labels

��

Conclusions and future work

� Two-stage domain adaptation� Generalization: outperformed standard supervised

learning� Adaptation: outperformed standard bootstrapping

� Two ways to find generalizable features� Domain cross validation is more effective

� Future work� Single source domain?� Setting parameters h and m

20

��

References

� S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007.

� J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006.

� H. Daumé III. Frustratingly easy domain adaptation. ACL 2007.

��

Thank you!

Documents

Jing Jiang & ChengXiang Zhai07.pdf · Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign What is domain adaptation? ... Adaptation: