Word representations: A simple and general method for semi-supervised learning

Word representations:A simple and general method for semi-supervised learning

Joseph Turian

with Lev Ratinov and Yoshua Bengio

Goodies:http://metaoptimize.com/projects/wordreprs/

2

Supmodel

Supdata

Supervised training

3

Supmodel

Supdata

Supervised training

Semi-suptraining?

4

Supmodel

Supdata

Supervised training

Semi-suptraining?

5

Supmodel

Supdata

Supervised training

Semi-suptraining?

Morefeats

6

Supmodel

Supdata

Morefeats

Supmodel

Supdata

Morefeats

sup task 1

sup task 2

7

Semi-supmodel

Unsupdata

Supdata

Joint semi-sup

8

Semi-supmodel

Unsupmodel

Unsupdata

Supdata

unsuppretraining

semi-supfine-tuning

9

Unsupmodel

Unsupdata

unsuptraining

unsupfeats

10

Semi-supmodel

Unsupdata

unsuptraining

Sup training

Supdata

unsupfeats

11

Unsupdata

unsuptraining

unsupfeats

sup task 1 sup task 2 sup task 3

12

What unsupervised featuresare most useful in NLP?

13

Natural language processing

• Words, words, words• Words, words, words• Words, words, words

14

How do we handle words?

• Not very well

15

“One-hot” word representation

• |V| = |vocabulary|, e.g. 50K for PTB2

word -1, word 0, word +1

Pr dist over labels

(3*|V|) x m

3*|V|

m

16

One-hot word representation

• 85% of vocab words occur as only 10% of corpus tokens

• Bad estimate of Pr(label|rare word)

word 0

|V| x m

|V|

m

17

Approach

18

Approach

• Manual feature engineering

19

Approach

• Manual feature engineering

20

Approach

• Induce word reprs over large corpus, unsupervised

• Use word reprs as word features for supervised task

21

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

22



23

Distributional representations

FW(size ofvocab)

C

e.g. Fw,v = Pr(v follows word w)or Fw,v = Pr(v occurs in same doc as w)

24

Distributional representations

FW(size ofvocab)

C d

g( ) = f

g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand transC >> d

25



26

Class-based word repr

• |C| classes, hard clustering

word 0

(|V|+|C|) x m

|V|+|C|

m

27

Class-based word repr

• Hard vs. soft clustering• Hierarchical vs. flat clustering

28


• Distributional word reprs• Class-based (clustering) word reprs

– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering

• Distributed word reprs

29





30

Brown clustering

• Hard, hierarchical class-based LM• Brown et al. (1992)• Greedy technique for maximizing bigram

mutual information• Merge words by contextual similarity

31

Brown clustering

(image from Terry Koo)

cluster(chairman) = `0010’2-prefix(cluster(chairman)) = `00’

32

Brown clustering

• Hard, hierarchical class-based LM• 1000 classes• Use prefixes = 4, 6, 10, 20

33





34



35

Distributed word repr

• k- (low) dimensional, dense representation• “word embedding” matrix E of size |V| x k

word 0

k x m

k

m

36

Sequence labeling w/ embeddings

word -1, word 0, word +1

(3*k) x m

|V| x k,tiedweights

m

“word embedding” matrix E of size |V| x k

37



– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)

38




39

Collobert + Weston 2008

w1 w2 w3 w4 w550*5

1

100

w1 w2 w3 w4 w5

score > μ + score

40

50-dim embeddings: Collobert + Weston (2008)t-SNE vis by

van der Maaten +Hinton (2008)

41




42

Log bilinear Language Model (LBL)

w1 w2 w3 w4 w5

Linear prediction of w5

} Z)targetpredictexp(Pr

43

HLBL

• HLBL = hierarchical (fast) training of LBL• Mnih + Hinton (2009)

44

Approach

• Induce word reprs over large corpus, unsupervised– Brown: 3 days– HLBL: 1 week, 100 epochs– C&W: 4 weeks, 50 epochs

• Use word reprs as word features for supervised task

45

Unsupervised corpus

• RCV1 newswire• 40M tokens (vocab = all 270K types)

46

Supervised Tasks

• Chunking (CoNLL, 2000)– CRF (Sha + Pereira, 2003)

• Named entity recognition (NER)– Averaged perceptron (linear classifier)– Based upon Ratinov + Roth (2009)

47

Unsupervised word reprsas features

Word = “the”

Embedding = [-0.2, …, 1.6]

Brown cluster = 1010001100(cluster 4-prefix = 1010,cluster 6-prefix = 101000,…)

48

Unsupervised word reprsas features

Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...}

X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1,class-2-pre4=“1010”: 1,class-2-pre6=“101000”: 1}

X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1,word-2-dim00: -0.2, …,word-2-dim49: 1.6, ...}

49

Embeddings: Normalization

E = σ * E / stddev(E)

50

Embeddings: Normalization(Chunking)

51

Embeddings: Normalization(NER)

52

Repr capacity (Chunking)

53

Repr capacity (NER)

54

Test results (Chunking)

93

93.5

94

94.5

95

95.5baseline

HLBL

C&W

Brown

C&W+Brown

Suzuki+Isozaki(08), 15MSuzuki+Isozaki(08), 1B

55

Test results (NER)

84

85

86

87

88

89

90

91Baseline

Baseline+nonlocalGazeteers

C&W

HLBL

Brown

All

All+nonlocal

56

MUC7 (OOD) results (NER)

66687072747678808284

Baseline

Baseline+nonlocalGazeteers

C&W

HLBL

Brown

All

All+nonlocal

57

Test results (NER)

88

88.5

89

89.5

90

90.5

91

Lin+Wu (09),3.4BSuzuki+Isozaki(08), 37MSuzuki+Isozaki(08), 1BAll+nonlocal,37MLin+Wu (09),700B

Test results

• Chunking: C&W = Brown• NER: C&W < Brown• Why?

58

59

Word freq vs word error(Chunking)

60

Word freq vs word error(NER)

61

Summary

• Both Brown + word emb can increase acc of near-SOTA system

• Combining can improve accuracy further• On rare words, Brown > word emb• Scale parameter σ = 0.1• Goodies:http://metaoptimize.com/projects/wordreprs/

Word features! Code!

62

Difficulties with word embeddings

• No stopping criterion during unsup training• More active features (slower sup training)• Hyperparameters

– Learning rate for model– (optional) Learning rate for embeddings– Normalization constant

• vs. Brown clusters, few hyperparams

63

HMM approach

• Soft, flat class-based repr• Multinomial distribution over hidden states

= word representation• 80 hidden states• Huang and Yates (2009)• No results with HMM approach yet

Documents

Word representations: A simple and general method for semi-supervised learning