63
Word representations: A simple and general method for semi- supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/ wordreprs/

Word representations: A simple and general method for semi-supervised learning

  • Upload
    azuka

  • View
    43

  • Download
    6

Embed Size (px)

DESCRIPTION

Word representations: A simple and general method for semi-supervised learning. Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/. Supervised training. Sup model. Sup data. Supervised training. Semi-sup training?. Sup model. Sup - PowerPoint PPT Presentation

Citation preview

Page 1: Word representations: A simple and general method for semi-supervised learning

Word representations:A simple and general method for semi-supervised learning

Joseph Turian

with Lev Ratinov and Yoshua Bengio

Goodies:http://metaoptimize.com/projects/wordreprs/

Page 2: Word representations: A simple and general method for semi-supervised learning

2

Supmodel

Supdata

Supervised training

Page 3: Word representations: A simple and general method for semi-supervised learning

3

Supmodel

Supdata

Supervised training

Semi-suptraining?

Page 4: Word representations: A simple and general method for semi-supervised learning

4

Supmodel

Supdata

Supervised training

Semi-suptraining?

Page 5: Word representations: A simple and general method for semi-supervised learning

5

Supmodel

Supdata

Supervised training

Semi-suptraining?

Morefeats

Page 6: Word representations: A simple and general method for semi-supervised learning

6

Supmodel

Supdata

Morefeats

Supmodel

Supdata

Morefeats

sup task 1

sup task 2

Page 7: Word representations: A simple and general method for semi-supervised learning

7

Semi-supmodel

Unsupdata

Supdata

Joint semi-sup

Page 8: Word representations: A simple and general method for semi-supervised learning

8

Semi-supmodel

Unsupmodel

Unsupdata

Supdata

unsuppretraining

semi-supfine-tuning

Page 9: Word representations: A simple and general method for semi-supervised learning

9

Unsupmodel

Unsupdata

unsuptraining

unsupfeats

Page 10: Word representations: A simple and general method for semi-supervised learning

10

Semi-supmodel

Unsupdata

unsuptraining

Sup training

Supdata

unsupfeats

Page 11: Word representations: A simple and general method for semi-supervised learning

11

Unsupdata

unsuptraining

unsupfeats

sup task 1 sup task 2 sup task 3

Page 12: Word representations: A simple and general method for semi-supervised learning

12

What unsupervised featuresare most useful in NLP?

Page 13: Word representations: A simple and general method for semi-supervised learning

13

Natural language processing

• Words, words, words• Words, words, words• Words, words, words

Page 14: Word representations: A simple and general method for semi-supervised learning

14

How do we handle words?

• Not very well

Page 15: Word representations: A simple and general method for semi-supervised learning

15

“One-hot” word representation

• |V| = |vocabulary|, e.g. 50K for PTB2

word -1, word 0, word +1

Pr dist over labels

(3*|V|) x m

3*|V|

m

Page 16: Word representations: A simple and general method for semi-supervised learning

16

One-hot word representation

• 85% of vocab words occur as only 10% of corpus tokens

• Bad estimate of Pr(label|rare word)

word 0

|V| x m

|V|

m

Page 17: Word representations: A simple and general method for semi-supervised learning

17

Approach

Page 18: Word representations: A simple and general method for semi-supervised learning

18

Approach

• Manual feature engineering

Page 19: Word representations: A simple and general method for semi-supervised learning

19

Approach

• Manual feature engineering

Page 20: Word representations: A simple and general method for semi-supervised learning

20

Approach

• Induce word reprs over large corpus, unsupervised

• Use word reprs as word features for supervised task

Page 21: Word representations: A simple and general method for semi-supervised learning

21

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

Page 22: Word representations: A simple and general method for semi-supervised learning

22

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

Page 23: Word representations: A simple and general method for semi-supervised learning

23

Distributional representations

FW(size ofvocab)

C

e.g. Fw,v = Pr(v follows word w)or Fw,v = Pr(v occurs in same doc as w)

Page 24: Word representations: A simple and general method for semi-supervised learning

24

Distributional representations

FW(size ofvocab)

C d

g( ) = f

g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand transC >> d

Page 25: Word representations: A simple and general method for semi-supervised learning

25

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

Page 26: Word representations: A simple and general method for semi-supervised learning

26

Class-based word repr

• |C| classes, hard clustering

word 0

(|V|+|C|) x m

|V|+|C|

m

Page 27: Word representations: A simple and general method for semi-supervised learning

27

Class-based word repr

• Hard vs. soft clustering• Hierarchical vs. flat clustering

Page 28: Word representations: A simple and general method for semi-supervised learning

28

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs

– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering

• Distributed word reprs

Page 29: Word representations: A simple and general method for semi-supervised learning

29

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs

– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering

• Distributed word reprs

Page 30: Word representations: A simple and general method for semi-supervised learning

30

Brown clustering

• Hard, hierarchical class-based LM• Brown et al. (1992)• Greedy technique for maximizing bigram

mutual information• Merge words by contextual similarity

Page 31: Word representations: A simple and general method for semi-supervised learning

31

Brown clustering

(image from Terry Koo)

cluster(chairman) = `0010’2-prefix(cluster(chairman)) = `00’

Page 32: Word representations: A simple and general method for semi-supervised learning

32

Brown clustering

• Hard, hierarchical class-based LM• 1000 classes• Use prefixes = 4, 6, 10, 20

Page 33: Word representations: A simple and general method for semi-supervised learning

33

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs

– Brown (hard, hierarchical) clustering– HMM (soft, flat) clustering

• Distributed word reprs

Page 34: Word representations: A simple and general method for semi-supervised learning

34

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

Page 35: Word representations: A simple and general method for semi-supervised learning

35

Distributed word repr

• k- (low) dimensional, dense representation• “word embedding” matrix E of size |V| x k

word 0

k x m

k

m

Page 36: Word representations: A simple and general method for semi-supervised learning

36

Sequence labeling w/ embeddings

word -1, word 0, word +1

(3*k) x m

|V| x k,tiedweights

m

“word embedding” matrix E of size |V| x k

Page 37: Word representations: A simple and general method for semi-supervised learning

37

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)

Page 38: Word representations: A simple and general method for semi-supervised learning

38

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)

Page 39: Word representations: A simple and general method for semi-supervised learning

39

Collobert + Weston 2008

w1 w2 w3 w4 w550*5

1

100

w1 w2 w3 w4 w5

score > μ + score

Page 40: Word representations: A simple and general method for semi-supervised learning

40

50-dim embeddings: Collobert + Weston (2008)t-SNE vis by

van der Maaten +Hinton (2008)

Page 41: Word representations: A simple and general method for semi-supervised learning

41

Less sparse word reprs?

• Distributional word reprs• Class-based (clustering) word reprs• Distributed word reprs

– Collobert + Weston (2008)– HLBL embeddings (Mnih + Hinton, 2007)

Page 42: Word representations: A simple and general method for semi-supervised learning

42

Log bilinear Language Model (LBL)

w1 w2 w3 w4 w5

Linear prediction of w5

} Z)targetpredictexp(Pr

Page 43: Word representations: A simple and general method for semi-supervised learning

43

HLBL

• HLBL = hierarchical (fast) training of LBL• Mnih + Hinton (2009)

Page 44: Word representations: A simple and general method for semi-supervised learning

44

Approach

• Induce word reprs over large corpus, unsupervised– Brown: 3 days– HLBL: 1 week, 100 epochs– C&W: 4 weeks, 50 epochs

• Use word reprs as word features for supervised task

Page 45: Word representations: A simple and general method for semi-supervised learning

45

Unsupervised corpus

• RCV1 newswire• 40M tokens (vocab = all 270K types)

Page 46: Word representations: A simple and general method for semi-supervised learning

46

Supervised Tasks

• Chunking (CoNLL, 2000)– CRF (Sha + Pereira, 2003)

• Named entity recognition (NER)– Averaged perceptron (linear classifier)– Based upon Ratinov + Roth (2009)

Page 47: Word representations: A simple and general method for semi-supervised learning

47

Unsupervised word reprsas features

Word = “the”

Embedding = [-0.2, …, 1.6]

Brown cluster = 1010001100(cluster 4-prefix = 1010,cluster 6-prefix = 101000,…)

Page 48: Word representations: A simple and general method for semi-supervised learning

48

Unsupervised word reprsas features

Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...}

X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1,class-2-pre4=“1010”: 1,class-2-pre6=“101000”: 1}

X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1,word-2-dim00: -0.2, …,word-2-dim49: 1.6, ...}

Page 49: Word representations: A simple and general method for semi-supervised learning

49

Embeddings: Normalization

E = σ * E / stddev(E)

Page 50: Word representations: A simple and general method for semi-supervised learning

50

Embeddings: Normalization(Chunking)

Page 51: Word representations: A simple and general method for semi-supervised learning

51

Embeddings: Normalization(NER)

Page 52: Word representations: A simple and general method for semi-supervised learning

52

Repr capacity (Chunking)

Page 53: Word representations: A simple and general method for semi-supervised learning

53

Repr capacity (NER)

Page 54: Word representations: A simple and general method for semi-supervised learning

54

Test results (Chunking)

93

93.5

94

94.5

95

95.5baseline

HLBL

C&W

Brown

C&W+Brown

Suzuki+Isozaki(08), 15MSuzuki+Isozaki(08), 1B

Page 55: Word representations: A simple and general method for semi-supervised learning

55

Test results (NER)

84

85

86

87

88

89

90

91Baseline

Baseline+nonlocalGazeteers

C&W

HLBL

Brown

All

All+nonlocal

Page 56: Word representations: A simple and general method for semi-supervised learning

56

MUC7 (OOD) results (NER)

66687072747678808284

Baseline

Baseline+nonlocalGazeteers

C&W

HLBL

Brown

All

All+nonlocal

Page 57: Word representations: A simple and general method for semi-supervised learning

57

Test results (NER)

88

88.5

89

89.5

90

90.5

91

Lin+Wu (09),3.4BSuzuki+Isozaki(08), 37MSuzuki+Isozaki(08), 1BAll+nonlocal,37MLin+Wu (09),700B

Page 58: Word representations: A simple and general method for semi-supervised learning

Test results

• Chunking: C&W = Brown• NER: C&W < Brown• Why?

58

Page 59: Word representations: A simple and general method for semi-supervised learning

59

Word freq vs word error(Chunking)

Page 60: Word representations: A simple and general method for semi-supervised learning

60

Word freq vs word error(NER)

Page 61: Word representations: A simple and general method for semi-supervised learning

61

Summary

• Both Brown + word emb can increase acc of near-SOTA system

• Combining can improve accuracy further• On rare words, Brown > word emb• Scale parameter σ = 0.1• Goodies:http://metaoptimize.com/projects/wordreprs/

Word features! Code!

Page 62: Word representations: A simple and general method for semi-supervised learning

62

Difficulties with word embeddings

• No stopping criterion during unsup training• More active features (slower sup training)• Hyperparameters

– Learning rate for model– (optional) Learning rate for embeddings– Normalization constant

• vs. Brown clusters, few hyperparams

Page 63: Word representations: A simple and general method for semi-supervised learning

63

HMM approach

• Soft, flat class-based repr• Multinomial distribution over hidden states

= word representation• 80 hidden states• Huang and Yates (2009)• No results with HMM approach yet