58
Learning Data Representations with “Partial SupervisionAriadna Quattoni

Learning Data Representations with “ Partial Supervision ”

  • Upload
    shaman

  • View
    31

  • Download
    2

Embed Size (px)

DESCRIPTION

Learning Data Representations with “ Partial Supervision ”. Ariadna Quattoni. Outline. Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Learning Data Representations  with “ Partial Supervision ”

Learning Data Representations with “Partial Supervision”

Ariadna Quattoni

Page 2: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 3: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 4: Learning Data Representations  with “ Partial Supervision ”

Semi-Supervised Learning

)},(),...,,(),,{( 2211 uu cxcxcxU

{ 1,1}Y

:F X Y

“Raw” Feature Space

Output Space

Core Task:Learn a function from X to Y

)},()...,,(),,{( 2211 nn yxyxyxT

},...,,{ 21 uxxxU

Labeled Dataset (Small)

Unlabeled Dataset (Large)

Partially Labeled Dataset (Large)

dRX

Classical Setting

Partial Supervision Setting

Page 5: Learning Data Representations  with “ Partial Supervision ”

Semi-Supervised LearningClassical Setting

': XXG YXGF )(:

Unlabeled Dataset

Learn Representation

Labeled Dataset

TrainClassifier

hRX '

dh Dimensiona

lity Reduction

xxG )(dhR

Page 6: Learning Data Representations  with “ Partial Supervision ”

Semi-Supervised LearningPartial Supervision Setting

': XXG YXGF )(:

Unlabeled Dataset

+Partial

Supervision

Learn Representation

Labeled Dataset

TrainClassifier

hRX '

dh Dimensiona

lity Reduction

xxG )(dhR

Page 7: Learning Data Representations  with “ Partial Supervision ”

Why is “learning representations” useful?

Infer the intrinsic dimensionality of the data.

Learn the “relevant” dimensions.

Infer the hidden structure.

Page 8: Learning Data Representations  with “ Partial Supervision ”

Example: Hidden Structure

},...,,{ 2021 sssS

},..,{},,..,{ 107625211 sssTsssT

},..,{},,..,{ 20171641512113 sssTsssT

20 Symbols

4 Topics

]0,....,31,3

1,..,0,31[ 20111021 xxxxxx

},,{ 11101 sssSubset of 3 symbols

Generate a datapoint:

Choose a topic T. Sample 3 symbols from T.

Data Covariance Matrix

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

Data Covariance

Page 9: Learning Data Representations  with “ Partial Supervision ”

Example: Hidden Structure

xxG )(

Number of latent dimensions = 4 Map each x to the topic that generated it Function:

1111

1 0000

0

0000

0

0000

0

0000

0

0000

0

0000

0

1111

1

1111

1

1111

1 0000

0

0000

0

0000

0

0000

0

0000

0

0000

0 0001

31

3103

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Projection Matrix

Topic Vector

DataPoint

Latent Representation

1111 0000 0 0000 0 0000 0

0000 0 0000 0

0000 0

1111 1

1111 1

1111 10000 0

0000 0

0000 0 0000 0

0000 00000 0

1

Page 10: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 11: Learning Data Representations  with “ Partial Supervision ”

Classical SettingPrincipal Components Analysis

0,0,0,31z

4

1

'j

jzx jr

0,31,0,0z

Rows of theta as a ‘basis’:

Example generated by:

T1 T2

T3 T4

0,0,31,0z

31,0,0,0z

Low Reconstruction Error:22

2

3

1

3

1´' xx

Page 12: Learning Data Representations  with “ Partial Supervision ”

Minimum Error Formulation

d

miii

m

j

inn ubzx

11

'´iu

2

1

'1

U

unn xx

UJ

0jTi uu

Error:

Orthonormal basis

Solution

Data covariance

Distorsion

Approximate high dimensional x with low dimensional x‘

d

mi

TiJ

1iSuu

ii uSu i

d

miiJ

1

Page 13: Learning Data Representations  with “ Partial Supervision ”

Principal Component Analysis2D Example

Projection Error

Uncorrelated variables

ii eu and )var( ii x

Cut dimensions according to their variance. Variables must be correlated.

Page 14: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 15: Learning Data Representations  with “ Partial Supervision ”

Unlabeled Dataset

+Partial

SupervisionCreate

AuxiliaryTasks

StructureLearning

Partial Supervision Setting[Ando & Zhang JMLR 2005]

': XXG

Page 16: Learning Data Representations  with “ Partial Supervision ”

Partial Supervision Setting

Unlabeled data + partial supervision: Images with associated natural language captions. Video sequences with associated speech. Document + keywords

How could the partial supervision help? A hint for discovering important features. Use the partial supervision to define “auxiliary tasks”. Discover feature groupings that are useful for these tasks.

Sometimes ‘auxiliary tasks’ defined from unlabeled data alone.E.g. Auxiliary Task for word tagging predicting substructures-

Page 17: Learning Data Representations  with “ Partial Supervision ”

Auxiliary Tasks:

keywords: machine learning, dimensionality reduction

keywords: linear embedding, spectral methods, distance learning

keywords: object recognition, shape matching, stereo

machine learningpapers

computer vision papers

mask occurrencesof keywords

Auxiliary task: predict object recognition from document contentAuxiliary task: predict object recognition from document content

Core task: Is a vision or machine learning article?

Page 18: Learning Data Representations  with “ Partial Supervision ”

Auxiliary Tasks

otherwisey

ky

yxyxyxD

i

iii

uui

1

1

)},()...,,(),,{( 2211

)},(),...,,(),,{( 2211 uuxxxU

Page 19: Learning Data Representations  with “ Partial Supervision ”

Structure Learning

Learning with no prior knowledge

),(min^

jDfL

}'),({ xwxwfF

Hypothesis learned from examples

Best hypothesis

}'),({ xwxwfF

Learning with prior knowledge

}),({)( xvxvfF t

}'),({ xwxwfF

Learning from auxiliary tasks

}),({)( xvxvfF t Hypothesis learnedfor related tasks

Page 20: Learning Data Representations  with “ Partial Supervision ”

Learning Good Hypothesis Spaces

Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that

minimizes the joint loss.

( , ) 'tf v x v x

v

1:

( , , ) ( ) ( )j j jj m

L v D reg v reg

jDLoss on training set

Problem specific parametersShared parameters

Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that

minimizes the joint loss.

Page 21: Learning Data Representations  with “ Partial Supervision ”

Algorithm Step 1:

Train classifiers for auxiliary tasks.

* 2arg min ( ( , ), ) || ||2j w i i

i

Cw l f w x y w

Page 22: Learning Data Representations  with “ Partial Supervision ”

Algorithm Step 2:PCA On Classifiers Coefficients

1 2

* * *[ , ,... ]m

w w wW

dhR tWW

by taking the first h eigenvectors

Linear subspace of dimension h; a good low dimensional approximation

to the space of coefficients.

of Covariance Matrix:

Page 23: Learning Data Representations  with “ Partial Supervision ”

Algorithm Step 3: Training on the core task

** vw tcore

( ) 'q x x

* 2arg min ( ( , ( )), ) || ||2v i i

i

Cv l f v q x y v

Project data:

Equivalent to training core task on the original d dimensional space

with parameters constraints:

-
Structural Learning : defining a new representation. sub-space constraint on weights
Page 24: Learning Data Representations  with “ Partial Supervision ”

Example

Object = { letter, letter, letter }

An object

abC

Page 25: Learning Data Representations  with “ Partial Supervision ”

Example

The same object seen in a different font

Abc

Page 26: Learning Data Representations  with “ Partial Supervision ”

Example

The same object seen in a different font

ABc

Page 27: Learning Data Representations  with “ Partial Supervision ”

Example

The same object seen in a different font

abC

Page 28: Learning Data Representations  with “ Partial Supervision ”

Example

acE 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 1

A C EB

6 Letters (topics)6 Letters (topics)5 fonts per letter (symbols)5 fonts per letter (symbols)

auxiliary task: recognize object .

words

“ABC” object “ADE” object

“BCF” words “ABD” words

30 Symbols 30 Features

20 words

Page 29: Learning Data Representations  with “ Partial Supervision ”

PCA on Data can not recoverlantent structure

Covariance Matrix

5 10 15 20 25 30

5

10

15

20

25

30

Covariance DATA

Page 30: Learning Data Representations  with “ Partial Supervision ”

PCA on Coefficients can recover latent structure

Weight Matrix Auxiliary Task

2 4 6 8 10 12 14 16 18 20

5

10

15

20

25

30

Featuresi.e. fonts

Auxiliary Tasks

Topicsi.e Letters

Parameters for object

BCD

W

Page 31: Learning Data Representations  with “ Partial Supervision ”

PCA on Coefficients can recover latent structure

Covariance W

5 10 15 20 25 30

5

10

15

20

25

30

Each Block of Correlated Variables corresponds to a Latent Topic

Covariance WFeaturesi.e. fonts

Featuresi.e. fonts

Page 32: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 33: Learning Data Representations  with “ Partial Supervision ”

News domain

figure skating ice hockey golden globes

grammys

Dataset: News images from Reuters web-site.Problem: Predicting news topics from images.

Page 34: Learning Data Representations  with “ Partial Supervision ”

Learning visual representations using images with captions

The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics.

Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad.

Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine.

Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006.

Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,

U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival.

Auxiliary task: predict “ team ” from image contentAuxiliary task: predict “ team ” from image content

Page 35: Learning Data Representations  with “ Partial Supervision ”

Learning visual topics

word ‘games’ might contain the visual

topics:medalspeople paveme

nt

Auxiliary tasksshare visual topics

people

word ‘Demonstrations’ might contain the visual

topics:

Different words can share topics.Each topic can be observed under

different appearances.

Page 36: Learning Data Representations  with “ Partial Supervision ”

Experiments Results

Page 37: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 38: Learning Data Representations  with “ Partial Supervision ”

Chunking

Jane lives in New York and works for Bank of New York. PER LOC ORG

But economists in Europe failed to predict that … NP NP VP SBARPP

• Named entity chunking

• Syntactic chunking

Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

Page 39: Learning Data Representations  with “ Partial Supervision ”

Example input vector representation

1curr-“in”

left-“lives”right-“New”

11

… lives in New York …

1curr-“New”

left-“in”

right-“York”

1

1

input vector X• High-dimensional vectors. • Most entries are 0.

Page 40: Learning Data Representations  with “ Partial Supervision ”

1. Create m auxiliary problems. 2. Assign auxiliary labels to unlabeled data. 3. Compute (shared structure) by joint empirical risk minimization over all

the auxiliary problems.

4. Fix , and minimize empirical risk on the labeled data for the target task.

xvxwx TTf ),(

Algorithmic Procedure

Predictor: Additional features

Page 41: Learning Data Representations  with “ Partial Supervision ”

Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? :

Predict 1 from 2 . compute shared add 2 as new features

Example auxiliary problems

Example auxiliary problems

??:?

currentword

left word

right word

1

1

1

2

Page 42: Learning Data Representations  with “ Partial Supervision ”

Experiments (CoNLL-03 named entity)

4 classes: LOC, ORG, PER, MISC Labeled data: News documents.

204K words (English), 206K words (German) Unlabeled data:

27M words (English), 35M words (German) Features: A slight modification of ZJ03.

Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word.

No gazetteer. No hand-crafted resources.

Page 43: Learning Data Representations  with “ Partial Supervision ”

Auxiliary problems

# of aux. problems

Auxiliary labels Features used for learning auxiliary problems

100010001000

Previous wordsCurrent wordsNext words

All but previous wordsAll but current wordsAll but next words

300 auxiliary problems.

Page 44: Learning Data Representations  with “ Partial Supervision ”

Syntactic chunking results (CoNLL-00)

method description F-measure

supervised baseline 93.60ASO-semi +Unlabeled data 94.39

Co/self oracle +Unlabeled data 93.66KM01 SVM combination 93.91CM03 Perceptron in two layers 93.74ZDJ02 Reg. Winnow 93.57

Exceeds previous best systems.

ZDJ02+ +full parser (ESG) output 94.17

(+0.79%)

Page 45: Learning Data Representations  with “ Partial Supervision ”

Other experiments

POS tagging Text categorization (2 standard corpora)

Confirmed effectiveness on:

Page 46: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 47: Learning Data Representations  with “ Partial Supervision ”

Notation

Collection of Tasks

},...,,{ 21 mDDDD},...,,{ 21 mDDDD

Joint SparseApproximation

1D2D mD

)},(),....,,{( 11kn

kn

kkk kk

yxyxD

dx }1,1{ y

mddd

m

m

www

www

www

,2,1,

,22,21,2

,12,11,1

W

Page 48: Learning Data Representations  with “ Partial Supervision ”

Single Task Sparse Approximation

xwxf )(

Dyx

d

jjwQyxfl

),( 1

||)),((minargw

Consider learning a single sparse linear classifier of the form:

We want a few features with non-zero coefficients

Recent work suggests to use L1 regularization:

Classification error

L1 penalizes

non-sparse solutions

Donoho [2004] proved (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

Page 49: Learning Data Representations  with “ Partial Supervision ”

Joint Sparse Approximation

m

kmk

Dyxk

QyxflD

k121

),(,...,, ),....,,R()),((

||

1minarg www

m21 www

Setting : Joint Sparse Approximation

Average Loss on training set k

penalizes solutions that

utilize too many features

xxf kk w)(

Page 50: Learning Data Representations  with “ Partial Supervision ”

Joint Regularization Penalty

rowszerononW #)R(

mddd

m

m

WWW

WWW

WWW

,2,1,

,22,21,2

,12,11,1

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients for classifier 2

Would lead to a hard combinatorial problem .

Ariadna
Simultaneous OrthogonalMatching Pursuit, greedy approximationsAgregar citas
Page 51: Learning Data Representations  with “ Partial Supervision ”

Joint Regularization Penalty

We will use a L1-∞ norm [Tropp 2006]

d

iik

kWW

1

|)(|max)R(

The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solvingmany classification problems.

This norm combines:

An L1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity.

Use few features

The L∞ norm on each row promotes non-sparsity on the rows.

Share features

Page 52: Learning Data Representations  with “ Partial Supervision ”

Joint Sparse Approximation

m

k

d

iik

kk

Dyxk

WQyxflD

k1 1),(

|)(|max)),((||

1minW

Using the L1-∞ norm we can rewrite our objective function as:

For the hinge loss: the optimization problem can be expressed as a linear

program.

))(1,0max()),(( xyfyxfl

For any convex loss this is a convex objective.

Page 53: Learning Data Representations  with “ Partial Supervision ”

Joint Sparse Approximation

Objective:

m

k

D

j

d

ii

kj

k

k

tQD1

||

1 1],,[ ||

1min tεW

Linear program formulation (hinge loss):

Max value constraints:

mkfor :1:

difor :1:

0kj

iiki twt mkfor :1:

|:|1: kDjfor

kj

kjk

kj xfy 1)(

and

Slack variables constraints:and

Page 54: Learning Data Representations  with “ Partial Supervision ”

The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.

We might want a more general optimization algorithm that can handle arbitrary convex losses.

An efficient training algorithm

The LP formulation can be optimized using standard LP solvers.

We developed a simple an efficient global optimization algorithm for training joint models with L1−∞ constraints.

The total cost is in the order of: ))log(( dmdmO

Page 55: Learning Data Representations  with “ Partial Supervision ”

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Page 56: Learning Data Representations  with “ Partial Supervision ”

SuperBowl

Danish CartoonsSharon

Australian Open Trapped Miners

Golden globes

Grammys Figure Skating

Academy Awards

Iraq

Learn a representation using labeled data from 9 topics.

Train a classifier for the 10th held out topic using the relevantfeatures R only.

}0|)(|max:{ rkk wrR Define the set of relevant features to be:

Learn the matrix W using our transfer algorithm.

Page 57: Learning Data Representations  with “ Partial Supervision ”

4 20 40 60 80 100 120 1400.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

Average AUC

# training samples

Asymmetric Transfer

Baseline RepresentationTransfered Representation

Results

Page 58: Learning Data Representations  with “ Partial Supervision ”

Future Directions

Joint Sparsity Regularization to control inference time.

Learning representations for ranking problems.