Advanced Machine Learning

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Structured Prediction

pageAdvanced Machine Learning - Mohri@

Structured PredictionStructured output:

Loss function: decomposable.

• Example: Hamming loss.

• Example: edit-distance loss.

Y = Y1 ⇥ · · ·⇥ Yl.

L : Y ⇥ Y ! R+

L(y, y0) =1

1yk 6=y0k

L(y, y0) =1

ldedit(y1 · · · yl, y01 · · · y0l).

ExamplesPronunciation modeling.

Part-of-speech tagging.

Named-entity recognition.

Context-free parsing.

Dependency parsing.

Machine translation.

Image segmentation.

Examples: NLP TasksPronunciation:

POS tagging:

Context-free parsing/Dependency parsing:

The thief stole a car D N V D N

I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax

root The thief stole a car D N V D N The thief stole a car

Examples: Image Segmentation

PredictorsFamily of scoring functions mapping from to .

For any , prediction based on highest score:

Decomposition as a sum modeled by factor graphs.

H X ⇥ Y R

8x 2 X , h(x) = argmax

y2Yh(x, y).

Factor Graph ExamplesPairwise Markov network decomposition:

Other decomposition:

f1 f21 2 3

h(x, y) = hf1(x, y1, y2) + hf2(x, y2, y3).

h(x, y) = hf1(x, y1, y3)+

hf2(x, y1, y2, y3).

Factor Graphs : factor graph.

: neighborhood of .

: substructure set cross-product at .

Decomposition:

More generally, example-dependent factor graph,

N(f) f

fYf =Q

k2N(f) Yk

G = (V, F,E)

Gi = G(xi, yi) = (Vi, Fi, Ei).

h(x, y) =X

hf (x, yf ).

Linear HypothesesFeature decomposition Hypothesis decomposition.

• Example: bigram decomposition.

y: D N V D N x: his cat ate the fish k: 4

�(x, 4, y3, y4)

�(x, y) =lX

�(x, s, ys�1, ys).

h(x, y) = w ·�(x, y) =lX

w · �(x, s, ys�1, ys)| {z }

hs(x,ys�1,ys)

Structured Prediction ProblemTraining data: sample drawn i.i.d. from according to some distribution ,

Problem: find hypothesis in with small expected loss:

• learning guarantees?

• role of factor graph?

• better algorithms?

X ⇥ YD

S=((x1, y1), . . . , (xm, ym)) 2 X⇥Y.

h : X ⇥ Y ! R H

R(h) = E(x,y)⇠D

[L(h(x), y)].

OutlineGeneralization bounds.

Algorithms.

Learning GuaranteesStandard multi-class learning bounds:

• number of classes is exponential!

Structured prediction bounds:

• covering number bounds: Hamming loss, linear hypotheses (Taskar et al., 2003).

• PAC-Bayesian bounds (randomized algorithms) (David

McAllester, 2007).

can we derive learning guarantees for general hypothesis sets and general loss functions?

Covering Number BoundTheorem: fix . Then, with probability at least over the choice of sample of size , the following holds for any hypothesis :

(Taskar et al., 2003)

⇢>0 1�⇢S m

(x,y)⇠D[LH(h, x, y)] 1

f2F⇢S(h)

LH(f, xi, yi)+O

R2kwk2⇢2

(logm+ log l + logmax

k|Yk|)

F ⇢S(h) = {f : X ⇥ Y ! R | 8y 2 Y, 8i 2 [1,m], |f(xi, y)� h(xi, y)| ⇢H(y, yi)} .

h : (x, y) ! w ·�(x, y)

Factor Graph ComplexityEmpirical factor graph complexity for hypothesis set and sample :

Factor graph complexity:

S = (x1, . . . , xm)

(H) =1

"suph2H

| ✏i,f,y

= E✏

4suph2H

" ...✏i,f,y...

|Fi|hf (xi,y)...

RGm(H) = E

S⇠Dm

hbRGS (H)

correlation with random noise| {z }

(Cortes, Kuznetsov, MM, Yang, 2016)

MarginDefinition: the margin of at a labeled point is

• error when

• small margin interpreted as low confidence.

(x, y) 2 X ⇥ Yh

⇢h(x, y) = miny0 6=y

h(x, y)� h(x, y0).

⇢h(x, y) 0.

Loss FunctionAssumptions:

• bounded: for some .

• definite: .

Consequence:

y,y0L(y, y0) M M > 0

L(y, y0) = 0 ) y = y0

L(h(x), y) = L(h(x), y) 1⇢h(x,y)0.

Empirical Margin LossesFor any ,

⇢ > 0

bRaddS,⇢

(h) = E(x,y)⇠S

✓max

L(y0, y)� h(x,y)�h(x,y0)⇢

◆�

bRmultS,⇢

(h)= E(x,y)⇠S

✓max

L(y0, y)⇣1� h(x,y)�h(x,y0)

⌘◆�,

�M (u)

Generalization BoundsTheorem: for any , with probability at least , each of the following holds for all :

• tightest margin bounds for structured prediction.

• data-dependent.

• improves upon bound of (Taskar et al., 2003) by log terms (in the special case they study).

� > 0 1� �

R(h) bRaddS,⇢ (h) +

m(H) +M

R(h) bRmultS,⇢ (h) +

m(H) +M

(Cortes, Kuznetsov, MM, Yang, 2016)

Linear HypothesesHypothesis set used by most convex structured prediction algorithms (StructSVM, M3N, CRF):

with (x, y) =X

f (x, yf ).p � 1 and

(x, y) 7! w · (x, y) : w 2 RN, kwkp ⇤p

Complexity BoundsBounds on factor graph complexity of linear hypothesis sets:

bRGS (H1) ⇤1r1

ps log(2N)

bRGS (H2)

⇤2r2qPm

withrq = max

i,f,yk f (xi, y)kq

s = max

j2[1,N ]

|1 f,j(xi,y) 6=0.

Key TermSparsity parameter:

• factor graph complexity in for hypothesis set .

• key term: average factor graph size.

|Fi| mX

|Fi|2di mmax

i|Fi|2di,

where di = max

|Yf |.

log(N)maxi |Fi|2di/m�

NLP ApplicationsFeatures:

• is often a binary function, non-zero for a single pair .

• example: presence of n-gram (indexed by ) at position of the output with input sentence .

• complexity term only in .

(x, y) 2 X ⇥ Yf

O�max

plog(N)/m

Theory TakeawaysKey generalization terms:

• average size of factor graphs.

• empirical margin loss.

But, is learning with very complex hypothesis sets (factor graph complexity) possible?

• richer families needed for difficult NLP tasks.

• but generalization bound indicates risk of overfitting.

Voted Risk Minimization (VRM) theory (Cortes, Kuznetsov, MM, Yang, 2016).

OutlineGeneralization bounds.

Algorithms.

Surrogate LossLemma: for any , let be an upper bound on . Then, the following upper bound holds for any and :

Proof: if , then the following holds:

u 2 R+ �u : R ! Rv 7! u1v0

h 2 H (x, y) 2 X ⇥ Y

L(h(x), y) max

y0 6=y�L(y0,y)(h(x, y)� h(x, y

h(x) 6= y

L(h(x), y) = L(h(x), y)1⇢h(x,y)0

�L(h(x),y)(⇢h(x, y))

= �L(h(x),y)(h(x, y)�max

h(x, y

= �L(h(x),y)(h(x, y)� h(x, h(x)))

�L(y0,y)(h(x, y)� h(x, y

Φ-ChoicesDifferent algorithms:

• StructSVM: .

• M3N: .

• CRF: .

• StructBoost: (Cortes, Kuznetsov, MM, Yang, 2016).

�u(v) = max(0, u(1� v))

�u(v) = max(0, u� v)

�u(v) = log(1 + eu�v)

�u(v) = ue�v

AlgorithmsStructSVM

Maximum Margin Markov Networks (M3N)

Conditional Random Fields (CRF)

Regression for Learning Transducers (RLT)

Linear PredictionFeatures: function .

Hypothesis set: functions of the form

Formulation:

• scoring functions.

• multi-class classification.

• margin:

� : X�Y �RN

h(x) = argmaxy�Y

w · �(x, y),

where the vector is learned from data.w

h : X ! Y

⇢w(xi, yi) = w ·�(xi, yi)�max

y 6=yi

w ·�(xi, y).

Multi-Class SVMOptimization problem:

Decision function:

kwk2 + CmX

y 6=yi

⇣0, 1�w ·

⇥�(xi, yi)��(xi, y)

⇤⌘

x 7! argmax

y2Yw ·�(x, y).

SVMStructOptimization problem (StructSVM):

• solution based on iteratively solving QP and adding most violating constraint.

• no specific assumption on loss.

• use of kernels.

(Tsochantaridis et al., 2005)

12�w�2+C

maxy �=yi

L(yi, y)max�0, 1�w · [�(xi, yi)��(xi, y)]� ��

=�(xi,yi,y)

M3NOptimization problem:

• assumed to have a graph structure with a Markov property, typically a chain or a tree.

• loss assumed decomposable in the same way.

• polynomial-time algorithm using graphical model structure.

• use of kernels.

12�w�2+C

maxy �=yi

max�0, L(yi, y)�w · [�(xi, yi)��(xi, y)]� ��

=�(xi,yi,y)

(Taskar et al., 2003)

Equivalent FormulationsOptimization problems:

minw,⇠

2kwk2+C

s.t. w · [�(xi, yi)��(xi, y)] � 1� ⇠i

L(y, yi), ⇠i � 0, 8i 2 [1,m], y 6= yi.

minw,⇠

2kwk2+C

s.t. w · [�(xi, yi)��(xi, y)] � L(y, yi)� ⇠i, ⇠i � 0, 8i 2 [1,m], y 6= yi.

Dual ProblemOptimization problem:

can use PDS kernel.

↵�0

i,y 6=yi

↵iy �1

i,y 6=yi

j,y06=yj

↵iy↵jy0h� i(y),� j(y0)i

y 6=yi

L(yi, y) C

m, 8i 2 [1,m].

� i(y) = �(xi, yi)��(xi, y)

Optimization SolutionCutting plane method: number of steps .

• start with empty constraints .

• do until no new constraint:

• for do

• find most violating constraint:

• if

• dual solution for

by = argmax

y2YL(y, yi)

h1�w · [�(xi, yi)��(xi, y)]

i= ⇠i(y)

Si = ;, i = 1 . . .m

i = 1 . . .m

(⇠i(by) > max

⇠i(y) + ✏)

Si Si [ {by}

↵ [mi=1Si

(Tsochantaridis et al., 2005)

�1✏ , C,max

y,iL(y, yi)

CRF = Cond. Maxent ModelDefinition: conditional probability distribution over the outputs :

• assumed to have a graph structure with a Markov property, typically a chain or a tree.

(Lafferty et al., 2001)

pw(y|x) =exp

⇣w ·�(x,y)

Zw(x) =

y2Yexp

�w ·�(x,y)

CRFOptimization problem (CRFs):

• comparison with M3N.

• smooth optimization problem, solutions.

12�w�2+C

log�

y�Yexp

�L(yi, y)�w · [�(xi, yi)��(xi, y)]� ��

=�(xi,yi,y)

� �� max (M3N) soft-max (CRF)

O(C log(1/✏))

FeaturesDefinitions:

• output alphabet , .

• input: .

• output: .

Decomposition: bigram case.

x = x1 · · ·xl

� |�| = r

y = y1 · · · yl 2 �l

�(x,y) =lX

�(x, k, yk�1, yk).

PredictionComputation:

• exponentially many possible outputs.

Solution:

• cast as single-source shortest-distance problem in acyclic directed graph with edges.

• linear-time algorithms: standard acyclic shortest-distance algorithm (Lawler) or the Viterbi algorithm.

argmax

y2�l

w ·�(x,y) = argmax

y2�l

w · �(x, k, yk�1, yk).

(r2l + r)

Directed Graph

· · ·

(y1, 1)

(yk�1, 1)

(yr, 1)

(yk�1, k � 1)

(yk, k)

(y0, 1)

w·�(x, k, yk�1, yk)

w·�(x, 1, y0, y1)

w·�(x, 1, y0, yr)

y0 = ✏.

EstimationKey term in gradient computation:

Computation:

F (w) =1

Ey⇠pw[·|xi]

[�(xi,y)]� E(x,y)⇠S

[�(x,y)] + �w.

Ey⇠pw[·|xi]

[�(xi,y)] =X

y2�l

[y|w]�(xi,y)

y2�l

[y|w]h lX

�(xi, k, yk�1, yk)i

(y,y0)2�2

yk�1=yyk=y0

775�(xi, k, y, y0).

Flow ComputationDecomposition:

Flow: sum of the weights of all paths going through a given transition.

• linear-time computation.

• two single-source shortest-distance algorithms.

• computational cost in .41

pw(y|xi) =

⇣w ·�(xi,y)

Zw(xi)

O(r2l)

⇣w ·�(xi,y)

�w · �(xi, k, yk�1, yk)

�| {z }

a(k,yk�1,yk)

Directed Graph

· · ·

(y1, 1)

(yk�1, 1)

(yr, 1)

(yk�1, k � 1)

(yk, k)

(y0, 1)

�((yk, k))

↵((yk�1, k � 1))

a(k, yk�1, yk)

ComputationSingle-source shortest distance problems in :

• : sum of the weights of all paths from initial to .

• : sum of the weights of all paths from final to .

• linear-time algorithms for acyclic graphs.

Partition function : sum of the weights of all accepting paths, .

Formula:

Zw(xi)

↵(q)

(+,⇥)

�(q) q

yk�1=yyk=y0

pw[y|w] =↵((y, k � 1)) · a(k, y, y0) · �((y0, k))

�((y0, 0)).

�((y0, 0))

RLTDefinition: formulated as a regression problem.

• learning transduction (regression).

• prediction: finding pre-image.

(Cortes, MM, Weston, 2005)

X* Y*f

gΦX ΦY ΦY

RLTOptimization problem:

• generalized ridge regression problem.

• closed-form solution, single matrix inversion.

• can be generalized to encoding constraints.

• use of kernels.

argminW�RN2�N1

F (W) = ��W�2F +

�WMxi �Myi�2.�X(xi) �Y (yi)

SolutionPrimal:

Regression solution:

W = MY M>X(MXM>

X + �I)�1.

W = MY (KX + �I)�1M>X .

g(x) = WMx

PredictionPrediction using kernels:

f(x) = argminy2Y

⇤kWM

= argminy2Y

� 2M>y

= argminy2Y

� 2M>y

+ �I)�1M>X

= argminy2Y

(y, y)� 2(Ky

)>(KX + �I)�1Kx

with KyY =

KY (y,y1)...KY (y,ym)

KX(x,x1)...KX(x,xm)

�.and

Example: N-gram kernelDefinition: for any two strings and ,

kn(y1, y2) =X

|y1|u |y2|u.

Pre-Image ProblemExample: pre-image for n-gram features.

• find sequence with matching n-gram counts.

• use de Bruijn graph, Euler circuit.

bc|abc| = 2

|abb| = 3 ab

ExistenceTheorem: the vector of n-gram counts admits a pre-image iff for any vertex the directed graph

Proof: direct consequence of theorem of Euler (1736).

in-degree(q) = out-degree(q).

Pre-Image ProblemExample: bigram count vector predicted

• de Bruijn graph :

• Euler circuit: .

x = bcbca

z = (0, 1, 0, 0, 0, 2, 1, 1, 0)>.

AlgorithmAlgorithm:

• proof of correctness non-trivial.

• linear-time algorithm.

Euler(q)1 path ✏2 for each unmarked edge e leaving q do

3 Mark(e)4 path e Euler(dest(e)) path5 return path

(Cortes, MM, Weston, 2005)

UniquenessIn general not unique.

Set of strings with unique pre-image regular (Kontorovich, 2004).

x = bcbcca/bccbca.

Generalized Euler CircuitExtensions:

• round components of vector.

• cost of one extra or missing count for an n-gram: one local insertion or deletion.

• potentially more pre-image candidates: potentially use n-gram model to select most likely candidate.

• regression errors and potential absence of pre-image: restart Euler at every vertex for which not all edges are marked.

Illustration

x = bccbca/bcbcca.

x = bcba.

RLTBenefits:

• regression formulation structured prediction problems.

• simple algorithm.

• can be generalized to regression with constraints (Cortes,

MM, Weston, 2007).

Drawbacks:

• input-output features not natural (but constraints).

• pre-image problem for arbitrary PDS kernels?

pageMohri@

ConclusionStructured prediction theory:

• tightest margin guarantees for structured prediction.

• general loss functions, data-dependent.

• key notion of factor graph complexity.

• additionally, tightest margin bounds for standard classification.

References

Advanced Machine Learning - Mohri@ page

Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble Methods for Structured Prediction. In ICML, 2014.

Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured Prediction Theory Based on Factor Graph Complexity. In NIPS, 2016.

Corinna Cortes, Mehryar Mohri, and Jason Weston. A General Regression Framework for Learning String-to-String Mappings. In Predicting Structured Data. The MIT Press, 2007.

John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.

David McAllester. Generalization Bounds and Consistency. In Predicting Structured Data. The MIT Press, 2007.

References

Advanced Machine Learning - Mohri@ page

Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics 7(3), 2002.

Ben Taskar and Carlos Guestrin and Daphne Koller. Max-Margin Markov Networks. In NIPS, 2003.

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large Margin Methods for Structured and Interdependent Output Variables, JMLR, 6, 2005.

Advanced Machine Learning - NYU Courant

Documents

ch1 machine courant continu - Université de Poitiers · 1 Machine à courant continu 1- Constitution La machine à courant continu est constituée de trois parties principales :

MODELISATION D’UNE MACHINE A COURANT CONTINU

MACHINE TOURNANTE A COURANT ALTERNATIF LES …

Epl gittu put g gut - NYU Courant

LA MACHINE A COURANT CONTINU - fuuu.befuuu.be/polytech/ELECH300/CHAP4-ELEC-H-300.pdf · 4.2 Chap. 4 : La machine à courant continu Figure 4.1 !1 Chapitre 4 : LA MACHINE A COURANT

Random Triangle Removal - NYU Courant

Machine courant continu - electronique-mixte.fr · Alimentation et commande de la machine à courant continu 1. Organisation de la machine Dans l’organisation d’une machine à

Matrices - NYU Courant

MODELISATION D’UNE MACHINE A COURANT · PDF fileMODELISATION D’UNE MACHINE A COURANT CONTINU COURS Page 1/8 1. Introduction Une machine à courant continu est une machine électrique

Lecture 2: Markov Chains (I) - NYU Courant

Introduction - NYU Courant

Georg Stadler Courant Institute, NYU stadler@cims.nyu

Machine à courant continu (MCC)

ON BATHROOM USE OBAMA DIRECTIVE - NYU Courant

13 MACHINE A COURANT ALTERNATIF SINUSOIDAL - · PDF file13 MACHINE A COURANT ALTERNATIF SINUSOIDAL ... Un aimant ou un électroaimant (alimenté avec un courant continu), entraîné

TD Machine à courant continu

March 19, 2002Marco Antoniotti NYU Bioinformatics Group1 Simulation: Software Methods and Biological Processes Marco Antoniotti NYU Courant Bioinformatics

On the Unique Games Conjecture Subhash Khot NYU Courant CCC, June 10, 2010

Exercices Machine Courant Continu

Polycopi Td Machine Courant Alternatif