Deep Neural Networks Are Our Friendslxmls.it.pt/2016/Deep-Neural-Networks-Are-Our-Friends.pdf · 2016. 7. 26. · Gradients are our friends Computation Graphs are our friends Outline

Deep Neural NetworksAre Our Friends

Wang Ling

● Part I - Neural Networks are our friends○ Numbers are our friends ○ Operators are our friends○ Functions are our friends○ Parameters are our friends○ Cost Functions are our friends○ Optimizers are our friends○ Gradients are our friends○ Computation Graphs are our friends

Outline

● Part I - Neural Networks are our friends● Part 2 - Into Deep Learning

○ Nonlinear Neural Models○ Multilayer Perceptrons○ Using Discrete Variables○ Example Applications

Outline

Numbers are our friends

Numbers are our friendsAbby Cadabby

How many apples does Abby have?

http://muppet.wikia.com/wiki/Abby_Cadabby


Numbers are our friends

4

Abby Cadabby



Numbers are our friends● Types of Numbers:

○ Integers : 5○ Rationals : 1/2○ Reals : 1.4e10 ...

Operators are our friends

4

Bert


41

Bert

If Abby has 4 apples, and gives Bert 1 apple, how many apples will

Abby have?


3 1

Bert

Operators are our friends● Arithmetic Operators

○ Addition : 23 + 12 = 35○ Subtraction : 31 - 15 = 16○ Multiplication : 4 x 5 = 20○ Division : 20 / 5 = 4

Functions are our friends

41


4

5?

1

If Bert always returns 3 bananas for each apple, how many bananas will

Abby receive for 2 apples


y = 3x

● Input, x - Number of Apples given by Abby


y = 3x

● Input, x - Number of Apples given by Abby

● Output, y - Number of Bananas received by Abby


4

5?

1

y = 3x


4

5?

1

y = 3x , x =1


4

53

1

y = 3x , x =1y = 3

Functions are our friendsy = 3x

Functions are our friendsy = 3x

Cookie Monster

Functions are our friendsy = 3x y = ??

Functions are our friendsy = ??

0

1


0

1

16

5


0

1

16

5

20

6


0

1

16

5

20

6

?

3

If Abby gives Cookie Monster 3 apples, how many bananas

does she get?

Parameters are our friends

y = 3x + 1

● Input● Output


y = wx + b

● Input● Output● Parameters

Input - Fixed, comes from dataParameters - Need to be estimated

Parameters are our friendsy = wx + b

0

1

16

5

20

6

?

3

Data


0

1

16

5

20

6

?

3


?

3

x y

1 0

5 16

6 20


y = wx + bx y

1 0

5 16

6 20

Data Model


y = wx + bx y

1 0

5 16

6 20

Data Model

How to find the parameters w and b?


y = wx + bx y

1 0

5 16

6 20

Data ModelModel

Candidate 1x y ŷ

1 0 1

5 16 5

6 20 6y = 1x + 0


y = wx + bx y

1 0

5 16

6 20

Data ModelModel

Candidate 1x y ŷ

1 0 1

5 16 5

6 20 6

Model Candidate 2 x y ŷ

1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2


y = wx + bx y

1 0

5 16

6 20

Data ModelModel

Candidate 1x y ŷ

1 0 1

5 16 5

6 20 6


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2Which one is better ?

Cost functions are our friends

yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1x y ŷ

1 0 1

5 16 5

6 20 6


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1x y ŷ

1 0 1

5 16 5

6 20 6


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5

2 6 20 6

2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6

2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1


1 0 4

5 16 12

6 20 14

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

2

318C(1,0)


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1

n x y ŷ (y-ŷ)

0 1 0 1 1

1 5 16 5 121

2 6 20 6 196

Model Candidate 2

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

2

318

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 12 16

2 6 20 14 36

2

68

C(1,0)

C(2,2)


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data ModelModel

Candidate 1

Model Candidate 2

y = 1x + 0

y = 2x + 2

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

318

68

C(1,0)

C(2,2)


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data Model

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2


yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data Model

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

How to find the parameters w and b?

Optimizers are our friends

yn = wxn + bn x y

0 1 0

1 5 16

2 6 20

Data Model

Cost

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2Optimizer

arg min C(w,b)w,b∈[-∞,∞]

Optimizers are our friendsOptimizer


w

b



w0,b0 = 2,2 : C(w0,b0) = 68

w

b

y = wx + b



w0,b0 = 2,2 : C(w0,b0) = 68

w

b

2

2

68

y = wx + b



w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = ?

w

b

y = wx + b



w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = 26

n x y ŷ (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0

C(3,2) 26

w

b

2

y = wx + b



w0,b0 = 2,2 : C(w0,b0) = 68w1,b1 = 3,2 : C(w1,b1) = 26

n x y ŷ (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0

C(3,2) 26

w

b

2

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 4,2 : C(w2,b2) = ??

w

b

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 4,2 : C(w2,b2) = 136

w

b

n x y ŷ (y-ŷ)

0 1 0 6 36

1 5 16 22 64

2 6 20 26 36

C(4,2) 136

2

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26

w

b

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 3,3 : C(w2,b2) = 41

w

b

n x y ŷ (y-ŷ)

0 1 0 6 36

1 5 16 18 4

2 6 20 21 1

C(3,3) 41

2

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26

w

b

y = wx + b



w1,b1 = 3,2 : C(w1,b1) = 26w2,b2 = 3,1 : C(w2,b2) = 17

w

b

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 16 0

2 6 20 19 1

C(3,1) 17

2

y = wx + b



w2,b2 = 3,1 : C(w2,b2) = 17

w

b

y = wx + b



w2,b2 = 3,1 : C(w2,b2) = 17

w

b

w3,b3 = 3,0 : C(w3,b3) = 13

n x y ŷ (y-ŷ)

0 1 0 3 9

1 5 16 15 1

2 6 20 18 4

C(3,0) 13

2

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 3,-1 : C(w4,b4) = 17

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 14 4

2 6 20 17 9

C(3,-1) 17

2

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 2,0 : C(w4,b4) = 104

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 10 36

2 6 20 12 64

C(2,0) 104

2

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 4,0 : C(w4,b4) = 104

n x y ŷ (y-ŷ)

0 1 0 4 16

1 5 16 20 16

2 6 20 24 16

C(2,0) 54

2

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b



w

b

w?,b? = 4,-2 : C(w?,b?) = ??

y = wx + b



w

b

n x y ŷ (y-ŷ)

0 1 0 2 4

1 5 16 18 4

2 6 20 22 4

C(4,-2) 12

2

w?,b? = 4,-2 : C(w?,b?) = 12

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13

Search Problem

y = wx + b



w

b

w3,b3 = 3,0 : C(w3,b3) = 13w4,b4 = 3.01,0 : C(w4,b4) = 12.82

n x y ŷ (y-ŷ)

0 1 0 3.01 9.06

1 5 16 15.01 0.98

2 6 20 18.01 3.96

C(3.01,0) 12.82

2

y = wx + b



w

b

w*,b* = 4,-2 : C(w*,b*) = 12

y = wx + b



w

b

w*,b* = 4,-2 : C(w*,b*) = 12

y = wx + b



w

b

w*,b* = 4,-4 : C(w*,b*) = 0

y = wx + b

Gradients are our friendsOptimizer


w

b

Should be used sparingly

y = wx + b



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hwhw = 1



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hwhw = 1C(w0+hw,b0) = C(3,2) = 26



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hwhw = 1C(w0+hw,b0) = C(3,2) = 26 (C(w0+1,b0)-C(w0,b0))

(C(3,2)-C(2,2))=-421

1

rw=

rw=



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hwhw = 1, r = -42hw = 0.1, r = -98hw = 0.01, r = -104hw = 0.001, r = -104



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hwhw = 1, r = -42hw = 0.1, r = -98hw = 0.01, r = -104hw = 0.001, r = -104 ∂C

∂w(w0,b0)hw → 0, r =



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hw∂C

∂w=

∂∑(ŷn-yn) 2

∂wn



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hw∂C

∂w=

∂∑(ŷn-yn) 2

∂wn = ∑-2(ŷn-yn)xn

n



w0,b0 = 2,2 : C(w0,b0) = 68

∂C

∂w=

∂∑(ŷn-yn) 2


n

∂w(w0,b0)hw → 0, rw = = -104

∂C

n x y ŷ (ŷ-y) -2(ŷ-y)x

0 1 0 4 4 8

1 5 16 12 -4 -40

2 6 20 14 -6 -72



w

b

y = wx + b

w0,b0 = 2,2 : C(w0,b0) = 68

2

2

68

hw∂C

∂w=

∂∑(ŷn-yn) 2


n

∂C

∂b=

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

n



w0,b0 = 2,2 : C(w0,b0) = 68

∂w(w0,b0)hw → 0, rw = = -104

∂C

n x y ŷ (ŷ-y) -2(ŷ-y)

0 1 0 4 4 8

1 5 16 12 -4 -8

2 6 20 14 -6 -12

∂w(w0,b0)hb → 0, rb = = -12

∂C



w0,b0 = 2,2 : C(w0,b0) = 68

∂w(w0,b0)hw → 0, rw = = -104

∂C

∂w(w0,b0)hb → 0, rb = = -12

∂C

w

b

y = wx + b

2

2w1 = w0 - rw

b1 = b0 - rb → Learning Rate

Gradients are our friendsy = 4x-4

Data

0

1

16

5

20

6

?

3

Gradients are our friendsy = 4x-4

Data

0

1

16

5

20

6

8

3

Computation Graphs are our friends

C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

∂C

∂w=

∂∑(ŷn-yn)


n

∂C

∂b=

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

n

y = wx + b

Easy!

2


Harder!

y = wx + b + tanh(yx + b)2


Computation Graphs can

compute gradients for you!

y = wx + b + tanh(yx + b)2


C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

∂C

∂w=

∂∑(ŷn-yn)


n

∂C

∂b=

∂∑(ŷn-yn) 2

∂bn = ∑-2(ŷn-yn)

n

y = wx + b

2


C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

∂C

∂w=

∂(ŷn-yn)

∂ynn

= ∑-2(ŷn-yn)xn n

2

= ∑-2(ŷn-yn) n

y = wx + b

∂yn

∂w

2

∑

∂C

∂b=

∂(ŷn-yn)

∂ynn

∂yn

∂b∑


C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

∂C

∂w=

∂(ŷn-yn)

∂ynn

2

y = wx + b

∂yn

∂w

2

∑

∂C

∂b=

∂(ŷn-yn)

∂ynn ∂b∑ ∂yn


C(w,b) = ∑(yn-ŷn)n∈{0,1,2}

2

∂C

∂w=

∂(ŷn-yn)

∂ynn

2

y = o + bo = wx

∂yn

∂w

2

∑

∂C

∂b=

∂(ŷn-yn)



C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂ynn

2

c = dd = y - ŷy = o + bo = wx

∂yn

∂w

2

∑

∂C

∂b=

∂(ŷn-yn)


2

∂(ŷn-yn)


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

2

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b=

∂(ŷn-yn)


2

∂dn

∂yn

∂yn

∂on


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

Sub


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

forward(x,y) → zbackward(x,y,dz) → dx,dy

Sub


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

forward(x,y) : return x - ybackward(x,y,dz) : return dz, -dz

Sub


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

forward(x,y) : return x - ybackward(x,y,dz) : return dz, -dz

Sub


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

forward(x,y) : return x - ybackward(x,y,dz) : return 1, -1

Sub ∂dn

∂ŷn


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷy = o + bo = wx

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

Add

Product

o

w x

Product


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

c = dd = y - ŷ

∂on

∂w∑

∂C

∂b

2

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

o

w x

Product

b

Add

y


C(w,b) = ∑cnn∈{0,1,2}

∂C

∂w=

∂cn

∂dnn

∂on

∂w∑

∂C

∂b

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

o

w x

Product

b

Add

y ŷ

d c


C(w,b) = ∑cnn∈{0}

∂C

∂w=

∂cn

∂dnn

∂on

∂w∑

∂C

∂b

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

o

w x

Product

b

Add

y ŷ

d c Id C


C(w,b) = ∑cnn∈{0}

∂C

∂w=

∂cn

∂dnn

∂on

∂w∑

∂C

∂b

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

o

w x

Product

b

Add

y ŷ

d c Id C

Input


C(w,b) = ∑cnn∈{0}

∂C

∂w=

∂cn

∂dnn

∂on

∂w∑

∂C

∂b

∂dn

∂yn

∂yn

∂on

= ∂cn

∂dnn

∑ ∂dn

∂yn

∂yn

∂b

Power 2

Sub

o

w x

Product

b

Add

y ŷ

d c Id C

Input

Parameters

Computation Graphs are our friendsPower 2

Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2

Forward:1-Initialize inputs


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2

Forward:1-Initialize inputs2-Initialize variables

Variables


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2

Forward:1-Initialize inputs2-Initialize variables

Variables

2 values: x and dx

0,0

0,0

0,00,0 0,0


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2

Forward:1-Initialize inputs2-Initialize variables3-Topological Sort variables

0,0

0,0

0,00,0 0,0


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


0,0

0,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


10,0

0,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


10,0

12,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


0,0

0,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


0,0

2,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


10,0

2,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


0,0

0,0

0,00,0 0,0


Sub

o

Add

y

d c Id CForward:

1-Initialize inputs2-Initialize variables3-Topological Sort variables

0,0

0,0

0,00,0 0,0


o

y

d c CForward:


0,0

0,0

0,00,0 0,0

1st

2nd

3rd4th 5th


Sub

o

Add

y

d c Add CForward:


0,0

0,0

0,00,0 0,0

g0,0

Add

s 0,0


o

y

d c CForward:


0,0

0,0

0,00,0 0,0

g0,0

s 0,0

1st

2nd

3th

4th

5th 6th 7th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2

Forward:1-Initialize inputs2-Initialize variables3-Topological Sort variables4-For each variable in topological

order, run the forward method of all operations that link to them

0,0

0,0

0,00,0 0,0

1st

2nd

3rd

4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



10,0

0,0

0,00,0 0,0

1st

2nd

3rd

4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



10,0

12,0

0,00,0 0,0

1st

2nd

3rd

4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



10,0

12,0

-4,00,0 0,0

1st

2nd

3rd

4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



10,0

12,0

-4,016,0 0,0

1st

2nd

3rd

4th 5th


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



10,0

12,0

-4,016,0

1st

2nd

3rd

4th 5th16,0


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2



5-Set gradients to final variables

10,0

12,0

-4,016,0

1st

2nd

3rd

4th 5th16,1


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2


order, run the forward method of all operations that link to them (Forward)

5-Set gradients to final variables6-run the operations backward method

in reverse order (Backward)10,0

12,0

-4,016,0

1st

2nd

3rd

4th 5th16,1

∂C

∂c C=c =1

dc = dC ∂C

∂c


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,016,1

1st

2nd

3rd

4th 5th16,1

∂C

∂c C=c =1

dc = dC ∂C

∂c


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,016,1

1st

2nd

3rd

4th 5th16,1

c = d2

dd = dc ∂c

∂d

∂c

∂d= 2d


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,016,1

1st

2nd

3rd

4th 5th16,1

c = d2

dd = dc ∂c

∂d

∂c

∂d= 2 x -4


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,016,1

1st

2nd

3rd

4th 5th16,1

c = d2

dd = dc ∂c

∂d

∂c

∂d= -8


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,-816,1

1st

2nd

3rd

4th 5th16,1

c = d2

dd = dc ∂c

∂d

∂c

∂d= -8


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,0

-4,-816,1

1st

2nd

3rd

4th 5th16,1

d = y - ŷ ∂d

∂y= 1


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

d = y - ŷ ∂d

∂y= 1

dy = dd ∂d

∂y


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2




in reverse order (Backward)10,-8

12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

y = o + b

∂y

∂o= 1

do = dy ∂y

∂o


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

y = o + b

∂y

∂o= 1

∂y

∂b= 1

bt+1 = b - dy ∂y

∂b


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

y = o + b

∂y

∂o= 1

∂y

∂b= 1

bt+1 = b - dy ∂y

∂b


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

y = o + b

∂y

∂o= 1

∂y

∂b= 1

bt+1 = b - ∂c

∂d

∂d∂y

∂y∂b

∂C

∂c


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

y = o + b

∂y

∂o= 1

∂y

∂b= 1

bt+1 = b - ∂C

∂b


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52

2





12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

o = wx

∂o

∂w= x

wt+1 = w - do ∂o

∂w


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52.8

2.2




in reverse order (Backward)7-update parameters 10,-8

12,-8

-4,-816,1

1st

2nd

3rd

4th 5th16,1

o = wx

∂o

∂w= x

wt+1 = w - do ∂o

∂w


Sub

o

w x

Product

b

Add

y ŷ

d c Id C

16

52.8

2.210,-8

12,-8

-4,-816,1 16,1

o = wx

∂o

∂w= x

wt+1 = w - do ∂o

∂w

Existing Tools:-Tensorflow ( https://www.tensorflow.org )-Torch ( https://github.com/torch/nn )-CNN ( https://github.com/clab/cnn )-JNN ( https://github.com/wlin12/JNN )-Theano (http://deeplearning.net/software/theano/ )

https://www.tensorflow.org

https://github.com/torch/nn

https://github.com/clab/cnn

https://github.com/wlin12/JNN

http://deeplearning.net/software/theano/

Into Deep Learning

Nonlinear Neural Modelsy = 4x-4

Data

0

1

16

5

20

6

?

3

Nonlinear Neural Models

Data

0

1

16

5

20

6

?

3

There is a limit of bananas I can give you


n x y

0 1 0

1 5 16

2 6 20

Data

x

y y = 4x-4


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

x

y y = 4x-4


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

x

y y = 2x+3

Model Problem


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

x

y y = 2x+3

Model Problem

Underfitting


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

x

y y = ???

Can we learn arbitrary functions?


y = (w1x + b1)s1 + (w2x+b2)s2

Use different linear functions depending on the value of x?


y = (w1x + b1)s1 + (w2x+b2)s2s1 - 1 if x < 6 and 0 otherwises2 - 1 if x >= 6 and 0 otherwise


y = (w1x + b1)s1 + (w2x+b2)s2

n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

y = (4x - 4)s1 + (0x+20)s2

s1 - 1 if x < 6 and 0 otherwises2 - 1 if x >= 6 and 0 otherwise


s = (wx + b)

(t) = 11 + e-t


s = (1000x)

x = 0.1 then (1000x) = 1

x = -0.1 then (1000x) = 0


s = (1000x)

x = 0.1 then (1000x) = 1

x = -0.1 then (1000x) = 0


s = (1000x - 6000)

x = 6.1 then (1000x - 6000) = 1

x = 5.9 then (1000x - 6000) = 0


y = (w1x + b1)s1 + (w2x+b2)s2

s1 = (w3x + b3)s2 = (w4x + b4)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (1000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)s1 + (20)s2

s1 = (1000)s2 = (-1000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (16)1 + (20)0

s1 = (1000)s2 = (-1000)

n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = 16

s1 = (1000)s2 = (-1000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (4x - 4)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (0x+20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-1000x + 6000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-3000)s2 = (1000x - 6000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)s1 + (20)s2

s1 = (-3000)s2 = (3000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = (32)0 + (20)1

s1 = (-3000)s2 = (3000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data y = 20

s1 = (-3000)s2 = (3000)


Data

0

1

16

5

20

6

?

3

If you give me too many apples, I will give them to...


Data

0

1

16

5

20

6

?

3

Count Von Count


Multilayer Perceptrons

n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

Data

x

y y = (4x - 4)s1 + (0x+20)s2

n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

x

y y = (4x - 4)s1 + (0x+20)s2


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = ????s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = not s1 and not s3

s3 = (1000x - 15000)


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 1 Perceptron

Layer 1 Perceptron


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 2 Perceptron

Layer 1 Perceptron

Layer 1 Perceptron


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = not s1 and not s3

s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)s1 + (20)s2 + (1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (-1000s1 - 1000s3 + 500)s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (-1000s4 - 1000s5 + 500)s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (500)s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)s1 + (20)s2 + (1)s3 s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (40)0 + (20)1 + (1)0s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = 20s1 = (-5000) = 0s2 = (500) = 1s3 = (-4000) = 0


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3 s1 = (-1000x + 6000)s2 = (-1000s1 - 1000s3 + 500)s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (772)s1 + (20)s2 + (1)s3 s1 = (-1000x + 6000)s2 = (-1000s4 - 1000s5 + 500)s3 = (1000x - 15000)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-1000s4 - 1000s5 + 500)s3 = (4000) = 1


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-1000 + 0 + 500)s3 = (4000) = 1


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (772)s1 + (20)s2 + (1)s3 s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = (772)0 + (20)0 + (1)1s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

y = 1s1 = (-13000) = 0s2 = (-500) = 0s3 = (4000) = 1


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

x

yy = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

Layer 2 Perceptron

Layer 1 Perceptron

Layer 1 Perceptron


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

x

s1

s3

s2

w4x

b4


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

x

s2

w4x

b4

w7x

b5

s1

s3


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

x

s2

s1

s3

w6s3w5s1

b5


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

x

s2

s1

s3x < 6 x > 15

!(x > 15) & !(x < 6)


y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

x

s2

s1

s3x < 6 x > 15

x∈[6,15]


x

s2

s1

s3x < 6 x > 15

x∈[6,15]

s4

x∈]-∞,6] & ]15,∞]


x

s5

s1

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

s7

s6

s7

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]


x

s5

s1

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

s7

s6

s7

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

Input

Layer 1 (Input Features)

Layer 2 (And and Or Combinations)


x

s5

s1

s2x < 6 x > 15

x∈[6,15]

s3 x > 2

s4 x < 3

s7

s6

s7

x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

Input



And(s1,s2) = (1000s1 + 1000s3 - 1500)Or(s1,s2) = (1000s1 + 1000s3 - 500)


x

s5

s1

s2

s3

s4

s7

s6

s7

Input



Layer 3 (Xor Combinations)s8

s9

sa

sb


x

s5

s1

s2

s3

s4

s7

s6

s7

Input




s9

sa

sb

Xor(s1,s2) = Or(And(s1,!s2), And(!s1,s2))


x

s5

s1

s2

s3

s4

s7

s6

s7

Input




s9

sa

sb

Xor(s1,s2) = Or(s5, s6)


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

x

y

Universal approximator


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

x

y

but...


n x y

0 1 0

1 5 16

2 6 20

3 9 20

4 11 20

5 15 1

6 19 1

Data

x

y

No guarantee that the best function will

be found


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

n x y

0 1 0

1 5 16

2 6 20

y


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

n x y

0 1 0

1 5 16

2 6 20

y = 0s5 + 16s6 + 20s7

y


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

n x y

0 1 0

1 5 16

2 6 20

y

y = 0s5 + 16s6 + 20s7


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

n x y

0 1 0

1 5 16

2 6 20

y

y = 0s5 + 16s6 + 20s7


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

n x y

0 1 0

1 5 16

2 6 20Overfitting

y = 0s5 + 16s6 + 20s7


y

Model Problem

Task Complexity

Model Complexity


Task Complexity

Model Complexity

Underfitting


Task Complexity

Model Complexity

Overfitting

Underfitting


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Line

ar R

egre

ssio

n

MLP

1 L

ayer

MLP

2 L

ayer

MLP

3 L

ayer


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Line

ar R

egre

ssio

n

Line

ar

Reg

ress

ion

mor

e fe

atur

es


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Line

ar R

egre

ssio

n

MLP

1 L

ayer

MLP

2 L

ayer

MLP

3 L

ayer


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Line

ar R

egre

ssio

n

MLP

1 L

ayer

MLP

2 L

ayer

MLP

3 L

ayer

Sentiment analysis


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Line

ar R

egre

ssio

n

MLP

1 L

ayer

MLP

2 L

ayer

MLP

3 L

ayer

Sentiment analysis

Machine Translation


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Data


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Data


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Data


yn x y

0 1 0

1 5 16

2 6 20

y y


yn x y

0 1 0

1 5 16

2 6 20

3 2 4

y y


n x y

0 1 0

1 5 16

2 6 20

3 2 4

y y


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Model Bias


Task Complexity

Model Complexity

Overfitting

Underfitting

Happy Zone

Model BiasL1 & L2 RegularizationStochastic Dropout (Srivastava et al, 2014)Model Structure (CNN, RNNs)


Regularization

C(w,b) = ∑(yn-ŷn) + (w+b)ß

ß = Regularization constantn∈{0,1,2}

2


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

y

Regularization


x

s5

s1

s2x > 1 nothing

x∈]-∞,1]

s3 nothing

s4 x < 6

s7

s6

nothing x∈[6,∞]

y

Regularization


x

s5

s1

s2x > 1 nothing

x∈]-∞,1]

s3 nothing

s4 x < 6

s7

s6

nothing x∈[6,∞]

y

Regularization

Find solutions that require less effort


x

s5

s1

s2x > 1 x < 2

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

x∈[5,6[ x∈[6,∞]

y

Stochastic Dropout (Srivastava et al, 2014)



x

s5

s1

s2x > 1 0

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

0 0

y



x

s5

s1

s2x > 1 0

x∈]-∞,1]

s3 x < 5

s4 x < 6

s7

s6

0 0

y Find robust models


Model Structure

Weighted sum of linear functions VS MLP

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


Model Structure

Weighted sum of linear functions VS MLP

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

Convolutional Vs RNNs


s1 = (w4x + b4)s2 = (w5s1 + w6s3 + b5)s3 = (w7x + b6)

x

s2

s1

s3

w6s3w5s1

b5

Representation


s1 = (W3x + b3)s2 = (W4s1 + b4)

Representation

s1

s2

2

1

1xx

s2

s1

s3


Representation

s1

s2

1000

1000

100x

s1 = (Ws2 + b)


Representation

s1

s2

1000

1000

100x

s1 = (Ws2 + b)Tensoflow Code

s1 = tf.matmul(x, W1) + b1

s1 = tf.nn.sigmoid(s1)

s2 = tf.matmul(s1, W2) + b2

s2 = tf.nn.sigmoid(s2)


Using Discrete Variables

Data

0

1

16

5

20

6

?

3


Data

0

1

16

5

20

6

?

3


Data

0

1

16

5

20

6

?

3

?


x

s5

s1

s2

s3

s4

s7

s6

y

Number of fruit to offer

Number of fruit received

Using Discrete Variablesx

y


Number of fruit received

s1

s2


y


uType of fruit to offer

v Number of fruit receivedType of fruit received

s1

s2


y




s1

s2

u∈{Apple, Banana, Coconut}

v∈{Apple, Banana, Coconut}

Using Discrete VariablesLookup Tables

e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

u

V = 3


e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

u

V = 3


e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

u

Embedding for u Size = 4

V = 3


e1 e2 e3 e4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

u

Embedding for u

Banana

Size = 4

V = 3


e1 e2 e3 e4

0 0.1 -0.4 0.2 0.5

1 0.4 1.4 -1.0 0.1

2 1.1 0.9 1.1 0.5

u

Embedding for u

1

Size = 4

V = 3


u

Embedding for u

1

Lookup

Size = 4


y




s1

s2



eu

Lookup

Using Discrete VariablesSoftmax

V = 3

Apple Banana Coconut

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Input vector Size = 4V = 3


w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Input vector Size = 4

logits Size = V

V = 3


w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Input Vector

Logits

V = 3


w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

s1

s2

s3

s4

d1

d2

d3

1 -1 -2


Input Vector

Logits

V = 3


w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

s1

s2

s3

s4

d1

d2

d3

1 -1 -2

p1

p2

p2

0.84 0.11 0.05


Input Vector

Logits

V = 3


w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4

s1

s2

s3

s4

d1

d2

d3

1 -1 -2

p1

p2

p2

0.84 0.11 0.05

Apple


y




s1

s2



eu

Softmax

Lookup


y




s1

s2



eu

Softmax

Lookup

Example Applications

Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

NNP VBZ TO VB NNS CC NNS




e-2 e-1 e-0 e1 e2




e-2 e-1 e-0 e1 e2 Word Embeddings

Non-Linear Layer 1s1

s2 Non-Linear Layer 2







VB Softmax







VB Softmax




Translation Rescoring (Devlin et al, 2014)





ContextPredict




e-4 e-3 e-2 e-1

s1

s2

Softmax




0.2<s>




0.10.2




0.10.2 0.3




0.10.2 0.3 0.5 0.7 0.4 0.20.000378



Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012

John does to eat coconuts and bananas 0.00003



Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012

John does to eat coconuts and bananas 0.00003




ContextPredict

Translation

Source

Abby gosta de comer macas e bananas




ContextPredict

Translation

Source

Abby gosta de comer macas e bananas




Translation

macas

e-4 e-3 e-2 e-1

s1

s2

f-1



Translation Score (BLEU) Arabic - English Chinese - English

Best Rescored System 52.8 34.7

1st OpenMT12 49.5 32.6

Hierarchical 43.4 30.1

Deep Neural Networks are our friends?Convolutional Neural Network


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

z1

x1

x2

...

x11

z1

w9

w1


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

z1 z2

x2

x3

...

x12

z1

w1

w9


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

z1 z2

z3 z4


x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image

z1 z2

z3 z4

z1

z2

z3

z4

y Is this a cat?

Documents

Deep Neural Networks Are Our Friendslxmls.it.pt/2016/Deep-Neural-Networks-Are-Our-Friends.pdf · 2016. 7. 26. · Gradients are our friends Computation Graphs are our friends Outline